Merging publicly available data from different sources

 

There is a wealth of publicly available data that can be used to answer questions related to agriculture, food systems, nutrition, and health. Merging data from different sources can allow us to answer innovative research questions. However, combining datasets can be challenging, especially for researchers hoping to answer questions that touch on multiple disciplines. In this use case, we’ll explore strategies for and challenges of merging data from two different publicly available databases.

For instance, we might be interested in understanding the connections between trade, development, and health. How is food trade related to national economies? To delve into this topic, I decided to look at the effect of the African Continental Free Trade Area (AfCFTA) on African countries. Is free trade within the African continent related to development or health outcomes?

To answer this question, I used data from World Bank’s World Development Indicators and the Food and Agriculture Organization of the United Nations’ (FAO) Food Balance Sheets.

 

Find the datasets and check for compatibility

To access World Bank Data, go to the SCANR Research Guidance Data repositories section. From there, I accessed the World Bank DataBank. On this page, you can select from 82 available databases, among which I selected World Development Indicators (WDI), which provides collection of development indicators.

  • SCANR tip: To review all the available data from World Bank, look at the World Bank Data Catalog, which provides access to over 3,000 datasets and 14,000 indicators with essential metadata. 

To access FAO data, go back to the SCANR Research Guidance Data repositories section and click on from FAOSTAT. There I clicked on the “Data” tab and then “Food balances” to access the FAO Food Balance Sheets (FBS).

Among the countries who are members of the AfCFTA, I decided to look at Egypt, Ghana, and South Africa. To make sure that the countries are named in the same way in the World Bank and FAO datasets, I checked the ISO country codes for each country, which are the internationally recognized abbreviations for country names.          

  • SCANR tip: Before combining data from different sources, you must check if the data are compatible. One way to do this is to check the metadata or codebooks of the datasets and understand how the variables are defined or scaled. For example, you can view the definitions and standards used in World Bank dataset here; you can view the definitions and standards used in FAOSTAT here.
 
Download data from each source

From the World Bank World Development Indicators, I downloaded GNI per capita, PPP, and life expectancy at birth up from 1960-2013. From the World Bank Health Nutrition and Population Statistics, I downloaded prevalence of overweight and cause of death by noncommunicable disease. Once you open one of the World Bank DataBank datasets, you can use the options on the left side to view different databases, countries, series (variables), and time periods. To check the ISO country codes for your selected countries, click on the “Metadata” button.

From the Food Balance Sheets, I downloaded data on import quantity and export quantity of major food groups, including animal products, cereals, starchy roots, sugar & sweeteners, and vegetal products. To check the ISO country codes in FAOSTAT, look at the “Definitions and Standards” tab.

  • SCANR tip: Make sure to include “Null values” when you download your datasets to view missing items. Otherwise, the Food Balance Sheet default is to download available data only.
  • SCANR tip: The World Bank’s default data format is different than FAOSTAT. World Bank is downloaded as a table by default, while FAO data is downloaded as a list by default. In the World Bank DataBank, you can change the data format to “List” using the “Download options” button. Or, you can change the data format in FAOSTAT by changing the output type to “Pivot.” Alternatively, you can reformat your data after downloading each dataset by pivoting the data in Excel or other reshaping in Stata or other statistical software.

 

Merge datasets from different sources in Excel

Once I downloaded each dataset in CSV format, I looked at my datasets and noticed their formats. For all the datasets, country column is the left most column. I see that in the World Bank DataBank datasets, the units are included in the series name. The starting year for the Food Balance Sheets is 1961, while the starting year for the World Bank DataBank datasets is 1960. Missing values are marked as NA in the World Bank DataBank datasets, but left blank in the Food Balance Sheets.

Then, I combined them in Excel by matching the “Series” from the World Bank DataBank datasets to the “Items” from the FAOSTAT dataset by year (temporal component) and the respective countries (cross-sectional component).

 

Merge datasets from different sources in Stata

If you are familiar with statistical software packages such as Stata or R, you can also use these programs to merge datasets.

To do this, I imported my Excel data to Stata. Next, I cleaned the datasets by keeping only the necessary variables using the “drop” command, re-named the variables to have the same variable names using the “rename” command, and sorted the data by the key variables identified, country and year, using the “sort” command. Next, I reshaped the dataset to wide format so that each variable had its own column there was one observation of each variable for each country and year. 

Once the datasets were prepared for merging, I used the merge command to combine the datasets. To use the merge command, we need to identify the common variable(s). The “merge 1:1 using `var’” command is used when the identifying variable uniquely identifies observations on each dataset. You can learn more about the options for the “merge” command by typing “help merge” in Stata or look at the Internet Guide to Stata.

  • SCANR tip: You can also combine datasets in Stata using the “append” command. However, append is not the same as merge command. The append command allows you to combine two datasets with different observations of the same variables; the merge command allows you to combine two datasets with different variables. To append datasets, you must have variables with exactly the same name. To learn more about append command, you can type “help append” in Stata or look at UCLA’s Statistical Consulting on appending data files.

 

Answer new research questions using your combined dataset!

Now that the datasets are merged, I can start the analysing my data! For resources and tips on data analysis check out SCANR Research Guidance Analysing data section.

 

SCANR resources for merging data

There are many pitfalls when combining data from different sources. Attention to details and double checking your datasets will be an invaluable asset. Here are some resources and tips you can check out:

  • Data Ladder: In this posting, you can review the key concepts and potential pitfalls of merging data.
  • Stata Blog: In this blog post, you can review common mistakes and tips to avoid these mistakes when merging data.
  • University of Wisconsin-Madison: In this Data Wrangling in Stata series, you can learn to understand the structure of datasets to merge them together.
  • Analytic Vidhya: In this blog post, you can learn about the 9 challenges that R and Python users often face in merging datasets.
  • SAS: In this paper, you walk through an example of merging datasets in SAS and learn about the common issues and how to avoid them.

 

This use case was prepared by Hyomin Lee, MS Candidate at the Friedman School of Nutrition Science and Policy at Tufts University. May 6, 2022.