Cleaning data

Once you have found and imported your dataset, you will need to prepare it for analysis. This page contains resources and tips for using statistical software to clean secondary data.

Resources for data cleaning and management

Excel
- Top Excel Tips for Data Analysis: A set of tips for data cleaning, analysis, and visualization in Excel.
Python
- Python Data Science Handbook: This free online book includes chapters on data cleaning and manipulation.
SPSS
- SPSS Online Training Workshop: This set of tutorials, videos, and datasets includes tutorials on defining and modifying variables, merging data, transforming variables, and restructuring variables in SPSS.
Stata
- Stata YouTube channel: The StataYouTube channel includes hundreds of videos on data cleaning and management in Stata. Each short video addresses a specific topic, such as merging files, appending files, reshaping data, identifying and removing duplicates, and many more.
R
- Manual for using R for HCES: A step-by-step Github manual designed for users who are new to R and RStudio that provides instructions for getting started in R and using it to analyse household consumption and expenditure surveys. This manual includes guidance for understanding data types and data structures and manipulating data in R. This manual was created by the Micronutrient Action Policy Support (MAPS) Project.
- R Bootcamp: This set of R tutorials includes tutorials on sorting, reshaping, and cleaning data.

General resources on data cleaning and management:

Best Practices for Data and Code Management: This guide created by Innovations for Poverty Action provides best practices for organizing folder and files; writing and organizing code; dealing with missing values; documenting data and code; and keeping data secure.
J-PAL Research Resources: This collection of resources from the Abdul Latif Jameel Poverty Action Lab (J-PAL) includes guidance on data cleaning and analysis.

Resources and guidance for merging datasets from different sources:

Technical guidance for merging data: There are many pitfalls when combining data from different sources. These resources provide information about merging data in Stata, R, SAS, and Python.
- Data Ladder: In this posting, you can review the key concepts and potential pitfalls of merging data.
- Stata Blog: In this blog post, you can review common mistakes and tips to avoid these mistakes when merging data.
- University of Wisconsin-Madison: In this Data Wrangling in Stata series, you can learn to understand the structure of datasets to merge them together.
- Analytic Vidhya: In this blog post, you can learn about the 9 challenges that R and Python users often face in merging datasets.
- SAS: In this paper, you walk through an example of merging datasets in SAS and learn about the common issues and how to avoid them.

SCANR tips on data cleaning

What is metadata?
- Metadata is information that describes the data and helps future users understand how it was collected, who collected or created it, what is contains, and how it can be referenced. Metadata may include elements such as a dataset persistent ID (DOI), publication date, title, author(s), contact information, description, subject(s), keyword(s), topic classification(s), language, producing and distributing organization(s), production date and place, names of contributors, funding information, time period, data type, software, geographic location, and unit of analysis.
What is a codebook?
- The codebook tells you what questions were asked/answered, what variables were measured, and/or how the answers/results were recorded. Use the codebook to find your variables of interest and explore other aspects of the dataset. When cleaning data, refer to the codebook for the variable names, variable values, and value labels. For some datasets, the “codebook” might be a survey questionnaire, survey guide or protocol, or other data collection tool.
What if some data are missing?
- First, figure out how missing data is coded in your dataset. Missing values may be represented by a blank, a period (i.e., a dot or full stop), a large number (e.g., 9999 or -9999), or some other symbol.
- Figure out which values are plausible. Compare the values in the dataset to the values listed in the codebook. Values not listed in the codebook may reflect data entry errors. Then, check for biologically implausible values. These may reflect data entry errors (e.g., age = 200 years). If possible, contact the people or organization who collected the data to see if changes to the data collection tool or data entry protocols could help you to understand implausible values.
- Check for skip patterns denoted in the questionnaire or other data collection tool or guide. For example, in a survey, some questions may be automatically skipped if a respondent gives a certain answer to an earlier question. These questions are not applicable to the respondent but should not be considered missing.