Module 14. Data entry and cleaning

Ensuring the quality of data

Processed for data entry, examples of data entry software and management of the database

Cleaning and processing data to finalize the database for analysis

Introduction

Confidence in survey results is heavily dependent on the quality of data collection and analysis. Data quality is affected by such factors as database design, data entry, cleaning and processing. There is usually an expert subcommittee to manage and provide oversight for tasks related to data management and analysis. More information is provided in Box 8.2 of Module 8, Survey supervision and personnel and in Module 15: Data processing and analysis.

Introduction

Confidence in survey results is heavily dependent on the quality of data collection and analysis. Data quality is affected by such factors as database design, data entry, cleaning and processing. There is usually an expert subcommittee to manage and provide oversight for tasks related to data management and analysis. More information is provided in Box 8.2 of Module 8: Survey supervision and personnel and in Module 15: Data processing and analysis.

Ensuring the quality of data

The procedure for entering and managing data, whether collected electronically or on paper, should include guidance on:

key staff and supervisors responsible for each stage (database design, data entry, cleaning, management, processing and analysis);
the process for entering data;
software and equipment to be used;
developing the data entry screen with appropriate validation checks and skips;
the flow and tracking of data from the field to the final format; and
data confidentiality and security.

The choice of paper-based or electronic data collection must be made early on in the survey planning process. Electronic data collection is the preferred method, as it reduces time and improves data quality. In addition, electronic data can be backed up every day while teams are still in the field, whereas paper forms risk being lost or misplaced before the data are entered into an electronic database. Further discussion of the advantages and disadvantages of each method can be found in Module 4: Survey design. When paper-based questionnaires are used, sufficient entry staff are needed to perform double data entry (entry of each questionnaire by two independent staff).

Each survey team member, from the interviewers to the Survey coordinator should have a clear task related to data quality in his or her roles and responsibilities. During field work, the responsibility for data integrity and quality starts with the interviewers. Information should be entered in a standardized, legible format onto paper questionnaires and other data collection forms. Ultimately, team leaders are responsible for the consistency and completeness of the team’s data. Team leaders in the field need to review all data on the questionnaires to ensure that they are clear and complete, and check for potential data errors or mismatching between specimens and individual or household questionnaires. See Module 11: Data collection tools, field manual, and database, Module 13, Field logistics for more information on these checks.

Processes for data entry, examples of data entry software and management of the database

Constructing the data entry system

A strong data entry system is required to ensure high-quality data, whether data are collected electronically in the field or are entered from paper-based questionnaires. To improve data quality, the data entry program should have preprogrammed skips, correctly formatted fields for variables such as dates, and validation checks that set appropriate limits for certain variables, such as dates of birth for children under 5 years of age and values for haemoglobin levels. It should also include cross-checks for consistency between related variables, such as the ID code of a woman of reproductive age compared with the household ID and the stated number of women in this age group in the household. In this way, the system rejects any unexpected values and the variable is flagged for further review.

Construction of a data entry system requires a complete data collection tool (questionnaire). It is common to develop the electronic data collection system or data entry system for paper-based tools after the cognitive interviewing process Module 11: Data collection tools, field manual, and database). This version can be used to train data entry staff (where applicable) who can practice with completed forms during the training and pilot test. Minor adjustments to finalize the tools may still be expected during the training and piloting process, and it is important to make sure that all changes are made to the final software version, whether uploaded for electronic collection or used to enter data from paper-based forms.

For double data entry of paper-based forms, a discrepancy check program needs to be developed to compare the independent entries.

The steps required to develop the system and enter data are illustrated in Fig. 14.2.

With electronic data collection, there is no need for double data entry, nor for the related discrepancy checks and reconciliation. It is possible to move straight to data checks, cleaning, and analysis. This is one of the principal advantages of electronic data collection.

Choosing software

Software for data entry from paper-based forms

Programs that can be used for data entry from paper-based questionnaires include Epi Info (http://www.cdc.gov/epiinfo), Epi Data (http://www.epidata.dk), and Census and Survey Processing System, CSPro (https://www.census.gov/data/software/cspro.html). Several Microsoft® Office programs, including Microsoft® Access (https://products.office.com/en-us/access), offer additional options.

Software for data entry for electronic data collection

Programs available for electronic data entry include Epi Info (http://www.cdc.gov/epiinfo) and Open Data Kit (ODK) (https://opendatakit.org/), a frequently used free and open access software. Factors to consider in choosing software include cost, capacity to generate relational (hierarchical) data files (for example, linking a woman of reproductive age to the household she is in and to a child she may have) and whether open access is an important feature.

In either case (paper-based or electronic), a program that can be modified by others relatively easily should be used in case the primary developer becomes unavailable.

Developing a data dictionary

A data dictionary defines all variables included in the survey questionnaire. It is required for developing the data entry program so that type, field width and validation checks (agreed upon acceptable values) can be programmed for each variable. The data dictionary also needs to define all variables created from the original data, for example, the variable “anaemia” may be defined from the result of the haemoglobin test together with the individual’s age and pregnancy status. A data dictionary is also essential for developing the data analysis syntax. Box 14.1 provides an example of a data dictionary.

Box 14.1 Example Data Dictionary

Variable Variable name Variable type Variable width Example values/notes

Participant number ID Numeric 3 001–999

Household number HHID Numeric 2 01–25

Residence URBAN_RURAL Numeric 1 1=Urban, 2=Rural

Region STRATA Numeric 1 1-3

Cluster number CLUSTER Numeric 2 01-30

Age in months AGE Numeric 2.1 06.0-59.9

Date of birth DOB dd/mm/yyyy 2.1 [values set according to survey date and expected age of respondent]

Sex SEX Numeric 1 1=Male, 2=Female

Date of survey SURVEY dd/mm/yyyy 15/06/2004–20/08/2004

Haemoglobin HB Numeric 2.1 04.0–18.0^a

Urinary iodine concentration UIC Numeric 4.1 0000.0–1000.0 µg/L

Retinol binding protein concentration RBP Numeric 2.2 00.00–90.00 µmol/L

Iodine in salt based on rapid test kit SALT_RTK Numeric 1 1=Yes, 0=No

Iodine level in salt based on titration SALT_QUANT Numeric 3 000–120 mg/kg

^a Note: These may not be correct minimum and maximum values for use in populations living at high altitudes.

Variable	Variable name	Variable type	Variable width	Example values/notes
Participant number	ID	Numeric	3	001–999
Household number	HHID	Numeric	2	01–25
Residence	URBAN_RURAL	Numeric	1	1=Urban, 2=Rural
Region	STRATA	Numeric	1	1-3
Cluster number	CLUSTER	Numeric	2	01-30
Age in months	AGE	Numeric	2.1	06.0-59.9
Date of birth	DOB	dd/mm/yyyy	2.1	[values set according to survey date and expected age of respondent]
Sex	SEX	Numeric	1	1=Male, 2=Female
Date of survey	SURVEY	dd/mm/yyyy		15/06/2004–20/08/2004
Haemoglobin	HB	Numeric	2.1	04.0–18.0^a
Urinary iodine concentration	UIC	Numeric	4.1	0000.0–1000.0 µg/L
Retinol binding protein concentration	RBP	Numeric	2.2	00.00–90.00 µmol/L
Iodine in salt based on rapid test kit	SALT_RTK	Numeric	1	1=Yes, 0=No
Iodine level in salt based on titration	SALT_QUANT	Numeric	3	000–120 mg/kg

Testing the data entry system

The data entry system requires extensive testing, preferably by a number of people entering different options that will, for example, test different skip patterns. After this testing, the system should be piloted among different groups, to assess:

validation checks (expected data ranges/exclusion of implausible values and cross-checks with values for other entered, related variables)
data entry formats
skip patterns
logical, user-friendly variable names, labels, format, and flow

Results of the pilot test may reveal that the data dictionary needs adjustment. Piloting of the data collection and data entry system should be done prior to training, so that enumerators are using the most optimal system during the training.

Data entry requirements

Data entry should start as soon as possible after the initiation of fieldwork. This will allow common errors to be identified early, reasons for errors to be determined and corrective action to be taken.

A micronutrient survey may require that a large amount of data be entered. For paper-based survey questionnaires, data can be entered into the electronic database either:

At the end of the day by the survey team. This approach requires significant time in the field that could otherwise be spent on data collection. On the other hand, it allows for the quick correction of erroneous data by allowing the team to return to a cluster. It also enables data to be backed up onto a separate device to avoid loss of information that could result if the paper version of the completed questionnaire was lost.
By double data entry at the central data management location. This is the most commonly used method for when data collection is paper-based. It may improve data quality by reducing the rate of errors and inter-individual variability because a limited number of experienced data entry personnel enter the data. This approach requires strong supervision and detailed checks in the field to ensure the legibility and quality of the data entered. This method requires:
- a minimum of two data entry staff assigned by the database manager to enter the data;
- entry of information from each questionnaire by each of these two people (double data entry);
- comparison of the two data files by the database manager using the discrepancy check program;
- reconciliation of any differences based on the paper version of the questionnaire; and
- monitoring of personnel performance and retraining where needed.

For accountability, the final and complete set of data files should include:

A clean final master version of the data to be used for data analysis. The final master dataset will have a data dictionary with variable labels that link to specific questionnaires.
The two sets of raw entry (to confirm double entry).
A log of any discrepancies found. The log of discrepancies per variable could be presented as a table with the following headings:

Variable Data entry staff n°1 value Data entry staff n°2 value Resolved value

Cleaning data to finalize the database for analysis

Whether collected on paper or electronically, data require checks and cleaning. Data cleaning is intended to identify potentially erroneously recorded data. For paper-based collection, checking and cleaning take place after data entry errors have been corrected by comparing and reconciling double-entered data. For electronic data that are continually uploaded to the server, checking and cleaning can be done on a regular basis during the field work period.

With electronic data collection, a data review exercise can be done several times a week. This allows feedback to be sent to the Survey coordinator on progress toward the expected numbers of interviews and specimens. For example, it might be found that consent for blood collection among children under 5 years of age is lower than expected, which may prompt follow-up to find the cause and advise teams accordingly. Again, any changes to procedures should be documented to avoid biasing the sample.

Duplicate Entries

Unexpected duplicate entries need to be fixed immediately. Household and individual ID numbers are unique and should occur only once in a data file. Duplicate ID numbers may have several causes, for example a single questionnaire may have been entered twice, two different individuals were assigned the same ID number or one of the two ID numbers was entered incorrectly. This last issue can be avoided by using barcode labels and barcode readers for unique ID numbers in the field and at the laboratory.

Dates and other identifiers are very useful in data cleaning processes. Data tracking documents that describe the clusters, dates and individuals should be used as management tools to help disentangle duplicate data entries and additional types of irregular data findings. Within the data tracking system, there can be a list of ID numbers and the date that data were collected for each individual. For example, if you know that person 1234 belongs to cluster 567 and you have two entries for person 1234, then you can check the cluster number to determine which data was entered incorrectly. This part of data cleaning—determining which of the duplicate entries is correct—requires time and attention to detail.

Implausible Values

The most common method of checking data is to produce a frequency table for every variable and to identify values that fall outside of a normally acceptable range. This range should be defined in the data dictionary (see Box 14.1). Where outlying values are found, they should be traced back to the original questionnaire to see if it could be a simple data entry error, due to handwriting that is difficult to read or to an incorrect decimal place. In general, valid data entry types and ranges are pre-set in any electronic data collection form in order to reduce the likelihood of “out of normal range” errors. Where there appears to be a consistent unexpected value for a specific cluster, the Team leader should be notified and he or she should verify whether the value reflects something unusual about that cluster. If the checks are being conducted on an ongoing basis during data collection and there is a consistently unexpected value produced by one interviewer, the Team leader should follow up and monitor the performance of that person. Often the outlying values cannot be verified and corrected, and a decision needs to be made regarding changing the variable outcome to a ‘missing’ value. Any such findings and changes need to be documented.

Logical errors found during the data cleaning should be investigated and, when possible, corrected. This is relevant for electronic or paper-based data collection.

Examples of logical errors include:

the date of birth is recorded as after the survey date;
the date of of birth does not fit with the expected age of the individual, for example the age calculated from the date of survey and date of birth is not the same as (or within an acceptable range of) the stated age;
the designation of “urban” and “rural” is inconsistent among households within the same cluster;
body mass index (BMI) values indicate that the height and weight measurements may have been entered in the wrong boxes, or that a decimal place has been entered incorrectly.

Logical error checks should be pre-set in electronic data collection forms. By correctly programming the electronic data collection system, it is possible to ensure that these errors cannot be entered, and such values are immediately flagged and can be rectified. For example, if a BMI is outside of the expected range, the participant’s weight and height can be measured again.

All errors must be either corrected or deleted from the database, and the process should continue until the data are considered “clean.”

Missing data

Missing data may have been entered as 99.9 or 999.99, depending on the questionnaire instructions. Missing data, including refusal codes, need to be appropriately recoded so they do not skew the summary statistics. In addition, the number of missing responses for each variable needs to be investigated. If there are many missing values, check that these are not a result of a database or data entry error.

Merged data

Laboratory data that are not measured during the data collection period (for example, haemoglobin levels or the presence of malaria) usually become available well after the final database has been approved. These laboratory data will need to be merged with the questionnaire data, using the household or individual’s unique ID number.

Here is an example of how to verify merged data: If the survey data file shows 800 women of reproductive age eligible for specimen collection, and the corresponding laboratory data file includes only 700, you would expect that 100 eligible women refused consent to provide a specimen or that the specimen volume was insufficient. However, after merging data by unique ID, it might be that only 650 lines of data match. In this case investigations to resolve the discrepancy may include:

reviewing the Specimen transfer form to compare the IDs of women of reproductive age against specimens collected and specimens sent to the laboratory; or
verifying the use of a barcode reader to enter ID numbers at the laboratory (on arrival and during recording of analysis results). If the barcode reader was not used, it is possible that IDs were incorrectly entered at the laboratory and that some specimen ID numbers do not match with a corresponding ID of women of reproductive age;
reviewing the response to specimen collection for all individual IDs where laboratory data are missing to assess the reason. Reasons include a declined test, inadequate specimen collected, unable to be measured (for example if the blood was haemolyzed), or lost specimen.

Creating individual-level and household-level data sets

There will be individual data sets and household data sets. The individual data sets may need to be cross-linked, for example, mother and child pair. They will also be linked to the household. During the planning stages, unique IDs and linking variables were ideally created in the data entry form to enable linking at the data management stage. Software should be selected that allows for hierarchical linking of data where needed.

Managing the database

All data, collected on paper or electronically, should be entered and maintained securely in a central database. Typically, a Database manager is responsible for developing the database and maintaining backups. However, there will be multiple people that work on managing the survey database, depending on the complexity of the survey. A Data coordinator needs to work with the Database manager as well as software programmers, statisticians, and other specialists to ensure that the data are entered, linked, and maintained in a secure, organized way. Saving data in two different networks or servers, with different access permissions, helps ensure that there is no loss of data or risk of files being deleted or manipulated by error. The Database manager and Data coordinator need to have strong experience working with large databases.

Creating individual-level and household-level data sets