Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Original Article
  • Published:

Mass Data Massage: an automated data processing system used for NHEXAS, Arizona

Abstract

Data entry and management are critical components of all large survey projects; data quality objectives must be met and data must be quickly and readily accessible. We developed a comprehensive system for data entry and management utilizing scannable forms with bubble fields and handwriting recognition. This 'Mass Data Massage' (MDM) system had three components: (1) form creation and database definition; (2) programming of data dictionaries for documentation and preliminary logic and range checks; and (3) data entry, management and documentation using the 'Mass Data Cleaning Program' (MDCP). Scannable forms were written in Teleform, where the data field definition, variable names and ranges were defined as the form was created. Completed forms were returned from the field, subjected to final field quality control (QC) checks, and transferred to the data management section. They were batched and coded as necessary. Once a batch of data was scanned and visually verified, the operator called up the menu for the MDCP. The MDCP had 31 program modules with 500–1200 lines of code each. The operator could select and run the appropriate dictionary on each data batch 'correcting' apparent errors in responses. This process was iterative until the data batch passed all dictionary checks. Proposed 'changes' were forwarded to the data coordinator (DC) for acceptance or rejection. After all errors had been resolved, each data batch was subjected to a 10% quality assurance (QA) check. The original data batch and associated file of applied changes were archived. Time expenditure using the scanning approach varied with the number of questions and the types of responses (handwritten or bubble fields). One-page forms took 42–60% of the time needed for hand entry; forms longer than 10 pages took 35–38% of the time. Use of faster machines will further speed the process. The main advantage of the system was the reduction of systematic errors. Scanning alone reduced errors found on 995 NHEXAS Baseline Questionnaires. Overall, the dictionary identified 0.55% errors on the scanned forms. Ten percent QC checks, performed on corrected batches ready for appendage to the master database, revealed an overall error rate of 0.02%. Similar checks on a laboratory form scanned from numeric handwriting detected 0.3% errors following dictionary application and 0.2% errors during the 10% QA check. This system was faster, more accurate, and more cost-effective than hand entry of data. A batch of data that took >1 week to process using the hand entry method was processed within 1 day using MDM. Human coding of specific answers and the final verification were the most time-consuming processes.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Similar content being viewed by others

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to MARY KAY O'ROURKE.

Rights and permissions

Reprints and permissions

About this article

Cite this article

O'ROURKE, M., FERNANDEZ, L., BITTEL, C. et al. Mass Data Massage: an automated data processing system used for NHEXAS, Arizona. J Expo Sci Environ Epidemiol 9, 471–484 (1999). https://doi.org/10.1038/sj.jea.7500043

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/sj.jea.7500043

Keywords

Search

Quick links