What is data management?

Data management is a multistep process that involves obtaining, cleaning, and storing data to allow accurate analysis and produce meaningful results. While data management has broad applications (and meaning) across many fields and industries, in clinical research the term data management is frequently used in the context of clinical trials.1 This editorial is written to introduce early career researchers to practices of data management more generally, as applied to all types of clinical research studies.

Outlining a data management strategy prior to initiation of a research study plays an essential role in ensuring that both scientific integrity (i.e., data generated can accurately test the hypotheses proposed) and regulatory requirements are met. Data management can be divided into three steps—data collection, data cleaning and transformation, and data storage. These steps are not necessarily chronological and often occur simultaneously. Different aspects of the process may require the expertise of different people necessitating a team effort for the effective completion of all steps.

Data collection

Data source

Data collection is a critical first step in the data management process and may be broadly classified as “primary data collection” (collection of data directly from the subjects specifically for the study) and “secondary use of data” (repurposing data that were collected for some other reason—either for clinical care in the subject’s medical record or for a different research study). While the terms retrospective and prospective data collection are occasionally used,2 these terms are more applicable to how the data are utilized rather than how they are collected. Data used in a retrospective study are almost always secondary data; data collected as part of a prospective study typically involves primary data collection, but may also involve secondary use of data collected as part of ongoing routine clinical care for study subjects. Primary data collected for a specific study may be categorized as secondary data when used to investigate a new hypothesis, different from the question for which the data were originally collected. Primary data collection has the advantage of being specific to the study question, minimize missingness in key information, and provide an opportunity for data correction in real time. As a result, this type of data is considered more accurate but increases the time and cost of study procedures. Secondary use of data includes data abstracted from medical records, administrative data such as from the hospital’s data warehouse or insurance claims, and secondary use of primary data collected for a different research study. Secondary use of data offers access to large amounts of data that are already collected but often requires further cleaning and codification to align the data with the study question.

Data forms

A case report form (CRF) is a powerful tool for effective data collection. A CRF is a paper or electronic questionnaire designed to record pertinent information from study subjects as outlined in the study protocol.3 CRFs are always required in primary data collection but can also be useful in secondary use of data to preemptively identify, define, and, if necessary, derive critical variables for the study question. For instance, medical records provide a wide array of information that may not be required or be useful for the study question. A CRF with well-defined variables and parameters helps the chart reviewer focus only on the relevant data, and makes data collection more objective and unbiased, and, in addition, optimize patient confidentiality by minimizing the amount of patient information abstracted. Tools like REDCap (Research Electronic Data Capture) provide electronic CRFs and offer some advanced features like setting validation rules to minimize errors during data collection.4 Designing an effective CRF upfront during the study planning phase helps to streamline the data collection process, and make it more efficient.3

Data cleaning and transformation

Quality checks

Data collected may have errors that arise from multiple sources—data manually entered in a CRF may have typographical errors, whereas data obtained from data warehouses or administrative databases may have missing data, implausible values, and nonrandom misclassification errors. Having a systematic approach to identify and rectify these errors, while maintaining a log of the steps performed in the process, can prevent many roadblocks during analysis.

First, it is important to check for missing data. Missing data are defined as values that are not available and that would be meaningful for analysis if they were observed.5 Missing data can bias the results of the study depending on how much data is missing and what is the pattern of distribution of missing data in the study cohort. Many methods for handling missing data have been published. Kang6 provide a practical review of methods for handling missing data. If missing data cannot be retrieved and is limited to only a small number of subjects, one approach is to exclude these subjects from the study. Missing data in different variables across many subjects often require more sophisticated approaches to account for the “missingness.” These may include creating a category of “missing” (for categorical variables), simple imputation (e.g., substituting missing values in a variable with an average of non-missing values in the variable), or multiple imputations (substituting missing values with the most probable value derived from other variables in the dataset).7

Second, errors in the data can be identified by running a series of data validation checks. Some examples of data validation rules for identifying implausible values are shown in Table 1. Automated algorithms for detection and correction of implausible values may be available for cleaning specific variables in large datasets (e.g., growth measurements).8 After identification, data errors can either be corrected, if possible, or can be marked for deletion. Other approaches, similar to those for dealing with missing data, can also be used for managing data errors.

Table 1 Examples of validation rules that can assist in data quality checks.

Data transformation

The data collected may not be in the form required for analysis. The process of data transformation includes recategorization and recodification of the data, which has been collected along with derivation of new variables, to align with the study analytic plan. Examples include categorizing body mass index collected as a continuous variable into under- and overweight categories, recoding free-text values such as “growth of an organism” or “no growth,” and into a binary “positive” or “negative,” or deriving new variables such as average weight per year from multiple weight values over time available in the dataset. Maintaining a code-book of definitions for all variables, predefined and derived, can help a data analyst better understand the data.

Data storage

Securely storing data is especially important in clinical research as the data may contain protected health information of the study subjects.9 Most institutes that support clinical research have guidelines for safeguards to prevent accidental data breaches.

Data are collected in paper or electronic formats. Paper data should be stored in secure file cabinets inside a locked office at the site approved by the institutional review board. Electronic data should be stored on a secure approved institutional server, and should never be transported using unencrypted portable media devices (e.g., “thumb drives”). If all study team members do not require access to study data, then selective access should be granted to the study team members based on their roles.

Another important aspect of data storage is data de-identification. Data de-identification is a process by which identifying characteristics of the study participants are removed from the data, in order to mitigate privacy risks to individuals.10 Identifying characteristics of a study subject includes name, medical record number, date of birth/death, and so on. To de-identify data, these characteristics should either be removed from the data or modified (e.g., changing the medical record number to study IDs, changing dates to age/duration, etc.). If feasible, study data should be de-identified when storing. If you anticipate that reidentification of the study participants may be required in future, then the data can be separated into two files, one containing only the de-identified data of the study participants, and one containing all the identifying information, with both files containing a common linking variable (e.g., study ID), which is unique for every subject or record in the two files. The linking variable can be used to merge the two files when reidentification is required to carry out additional analyses or to get further data. The link key should be maintained in a secure institutional server accessible only to authorized individuals who need access to the identifiers.

To conclude, effective data management is important to the successful completion of research studies and to ensure the validity of the results. Outlining the steps of the data management process upfront will help streamline the process and reduce the time and effort subsequently required. Assigning team members responsible for specific steps and maintaining a log, with date/time stamp to document each action as it happens, whether you are collecting, cleaning, or storing data, can ensure all required steps are done correctly and identify any errors easily. Effective documentation is a regulatory requirement for many clinical trials and is helpful for ensuring all team members are on the same page. When interpreting results, it will serve as an important tool to assess if the interpretations are valid and unbiased. Last, it will ensure the reproducibility of the study findings.