To the Editor:

With the rapid accumulation of data in all areas of chemical biology research, scientists rely increasingly on historical chemogenomics data and computational models to guide small-molecule bioactivity screens and chemical probe development. However, there is a growing public concern about the frequent irreproducibility of experimental data reported in peer-reviewed scientific publications1,2. An editorial in this journal3 emphasized a critical need to address this problem, an issue that has also received attention from the US National Institutes of Health (NIH) leadership4. Since successful development of chemical probes and robust screening assays—one central objective of chemical biology—rely on the prior art in the field, it is critical that researchers establish the highest possible quality standards for data deposited in chemogenomics databases.

Concerning the impact of poor data in chemogenomic databases, we5 and others6 have shown that inaccurate and inconsistent representations of chemical structures in available molecular datasets result in models of poor accuracy, whereas data curation improves the modeling outcome. Researchers relying on non-curated historical data are taking a risk of corrupting their results owing to the following 'five I's': data may be incomplete, inaccurate, imprecise, incompatible and/or irreproducible. These considerations emphasize the need for thorough curation as the first critical step of any data analysis study to ensure the stability and reliability of the models and to guide experimental follow-up5.

As one means of addressing the data quality problem, we propose a general chemical and biological data curation workflow (Fig. 1) that relies on existing cheminformatics approaches to flag, and in some cases correct, possibly erroneous entries in large chemogenomics datasets. This workflow begins with chemical data curation following a previously established protocol5 (step 1 in Fig. 1), resulting in the identification and correction of structural errors. Duplicate analysis (step 2) assesses data quality and removes duplicate chemical structures and contradictory records. Analysis of intra- and interlab experimental variability (step 3) and exclusion of unreliable data sources (step 4) help increase data quality and aid decision-making about combination of data from different sources. Detection and verification of activity 'cliffs' (step 5) and calculation and tuning of the dataset modelability index7 (step 6), which estimates the feasibility of obtaining predictive quantitative structure-activity relationship (QSAR) models for a given dataset, serve as additional indicators of data quality. Consensus QSAR modeling (step 7), used for the identification and correction of potentially erroneous values or categories of compound bioactivities (step 8), conclude the workflow.

Figure 1: General workflow for comprehensive curation of chemogenomics datasets.
figure 1

Each step can be done using existing cheminformatics techniques and software tools. The workflow ensures the detection and elimination of the following: nonstandardized and duplicated chemical structures (steps 1 and 2); records associated with unreliable data sources or high experimental variability (steps 3 and 4); structural outliers and unverified activity cliffs (steps 5 and 6). Some mislabeled compounds can thereby be identified and corrected (steps 7 and 8).

As a community, we must take multifaceted approaches to ensure the quality and reproducibility of chemogenomics data through better data generation and reporting. The Nature family of journals8 have taken steps in this direction by removing space restrictions for method sections and having external statisticians verifying the correctness of statistical tests reported in some manuscripts considered for publication. The NIH is also developing plans to stimulate researchers to enhance reproducibility of their research results (http://grants.nih.gov/grants/guide/notice-files/NOT-OD-15-103.html). It is also crucial for journals to support and encourage the use of standardized electronic protocols and formats (such as MIABE9) for chemical data sharing and to require authors to upload their data electronically to public repositories at the time of manuscript submission.

Among other measures, the chemical biology community should adopt a culture of curation as a mandatory component of primary data processing and a prerequisite for data sharing. Chemical and biological data curation workflows can be developed further and utilized to flag (and where possible, fix) those records and ultimately improve the quality of data analysis and the prediction performances of modeling approaches. Experimental and computational scientists should convene to agree upon standards and best practices for data generation, reporting and curation of chemogenomics data, which will improve data reproducibility and accelerate the progression from data to knowledge in chemical biology research.