Identification of discrepant data in aggregated databases is a key step in data curation and remediation. We have applied the ALATIS approach, which is based on the international chemical shift identifier (InChI) model, to the full PubChem Compound database to generate unique and reproducible compound and atom identifiers for all entries for which three-dimensional structures were available. This exercise also served to identify entries with discrepancies between structures and chemical formulas or InChI strings. The use of unique compound identifiers and atom nomenclature should support more rigorous links between small-molecule databases including those containing atom-specific information of the type available from crystallography and spectroscopy. The comprehensive results from this analysis are publicly available through our webserver [http://alatis.nmrfam.wisc.edu/].
Organic compounds with low molecular weights (usually less than 1,000 Da) are commonly categorized as small molecules. Small molecules are major targets for drug and biomarker discoveries 1,2 . For example, more than 60% of approved drugs in the 2017 United States Food and Drug Administration list are small molecules. The availability of accurate data about small molecules, including chemical, physical, and biomedical properties, is indispensable to many fields of endeavor. As the host to more than 94 million small molecule entries, PubChem 3 is the most important resource for retrieving these types of metadata. The vast amount of information archived in PubChem, generated either internally or aggregated from about 2,000 external resources and databases [ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/source/], serves as the central hub for inquiries related to small molecules. The accuracy of the archived and cross-referenced data in PubChem is a critical factor in the reproducibility of biomedical research studies that utilize this database.
Other databases contain essential data on the physical and chemical properties (e.g., mass spectrometry and nuclear magnetic resonance spectroscopy databases) and biological and toxicological properties (e.g., the Chemical Entities of Biological Interest (ChEBI) 4 ) of small molecules. PubChem offers pre-aggregated data from several of these archives. However, the process of identifying and collating information on a specific compound from different data sources can be a challenging problem, partly owing to the absence of a system for unique compound identification that can provide an “anchor” for linking entries across different reference databases 5 . This impediment becomes evident when attempts are made to create cross-references linking the physical and chemical properties of the compounds archived in the ever-expanding number of databases containing experimental data on small molecules (e.g., BMRB 6 , HMDB 7 , PDB ligand expo [http://ligand-expo.rcsb.org/], METLIN 8 , MassBank of North America [http://mona.fiehnlab.ucdavis.edu/]) with reference databases such as PubChem.
In order to forge valid links in aggregated data and to detect and remediate discrepancies, a vital first step is to implement unique and reproducible identifiers for compounds and their constituent atoms. We recently created a software technology called ALATIS (atom label assignment tool using InChI string) 5 that takes the structure file of a small molecule as input and produces the international chemical shift identifier (InChI) 9 as the unique compound name, and further expands this identifier to uniquely label all constituent atoms of the compound. The ALATIS software program utilizes the “inchi-1” program (https://www.inchi-trust.org/, InChI version 1, software version 1.04) to generate standard InChI strings for input compounds. ALATIS utilizes the Open Babel 10 software package to interconvert structure file formats and also to project 2D structures to 3D structures when needed. We recently used ALATIS in analyzing several databases: BMRB, HMDB, PDB Ligand Expo, and a small subset (0.05%) of PubChem entries. The results revealed the presence of non-standard InChI strings, inconsistent atom labeling, and inaccurate cross-references among the databases 5 .
In this study, we have expanded the ALATIS analysis to the entire PubChem database. The substantial computing resources needed for this work were provided through the enterprise-level computing platform of NMRbox 11 . The ALATIS results for the entire PubChem content have provided insights about the current state of this important resource. Most notably, the results provide unique and reproducible compound and atom identifiers associated with PubChem entries for use in improved curation and more accurate data retrieval.
We downloaded two sets of archived PubChem structure files on the twentieth of December 2017: (i) the “Current-Full” dataset consisting of 94,201,188 entries with their corresponding two-dimensional (2D) structures stored in SDF 12 format, and (ii) the “Compound_3D” dataset consisting of 91,699,620 entries with their corresponding three-dimensional (3D) structures stored in SDF format. The “Current-Full” dataset was needed because it contains metadata that are not available in the “Compound_3D” files. More than 2.5 million entries in the PubChem did not have a 3D structure file. Interestingly, all compounds with more than 152 atoms did not have 3D structures (Fig. 1).
In order to probe the correctness of atom chirality, we processed the Compound_3D dataset with ALATIS software. This step generated unique identifiers for more than 91 million compounds and their constituent atoms (Data Citation 1: Open Science Framework https://doi.org/10.17605/OSF.IO/JZDYQ). The output for each entry consisted of: (i) structure files in SDF, PDB, and XYZ formats containing ALATIS-based identifiers (labels) for all atoms, (ii) a map linking the input atom labels to the unique atom labels, (iii) a file containing a standard InChI string as the unique compound identifier (called ‘inchi.inchi’), (iv) two text files, named ‘warnings.txt’ and ‘error.txt’, that contain warnings or errors related to the ALATIS analysis of a particular compound, and (v) a comma-separated values (CSV) file, named ‘meta_data.csv’, containing the metadata associated with that entry. The metadata file contains, in addition to the PubChem compound identifier (CID), molecular formula, weight, and exact mass as reported by PubChem, the corresponding standard InChI string as generated by ALATIS. To facilitate side-by-side comparison of results, including comparison of input 3D structures and ALATIS output structures annotated with unique atom identifiers, we have generated a web page for each compound, which includes download links to all the data. We used the software Jmol [Jmol: an open-source Java viewer for chemical structures in 3D. http://www.jmol.org/] to create displays of the 3D structures. The unique compound and atom identifiers, along with information associated with PubChem entries, can be accessed through the ALATIS website [http://alatis.nmrfam.wisc.edu]. Users can query the search engine on this website with a PubChem CID or a compound name to retrieve the corresponding ALATIS output.
We used the ALATIS-curated data to analyze the consistency of the data stored for each entry in PubChem. Note that the synonyms and metadata are archived separately from the 3D structure files: synonyms are located at [ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-Synonym-filtered.gz] and that the metadata are stored as part of SDF files archived in “Current-Full” dataset [ftp://ftp.ncbi.nlm.nih.gov/pubchem/Compound/CURRENT-Full/SDF/]. The synonyms were used in creating a user-friendly search engine on the ALATIS webserver. The metadata were needed for the subsequent consistency analysis. We highlight below the two major outcomes of our study.
Inconsistency between the archived 3D structures and formulas
The chemical formula of a compound archived in PubChem normally follows the Hill convention 13 and represents the core parent structure of the compound 9 . However, the PubChem archive includes 1,239,752 charged chemical formulas, where charges are denoted by a symbol at the end of the chemical formula. The core parent structure of a compound indicates the composition of the compound before imposing any charges, through the addition or subtraction of hydrogen atoms. As illustrated by the examples in Fig. 2, it is not always possible to determine the core parent structure of a compound from its charged chemical formula. This is because, rather than resulting from the addition or subtraction of protons, the charge could be intrinsic to the covalent structure of the compound. Thus, large-scale computational processing and curation of the database could lead to inconsistent or ambiguous results in identifying the atom compositions of the compounds. This problem can be addressed by utilizing standard InChI strings. The formula layer of standard InChI strings provides the composition of the core parent of a compound, and the net charge (“/q”) and protonation (“/p”) layers of InChI strings represent compounds charges. This separation of charges from formulas facilitates extraction of the precise number of atoms in a compound’s structure file or chemical formula, as well as indicating the types of charges associated with the compound. We have produced a complete list of PubChem CIDs with charged chemical formulas, along with their corresponding ALATIS formulas in Hill format as extracted from standard InChI strings. These data are available on the ALATIS website.
Inconsistency between the archived 3D structures and InChI strings
We compared the deposited PubChem InChI strings to those generated by ALATIS (ALATIS utilizes InChI program v. 1.04 [http://www.inchi-trust.org/download/104/InChI_TechMan.pdf]). Standard InChI strings represent unique compound identifiers that can be used for cross-referencing entries from different databases 5 . These strings consist of several layers of information, including compound formulas, covalent connectivity between heavy atoms, the number of hydrogen atoms associated with heavy atoms, a layer to represent chirality, and other layers associated with isotopically labeled atoms and compound charges 9 . We used ALATIS to process the 3D structure files deposited in PubChem, and flagged entries for which the corresponding deposited InChI strings failed to match those reported by ALATIS. Table 1 shows different categories of these flagged PubChem entries. In this table, the ‘Atom connectivity’ category reports the number of entries flagged because of discrepancies in (a) covalent connectivity between heavy atoms (reported in “/c” layer of InChI strings) or (b) the number of assigned hydrogen atoms to the heavy atoms (“/h” layer of InChI strings). The ‘Charge’ category reports the number of flagged entries that represent different (de)protonation (“/p” layer of InChI) or intrinsic covalent charges (“/q” layer). The ‘Stereochemistry’ category shows the number of entries that have been flagged because of discrepancies in their (a) “/b” layer of InChI strings that reports sp2 double bond stereochemistry of the compounds, or (b) InChI “/t” layer that reports orientations of chiral centers. We note that a compound could be flagged and reported in multiple categories. Overall, our analyses flagged 32,036,565 entries (about 33% of the PubChem entries with 3D structures) as having a discrepancy between its archived InChI string and that generated from the corresponding 3D structure by ALATIS. Improper representation of stereochemistry was the most common reason for discrepancy, followed by charge, and atom connectivity (Table 1). Complete lists of these flagged entries are reported on our website [http://alatis.nmrfam.wisc.edu/databases].
We provide below examples from the three categories of flagged inconsistencies.
(a) Inconsistency in atom connectivity
As noted above, the layers “/c” and “/h” in the standard InChI string represent the connectivity of heavy atoms and the number of associated hydrogen atoms to the heavy atoms, respectively. The PubChem entry shown in Fig. 3 illustrates a case in which the 3D structure file and the deposited InChI strings represent distinct covalent bonds between heavy atoms. Correct identification of 3D structure is essential in functional investigations of compounds, and this category of inconsistency could lead to erroneous conclusions.
(b) Inconsistency in charge distribution
As mentioned above, distinct charges due to (de)protonation or intrinsic covalent charges of compounds are represented in the “/p” and “/q” layers of InChI strings. The flagged PubChem entries in this category are ones in which the archived 3D structure and InChI strings represent different charge states. Figure 4 shows an example from this category.
(c) Inconsistency in stereochemistry
(c.1) Inconsistency in double bond sp2 stereochemistry
The orientation of the structure of a compound about a double bond, whether the configuration is cis or trans, is captured precisely in standard InChI strings. These orientations, which can only be identified in 3D structures, are indicated in the “/b” layer of InChI strings. The PubChem compound shown in Figure 5 displays an example of a discrepancy between the configuration of the archived 3D structure and its associated InChI string. In this example, the InChI string of PubChem entry (CID 1551886) contains a question mark in its “/b” layer, which indicates that the configuration of the compound is ambiguous. However, the archived 3D structure represents the trans configuration of the compound.
(c.2) Inconsistency in stereochemistry of chiral centers
The stereochemistry (chirality) of small molecules plays a vital role in determining their function. Among the more than 91 million PubChem entries with 3D structures, our computations using ALATIS indicated that more than 55% of the entries (50,508,180 entries) contained at least one chiral center. About 60% of these entries (30,236,352 entries) were flagged during our analysis, owing to inconsistencies between the stereochemistry layer of the deposited InChI strings in PubChem and those generated by ALATIS from the structures. The complete list of these entries is accessible from the ALATIS website. Figure 6 shows one example from these flagged entries.
The unique representation of chemical compounds and their constituent atoms is of paramount importance to the correct functioning of biochemical databases, including correct storage and retrieval of the compounds and their associated metadata. Unique IDs are essential to the creation and maintenance of cross-references between different databases and in the detection of discrepant data. The ALATIS naming system, which is based on the standard InChI string for a compound generated from its 3D structure, was utilized previously to evaluate the consistency of contents of three reference small molecule databases (BMRB, HMDB, and PDB Ligand Expo) 5 . The ALATIS naming system has subsequently been fully implemented in the Biological Magnetic Resonance data Bank (BMRB) and has been recently adopted by the NMReData initiative 14 . In the work described here, we utilized the powerful hardware capabilities of the NMRbox project to process the PubChem Compound 3D dataset with ALATIS software to further expand the domain of accurate and reproducible compound and atom identifiers of small molecules. The PubChem database provides an important share of information routinely utilized in biomedical investigations of small molecules. Therefore, the accuracy and consistency of the deposited information and the capability of cross-referencing compounds from other databases to PubChem has a direct effect on the cogency of results in biomedical investigations that utilize PubChem.
For entries with stored 3D structures, which covers the vast majority (97.34%) of compounds in PubChem, our analysis created unique compound identifiers (standard InChI) and InChI-based identifiers for all atoms. In addition, we were able to identify PubChem entries with InChI strings that failed to match those generated from the archived structures (about 33% of those analyzed). This information is available for further curation of this key reference database.
We defined and programmed a workflow (Fig. 7) and used it to process and analyze the entire set of 3D structure files in PubChem. Because the workflow has been fully defined, it ensures its reproducibility and availability as needed for future reprocessing of the database.
The three-dimensional structure files of the entire PubChem were downloaded in SDF format from the PubChem FTP server on 20 December 2017. At the time of this analysis, the entries consisted of 5,260 compressed SDF files, each containing approximately 25,000 PubChem 3D compound structures. Non-standard cases were handled through processing scripts designed for each specific case. One such script dealt with instances in which the structure file of a compound had been updated on several occasions at different times, and all previous structures for the compound had been preserved in the compressed downloadable file. These obsoleted structures did not have an impact on the results of our analysis, but they increased the computational load because additional structures needed to be processed.
In order to process the large number of entries in a reasonable time, we designated ten servers from the NMRbox platform for the computational task. Each server provided 30 CPU cores (for a total of 300 computing cores). A preprocessing module was developed to split the input SDF files such that 18 SDF files were assigned to one CPU core.
We implemented a distributed computing paradigm wherein ALATIS binary files were executed as a batch job on the correspondingly assigned set of SDF files. The size of the data archive made it impractical to store the reprocessed data (ALATIS output) and associated information as individual files on local storage devices. Therefore, we utilized PostgreSQL open source database [https://www.postgresql.org/] to archive the data. We designed a relational database with three tables: (i) ALATIS outputs. This table consists of 11 columns for compound IDs (CID), input structure file, output in SDF format, output in PDB format, output in XYZ format, standard InChI string, formula as generated by ALATIS, warning file, error file, mapping between the input and output atom labels, and an additional map file for the cases that the input structure file contains multiple compounds 5 . (ii) PubChem metadata. This table contains four columns for storing the CIDs, molecular formula, molecular weight, and compound’s mass. (iii) PubChem names. This table provides the association between the CID and the compound names and synonyms.
The ALATIS website is equipped with a search engine for these tables, and users can query on a PubChem CID or compound name. We note that PubChem does not provide a specific name for a compound: instead, the synonym file contains a list of synonyms for every PubChem CID. Because these lists may contain several hundred synonyms for a compound (for example, PubChem CID 23978 lists 2,286 synonyms), it is not time-efficient to search the entire lists of synonyms when users query a compound name. Therefore, in order to provide a fast search engine, we utilize only the first provided synonym for each entry as the representative name of the compound. However, the complete list of synonyms for each entry is provided on the associated ALATIS webpage.
Because the database contains more than 91 million entries, it was not practical to pre-generate the entry webpages for storage. Instead, we utilized the Jinja2 templating engine [http://jinja.pocoo.org/] to organize and display ALATIS webpages. This template-based web design allows us to create the webpage for an entry from the PostgreSQL database upon request, without the need to physically archive the webpages.
ALATIS is available as a public web-service [http://alatis.nmrfam.wisc.edu/] that runs on a high-throughput computing platform 15 . The executable binary of the software program is available through NMRbox project 11 , and the source codes are available through GitHub [https://github.com/htdashti/ALATIS].
Unique compound identifiers (standard InChI) and all atom identifiers for the analyzed PubChem entries are available through a search engine on the ALATIS website [http://alatis.nmrfam.wisc.edu/]. The processed entries are also available in (Data Citation 1: Open Science Framework https://doi.org/10.17605/OSF.IO/JZDYQ).
How to cite this article: Dashti, H. et al. Automated evaluation of consistency within the PubChem Compound database. Sci. Data. 6:190023 https://doi.org/10.1038/sdata.2019.23 (2019).
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Hesam, Dashti, Jonathan, R. Wedell, William, M. Westler, John, L. Markley, & Hamid, R. Eghbalnia Open Science Framework https://doi.org/10.17605/OSF.IO/JZDYQ (2018)
We thank the PubChem support team for answering our inquiries and facilitating easy download of the data entries. This study made use of the National Magnetic Resonance Facility at Madison, which is supported by National Institutes of Health (NIH) grant P41GM103399. H.D., J.R.W., and H.R.E. were supported in part by the National Center for Biomolecular NMR Data Processing and Analysis, which is supported by NIH grant P41GM111135 (NIGMS). H.D. is supported in part by National Heart Lung and Blood Institute T32 HL007575. Marvin (Marvin 16.7.11, 2016, ChemAxon http://www.chemaxon.com) was used primarily for drawing, displaying, and characterizing chemical structures, except as otherwise indicated.