Background & Summary

The rapid identification of small molecules in metabolomics, exposomics, and environmental monitoring studies increasingly involves the use of high resolution mass spectrometry (HRMS) and non-targeted analysis (NTA) techniques1,2,3. NTA experiments generally incorporate complementary software (open and commercial tools) and chemistry databases to enable effective and accurate compound identification4,5,6. Different instrumentation and NTA approaches require different tools for effective annotation. For example, compound identification strategies based on MS1 data (yielding exact mass and molecular formula) typically rely on chemical metadata (e.g. the number of data sources or literature references linked to a chemical)6, whereas those based on MS/MS data (yielding MS1 data and fragment ions) are enhanced by matching of empirical fragmentation spectra with library spectra7,8. Recent studies have shown that melding of these approaches leads to improved identification accuracy over spectral matching alone7,9. Thus, coupling robust metadata with spectral matching capabilities must now be the focus of research efforts to maximize yield from NTA identification techniques.

When considering the number of known compounds in public databases, the availability of library MS/MS spectra is quite limited2,10. Open spectral libraries such as MassBank (, MoNA (, METLIN12, and mzCloud ( are rich with empirical spectra, but do not yet fully cover the broad chemical space that may be monitored in NTA studies. Instrument vendors further provide empirical spectral data for users to purchase (with matching algorithms executed within vendor software), but access and coverage remains limited13. To address these gaps, researchers have developed in silico fragmenters and MS/MS prediction models, including MetFrag7, CSI Finger-ID14, and CFM-ID8, among a number of others available commercially (e.g. ACD/MS Fragmenter15, Mass Frontier16). Use of predicted MS/MS spectra in identification workflows has proven effective5, but requires the incorporation of command line utilities and/or on-the-fly processing of data for single chemicals. Prediction of MS/MS spectra en masse and mapping pre-computed spectra to structures and metadata within chemistry databases can enhance identification schemes and enable integration into various software systems and workflows.

The US EPA’s DSSTox database is a comprehensive chemistry resource, containing more than 760,000 distinct chemical substances, associated chemical structures, and metadata17, and serves as the underpinning for EPA’s CompTox Chemicals Dashboard ( Among its many functionalities, the Dashboard enables searching of masses and formulae generated from HRMS experiments. The data and algorithms associated with Dashboard searching have been shown to outperform the much larger ChemSpider database (ca. 67 million chemicals as of July 2018) using data source ranking for the identification of unknowns6. As an example, consider a search for the formula C15H16O2 which produces a total of 263 results. Rank ordering the results based on data source or literature reference counts brings the most likely chemical (Bisphenol A) to the top of the search results (Fig. 1).

Fig. 1
figure 1

Search results from an MS-Ready Formula search of C15H16O2 Candidate chemical structures are denoted by a DTXSID and preferred name and contain linked metadata such as the Number of Sources, CPDat Count, and PubMed Ref. Count. Rank ordering by metadata brings the most likely chemicals to the top of the search results list.

Additional metadata are now being optimized in a combined ranking scheme to further improve identifications. To improve Dashboard capabilities that support NTA research, we are generating, storing, and mapping predicted MS/MS spectra for all structures in the database.

Herein we describe: (1) the generation and storage of predicted MS/MS spectra for all chemical structures contained with DSSTox; (2) the validation and mapping of spectra to structures and substances; and (3) the publication of the comprehensive dataset for public dissemination (including the complete SQL database and schema). MS/MS spectra were predicted using competitive fragmentation modelling (CFM) and the open command line tools developed by Allen et al.8,19,20 and named CFM-ID (available here: All remaining data are sourced from the US EPA’s DSSTox database and available via the EPA’s CompTox Chemicals Dashboard ( Open and accessible data, integrated and provided in this dataset, enables NTA practitioners an improved means of small molecule identification when using MS/MS data from HRMS experiments.


Generation of predicted MS/MS Data

To maximize use of predicted MS/MS data, both for our processes13,21 and the mass spectral community at large, “MS-Ready” structures were used in the prediction model. An MS-Ready structure represents the form of a structure that would be observed via HRMS; these structures are de-salted, de-solvated, and processed such that chemical mixtures are separated22. These structures are stored in the DSSTox database with unique chemical identifiers (DTXCIDs) and linked to unique substance identifiers (DTXSIDs) to enable use of the structures and associated substance-level metadata in HRMS applications.

MS/MS spectra were predicted using CFM-ID with pre-trained parameters as defined by CFM-ID literature and described by Allen et al.8,19,20. All source code was downloaded from the CFM-ID SourceForge site: The input data were 843,113 MS-Ready chemical structures as SMILES strings. Additional data associated with chemical structures included DTXCIDs, molecular formulas, standard InChIKeys generated using the Indigo Toolkit (, and monoisotopic masses. The obtained chemicals were saved in a local tab separated file.

MS/MS spectra were generated for each structure in the following ionization modes: electrospray ionization in both positive and negative modes (ESI+ and ESI-, respectively) at three collision energies (Energy0–10 eV, Energy1–20 eV, and Energy2–40 eV), and electron impact ionization (EI). Spectra were predicted using standard parameters provided with the software and available via the CFM-ID SourceForge site with no limits placed on the number of MS/MS spectra calculated for a given structure.

The mass spectra calculations were performed on a large-scale Linux cluster at the US EPA National Computer Center ( A master shell script was used to generate over 4,000 Slurm ( queueing system run scripts that calculated EI, ESI+, and ESI- MS/MS spectra for 200 chemicals each. A small fraction of chemicals (<700) was excluded from CFM-ID calculations due to missing data and/or structural issues expected to fail in processing (such as SMILES notations of radicals, e.g. CC(C=C)=C[Al] |^3:5|). An additional 56 chemicals failed during calculation of all three prediction types. This was believed to occur due to the structural constraints of the models and ionization types as many of the failed chemicals were permanently charged species and metals (“Chemical Structures that failed during mass spectral prediction”, data available at Mode-specific failures occurred as follows: ~1000 chemicals failed during calculation of EI spectra, ~2000 failed during calculation of ESI+ spectra, and ~18,000 failed during calculation of ESI- spectra. The substantially higher number of failures occurring in ESI- mode are primarily driven by permanently charged species unlikely to ionize in negative electrospray.

For each type of mass spectra (EI, ESI+ and ESI-), the log files were merged and a Python script was used to separate the contents into a final output file (metadata followed by mass spectrum data for each chemical) and an error file (CFM-ID error messages for failed and timed out calculations). The final output file was a .dat ASCII file for each ionization mode (“Predicted EI-MS Spectra of CompTox Chemicals Dashboard Structures”, “Predicted MS/MS Spectra in ESI-positive mode of CompTox Chemicals Dashboard Structures”, “Predicted MS/MS Spectra in ESI-negative mode of CompTox Chemicals Dashboard Structures”, data available at

Data storage and database structure

The raw output of the predicted MS/MS data described above required parsing and manipulation in order to generate MySQL loadable data. A Java application was developed to parse the data and generate MySQL load statements to load the database (described below). The resulting database required ~137 GB of storage and took 10 hours to load.

Mapping to chemical metadata with DSSTox and associated databases

MS-Ready structures, denoted by individual DTXCIDs, are stored in a structure relationship mapping table linking MS-Ready structures to original DSSTox structures and associated chemical substances (DTXSIDs). Chemical substances are associated with a variety of identifiers (e.g. InChI strings and keys, synonyms, database identifiers) and data (e.g. physicochemical properties, toxicity data, bioactivity data). Additional details regarding the relationship between DTXCIDs and DTXSIDs are explained in more detail elsewhere18.

The CompTox Chemicals Dashboard ( enables users to search and peruse the data contained within multiple databases (see Table 2 in Williams, et al.18 for a list of all databases). Many of the data contained within these databases are of value for ranking candidate chemicals in search results, including the number of data sources associated with a chemical in PubChem (, the number of associated articles in PubMed (, and the number of unique consumer product categories associated with a chemical in the Chemical and Products Database (CPDat; As discussed above, ranking based on such metadata sources has already proven to be a valuable approach6.

To facilitate search and identification of unknowns using HRMS data, an export file from DSSTox was generated to include all DTXCIDs used to generated MS/MS data and valuable metadata, described below. Access to both substance-level metadata and predicted MS/MS data is made possible through the linked DTXCID identifier and database structure.

Data Records

The data described in this work is available in three primary formats: a SQL relational database, .dat ASCII files containing all predicted spectra, and as a complete export file in comma-separated format (.csv). Two types of data are presented to facilitate compound identification: predicted MS/MS spectral data and chemical metadata, described below and defined as data linked to a chemical structure. Access and use of the data are enabled by the inclusion of unique chemical identifiers (DTXCIDs) within all records to connect chemical structures to their associated data.

Spectral data

MS/MS spectra were generated for each structure in the following ionization modes: ESI+, ESI-, and EI. Each data record generated for a structure in ESI+ and ESI- contains MS/MS predictions for three collision energy levels while each record for EI contains results from a single collision energy only. Collision energy levels predicted for ESI are as follows: Energy0 (10 eV), Energy1 (20 eV), and Energy2 (40 eV). Preceding spectral predictions for a given structure are the following chemical structure metadata fields (see an example in Fig. 2):

  • Date/time: indicating the date and time the prediction was computed

  • CFM-ID version: indicating the version of the command line tools

  • DTXCID: the unique DSSTox chemical identifier for the structure

  • SMILES: the MS-Ready SMILES for the structure

  • MASS: the neutral MS-Ready monoisotopic mass

  • FORMULA: the MS-Ready molecular formula

  • INCHI_KEY: the standard InChI Key for the structure

Fig. 2
figure 2

Chemical structure metadata information followed by predicted MS/MS data included in the .dat ASCII prediction files using the example of DTXCID80539702 in ESI-positive mode. Only the first ~50 lines of predictions are shown and structural annotations with SMILES succeeding predictions are not included in the image.

Immediately following the chemical structure metadata fields are predicted MS/MS fragments in order of collision energy level (Energy0, Energy1, Energy2), when appropriate. Provided within each collision energy level are all fragments generated according to the CFM model, ordered from lowest m/z to highest with a single fragment per row of the table. A fragment is indicated by its m/z, relative intensity, structural annotation number, and annotation-specific intensities in parentheses, respectively. When multiple structural annotations are predicted for a single fragment, the relative intensities of each are provided sequentially and tab-separated in parentheses (see m/z 150.0105033 in Fig. 2 for an example). Fragment structural annotations are defined using SMILES at the end of each prediction (not pictured in the example Fig. 2). The files “spectra_EI-MS.dat” (“Predicted EI-MS Spectra of CompTox Chemicals Dashboard Structures”, data available at, “spectra_ESI-MSMS-neg.dat” (“Predicted MS/MS Spectra in ESI-negative mode of CompTox Chemicals Dashboard Structures”, data available at, and “spectra_ESI-MSMS-pos.dat” (Predicted MS/MS Spectra in ESI-positive mode of CompTox Chemicals Dashboard Structures”, data available at contain all predictions consecutively within the.dat files. Entries are separated by the presence of the chemical structure metadata fields described above.

SQL database

In addition to raw files containing the predicted MS/MS spectra, data was stored in a SQL relational database (“Database of Predicted Spectra of CompTox Chemicals Dashboard Structures”, data available at Each chemical structure processed through CFM-ID resulted in MS/MS data from multiple ionization modes and collision energies. This collection of data (chemical structure, identifier, fragments and intensities) is identified as a single job.

These relationships are reflected in the Enhanced Entity Relationship (EER) Diagram (see Fig. 3) and provided as an SQL schema in a separate file (“Database Schema File of Predicted Spectra of CompTox Chemicals Dashboard Structures”, data available at The “chemical” table contains the list of all processed chemicals, denoted by a unique DTXCID. The “job” table represents the processing of a chemical for a selected spectrum and provides links into the “peak” and “fragment” tables. In addition, the “peak” table is linked to the “fragintensity” table which contains the fragment intensities and structural annotations for a given peak.

Fig. 3
figure 3

Enhanced Entity Relationship (EER) Diagram of the MySQL database created to host predicted MS/MS data generated using CFM-ID.

Access to the database is made available through a Python script. In addition to querying the database the script is also capable of ranking the matched chemicals according to their cosine dot product score25,26. Relevant information, including the mass of the parent ion, the DTXCID of the parent mass, the masses and intensities of the fragments, and the collision energy, are all provided by the querying script to perform the ranking. The MySQL database is accessed through the PyMySQL module in Python. A query is constructed to combine the fragmentation information from different tables, based on an initial search of the mass of the parent ion or the chemical formula. When the mass is searched, an accuracy level (typically within 10 ppm) is provided. The query will then search for all chemicals with masses within the defined accuracy window, and the predicted fragments for all three collision energies are provided. This information is then loaded into a DataFrame using the Pandas27 module in Python, and further calculations, including relative intensities, cosine dot product, and ranking of the matched chemicals are performed.

Chemical metadata

Chemical metadata linked through the DTXCID are provided for all records for which predicted MS/MS spectra exist. An example of chemical metadata for a subset of structures is provided in Table 1. Metadata are provided in the “CFM-ID_metadata_DTXCID.csv” file for the following categories (“Chemical Metadata from the CompTox Chemicals Dashboard Linked to Predicted Spectra”, data available at

  • DTXCID: the unique DSSTox chemical identifier for the structure

  • DTXSID: the unique DSSTox substance identifier

  • Preferred Name

  • Chemical Abstracts Service Registry Number (CASRN)

  • MS-Ready Molecular Formula

  • MS-Ready Monoisotopic Mass

  • MS-Ready SMILES

  • Data Sources: the number of data sources in which a chemical is found within EPA’s DSSTox database

  • PubMed Reference Count: the number of references within PubMed associated with a given DTXSID queried using Medical Subject Heading (MeSH) annotation

  • PubChem Data Sources: the number of data sources within PubChem for a given DTXSID

  • CPDat Product Occurrence Count: the number of unique consumer products in which a chemical has been reported24

  • Presence in the following lists: NORMAN Merged Suspect List: SusDat28,29, STOFF-IDENT Database (!home), ToxCast30.

Table 1 Chemical metadata for a subset of chemicals defined by DTXCID.

Technical Validation

The reliability and accuracy of predicted MS/MS spectra using CFM-ID have been reviewed and validated in multiple publications19,20,26 and subsequent applications5,9. Therefore, to verify the accuracy and ultimate utility of the present work, simple and small scale comparisons were conducted between predictions generated using the CFM-ID web application ( and our own implementation of the command line tools. MS/MS spectra for three randomly selected structures in all three ionization types (for a total of nine comparison points) were predicted using each method and saved as text files (Supplementary Files 1 and 2). Supplementary Files 1 and 2 present the output data copied from each source for a single collision energy for each ionization type. The CFM-ID web application truncates the number of predicted spectra output19 and as such slight differences in predicted relative intensities and total number of spectra between the web application and our implementation were expected. As expected, comparison indicated exact output matching for smaller structures with fewer fragments (e.g. DTXCID107640/OC(CC(O)=O)C(O)=O) and highly similar outputs when spectra were truncated in the web application output (e.g. DTXCID00224961/NC(N)=NCCCC(NC=O)C(O)=O). In the instances where exact replication was not observed, only the relative intensities differ and do so by ~1%. Predicted fragments in all cases have identical m/z values between the two sources, indicating agreement between our implementation and the web application output.

Chemical metadata validation results from structural curation efforts and mapping within DSSTox between structural identifiers. To certify appropriate mapping between predicted spectra, chemical structures, and selected chemical metadata, a semi-automated process is conducted to link unique chemical identifiers with curated data. Mappings between MS-Ready DTXCIDs and linked DTXSIDs are stored in a structure relationship mapping table to facilitate access to pertinent chemical metadata associated with a DTXSID. The DSSTox database structure, MS-Ready linkages, and chemistry data have been previously described and validated18.

Usage Notes

Predicted MS/MS data are often used by researchers to compare an unidentified chemical (observed via HRMS) to a list of potential candidate chemicals. Empirically collected MS/MS data are scored against predicted spectra of a list of candidate chemicals to identify the best match. Spectral match scores provide an important piece of confirmatory data towards ultimate compound identification. A match score can be calculated between two sets of peaks using a variety of mathematical formulas25,26,31, any of which can be executed with simple queries of the present data. The most common use case will require a user to first query the database (or exported file converted to a data frame, for example) based on the parent mass or molecular formula of interest (i.e. observed via HRMS experimentation). The resulting set of structures from the defined search parameters will contain predicted MS/MS data. These data must then be parsed, and ionization mode identified (if desired) in order to match and ultimately score peaks. Here we provide the means to conduct these searches using code developed in Python and match scores computed using the cosine dot product ( The matched chemicals, along with their fragments and the corresponding intensities at specific collision energies, are fed into a Python script that matches predicted with experimental spectra. A mass accuracy window (within a few ppm) is needed to search for matches between the fragments of the two spectra. Fragments that fall within this accuracy window are considered a match and are used in the final calculation of the cosine dot product score. The calculation as implemented in our work is computed at all three predicted energy levels. The matched chemicals are then ranked based on individual energy scores or their sum, depending on the user’s preference.

Another potentially less common use case with these data involves a user interested in the predicted MS/MS spectra of a single structure. In this case again, a simple query of the database using structural identifiers will return the desired result. Ultimately, users will be able to conduct the aforementioned queries and calculations within a web interface via the CompTox Chemicals Dashboard. Development is in progress as of December 2018 and the prototype (with the scoring algorithm implemented in Java) enables users to input a mass or formula along with observed MS/MS data and query the database for matches. Users with experience in Python and/or with data requiring customization of the match code will find the Python code of greater value while the Dashboard represents the most accessible means with which to access these data.

Additional chemical metadata linked via structural identifiers presents more options for users to increase the certainty of identifications of unknowns. These data can be accessed directly by querying the full comma-separated export using candidate chemicals. Once retrieved, data source counts associated with candidate chemicals can be used to rank within the set: the greater the number of data sources the more likely the chemical would occur in a sample6,32. Preliminary research indicates that data sources contained within DSSTox merged with CFM-ID match scores substantially boosts the number of correct identifications from unknowns. Optimization of combined scoring metrics is under development for implementation via the Dashboard.