Linking in silico MS/MS spectra with chemistry data to improve identification of unknowns

Confident identification of unknown chemicals in high resolution mass spectrometry (HRMS) screening studies requires cohesive workflows and complementary data, tools, and software. Chemistry databases, screening libraries, and chemical metadata have become fixtures in identification workflows. To increase confidence in compound identifications, the use of structural fragmentation data collected via tandem mass spectrometry (MS/MS or MS2) is vital. However, the availability of empirically collected MS/MS data for identification of unknowns is limited. Researchers have therefore turned to in silico generation of MS/MS data for use in HRMS-based screening studies. This paper describes the generation en masse of predicted MS/MS spectra for the entirety of the US EPA’s DSSTox database using competitive fragmentation modelling and a freely available open source tool, CFM-ID. The generated dataset comprises predicted MS/MS spectra for ~700,000 structures, and mappings between predicted spectra, structures, associated substances, and chemical metadata. Together, these resources facilitate improved compound identifications in HRMS screening studies. These data are accessible via an SQL database, a comma-separated export file (.csv), and EPA’s CompTox Chemicals Dashboard.

www.nature.com/scientificdata www.nature.com/scientificdata/ vendors further provide empirical spectral data for users to purchase (with matching algorithms executed within vendor software), but access and coverage remains limited 13 . To address these gaps, researchers have developed in silico fragmenters and MS/MS prediction models, including MetFrag 7 , CSI Finger-ID 14 , and CFM-ID 8 , among a number of others available commercially (e.g. ACD/MS Fragmenter 15 , Mass Frontier 16 ). Use of predicted MS/MS spectra in identification workflows has proven effective 5 , but requires the incorporation of command line utilities and/or on-the-fly processing of data for single chemicals. Prediction of MS/MS spectra en masse and mapping pre-computed spectra to structures and metadata within chemistry databases can enhance identification schemes and enable integration into various software systems and workflows.
The US EPA's DSSTox database is a comprehensive chemistry resource, containing more than 760,000 distinct chemical substances, associated chemical structures, and metadata 17 , and serves as the underpinning for EPA's CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard) 18 . Among its many functionalities, the Dashboard enables searching of masses and formulae generated from HRMS experiments. The data and algorithms associated with Dashboard searching have been shown to outperform the much larger ChemSpider database (ca. 67 million chemicals as of July 2018) using data source ranking for the identification of unknowns 6 . As an example, consider a search for the formula C 15 H 16 O 2 which produces a total of 263 results. Rank ordering the results based on data source or literature reference counts brings the most likely chemical (Bisphenol A) to the top of the search results (Fig. 1).
Additional metadata are now being optimized in a combined ranking scheme to further improve identifications. To improve Dashboard capabilities that support NTA research, we are generating, storing, and mapping predicted MS/MS spectra for all structures in the database.
Herein we describe: (1) the generation and storage of predicted MS/MS spectra for all chemical structures contained with DSSTox; (2) the validation and mapping of spectra to structures and substances; and (3) the publication of the comprehensive dataset for public dissemination (including the complete SQL database and schema). MS/MS spectra were predicted using competitive fragmentation modelling (CFM) and the open command line tools developed by Allen et al. 8,19,20 and named CFM-ID (available here: http://sourceforge.net/projects/cfm-id). All remaining data are sourced from the US EPA's DSSTox database and available via the EPA's CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard). Open and accessible data, integrated and provided in this dataset, enables NTA practitioners an improved means of small molecule identification when using MS/MS data from HRMS experiments.

Generation of predicted MS/MS Data.
To maximize use of predicted MS/MS data, both for our processes 13,21 and the mass spectral community at large, "MS-Ready" structures were used in the prediction model. An MS-Ready structure represents the form of a structure that would be observed via HRMS; these structures www.nature.com/scientificdata www.nature.com/scientificdata/ are de-salted, de-solvated, and processed such that chemical mixtures are separated 22 . These structures are stored in the DSSTox database with unique chemical identifiers (DTXCIDs) and linked to unique substance identifiers (DTXSIDs) to enable use of the structures and associated substance-level metadata in HRMS applications.
MS/MS spectra were predicted using CFM-ID with pre-trained parameters as defined by CFM-ID literature and described by Allen et al. 8,19,20 . All source code was downloaded from the CFM-ID SourceForge site: http://sourceforge.net/projects/cfm-id. The input data were 843,113 MS-Ready chemical structures as SMILES strings. Additional data associated with chemical structures included DTXCIDs, molecular formulas, standard InChIKeys generated using the Indigo Toolkit (http://lifescience.opensource.epam.com/indigo/), and monoisotopic masses. The obtained chemicals were saved in a local tab separated file.
MS/MS spectra were generated for each structure in the following ionization modes: electrospray ionization in both positive and negative modes (ESI+ and ESI-, respectively) at three collision energies (Energy0-10 eV, Energy1-20 eV, and Energy2-40 eV), and electron impact ionization (EI). Spectra were predicted using standard parameters provided with the software and available via the CFM-ID SourceForge site with no limits placed on the number of MS/MS spectra calculated for a given structure.
The mass spectra calculations were performed on a large-scale Linux cluster at the US EPA National Computer Center (https://www.epa.gov/greeningepa/national-computer-center). A master shell script was used to generate over 4,000 Slurm (https://slurm.schedmd.com/) queueing system run scripts that calculated EI, ESI+, and ESI-MS/MS spectra for 200 chemicals each. A small fraction of chemicals (<700) was excluded from CFM-ID calculations due to missing data and/or structural issues expected to fail in processing (such as SMILES notations of radicals, e.g. CC(C=C)=C[Al] |^3:5|). An additional 56 chemicals failed during calculation of all three prediction types. This was believed to occur due to the structural constraints of the models and ionization types as many of the failed chemicals were permanently charged species and metals ("Chemical Structures that failed during mass spectral prediction", data available at https://doi.org/10.23645/epacomptox.7776212.v1) 23 . Mode-specific failures occurred as follows: ~1000 chemicals failed during calculation of EI spectra, ~2000 failed during calculation of ESI+ spectra, and ~18,000 failed during calculation of ESI-spectra. The substantially higher number of failures occurring in ESI-mode are primarily driven by permanently charged species unlikely to ionize in negative electrospray.
For each type of mass spectra (EI, ESI+ and ESI-), the log files were merged and a Python script was used to separate the contents into a final output file (metadata followed by mass spectrum data for each chemical) and an error file (CFM-ID error messages for failed and timed out calculations). The final output file was a .dat ASCII file for each ionization mode ("Predicted EI-MS Spectra of CompTox Chemicals Dashboard Structures", "Predicted MS/MS Spectra in ESI-positive mode of CompTox Chemicals Dashboard Structures", "Predicted MS/MS Spectra in ESI-negative mode of CompTox Chemicals Dashboard Structures", data available at https://doi.org/10.23645/ epacomptox.7776212.v1) 23 .
Data storage and database structure. The raw output of the predicted MS/MS data described above required parsing and manipulation in order to generate MySQL loadable data. A Java application was developed to parse the data and generate MySQL load statements to load the database (described below). The resulting database required ~137 GB of storage and took 10 hours to load.
Mapping to chemical metadata with DSSTox and associated databases. MS-Ready structures, denoted by individual DTXCIDs, are stored in a structure relationship mapping table linking MS-Ready structures to original DSSTox structures and associated chemical substances (DTXSIDs). Chemical substances are associated with a variety of identifiers (e.g. InChI strings and keys, synonyms, database identifiers) and data (e.g. physicochemical properties, toxicity data, bioactivity data). Additional details regarding the relationship between DTXCIDs and DTXSIDs are explained in more detail elsewhere 18 .
The CompTox Chemicals Dashboard (https://comptox.epa.gov/dashboard/) enables users to search and peruse the data contained within multiple databases (see Table 2 in Williams, et al. 18 for a list of all databases). Many of the data contained within these databases are of value for ranking candidate chemicals in search results, including the number of data sources associated with a chemical in PubChem (https://pubchem.ncbi.nlm.nih. gov/), the number of associated articles in PubMed (https://www.ncbi.nlm.nih.gov/pubmed/), and the number of unique consumer product categories associated with a chemical in the Chemical and Products Database (CPDat; https://www.epa.gov/chemical-research/chemical-and-products-database-cpdat) 24 . As discussed above, ranking based on such metadata sources has already proven to be a valuable approach 6 .
To facilitate search and identification of unknowns using HRMS data, an export file from DSSTox was generated to include all DTXCIDs used to generated MS/MS data and valuable metadata, described below. Access to both substance-level metadata and predicted MS/MS data is made possible through the linked DTXCID identifier and database structure.

Data Records
The data described in this work is available in three primary formats: a SQL relational database, .dat ASCII files containing all predicted spectra, and as a complete export file in comma-separated format (.csv). Two types of data are presented to facilitate compound identification: predicted MS/MS spectral data and chemical metadata, described below and defined as data linked to a chemical structure. Access and use of the data are enabled by the inclusion of unique chemical identifiers (DTXCIDs) within all records to connect chemical structures to their associated data.
www.nature.com/scientificdata www.nature.com/scientificdata/ Spectral data. MS/MS spectra were generated for each structure in the following ionization modes: ESI+, ESI-, and EI. Each data record generated for a structure in ESI+ and ESI-contains MS/MS predictions for three collision energy levels while each record for EI contains results from a single collision energy only. Collision energy levels predicted for ESI are as follows: Energy0 (10 eV), Energy1 (20 eV), and Energy2 (40 eV). Preceding spectral predictions for a given structure are the following chemical structure metadata fields (see an example in Fig. 2 SQL database. In addition to raw files containing the predicted MS/MS spectra, data was stored in a SQL relational database ("Database of Predicted Spectra of CompTox Chemicals Dashboard Structures", data available at https://doi.org/10.23645/epacomptox.7776212.v1) 23 . Each chemical structure processed through CFM-ID resulted in MS/MS data from multiple ionization modes and collision energies. This collection of data (chemical structure, identifier, fragments and intensities) is identified as a single job.
These relationships are reflected in the Enhanced Entity Relationship (EER) Diagram (see Fig. 3) and provided as an SQL schema in a separate file ("Database Schema File of Predicted Spectra of CompTox Chemicals Dashboard Structures", data available at https://doi.org/10.23645/epacomptox.7776212.v1) 23 . The "chemical" table contains the list of all processed chemicals, denoted by a unique DTXCID. The "job" table represents the processing of a chemical for a selected spectrum and provides links into the "peak" and "fragment" tables. In addition, the "peak" table is linked to the "fragintensity" table which contains the fragment intensities and structural annotations for a given peak.
Access to the database is made available through a Python script. In addition to querying the database the script is also capable of ranking the matched chemicals according to their cosine dot product score 25,26 . Relevant information, including the mass of the parent ion, the DTXCID of the parent mass, the masses and intensities of the fragments, and the collision energy, are all provided by the querying script to perform the ranking. The MySQL database is accessed through the PyMySQL module in Python. A query is constructed to combine the fragmentation information from different tables, based on an initial search of the mass of the parent ion or the chemical formula. When the mass is searched, an accuracy level (typically within 10 ppm) is provided. The query will then search for all chemicals with masses within the defined accuracy window, and the predicted fragments for all three collision energies are provided. This information is then loaded into a DataFrame using the Pandas 27 module in Python, and further calculations, including relative intensities, cosine dot product, and ranking of the matched chemicals are performed.
Chemical metadata. Chemical metadata linked through the DTXCID are provided for all records for which predicted MS/MS spectra exist. An example of chemical metadata for a subset of structures is provided in Table 1. Metadata are provided in the "CFM-ID_metadata_DTXCID.csv" file for the following categories ("Chemical Metadata from the CompTox Chemicals Dashboard Linked to Predicted Spectra", data available at https://doi. org/10.23645/epacomptox.7776212.v1) 23 : • DTXCID: the unique DSSTox chemical identifier for the structure • DTXSID: the unique DSSTox substance identifier • Preferred Name  www.nature.com/scientificdata www.nature.com/scientificdata/

Technical Validation
The reliability and accuracy of predicted MS/MS spectra using CFM-ID have been reviewed and validated in multiple publications 19,20,26 and subsequent applications 5,9 . Therefore, to verify the accuracy and ultimate utility of the present work, simple and small scale comparisons were conducted between predictions generated using the CFM-ID web application (http://cfmid.wishartlab.com/) and our own implementation of the command line tools. MS/MS spectra for three randomly selected structures in all three ionization types (for a total of nine comparison points) were predicted using each method and saved as text files ( Supplementary Files 1 and 2). Supplementary Files 1 and 2 present the output data copied from each source for a single collision energy for each ionization type. The CFM-ID web application truncates the number of predicted spectra output 19 and as such slight differences in predicted relative intensities and total number of spectra between the web application and our implementation were expected. As expected, comparison indicated exact output matching for smaller structures with fewer fragments (e.g. DTXCID107640/OC(CC(O)=O)C(O)=O) and highly similar outputs when spectra were truncated in the web application output (e.g. DTXCID00224961/NC(N)=NCCCC(NC=O)C(O)=O). In the instances where exact replication was not observed, only the relative intensities differ and do so by ~1%. Predicted fragments in all cases have identical m/z values between the two sources, indicating agreement between our implementation and the web application output.
Chemical metadata validation results from structural curation efforts and mapping within DSSTox between structural identifiers. To certify appropriate mapping between predicted spectra, chemical structures, and selected chemical metadata, a semi-automated process is conducted to link unique chemical identifiers with curated data. Mappings between MS-Ready DTXCIDs and linked DTXSIDs are stored in a structure relationship mapping table to facilitate access to pertinent chemical metadata associated with a DTXSID. The DSSTox database structure, MS-Ready linkages, and chemistry data have been previously described and validated 18 .

Usage Notes
Predicted MS/MS data are often used by researchers to compare an unidentified chemical (observed via HRMS) to a list of potential candidate chemicals. Empirically collected MS/MS data are scored against predicted spectra of a list of candidate chemicals to identify the best match. Spectral match scores provide an important piece of confirmatory data towards ultimate compound identification. A match score can be calculated between two sets of peaks using a variety of mathematical formulas 25,26,31 , any of which can be executed with simple queries of the present data. The most common use case will require a user to first query the database (or exported file converted to a data frame, for example) based on the parent mass or molecular formula of interest (i.e. observed via HRMS experimentation). The resulting set of structures from the defined search parameters will contain predicted MS/ www.nature.com/scientificdata www.nature.com/scientificdata/ MS data. These data must then be parsed, and ionization mode identified (if desired) in order to match and ultimately score peaks. Here we provide the means to conduct these searches using code developed in Python and match scores computed using the cosine dot product (https://github.com/USEPA/CFM-ID_generation_of_ CompTox_Chemicals_Dashboard_Structures_Paper). The matched chemicals, along with their fragments and the corresponding intensities at specific collision energies, are fed into a Python script that matches predicted with experimental spectra. A mass accuracy window (within a few ppm) is needed to search for matches between the fragments of the two spectra. Fragments that fall within this accuracy window are considered a match and are used in the final calculation of the cosine dot product score. The calculation as implemented in our work is computed at all three predicted energy levels. The matched chemicals are then ranked based on individual energy scores or their sum, depending on the user's preference.
Another potentially less common use case with these data involves a user interested in the predicted MS/MS spectra of a single structure. In this case again, a simple query of the database using structural identifiers will return the desired result. Ultimately, users will be able to conduct the aforementioned queries and calculations within a web interface via the CompTox Chemicals Dashboard. Development is in progress as of December 2018 and the prototype (with the scoring algorithm implemented in Java) enables users to input a mass or formula along with observed MS/MS data and query the database for matches. Users with experience in Python and/ or with data requiring customization of the match code will find the Python code of greater value while the Dashboard represents the most accessible means with which to access these data.
Additional chemical metadata linked via structural identifiers presents more options for users to increase the certainty of identifications of unknowns. These data can be accessed directly by querying the full comma-separated export using candidate chemicals. Once retrieved, data source counts associated with candidate chemicals can be used to rank within the set: the greater the number of data sources the more likely the chemical would occur in a sample 6,32 . Preliminary research indicates that data sources contained within DSSTox merged with CFM-ID match scores substantially boosts the number of correct identifications from unknowns. Optimization of combined scoring metrics is under development for implementation via the Dashboard.

Code Availability
All code for predicting the MS/MS spectra including model parameters and settings are available via http:// sourceforge.net/projects/cfm-id. Additional scripts used to implement the prediction algorithm and query the compiled database are available on GitHub (https://github.com/USEPA/CFM-ID_generation_of_CompTox_ Chemicals_Dashboard_Structures_Paper).