To the Editor — We introduce a web-enabled mass spectrometry (MS) search engine, named Mass Spectrometry Search Tool (MASST; https://masst.ucsd.edu). By enabling searches of all small-molecule tandem MS (MS/MS) data in public metabolomics repositories, we posit that MASST will unlock these resources for clinical, environmental and natural product applications.
Introduced in 1990, a tool for discovering related protein or gene sequences named Basic Local Alignment Search Tool (BLAST) enabled researchers to query entire public sequence data repositories through a web interface (WebBLAST; https://blast.ncbi.nlm.nih.gov/Blast.cgi)1. WebBLAST is one of the most widely cited and used bioinformatics tools because it permits any researcher to answer simple questions, such as ‘is a protein or DNA sequence common or rare?’. In the early days of public gene and protein databases, metadata, which include descriptions of sample, population or technical details, were limited. No deposition standards existed, except for the Short Read Archive and European Nucleotide Archive, which include experimental details for sequencing, instrumental details and sample description, such as the source of a sample. The current status of much MS data in the public domain is reminiscent of the DNA databanks of the 1990s. To increase usage and unlock the potential of openly available MS resources, we set out to build an infrastructure to enable WebBLAST for MS.
Algorithms developed for MS data, including molecular networking2 and fragmentation trees3, enable similarity searches against reference libraries of known molecules, whereas powerful metabolomics analysis software infrastructures, such as MS-DIAL4, MetaboAnalyst5, XCMS Online6 and HMDB7, focus on annotation of MS/MS spectra, or finding statistical relationships between molecular features. However, none of the existing tools enable searching a single MS/MS spectrum for identical or analogous MS/MS spectra against public data in repositories, including unknown molecules. Finding specific MS/MS spectra of interest, including unannotated spectra or structural analogs, in public repositories of metabolomics MS data and natural product MS data, is not possible. Deposition of untargeted MS data in the public domain is experiencing rapid growth. In March 2017, 910 metabolomics datasets were available8; by January 2019, there were >2,000 downloadable metabolomics datasets (about half of these datasets contain MS/MS data)9. Despite the availability of metabolomics and natural product data, including environmental and clinical MS datasets, public small-molecule MS data are hardly reused10. Now that there is a huge amount of small-molecule untargeted MS datasets publicly available (~1,100 untargeted datasets and ~110,000,000 spectra in ~150,000 files as of December 11, 2018), we felt that the time was right to develop MASST, to enable reuse of these MS data.
MASST comprises a web-based system to search the public data repository part of the GNPS/MassIVE knowledge base11 and an analysis infrastructure for a single MS/MS spectrum. The developments required for MASST searches included converting deposited public data to a uniform open format12 (irrespective of instrument type and original data format), the ability to trace the file from which each MS/MS spectrum originated, and a reporting system that shows all identical or similar MS/MS spectra found in public data along with their associated metadata. MASST development has been possible for two main reasons: first, adoption of universal, non-vendor-specific MS data formats has increased, which means that multiple publicly available datasets have been converted to the same data format13, and second, the recently developed ability to connect all public data in GNPS/MassIVE and connect each MS/MS spectrum to its metadata entries had not been developed yet.
A MASST report also includes matches to any reference spectra in public MS/MS spectral libraries, if the matches are within the user-specified search parameters. Libraries include GNPS user-contributed spectra11, GNPS libraries11, all three MassBanks14 (https://massbank.eu/MassBank/, https://mona.fiehnlab.ucdavis.edu/), ReSpect15, MIADB/Beniddir16, Sumner/Bruker, CASMI17, PNNL lipids18, Sirenas/Gates, EMBL MCF and several other libraries, listed at https://gnps.ucsd.edu/ProteoSAFe/libraries.jsp. Visualization of the MASST matches uses a mirror view (Fig. 1).
MASST can search against various repositories, including GNPS/MassIVE11, Metabolomics Workbench19, MetaboLights20 or the non-redundant (nr) MS/MS library of all unique MS/MS spectra from all three repositories combined. MASST searching using multiple repositories was enabled by converting data uploaded to the Metabolomics Workbench and MetaboLights repositories to the same open MS format in the GNPS/MassIVE data storage environment. Instructions on how to upload to GNPS/MassIVE can be found at https://ccms-ucsd.github.io/GNPSDocumentation/datasets/.
All public data in GNPS/MassIVE becomes MASST-searchable. MASST searches output results according to user-defined search parameters. The report returns the origin of the matched MS/MS spectrum with respect to the dataset and file information and any metadata associated with the file (Fig. 1). Datasets and files can be tagged with sample or spectral information by the community of MASST users, and this information then becomes part of the metadata reported back in future MASST searches. We also curated ~34,000 additional MS files with ~340,000 tags, mostly from human-associated samples, but also from microbes, food and indoor and outdoor environments, to provide a good foundation for MASST searches.
Metadata can be associated with MS/MS spectra in the GNPS/MassIVE upload portal at the dataset level, file level or single annotated spectrum level. Examples of metadata include instrument type, phylogeny (according to the National Center for Biotechnology Information (NCBI) taxonomy) and keywords at the dataset level; phylogeny, sample type, age, sex, body site (defined using the Uberon anatomy ontology21) and disease22 at the file level; and source, biological activity and structural class information at the single annotated spectrum level. In addition, GNPS/MassIVE is compatible with metadata formats from other software tools (e.g., QIIME2 and Qiita), which are used to analyze microbiome data and have a controlled vocabulary that can be imported23,24. Any sample information uploaded to GNPS/MassIVE from another repository (e.g., from MetaboLights and Metabolomics workbench) is also included in the MASST report.
At present, there is only limited metadata at the dataset and file level, but the metadata in the public domain can provide insights into the types of MS/MS signals being analyzed (Box 1 contains examples of usage). Although the amount and quality of metadata is increasing25, datasets do not always have detailed metadata. To allay this problem, re-annotation of metadata as knowledge increases, while retaining provenance of all changes, is possible in GNPS.11 If insufficient metadata are available for interpretation of a public dataset search results, the original depositors of the public data can be contacted. We expect this feature in MASST to foster collaborations worldwide.
MASST can be accessed at https://proteosafe-extensions.ucsd.edu/masst/ by copying and pasting the MS/MS spectrum peak list reported as mass-to-charge ratio (m/z) and intensity separated by a space for each fragment ion (also known as product ion), which can also be extracted from the open MS formats (e.g., .mzML, .mzXML and .MGF). Finally, MASST can be accessed as part of a GNPS data analysis. Manual entry at https://proteosafe-extensions.ucsd.edu/masst/ provides researchers with the ability to enter data from theoretical spectra, or spectra from published papers or supporting information, without needing access to the original experimental data. In GNPS users can launch a MASST search using links provided in classic and feature-based molecular networking output created within the GNPS infrastructure11, which automatically redirects to the MASST search page with prepopulated spectral data by clicking a simple MASST spectrum button. The MS/MS spectrum provided via the MASST website or as a link-out from a GNPS search is then searched against all public data with user defined parameters of minimum number of ions to match, precursor (parent) and product (fragment) ion tolerances, and analog similarity searches based on non-identical precursor masses2. An instruction video for running MASST jobs is available at https://youtu.be/4yBKomKzEKU. MASST searches retrieve all associated sample information (dataset and files) that match the MS/MS input spectrum query. A typical search takes about 10–20 min. Multiple search queries are placed in a queue for parallel execution as resources become available.
To promote data analysis reproducibility, the results of every job are stored in each user’s space and can be found under the ‘Jobs’ tab accessible through the banner in the GNPS browser (http://gnps.ucsd.edu). Only MASST jobs run while logged in on GNPS will be retained. Search parameters are also retained with each job and constitute a provenance record that can be provided as hyperlinks to share with others (e.g., collaborators) or in publications. These jobs can be shared, cloned and rerun with or without alterations of the input parameters (examples of links to jobs are shown in Box 1). This feature could enable new matches to be made when relevant public data are uploaded. The matches of MS/MS spectra among datasets are the equivalent to level two (putative annotation based on spectral library similarity) or three (putatively characterized compound class based on spectral similarity to known compounds of a chemical class) according to the 2007 metabolomics standards initiative26. Similar to short sequence read searches, MASST searches will not distinguish chemicals that have nearly identical fragmentation patterns, such as isomeric compounds, which would require an authentic standard and the use of an orthogonal property (such as the retention time). In cases when a MASST search returns no matches, it is possible that either there are no matching data or that MS/MS matches are possible but fall outside the specified search parameters. MASST should be used with these caveats in mind.
MASST, like WebBLAST, will likely find broad application. Uses of MASST might include translation of in vitro or in vivo data from model organisms to humans, or broad ecological questions. Box 1 contains ten example uses to highlight the types of discoveries possible with access, via MASST, to the entire body of public MS/MS data. These examples are illustrative, and we expect the user community to find multiple, innovative ways to use MASST.
All data used for testing and validating MASST are deposited in GNPS/MassIVE. MASST is a web-based application that is embedded in GNPS, which is a community service in which all public data are public. All data underlying figures present in the Supplementary Note are included as Supplementary Data 1 and 2. We cannot provide server installation, software engineers or administrator support for individual installations of MASST. The MASST platform is built as a workflow on top of the web repository workflow platform ProteoSAFe (https://github.com/CCMS-UCSD/ProteoSAFe). Each step of the MASST query is written in Python. Web rendering of the results is displayed by ProteoSAFe in the browser.
Altschul, S. F. et al. J. Mol. Biol. 215, 403–410 (1990).
Watrous, J. et al. Proc. Natl Acad. Sci. USA 109, 1743–1752 (2012).
Rasche, F. Anal. Chem. 83, 1243–1251 (2011).
Lai, Z. et al. Nat. Methods 15, 53–56 (2018).
Chong, J. et al. Nucleic Acids Res. 46, W486–W494 (2018).
Tautenhahn, R. et al. Anal. Chem. 84, 5035–5039 (2012).
Wishart, D. S. et al. Nucleic Acids Res. 46, D608–D617 (2018).
Aksenov, A. A. et al. Nat. Rev. Chem. 1, 0054 (2017).
Perez-Riverol, Y. et al. Nat. Biotechnol. 35, 406–409 (2017).
Rocca-Serra, P. et al. Metabolomics 12, 14 (2016).
Wang, M. et al. Nat. Biotechnol. 34, 828–837 (2016).
Kirchner, M. et al. J. Proteome Res. 9, 2762–2763 (2010).
Kessner, D. et al. Bioinformatics 24, 2534–2536 (2008).
Horai, H. et al. J. Mass Spectrom. 45, 703–714 (2010).
Sawada, Y. et al. Phytochemistry 82, 38–45 (2012).
Otogo N’Nang, E. et al. Org. Lett. 20, 6596–6600 (2018).
Schymanski, E. L. et al. Metabolites 3, 517–538 (2013).
Kyle, J. E. et al. Bioinformatics 33, 1744–1746 (2017).
Haug, K. et al. Nucleic Acids Res. 41, D781–D786 (2013).
Sud, M. et al. Nucleic Acids Res. 44, D463–D470 (2016).
Mungall, C. J. et al. Genome Biol. 13, R5 (2012).
Schriml, L. M. et al. Nucleic Acids Res. 47, D955–D962 (2019).
Bolyen, E. et al. Nat. Biotechnol. 37, 852–857 (2019).
Gonzalez, A. et al. Nat. Methods 15, 796–798 (2018).
Jarmusch, A.K. et al. Preprint at bioRxiv https://doi.org/10.1101/750471 (2019).
Sumner, L. W. et al. Metabolomics 3, 211–221 (2007).
Conversion of data from different repositories was supported by R03 CA211211 on reuse of metabolomics data. The development of a user-friendly interface was in part supported by Gordon and Betty Moore Foundation through grant GBMF7622. The UC San Diego Center for Microbiome Innovation supported the campus wide SEED grant awards for data collection that enabled the development of much of this infrastructure. A.K.J. thanks the American Society for Mass Spectrometry for the 2018 Postdoctoral Career Development Award. We acknowledge C. O’Donovan and K. Haug for help with navigating the MetaboLights data repository. J.V.D.H. was supported by a ASDI eScience grant (ASDI.2017.030) from the Netherlands eScience Center (NLeSC). E.I.Z. and L.L.-B. were supported by NIH grants AI081923 and AI113923. A.M.C.R., K.E.K., S.P.P., J.L.K., M.J.B. and P.C.D. were supported by NSF grant IOS-1656475. A.B. was supported by National Institute of Justice Award 2015-DN-BX-K047. F.V. was supported by the Department of Navy, Office of Naval Research Multidisciplinary University Research Initiative (MURI) Award, award number N00014-15-1-2809. D.P. was supported by the German Research Foundation (DFG) with grant PE 2600/1. Additional support for data acquisition and data storage was provided by P41 GM103484 Center for Computational Mass Spectrometry, Instrument support though NIH S10RR029121. R.L. is supported by NIH grants R01DK106419, 5P42ES010337 and 5UL1TR001442, and NIH K01DK116917 to J.D.W. The development of the web interface and harmonization with Qiita was in part supported by the Sloan Foundation.
M.W. is the founder of Ometa Labs LLC and consults for Sirenas, and P.C.D. is on the scientific advisory board of Sirenas.
Underlying data for figure panel in the Supplementary Note.
Underlying data for figure panel in the Supplementary Note.
Underlying data for figure panel in the Supplementary Note.
About this article
Cite this article
Wang, M., Jarmusch, A.K., Vargas, F. et al. Mass spectrometry searches using MASST. Nat Biotechnol 38, 23–26 (2020). https://doi.org/10.1038/s41587-019-0375-9
Scorpionicidal activity of secondary metabolites from Paecilomyces sp. CMAA1686 against Tityus serrulatus
Journal of Invertebrate Pathology (2021)
Journal of Plant Physiology (2021)
Non-targeted tandem mass spectrometry enables the visualization of organic matter chemotype shifts in coastal seawater
Natural Product Reports (2021)
Journal of Molecular Neuroscience (2021)