Abstract
Mass spectrometry is a predominant experimental technique in metabolomics and related fields, but metabolite structural elucidation remains highly challenging. We report SIRIUS 4 (https://bio.informatik.uni-jena.de/sirius/), which provides a fast computational approach for molecular structure identification. SIRIUS 4 integrates CSI:FingerID for searching in molecular structure databases. Using SIRIUS 4, we achieved identification rates of more than 70% on challenging metabolomics datasets.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Code availability
SIRIUS 4 is written in Java; is open source under the GNU General Public License (version 3); and works on Windows, macOS X, and Linux. In addition to the graphical front end, a comprehensive command-line version allows batch processing and integration into workflows; integration into GNPS1, OpenMS2, and MZmine4 is ongoing. We also provide source code, executable binaries, documentation, support, non-commercial training data, example files, and additional information on the SIRIUS website (https://bio.informatik.uni-jena.de/sirius/); a source copy is hosted on GitHub (https://github.com/boecker-lab/sirius). You can retrieve the InChIs of all compounds used to train CSI:FingerID from the web service (https://www.csi-fingerid.uni-jena.de/webapi/trainingstructures.csv?predictor=pos and https://www.csi-fingerid.uni-jena.de/webapi/trainingstructures.csv?predictor=neg).
Data availability
Data for the CASMI 2016 re-evaluation are available from https://bio.informatik.uni-jena.de/data under a Creative Commons CC-BY license. Cross-validation data for the GNPS search re-evaluation are available from https://bio.informatik.uni-jena.de/data/ (Creative Commons CC0 1.0 Universal license). Data for the American Gut project are available in the MassIVE database (MSV000080186 and MSV000080187; Creative Commons CC0 1.0 Universal license). The analysis can be accessed via the GNPS website (http://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=9bd16822c8d448f59a03e6cc8f017f43 and http://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=d26ae082b1154f73ac050796fcaa6bda). Data for the study of clothing with antibacterial properties are available at MassIVE (MSV000081379; Creative Commons CC0 1.0 Universal license). Analysis is available at the GNPS website (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=a5e8ca1b7a9c42cfb45fbb2855e36721). Source data for Supplementary Figs. 6–8 are available online.
References
Wang, M. et al. Nat. Biotechnol. 34, 828–837 (2016).
Röst, H. L. et al. Nat. Methods 13, 741–748 (2016).
Tsugawa, H. et al. Nat. Methods 12, 523–526 (2015).
Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. BMC Bioinformatics 11, 395 (2010).
Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R. & Siuzdak, G. Anal. Chem. 78, 779–787 (2006).
Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. Bioinformatics 25, 218–224 (2009).
Böcker, S. & Rasche, F. Bioinformatics 24, i49–i55 (2008).
Böcker, S. & Dührkop, K. J. Cheminform. 8, 5 (2016).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Shen, H., Dührkop, K., Böcker, S. & Rousu, J. Bioinformatics 30, i157–i164 (2014).
Heinonen, M., Shen, H., Zamboni, N. & Rousu, J. Bioinformatics 28, 2333–2341 (2012).
Pirhaji, L. et al. Nat. Methods 13, 770–776 (2016).
Hatzimanikatis, V. et al. Bioinformatics 21, 1603–1609 (2005).
Meusel, M. et al. Anal. Chem. 88, 7556–7566 (2016).
Ludwig, M., Dührkop, K. & Böcker, S. Bioinformatics 34, i333–i340 (2018).
Kim, S. et al. Nucleic Acids Res. 44, D1202–D1213 (2016).
Jeffryes, J. G. et al. J. Cheminform. 7, 44 (2015).
Schymanski, E. L. et al. J. Cheminform. 9, 22 (2017).
Pence, H. E. & Williams, A. J. Chem. Educ. 87, 1123–1124 (2010).
CASMI 2017 Team. And the results are. CASMI 2017 http://www.casmi-contest.org/2017/results.shtml (2017).
Cohen, L. J. et al. Nature 549, 48–53 (2017).
Dührkop, K., Ludwig, M., Meusel, M. & Böcker, S. in Algorithms in Bioinformatics (WABI 2013) (eds Darling, A. & Stoye, J.) 45–58 (Springer, Berlin, 2013).
Böcker, S. & Lipták, Zs. Algorithmica 48, 413–432 (2007).
Böcker, S., Letzel, M. C., Lipták, Zs. & Pervukhin, A. in Algorithms in Bioinformatics (WABI 2006) (eds Bücher, P. & Moret, B. M. E.) 12–23 (Springer, Berlin, 2006).
Rauf, I., Rasche, F., Nicolas, F. & Böcker, S. J. Comput. Biol. 20, 311–321 (2013).
White, W. T. J., Beyer, S., Dührkop, K., Chimani, M. & Böcker, S. in Computing and Combinatorics (COCOON 2015) (eds Xu, D., Du, D. & Du, D.) 310–322 (Springer, Cham, 2015).
Dührkop, K., Lataretu, M. A., White, W. T. J. & Böcker, S. in Proc. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018) (eds Parida, L. & Ukkonen, E.) 23:1–23:14 (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2018).
GNU Linear Programming Kit (GLPK) v. 4.60 https://www.gnu.org/software/glpk (Free Software Foundation, 2016).
CPLEX v. 12.8 https://www.ibm.com/analytics/cplex-optimizer (IBM, 2017).
Senior, J. Am. J. Math. 73, 663–689 (1951).
Pluskal, T., Uehara, T. & Yanagida, M. Anal. Chem. 84, 4396–4403 (2012).
Dührkop, K., Hufsky, F. & Böcker, S. Mass Spectrom. (Tokyo) 3, S0037 (2014).
LeCun, Y., Bengio, Y. & Hinton, G. Nature 521, 436–444 (2015).
Böcker, S. & Mäkinen, V. IEEE/ACM Trans. Comput. Biol. Bioinform. 5, 91–100 (2008).
Cortes, C., Mohri, M. & Rostamizadeh, A. J. Mach. Learn. Res. 13, 795–828 (2012).
Shen, H., Szedmak, S., Brouard, C. & Rousu, J. in Discovery Science (DS 2016) (eds Calders, T., Ceci, M. & Malerba, D.) 427–441 (Springer, Cham, 2016).
Horai, H. et al. J. Mass. Spectrom. 45, 703–714 (2010).
Brodley, C. E. & Friedl, M. A. J. Artif. Intell. Res. 11, 131–167 (1999).
Rogers, D. & Hahn, M. J. Chem. Inf. Model. 50, 742–754 (2010).
Willighagen, E. L. et al. J. Cheminform. 9, 33 (2017).
Wang, R., Gao, Y. & Lai, L. Perspect. Drug Discov. Des. 19, 47–66 (2000).
Steinbeck, C. et al. J. Chem. Inf. Comput. Sci. 43, 493–500 (2003).
SIRIUS v. 4.0.1 https://bio.informatik.uni-jena.de/software/sirius (Friedrich-Schiller-University Jena, 2018).
Melnik, A. et al. Data generation and analysis with SIRIUS 4 on two biological case studies. Protocol Exchange https://doi.org/10.1038/protex.2018.133 (2019).
Acknowledgements
We gratefully acknowledge financial support by the Deutsche Forschungsgemeinschaft (BO 1910/20) to S.B. and the Academy of Finland (310107/MACOME) to J.R.. We thank the GNPS community, S. Stein, F. Kuhlmann, and Agilent Technologies Inc. (Santa Clara, CA, USA) for providing data that were used to estimate the hyperparameters of SIRIUS 4 and to train CSI:FingerID. We also thank F. Kuhlmann and Agilent Technologies for data used to evaluate the isotope scoring.
Author information
Authors and Affiliations
Contributions
K.D., P.C.D., J.R., and S.B. designed the research. K.D., M.F., M.L., and J.R. developed computational methods. K.D., M.F., M.L., and M.M. implemented computational methods and performed method evaluations, coordinated by S.B. A.A.A. and A.V.M. performed the biological case studies, coordinated by P.C.D. S.B. wrote the manuscript, to which K.D., M.F., M.L., A.A.A., and A.V.M. contributed, in cooperation with all other authors.
Corresponding author
Ethics declarations
Competing interests
S.B. holds patents (Japanese patent 5559816 and US patent 8263931) whose value might be affected by this publication.
Additional information
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
Supplementary Figure 1 SIRIUS 4 graphical user interface.
a, The molecular formula ‘SIRIUS Overview’ tab displays all molecular formula candidates of some query compound in a single display. b, The ‘Spectra view’ tab shows the individual mass spectra. c, Similarly, the ‘Tree view’ tab allows a closer look at the fragmentation trees, for each molecular formula candidate. d,e, The next two tabs shows the result of the CSI:FingerID molecular structure search: (d) the ‘CSI:FingerID Overview’ tab summarizes all molecular formula candidates, whereas (e) the ‘CSI:FingerID Details’ tab presents results for each molecular formula candidate individually. Results can be filtered and searched; database links are provided for candidates when possible. f, Finally, the ‘Predicted Fingerprint’ tab presents the fingerprint predicted by CSI:FingerID, independently of any database searching.
Supplementary Figure 2 Job View in the SIRIUS 4 graphical user interface.
The job view displays name, type, state (running, queued, waiting and failed) and progress of SIRIUS 4 jobs in the job-scheduling system. Jobs can be canceled individually, and the job scheduler automatically handles potential dependencies. Logging information can be shown individually for each job.
Supplementary Figure 3 Example of SIRIUS 4 isotope pattern analysis for MS/MS data.
a, MSE spectrum for CASMI 2017 challenge 226, a derivative of cyclochlorotine with sum formula C24H31Cl2N5O7. b, A single isotope pattern in the spectrum is highlighted and shown in detail. The simulated isotope pattern of C23H28Cl2NO7 is drawn below in red. c, Part of the fragmentation graph corresponding to this spectrum. Yellow nodes correspond to the isotope peaks of C23H28Cl2NO7. We see that the first isotope peak of C23H28Cl2NO7 can also be explained as the monoisotopic peak of C24H24ClN3O7
Supplementary Figure 4 Support Vector Machine for classifying molecular formulas of biomolecules.
a, Histogram and kernel density plots of the linear Support Vector Machine (SVM); plotted are molecular formulas from the biomolecule structure database (Supplementary Table 2), PubChem, and a random subset of decompositions. b, Receiver operating characteristic (ROC) plot of the classifier, biomolecules versus random decompositions. The area under the curve (AUC) of ROC is 0.965.
Supplementary Figure 5 Predicted fingerprint of the query clobutinol.
Only molecular properties with at least four heavy atoms are displayed. Different from Fig. 1b–f, a second molecular property predicted to be present (green bars) has been selected; again, SIRIUS 4 displays a few example structures that contain the corresponding property. The two substructures (Fig. 1f and here) allow the user to deduce information about the query structure, without the need to query a molecular structure database.
Supplementary Figure 6 Comparing SIRIUS 3.0 and current SIRIUS 4.
Both versions are compared using isotope pattern (MS1) and MS/MS data from 3,965 compounds with mass ranging from 75 Da to 1,289 Da. a, Evaluation of molecular formula annotation. We report the number of instances where the correct molecular formula was ranked in the top k, for k = 1, …, 10. We evaluate exclusively the isotope pattern scoring of SIRIUS 3.0 (green diamonds) and SIRIUS 4 (blue diamonds), as well as the combined analysis of isotope patterns and MS/MS data (blue and green circles). b, Running time comparison, combined analysis. We sort compounds by mass, and report the time SIRIUS 3.0 and SIRIUS 4 require for computing the k% lightest compounds in the dataset. Note the logarithmic y-scale. SIRIUS 3.0 stopped after 154 h of computation with a timeout/memory exception, failing to compute the 90 heaviest compounds.
Supplementary Figure 7 Methods evaluation, structural elucidation searching GNPS in PubChem.
The percentage of correctly identified structures found in the top k output of a method. Searching N = 3,868 compounds from GNPS in PubChem (15 September 2014). CSI:FingerID 1.1 is evaluated here; identification rates for all other methods are taken from ref. 43 in the Supplementary Information reference list.
Supplementary Figure 8 Methods evaluation, contribution of method enhancements and new data.
Cumulative contribution of different aspects from version 1.0 to 1.1 of CSI:FingerID. New kernels (green) add ECFP fingerprints (red) and additional training data (violet).
Supplementary Figure 9 Identification of leukotrienes on human skin: network cluster.
A large network cluster indicating a number of structurally related compounds, with no compounds annotated via library search throughout the entire cluster.
Supplementary Figure 10 Identification of leukotrienes on human skin: fragmentation tree.
Fragmentation spectra and tree that explains the experimentally observed MS/MS fragmentation pattern of the ion with m/z 440.246.
Supplementary Figure 11 Identification of leukotrienes on human skin: compound structure.
Structure of the compound with the highest score, 14,15-leukotriene E4 (LTE4). This structure served as a starting point for annotation of other compounds in the cluster.
Supplementary Figure 12 Identification of leukotrienes on human skin: spectral validation.
Validation of LTE4 predicted by SIRIUS 4 with CSI:FingerID by spectral and retention time match with synthetic standard.
Supplementary information
Supplementary Text and Figures
Supplementary Figures 1–12, Supplementary Tables 1 and 2, Supplementary Notes 1–10 and Supplementary Results 1–6
Supplementary Protocol
Data generation and analysis with SIRIUS 4 on two biological case studies.
Supplementary Table 3
Molecular formula identification with SIRIUS 3 and SIRIUS 4 using solely MS1 data or MS1 and MS/MS data.
Supplementary Table 4
CASMI 2016 data reanalysis with SIRIUS 4.
Supplementary Table 5
Molecular structure identification rates using CSI:FingerID version 1.0 versus 1.1.
Rights and permissions
About this article
Cite this article
Dührkop, K., Fleischauer, M., Ludwig, M. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods 16, 299–302 (2019). https://doi.org/10.1038/s41592-019-0344-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-019-0344-8
This article is cited by
-
Genome sequencing and molecular networking analysis of the wild fungus Anthostomella pinea reveal its ability to produce a diverse range of secondary metabolites
Fungal Biology and Biotechnology (2024)
-
Unraveling the metabolomic architecture of autism in a large Danish population-based cohort
BMC Medicine (2024)
-
IDSL_MINT: a deep learning framework to predict molecular fingerprints from mass spectra
Journal of Cheminformatics (2024)
-
Tandem mass spectrum prediction for small molecules using graph transformers
Nature Machine Intelligence (2024)
-
MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics
Nature Communications (2024)