Abstract

Mass spectrometry is a predominant experimental technique in metabolomics and related fields, but metabolite structural elucidation remains highly challenging. We report SIRIUS 4 (https://bio.informatik.uni-jena.de/sirius/), which provides a fast computational approach for molecular structure identification. SIRIUS 4 integrates CSI:FingerID for searching in molecular structure databases. Using SIRIUS 4, we achieved identification rates of more than 70% on challenging metabolomics datasets.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Code availability

SIRIUS 4 is written in Java; is open source under the GNU General Public License (version 3); and works on Windows, macOS X, and Linux. In addition to the graphical front end, a comprehensive command-line version allows batch processing and integration into workflows; integration into GNPS1, OpenMS2, and MZmine4 is ongoing. We also provide source code, executable binaries, documentation, support, non-commercial training data, example files, and additional information on the SIRIUS website (https://bio.informatik.uni-jena.de/sirius/); a source copy is hosted on GitHub (https://github.com/boecker-lab/sirius). You can retrieve the InChIs of all compounds used to train CSI:FingerID from the web service (https://www.csi-fingerid.uni-jena.de/webapi/trainingstructures.csv?predictor=pos and https://www.csi-fingerid.uni-jena.de/webapi/trainingstructures.csv?predictor=neg).

Data availability

Data for the CASMI 2016 re-evaluation are available from https://bio.informatik.uni-jena.de/data under a Creative Commons CC-BY license. Cross-validation data for the GNPS search re-evaluation are available from https://bio.informatik.uni-jena.de/data/ (Creative Commons CC0 1.0 Universal license). Data for the American Gut project are available in the MassIVE database (MSV000080186 and MSV000080187; Creative Commons CC0 1.0 Universal license). The analysis can be accessed via the GNPS website (http://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=9bd16822c8d448f59a03e6cc8f017f43 and http://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=d26ae082b1154f73ac050796fcaa6bda). Data for the study of clothing with antibacterial properties are available at MassIVE (MSV000081379; Creative Commons CC0 1.0 Universal license). Analysis is available at the GNPS website (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=a5e8ca1b7a9c42cfb45fbb2855e36721). Source data for Supplementary Figs. 68 are available online.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  1. 1.

    Wang, M. et al. Nat. Biotechnol. 34, 828–837 (2016).

  2. 2.

    Röst, H. L. et al. Nat. Methods 13, 741–748 (2016).

  3. 3.

    Tsugawa, H. et al. Nat. Methods 12, 523–526 (2015).

  4. 4.

    Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. BMC Bioinformatics 11, 395 (2010).

  5. 5.

    Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R. & Siuzdak, G. Anal. Chem. 78, 779–787 (2006).

  6. 6.

    Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. Bioinformatics 25, 218–224 (2009).

  7. 7.

    Böcker, S. & Rasche, F. Bioinformatics 24, i49–i55 (2008).

  8. 8.

    Böcker, S. & Dührkop, K. J. Cheminform. 8, 5 (2016).

  9. 9.

    Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).

  10. 10.

    Shen, H., Dührkop, K., Böcker, S. & Rousu, J. Bioinformatics 30, i157–i164 (2014).

  11. 11.

    Heinonen, M., Shen, H., Zamboni, N. & Rousu, J. Bioinformatics 28, 2333–2341 (2012).

  12. 12.

    Pirhaji, L. et al. Nat. Methods 13, 770–776 (2016).

  13. 13.

    Hatzimanikatis, V. et al. Bioinformatics 21, 1603–1609 (2005).

  14. 14.

    Meusel, M. et al. Anal. Chem. 88, 7556–7566 (2016).

  15. 15.

    Ludwig, M., Dührkop, K. & Böcker, S. Bioinformatics 34, i333–i340 (2018).

  16. 16.

    Kim, S. et al. Nucleic Acids Res. 44, D1202–D1213 (2016).

  17. 17.

    Jeffryes, J. G. et al. J. Cheminform. 7, 44 (2015).

  18. 18.

    Schymanski, E. L. et al. J. Cheminform. 9, 22 (2017).

  19. 19.

    Pence, H. E. & Williams, A. J. Chem. Educ. 87, 1123–1124 (2010).

  20. 20.

    CASMI 2017 Team. And the results are. CASMI 2017 http://www.casmi-contest.org/2017/results.shtml (2017).

  21. 21.

    Cohen, L. J. et al. Nature 549, 48–53 (2017).

  22. 22.

    Dührkop, K., Ludwig, M., Meusel, M. & Böcker, S. in Algorithms in Bioinformatics (WABI 2013) (eds Darling, A. & Stoye, J.) 45–58 (Springer, Berlin, 2013).

  23. 23.

    Böcker, S. & Lipták, Zs. Algorithmica 48, 413–432 (2007).

  24. 24.

    Böcker, S., Letzel, M. C., Lipták, Zs. & Pervukhin, A. in Algorithms in Bioinformatics (WABI 2006) (eds Bücher, P. & Moret, B. M. E.) 12–23 (Springer, Berlin, 2006).

  25. 25.

    Rauf, I., Rasche, F., Nicolas, F. & Böcker, S. J. Comput. Biol. 20, 311–321 (2013).

  26. 26.

    White, W. T. J., Beyer, S., Dührkop, K., Chimani, M. & Böcker, S. in Computing and Combinatorics (COCOON 2015) (eds Xu, D., Du, D. & Du, D.) 310–322 (Springer, Cham, 2015).

  27. 27.

    Dührkop, K., Lataretu, M. A., White, W. T. J. & Böcker, S. in Proc. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018) (eds Parida, L. & Ukkonen, E.) 23:1–23:14 (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2018).

  28. 28.

    GNU Linear Programming Kit (GLPK) v. 4.60 https://www.gnu.org/software/glpk (Free Software Foundation, 2016).

  29. 29.

    CPLEX v. 12.8 https://www.ibm.com/analytics/cplex-optimizer (IBM, 2017).

  30. 30.

    Senior, J. Am. J. Math. 73, 663–689 (1951).

  31. 31.

    Pluskal, T., Uehara, T. & Yanagida, M. Anal. Chem. 84, 4396–4403 (2012).

  32. 32.

    Dührkop, K., Hufsky, F. & Böcker, S. Mass Spectrom. (Tokyo) 3, S0037 (2014).

  33. 33.

    LeCun, Y., Bengio, Y. & Hinton, G. Nature 521, 436–444 (2015).

  34. 34.

    Böcker, S. & Mäkinen, V. IEEE/ACM Trans. Comput. Biol. Bioinform. 5, 91–100 (2008).

  35. 35.

    Cortes, C., Mohri, M. & Rostamizadeh, A. J. Mach. Learn. Res. 13, 795–828 (2012).

  36. 36.

    Shen, H., Szedmak, S., Brouard, C. & Rousu, J. in Discovery Science (DS 2016) (eds Calders, T., Ceci, M. & Malerba, D.) 427–441 (Springer, Cham, 2016).

  37. 37.

    Horai, H. et al. J. Mass. Spectrom. 45, 703–714 (2010).

  38. 38.

    Brodley, C. E. & Friedl, M. A. J. Artif. Intell. Res. 11, 131–167 (1999).

  39. 39.

    Rogers, D. & Hahn, M. J. Chem. Inf. Model. 50, 742–754 (2010).

  40. 40.

    Willighagen, E. L. et al. J. Cheminform. 9, 33 (2017).

  41. 41.

    Wang, R., Gao, Y. & Lai, L. Perspect. Drug Discov. Des. 19, 47–66 (2000).

  42. 42.

    Steinbeck, C. et al. J. Chem. Inf. Comput. Sci. 43, 493–500 (2003).

  43. 43.

    SIRIUS v. 4.0.1 https://bio.informatik.uni-jena.de/software/sirius (Friedrich-Schiller-University Jena, 2018).

  44. 44.

    Melnik, A. et al. Data generation and analysis with SIRIUS 4 on two biological case studies. Protocol Exchange https://doi.org/10.1038/protex.2018.133 (2019).

Download references

Acknowledgements

We gratefully acknowledge financial support by the Deutsche Forschungsgemeinschaft (BO 1910/20) to S.B. and the Academy of Finland (310107/MACOME) to J.R.. We thank the GNPS community, S. Stein, F. Kuhlmann, and Agilent Technologies Inc. (Santa Clara, CA, USA) for providing data that were used to estimate the hyperparameters of SIRIUS 4 and to train CSI:FingerID. We also thank F. Kuhlmann and Agilent Technologies for data used to evaluate the isotope scoring.

Author information

Author notes

  1. These authors contributed equally: Kai Dührkop, Markus Fleischauer, Marcus Ludwig.

Affiliations

  1. Chair for Bioinformatics, Friedrich-Schiller University, Jena, Germany

    • Kai Dührkop
    • , Markus Fleischauer
    • , Marcus Ludwig
    • , Marvin Meusel
    •  & Sebastian Böcker
  2. Collaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, La Jolla, San Diego, CA, USA

    • Alexander A. Aksenov
    • , Alexey V. Melnik
    •  & Pieter C. Dorrestein
  3. Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, La Jolla, San Diego, CA, USA

    • Alexander A. Aksenov
    • , Alexey V. Melnik
    •  & Pieter C. Dorrestein
  4. Department of Microbial Natural Products, Helmholtz Institute for Pharmaceutical Research Saarland, Helmholtz Centre for Infection Research and Pharmaceutical Biotechnology, Saarland University, Saarbrücken, Germany

    • Marvin Meusel
  5. Helsinki Institute for Information Technology, Department of Computer Science, Aalto University, Espoo, Finland

    • Juho Rousu

Authors

  1. Search for Kai Dührkop in:

  2. Search for Markus Fleischauer in:

  3. Search for Marcus Ludwig in:

  4. Search for Alexander A. Aksenov in:

  5. Search for Alexey V. Melnik in:

  6. Search for Marvin Meusel in:

  7. Search for Pieter C. Dorrestein in:

  8. Search for Juho Rousu in:

  9. Search for Sebastian Böcker in:

Contributions

K.D., P.C.D., J.R., and S.B. designed the research. K.D., M.F., M.L., and J.R. developed computational methods. K.D., M.F., M.L., and M.M. implemented computational methods and performed method evaluations, coordinated by S.B. A.A.A. and A.V.M. performed the biological case studies, coordinated by P.C.D. S.B. wrote the manuscript, to which K.D., M.F., M.L., A.A.A., and A.V.M. contributed, in cooperation with all other authors.

Competing interests

S.B. holds patents (Japanese patent 5559816 and US patent 8263931) whose value might be affected by this publication.

Corresponding author

Correspondence to Sebastian Böcker.

Integrated supplementary information

  1. Supplementary Figure 1 SIRIUS 4 graphical user interface.

    a, The molecular formula ‘SIRIUS Overview’ tab displays all molecular formula candidates of some query compound in a single display. b, The ‘Spectra view’ tab shows the individual mass spectra. c, Similarly, the ‘Tree view’ tab allows a closer look at the fragmentation trees, for each molecular formula candidate. d,e, The next two tabs shows the result of the CSI:FingerID molecular structure search: (d) the ‘CSI:FingerID Overview’ tab summarizes all molecular formula candidates, whereas (e) the ‘CSI:FingerID Details’ tab presents results for each molecular formula candidate individually. Results can be filtered and searched; database links are provided for candidates when possible. f, Finally, the ‘Predicted Fingerprint’ tab presents the fingerprint predicted by CSI:FingerID, independently of any database searching.

  2. Supplementary Figure 2 Job View in the SIRIUS 4 graphical user interface.

    The job view displays name, type, state (running, queued, waiting and failed) and progress of SIRIUS 4 jobs in the job-scheduling system. Jobs can be canceled individually, and the job scheduler automatically handles potential dependencies. Logging information can be shown individually for each job.

  3. Supplementary Figure 3 Example of SIRIUS 4 isotope pattern analysis for MS/MS data.

    a, MSE spectrum for CASMI 2017 challenge 226, a derivative of cyclochlorotine with sum formula C24H31Cl2N5O7. b, A single isotope pattern in the spectrum is highlighted and shown in detail. The simulated isotope pattern of C23H28Cl2NO7 is drawn below in red. c, Part of the fragmentation graph corresponding to this spectrum. Yellow nodes correspond to the isotope peaks of C23H28Cl2NO7. We see that the first isotope peak of C23H28Cl2NO7 can also be explained as the monoisotopic peak of C24H24ClN3O7

  4. Supplementary Figure 4 Support Vector Machine for classifying molecular formulas of biomolecules.

    a, Histogram and kernel density plots of the linear Support Vector Machine (SVM); plotted are molecular formulas from the biomolecule structure database (Supplementary Table 2), PubChem, and a random subset of decompositions. b, Receiver operating characteristic (ROC) plot of the classifier, biomolecules versus random decompositions. The area under the curve (AUC) of ROC is 0.965.

  5. Supplementary Figure 5 Predicted fingerprint of the query clobutinol.

    Only molecular properties with at least four heavy atoms are displayed. Different from Fig. 1b–f, a second molecular property predicted to be present (green bars) has been selected; again, SIRIUS 4 displays a few example structures that contain the corresponding property. The two substructures (Fig. 1f and here) allow the user to deduce information about the query structure, without the need to query a molecular structure database.

  6. Supplementary Figure 6 Comparing SIRIUS 3.0 and current SIRIUS 4.

    Both versions are compared using isotope pattern (MS1) and MS/MS data from 3,965 compounds with mass ranging from 75 Da to 1,289 Da. a, Evaluation of molecular formula annotation. We report the number of instances where the correct molecular formula was ranked in the top k, for k = 1, …, 10. We evaluate exclusively the isotope pattern scoring of SIRIUS 3.0 (green diamonds) and SIRIUS 4 (blue diamonds), as well as the combined analysis of isotope patterns and MS/MS data (blue and green circles). b, Running time comparison, combined analysis. We sort compounds by mass, and report the time SIRIUS 3.0 and SIRIUS 4 require for computing the k% lightest compounds in the dataset. Note the logarithmic y-scale. SIRIUS 3.0 stopped after 154 h of computation with a timeout/memory exception, failing to compute the 90 heaviest compounds. Source data

  7. Supplementary Figure 7 Methods evaluation, structural elucidation searching GNPS in PubChem.

    The percentage of correctly identified structures found in the top k output of a method. Searching N = 3,868 compounds from GNPS in PubChem (15 September 2014). CSI:FingerID 1.1 is evaluated here; identification rates for all other methods are taken from ref. 43 in the Supplementary Information reference list. Source data

  8. Supplementary Figure 8 Methods evaluation, contribution of method enhancements and new data.

    Cumulative contribution of different aspects from version 1.0 to 1.1 of CSI:FingerID. New kernels (green) add ECFP fingerprints (red) and additional training data (violet). Source data

  9. Supplementary Figure 9 Identification of leukotrienes on human skin: network cluster.

    A large network cluster indicating a number of structurally related compounds, with no compounds annotated via library search throughout the entire cluster.

  10. Supplementary Figure 10 Identification of leukotrienes on human skin: fragmentation tree.

    Fragmentation spectra and tree that explains the experimentally observed MS/MS fragmentation pattern of the ion with m/z 440.246.

  11. Supplementary Figure 11 Identification of leukotrienes on human skin: compound structure.

    Structure of the compound with the highest score, 14,15-leukotriene E4 (LTE4). This structure served as a starting point for annotation of other compounds in the cluster.

  12. Supplementary Figure 12 Identification of leukotrienes on human skin: spectral validation.

    Validation of LTE4 predicted by SIRIUS 4 with CSI:FingerID by spectral and retention time match with synthetic standard.

Supplementary information

  1. Supplementary Text and Figures

    Supplementary Figures 1–12, Supplementary Tables 1 and 2, Supplementary Notes 1–10 and Supplementary Results 1–6

  2. Reporting Summary

  3. Supplementary Protocol

    Data generation and analysis with SIRIUS 4 on two biological case studies.

  4. Supplementary Table 3

    Molecular formula identification with SIRIUS 3 and SIRIUS 4 using solely MS1 data or MS1 and MS/MS data.

  5. Supplementary Table 4

    CASMI 2016 data reanalysis with SIRIUS 4.

  6. Supplementary Table 5

    Molecular structure identification rates using CSI:FingerID version 1.0 versus 1.1.

Source data

About this article

Publication history

Received

Accepted

Published

Issue Date

DOI

https://doi.org/10.1038/s41592-019-0344-8