Mass spectrometry is a predominant experimental technique in metabolomics and related fields, but metabolite structural elucidation remains highly challenging. We report SIRIUS 4 (https://bio.informatik.uni-jena.de/sirius/), which provides a fast computational approach for molecular structure identification. SIRIUS 4 integrates CSI:FingerID for searching in molecular structure databases. Using SIRIUS 4, we achieved identification rates of more than 70% on challenging metabolomics datasets.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
SIRIUS 4 is written in Java; is open source under the GNU General Public License (version 3); and works on Windows, macOS X, and Linux. In addition to the graphical front end, a comprehensive command-line version allows batch processing and integration into workflows; integration into GNPS1, OpenMS2, and MZmine4 is ongoing. We also provide source code, executable binaries, documentation, support, non-commercial training data, example files, and additional information on the SIRIUS website (https://bio.informatik.uni-jena.de/sirius/); a source copy is hosted on GitHub (https://github.com/boecker-lab/sirius). You can retrieve the InChIs of all compounds used to train CSI:FingerID from the web service (https://www.csi-fingerid.uni-jena.de/webapi/trainingstructures.csv?predictor=pos and https://www.csi-fingerid.uni-jena.de/webapi/trainingstructures.csv?predictor=neg).
Data for the CASMI 2016 re-evaluation are available from https://bio.informatik.uni-jena.de/data under a Creative Commons CC-BY license. Cross-validation data for the GNPS search re-evaluation are available from https://bio.informatik.uni-jena.de/data/ (Creative Commons CC0 1.0 Universal license). Data for the American Gut project are available in the MassIVE database (MSV000080186 and MSV000080187; Creative Commons CC0 1.0 Universal license). The analysis can be accessed via the GNPS website (http://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=9bd16822c8d448f59a03e6cc8f017f43 and http://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=d26ae082b1154f73ac050796fcaa6bda). Data for the study of clothing with antibacterial properties are available at MassIVE (MSV000081379; Creative Commons CC0 1.0 Universal license). Analysis is available at the GNPS website (https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=a5e8ca1b7a9c42cfb45fbb2855e36721). Source data for Supplementary Figs. 6–8 are available online.
Wang, M. et al. Nat. Biotechnol. 34, 828–837 (2016).
Röst, H. L. et al. Nat. Methods 13, 741–748 (2016).
Tsugawa, H. et al. Nat. Methods 12, 523–526 (2015).
Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. BMC Bioinformatics 11, 395 (2010).
Smith, C. A., Want, E. J., O’Maille, G., Abagyan, R. & Siuzdak, G. Anal. Chem. 78, 779–787 (2006).
Böcker, S., Letzel, M. C., Lipták, Z. & Pervukhin, A. Bioinformatics 25, 218–224 (2009).
Böcker, S. & Rasche, F. Bioinformatics 24, i49–i55 (2008).
Böcker, S. & Dührkop, K. J. Cheminform. 8, 5 (2016).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Shen, H., Dührkop, K., Böcker, S. & Rousu, J. Bioinformatics 30, i157–i164 (2014).
Heinonen, M., Shen, H., Zamboni, N. & Rousu, J. Bioinformatics 28, 2333–2341 (2012).
Pirhaji, L. et al. Nat. Methods 13, 770–776 (2016).
Hatzimanikatis, V. et al. Bioinformatics 21, 1603–1609 (2005).
Meusel, M. et al. Anal. Chem. 88, 7556–7566 (2016).
Ludwig, M., Dührkop, K. & Böcker, S. Bioinformatics 34, i333–i340 (2018).
Kim, S. et al. Nucleic Acids Res. 44, D1202–D1213 (2016).
Jeffryes, J. G. et al. J. Cheminform. 7, 44 (2015).
Schymanski, E. L. et al. J. Cheminform. 9, 22 (2017).
Pence, H. E. & Williams, A. J. Chem. Educ. 87, 1123–1124 (2010).
CASMI 2017 Team. And the results are. CASMI 2017 http://www.casmi-contest.org/2017/results.shtml (2017).
Cohen, L. J. et al. Nature 549, 48–53 (2017).
Dührkop, K., Ludwig, M., Meusel, M. & Böcker, S. in Algorithms in Bioinformatics (WABI 2013) (eds Darling, A. & Stoye, J.) 45–58 (Springer, Berlin, 2013).
Böcker, S. & Lipták, Zs. Algorithmica 48, 413–432 (2007).
Böcker, S., Letzel, M. C., Lipták, Zs. & Pervukhin, A. in Algorithms in Bioinformatics (WABI 2006) (eds Bücher, P. & Moret, B. M. E.) 12–23 (Springer, Berlin, 2006).
Rauf, I., Rasche, F., Nicolas, F. & Böcker, S. J. Comput. Biol. 20, 311–321 (2013).
White, W. T. J., Beyer, S., Dührkop, K., Chimani, M. & Böcker, S. in Computing and Combinatorics (COCOON 2015) (eds Xu, D., Du, D. & Du, D.) 310–322 (Springer, Cham, 2015).
Dührkop, K., Lataretu, M. A., White, W. T. J. & Böcker, S. in Proc. 18th International Workshop on Algorithms in Bioinformatics (WABI 2018) (eds Parida, L. & Ukkonen, E.) 23:1–23:14 (Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany, 2018).
GNU Linear Programming Kit (GLPK) v. 4.60 https://www.gnu.org/software/glpk (Free Software Foundation, 2016).
CPLEX v. 12.8 https://www.ibm.com/analytics/cplex-optimizer (IBM, 2017).
Senior, J. Am. J. Math. 73, 663–689 (1951).
Pluskal, T., Uehara, T. & Yanagida, M. Anal. Chem. 84, 4396–4403 (2012).
Dührkop, K., Hufsky, F. & Böcker, S. Mass Spectrom. (Tokyo) 3, S0037 (2014).
LeCun, Y., Bengio, Y. & Hinton, G. Nature 521, 436–444 (2015).
Böcker, S. & Mäkinen, V. IEEE/ACM Trans. Comput. Biol. Bioinform. 5, 91–100 (2008).
Cortes, C., Mohri, M. & Rostamizadeh, A. J. Mach. Learn. Res. 13, 795–828 (2012).
Shen, H., Szedmak, S., Brouard, C. & Rousu, J. in Discovery Science (DS 2016) (eds Calders, T., Ceci, M. & Malerba, D.) 427–441 (Springer, Cham, 2016).
Horai, H. et al. J. Mass. Spectrom. 45, 703–714 (2010).
Brodley, C. E. & Friedl, M. A. J. Artif. Intell. Res. 11, 131–167 (1999).
Rogers, D. & Hahn, M. J. Chem. Inf. Model. 50, 742–754 (2010).
Willighagen, E. L. et al. J. Cheminform. 9, 33 (2017).
Wang, R., Gao, Y. & Lai, L. Perspect. Drug Discov. Des. 19, 47–66 (2000).
Steinbeck, C. et al. J. Chem. Inf. Comput. Sci. 43, 493–500 (2003).
SIRIUS v. 4.0.1 https://bio.informatik.uni-jena.de/software/sirius (Friedrich-Schiller-University Jena, 2018).
Melnik, A. et al. Data generation and analysis with SIRIUS 4 on two biological case studies. Protocol Exchange https://doi.org/10.1038/protex.2018.133 (2019).
We gratefully acknowledge financial support by the Deutsche Forschungsgemeinschaft (BO 1910/20) to S.B. and the Academy of Finland (310107/MACOME) to J.R.. We thank the GNPS community, S. Stein, F. Kuhlmann, and Agilent Technologies Inc. (Santa Clara, CA, USA) for providing data that were used to estimate the hyperparameters of SIRIUS 4 and to train CSI:FingerID. We also thank F. Kuhlmann and Agilent Technologies for data used to evaluate the isotope scoring.
S.B. holds patents (Japanese patent 5559816 and US patent 8263931) whose value might be affected by this publication.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Integrated supplementary information
a, The molecular formula ‘SIRIUS Overview’ tab displays all molecular formula candidates of some query compound in a single display. b, The ‘Spectra view’ tab shows the individual mass spectra. c, Similarly, the ‘Tree view’ tab allows a closer look at the fragmentation trees, for each molecular formula candidate. d,e, The next two tabs shows the result of the CSI:FingerID molecular structure search: (d) the ‘CSI:FingerID Overview’ tab summarizes all molecular formula candidates, whereas (e) the ‘CSI:FingerID Details’ tab presents results for each molecular formula candidate individually. Results can be filtered and searched; database links are provided for candidates when possible. f, Finally, the ‘Predicted Fingerprint’ tab presents the fingerprint predicted by CSI:FingerID, independently of any database searching.
The job view displays name, type, state (running, queued, waiting and failed) and progress of SIRIUS 4 jobs in the job-scheduling system. Jobs can be canceled individually, and the job scheduler automatically handles potential dependencies. Logging information can be shown individually for each job.
a, MSE spectrum for CASMI 2017 challenge 226, a derivative of cyclochlorotine with sum formula C24H31Cl2N5O7. b, A single isotope pattern in the spectrum is highlighted and shown in detail. The simulated isotope pattern of C23H28Cl2NO7 is drawn below in red. c, Part of the fragmentation graph corresponding to this spectrum. Yellow nodes correspond to the isotope peaks of C23H28Cl2NO7. We see that the first isotope peak of C23H28Cl2NO7 can also be explained as the monoisotopic peak of C24H24ClN3O7
a, Histogram and kernel density plots of the linear Support Vector Machine (SVM); plotted are molecular formulas from the biomolecule structure database (Supplementary Table 2), PubChem, and a random subset of decompositions. b, Receiver operating characteristic (ROC) plot of the classifier, biomolecules versus random decompositions. The area under the curve (AUC) of ROC is 0.965.
Only molecular properties with at least four heavy atoms are displayed. Different from Fig. 1b–f, a second molecular property predicted to be present (green bars) has been selected; again, SIRIUS 4 displays a few example structures that contain the corresponding property. The two substructures (Fig. 1f and here) allow the user to deduce information about the query structure, without the need to query a molecular structure database.
Both versions are compared using isotope pattern (MS1) and MS/MS data from 3,965 compounds with mass ranging from 75 Da to 1,289 Da. a, Evaluation of molecular formula annotation. We report the number of instances where the correct molecular formula was ranked in the top k, for k = 1, …, 10. We evaluate exclusively the isotope pattern scoring of SIRIUS 3.0 (green diamonds) and SIRIUS 4 (blue diamonds), as well as the combined analysis of isotope patterns and MS/MS data (blue and green circles). b, Running time comparison, combined analysis. We sort compounds by mass, and report the time SIRIUS 3.0 and SIRIUS 4 require for computing the k% lightest compounds in the dataset. Note the logarithmic y-scale. SIRIUS 3.0 stopped after 154 h of computation with a timeout/memory exception, failing to compute the 90 heaviest compounds. Source data
The percentage of correctly identified structures found in the top k output of a method. Searching N = 3,868 compounds from GNPS in PubChem (15 September 2014). CSI:FingerID 1.1 is evaluated here; identification rates for all other methods are taken from ref. 43 in the Supplementary Information reference list. Source data
Cumulative contribution of different aspects from version 1.0 to 1.1 of CSI:FingerID. New kernels (green) add ECFP fingerprints (red) and additional training data (violet). Source data
A large network cluster indicating a number of structurally related compounds, with no compounds annotated via library search throughout the entire cluster.
Fragmentation spectra and tree that explains the experimentally observed MS/MS fragmentation pattern of the ion with m/z 440.246.
Structure of the compound with the highest score, 14,15-leukotriene E4 (LTE4). This structure served as a starting point for annotation of other compounds in the cluster.
Validation of LTE4 predicted by SIRIUS 4 with CSI:FingerID by spectral and retention time match with synthetic standard.
Supplementary Figures 1–12, Supplementary Tables 1 and 2, Supplementary Notes 1–10 and Supplementary Results 1–6
Data generation and analysis with SIRIUS 4 on two biological case studies.
Molecular formula identification with SIRIUS 3 and SIRIUS 4 using solely MS1 data or MS1 and MS/MS data.
CASMI 2016 data reanalysis with SIRIUS 4.
Molecular structure identification rates using CSI:FingerID version 1.0 versus 1.1.
About this article
Cite this article
Dührkop, K., Fleischauer, M., Ludwig, M. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat Methods 16, 299–302 (2019). https://doi.org/10.1038/s41592-019-0344-8
A map of mass spectrometry-based in silico fragmentation prediction and compound identification in metabolomics
Briefings in Bioinformatics (2021)
Aging Cell (2021)
Comparing the effects of an exposure to a polycyclic aromatic hydrocarbon mixture versus individual polycyclic aromatic hydrocarbons during monocyte to macrophage differentiation: Mixture exposure results in altered immune metrics
Journal of Applied Toxicology (2021)
Analytical Chemistry (2021)
Analytical Chemistry (2021)