Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra

Abstract

Metabolomics using nontargeted tandem mass spectrometry can detect thousands of molecules in a biological sample. However, structural molecule annotation is limited to structures present in libraries or databases, restricting analysis and interpretation of experimental data. Here we describe CANOPUS (class assignment and ontology prediction using mass spectrometry), a computational tool for systematic compound class annotation. CANOPUS uses a deep neural network to predict 2,497 compound classes from fragmentation spectra, including all biologically relevant classes. CANOPUS explicitly targets compounds for which neither spectral nor structural reference data are available and predicts classes lacking tandem mass spectrometry training data. In evaluation using reference data, CANOPUS reached very high prediction performance (average accuracy of 99.7% in cross-validation) and outperformed four baseline methods. We demonstrate the broad utility of CANOPUS by investigating the effect of microbial colonization in the mouse digestive system, through analysis of the chemodiversity of different Euphorbia plants and regarding the discovery of a marine natural product, revealing biological insights at the compound class level.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: CANOPUS workflow.
Fig. 2: Method evaluation: number of ClassyFire compound classes predicted with a particular performance measure.
Fig. 3: Comparing the digestive system of GF and SPF mice.
Fig. 4: Molecular network of daidzein.
Fig. 5: Compound class distribution in Euphorbia species.
Fig. 6: Structural analysis of rivulariapeptolide 1155 using CANOPUS.
Fig. 7: Heterogeneous training for compound class prediction.

Data availability

Input mzML/mzXML files are available at MassIVE (https://massive.ucsd.edu/) with the accession nos. MSV000079949 (mice data) and MSV000081082 (Euphorbia data). The mass spectrometry data for Rivularia sp. cyanobacteria were deposited at MassIVE (accession no. MSV000085578). The spectra for rivulariapeptolide 1155 were annotated in the GNPS spectral library (accession nos. CCMSLIB00005723986 and CCMSLIB00005723388). The structure database with ClassyFire annotations, the publicly available part of the evaluation data and the Cytoscape files for network visualization can be downloaded from https://bio.informatik.uni-jena.de/data/ and https://doi.org/10.6084/m9.figshare.13073051. Source data are provided with this paper.

Code availability

CANOPUS is part of SIRIUS software and can be downloaded from https://bio.informatik.uni-jena.de/software/canopus/. The source code of CANOPUS is available at https://github.com/boecker-lab/sirius-libs. The scripts for analysis and visualization of CANOPUS results are available at https://github.com/kaibioinfo/canopus_treemap.

References

  1. 1.

    Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).

    CAS  PubMed  Article  Google Scholar 

  2. 2.

    Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  3. 3.

    Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Kind, T. et al. Identification of small molecules using accurate mass MS/MS search. Mass Spectrom. Rev. 37, 513–532 (2018).

    CAS  PubMed  Article  Google Scholar 

  5. 5.

    Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI–MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).

    CAS  Article  Google Scholar 

  6. 6.

    Brouard, C. et al. Fast metabolite identification with Input Output Kernel Regression. Bioinformatics 32, i28–i36 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. 7.

    Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).

    PubMed  Article  CAS  Google Scholar 

  8. 8.

    Ridder, L. et al. Automatic chemical structure annotation of an LC-MSn based metabolic profile from green tea. Anal. Chem. 85, 6033–6040 (2013).

    CAS  PubMed  Article  Google Scholar 

  9. 9.

    Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  10. 10.

    Tsugawa, H. et al. Hydrogen rearrangement rules: computational MS/MS fragmentation and structure elucidation using MS-FINDER Software. Anal. Chem. 88, 7946–7958 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. 11.

    Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminf. 9, 22 (2017).

    Article  Google Scholar 

  12. 12.

    Blaženović, I., Kind, T., Ji, J. & Fiehn, O. Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites 8, 31 (2018).

  13. 13.

    Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).

    PubMed  Article  CAS  Google Scholar 

  14. 14.

    Tsugawa, H. Advances in computational metabolomics and databases deepen the understanding of metabolisms. Curr. Opin. Biotechnol. 54, 10–17 (2018).

    CAS  PubMed  Article  Google Scholar 

  15. 15.

    Montenegro-Burke, J. R., Guijas, C. & Siuzdak, G. METLIN: a tandem mass spectral library of standards. Methods Mol. Biol. 2104, 149–163 (2020).

    CAS  PubMed  Article  Google Scholar 

  16. 16.

    Vinaixa, M. et al. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects. Trends Anal. Chem. 78, 23–35 (2016).

    CAS  Article  Google Scholar 

  17. 17.

    Aksenov, A. A., Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1, 0054 (2017).

    CAS  Article  Google Scholar 

  18. 18.

    Frainay, C. et al. Mind the gap: mapping mass spectral databases in genome-scale metabolic networks reveals poorly covered areas. Metabolites 8, 51 (2018).

  19. 19.

    Venkataraghavan, R., McLafferty, F. W. & Lear, G. E. Computer-aided interpretation of mass spectra. Org. Mass Spectrom. 2, 1–15 (1969).

    CAS  Article  Google Scholar 

  20. 20.

    Curry, B. & Rumelhart, D. E. MSnet: a neural network that classifies mass spectra. Tetrahedron Comput. Methodol. 3, 213–237 (1990).

    CAS  Article  Google Scholar 

  21. 21.

    Werther, W., Lohninger, H., Stancl, F. & Varmuza, K. Classification of mass spectra: a comparison of yes/no classification methods for the recognition of simple structural properties. Chemom. Intell. Lab. Syst. 22, 63–76 (1994).

    CAS  Article  Google Scholar 

  22. 22.

    Heinonen, M., Shen, H., Zamboni, N. & Rousu, J. Metabolite identification and molecular fingerprint prediction via machine learning. Bioinformatics 28, 2333–2341 (2012).

    CAS  PubMed  Article  Google Scholar 

  23. 23.

    Hastings, J. et al. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 41, D456–D463 (2013).

    CAS  PubMed  Article  Google Scholar 

  24. 24.

    Rogers, F. B. Communications to the editor. Bull. Med. Libr. Assoc. 51, 114–116 (1963).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. 25.

    Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminf. 8, 61 (2016).

    Article  Google Scholar 

  26. 26.

    Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem. 91, 2155–2162 (2019).

    PubMed  Article  CAS  Google Scholar 

  27. 27.

    Ernst, M. et al. Assessing specialized metabolite diversity in the cosmopolitan plant genus Euphorbia L. Front. Plant Sci. 10, 846 (2019).

    PubMed  PubMed Central  Article  Google Scholar 

  28. 28.

    Tsugawa, H. et al. A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms. Nat. Methods 16, 295–298 (2019).

    CAS  PubMed  Article  Google Scholar 

  29. 29.

    Barupal, D. K. & Fiehn, O. Chemical Similarity Enrichment Analysis (ChemRICH) as alternative to biochemical pathway mapping for metabolomic datasets. Sci. Rep. 7, 14567 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  30. 30.

    Rasche, F. et al. Identifying the unknowns by aligning fragmentation trees. Anal. Chem. 84, 3417–3426 (2012).

    CAS  PubMed  Article  Google Scholar 

  31. 31.

    Treutler, H. et al. Discovering regulated metabolite families in untargeted metabolomics studies. Anal. Chem. 88, 8082–8090 (2016).

    CAS  PubMed  Article  Google Scholar 

  32. 32.

    Ernst, M. et al. MolNetEnhancer: enhanced molecular networks by integrating metabolome mining and annotation tools. Metabolites 9, 144 (2019).

  33. 33.

    Lowry, S. R. et al. Comparison of various K-nearest neighbor voting schemes with the self-training interpretive and retrieval system for identifying molecular substructures from mass spectral data. Anal. Chem. 49, 1720–1722 (1977).

    CAS  Article  Google Scholar 

  34. 34.

    Askenazi, M. & Linial, M. ARISTO: ontological classification of small molecules by electron ionization-mass spectrometry. Nucleic Acids Res. 39, W505–W510 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  35. 35.

    Peters, K. et al. Chemical diversity and classification of secondary metabolites in nine bryophyte species. Metabolites 9, 222 (2019).

  36. 36.

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    CAS  PubMed  Article  Google Scholar 

  37. 37.

    Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).

    CAS  PubMed  Article  Google Scholar 

  38. 38.

    Wolf, S., Schmidt, S., Müller-Hannemann, M. & Neumann, S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics 11, 148 (2010).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  39. 39.

    Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).

    PubMed  Article  CAS  PubMed Central  Google Scholar 

  40. 40.

    Cooper, B. T. et al. Hybrid search: a method for identifying metabolites absent from tandem mass spectrometry libraries. Anal. Chem. 91, 13924–13932 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  41. 41.

    Allard, P.-M. et al. Integration of molecular networking and in-silico MS/MS fragmentation for natural products dereplication. Anal. Chem. 88, 3317–3323 (2016).

    CAS  PubMed  Article  Google Scholar 

  42. 42.

    Silva, R. R. et al. Propagating annotations of molecular networks using in silico fragmentation. PLoS Comput. Biol. 14, e1006089 (2018).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  43. 43.

    Fox Ramos, A. E. et al. CANPA: computer-assisted natural products anticipation. Anal. Chem. 91, 11247–11252 (2019).

    CAS  PubMed  Article  Google Scholar 

  44. 44.

    Quinn, R. A. et al. Global chemical effects of the microbiome include new bile-acid conjugations. Nature 579, 123–129 (2020).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  45. 45.

    Minamida, K. et al. Production of equol from daidzein by Gram-positive rod-shaped bacterium isolated from rat intestine. J. Biosci. Bioeng. 102, 247–250 (2006).

    CAS  PubMed  Article  Google Scholar 

  46. 46.

    Quinn, R. A. et al. Molecular networking as a drug discovery, drug metabolism, and precision medicine strategy. Trends Pharmacol. Sci. 38, 143–154 (2017).

    CAS  PubMed  Article  Google Scholar 

  47. 47.

    Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  48. 48.

    Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl Acad. Sci. USA 113, 13738–13743 (2016).

    PubMed  Article  CAS  Google Scholar 

  49. 49.

    Vasas, A. & Hohmann, J. Euphorbia diterpenes: isolation, structure, biological activity, and synthesis (2008–2012). Chem. Rev. 114, 8579–8612 (2014).

    CAS  PubMed  Article  Google Scholar 

  50. 50.

    Yang, M. et al. Studies on the fragmentation pathways of ingenol esters isolated from Euphorbia esula using IT-MSn and Q-TOF-MS/MS methods in electrospray ionization mode. Int. J. Mass Spectrom. 323-324, 55–62 (2012).

    CAS  Article  Google Scholar 

  51. 51.

    Riina, R. et al. A worldwide molecular phylogeny and classification of the leafy spurges, Euphorbia subgenus Esula (Euphorbiaceae). TAXON 62, 316–342 (2013).

    Article  Google Scholar 

  52. 52.

    Horn, J. W. et al. Phylogenetics and the evolution of major structural characters in the giant genus Euphorbia L. (Euphorbiaceae). Mol. Phylogenet. Evol. 63, 305–326 (2012).

  53. 53.

    Horn, J. W. et al. Evolutionary bursts in Euphorbia (Euphorbiaceae) are linked with photosynthetic pathway. Evolution 68, 3485–3504 (2014).

  54. 54.

    Peirson, J. A., Bruyns, P. V., Riina, R., Morawetz, J. J. & Berry, P. E. A molecular phylogeny and classification of the largely succulent and mainly African Euphorbia subg. Athymalus (Euphorbiaceae). TAXON 62, 1178–1199 (2013).

    Article  Google Scholar 

  55. 55.

    Dorsey, B. L. et al. Phylogenetics, morphological evolution, and classification of Euphorbia subgenus Euphorbia. TAXON 62, 291–315 (2013).

    Article  Google Scholar 

  56. 56.

    Yang, Y. et al. Molecular phylogenetics and classification of Euphorbia subgenus Chamaesyce (Euphorbiaceae). TAXON 61, 764–789 (2012).

    Article  Google Scholar 

  57. 57.

    Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11, 395 (2010).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  58. 58.

    Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods 17, 905–908 (2020).

    CAS  PubMed  Article  Google Scholar 

  59. 59.

    Schmid, R. et al. Ion identity molecular networking in the GNPS Environment. Preprint at bioRxiv https://doi.org/10.1101/2020.05.11.088948 (2020).

  60. 60.

    Röst, H. L. et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat. Methods 13, 741–748 (2016).

    PubMed  Article  CAS  Google Scholar 

  61. 61.

    Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  62. 62.

    Shinbo, Y. et al. in Plant Metabolomics Vol. 57 (eds Saito, K. et al.) 165–181 (Springer, 2006).

  63. 63.

    Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, D608–D617 (2018).

    CAS  Article  Google Scholar 

  64. 64.

    Kanehisa, M., Goto, S., Kawashima, S. & Nakaya, A. The KEGG databases at GenomeNet. Nucleic Acids Res. 30, 42–46 (2002).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  65. 65.

    Bobach, C., Böhme, T., Laube, U., Püschel, A. & Weber, L. Automated compound classification using a chemical ontology. J. Cheminform. 4, 40 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  66. 66.

    Klekota, J. & Roth, F. P. Chemical substructures that enrich for biological activity. Bioinformatics 24, 2518–2525 (2008).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  67. 67.

    Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    CAS  PubMed  Article  Google Scholar 

  68. 68.

    Willighagen, E. L. et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminf. 9, 33 (2017).

    Article  CAS  Google Scholar 

  69. 69.

    Hähnke, V. D., Kim, S. & Bolton, E. E. PubChem chemical structure standardization. J. Cheminf. 10, 36 (2018).

    Article  CAS  Google Scholar 

  70. 70.

    Rogers, D. J. & Tanimoto, T. T. A computer program for classifying plants. Science 132, 1115–1118 (1960).

    CAS  PubMed  Article  Google Scholar 

  71. 71.

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).

    Google Scholar 

  72. 72.

    Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).

  73. 73.

    Abadi, M. N. et al. in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (eds Keeton, K. & Roscoe, T.) 265–283 (USENIX, 2016).

  74. 74.

    Platt, J. C. Advances in Large Margin Classifiers (MIT Press, 2000).

  75. 75.

    Böcker, S. & Dührkop, K. Fragmentation trees reloaded. J. Cheminform. 8, 5 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  76. 76.

    Ludwig, M. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat. Mach. Intell. 2, 629–641 (2020).

    Article  Google Scholar 

  77. 77.

    Moorthy, A. S., Wallace, W. E., Kearsley, A. J., Tchekhovskoi, D. V. & Stein, S. E. Combining fragment-ion and neutral-loss matching during mass spectral library searching: a new general purpose algorithm applicable to illicit drug identification. Anal Chem. 89, 13261–13268 (2017).

  78. 78.

    Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947).

    Article  Google Scholar 

  79. 79.

    Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Meth. 17, 261–272 (2020).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

We thank Deutsche Forschungsgemeinschaft for providing financial support (no. BO 1910/20 to S.B., K.D. and M.L. and no. PE 2600/1 to D.P.), and the Academy of Finland (no. 310107/MACOME to J.R.). P.C.D., R.R. and W.H.G. were supported by the Gordon and Betty Moore Foundation (no. GBMF7622) and by the US National Institutes of Health (NIH; no. R01 GM107550). P.C.D. was supported by NIH grants nos. P41 GM103484 and R03 CA211211. L.-F.N. was supported by NIH grant no. R01 GM107550 and by the European Union’s Horizon 2020 program (MSCA-GF, no. 704786). We thank F. Kuhlmann and Agilent Technologies, Inc. for providing data used in the evaluation of CANOPUS. We thank Y. Djoumbou Feunang, D. Arndt and D. Wishart for providing ClassyFire annotations for a database of molecular structures. We thank K. Alexander, E. Caro-Diaz and B. Naman for assistance with the collection of Rivularia sp. Further, we thank S. Whitner and K. Joosten for 16S recombinant DNA analysis. We thank M. Ernst for valuable discussions on the Euphorbia plant study, and J. van der Hooft and S. Rogers for feedback on the manuscript.

Author information

Affiliations

Authors

Contributions

K.D., J.R. and S.B. designed the research. K.D. and S.B. developed the computational method. K.D. implemented the computational method with contributions from M.L., M.F. and M.A.H. M.F. integrated CANOPUS into SIRIUS v.4.4. K.D., L.-F.N. and P.C.D. applied and evaluated the method in the mouse and Euphorbia studies. R.R. isolated rivulariapeptolide 1155 and applied CANOPUS (on mass spectrometry data collected and analyzed by D.P. and R.R. and supervised by W.H.G.) and one-/two-dimensional NMR analysis for its structural elucidation. K.D., S.B., L.-F.N. and R.R. wrote the manuscript, in concert with all authors.

Corresponding author

Correspondence to Sebastian Böcker.

Ethics declarations

Competing interests

S.B., K.D., M.L., M.F. and M.A.H. are cofounders of Bright Giant GmbH. P.C.D. is scientific advisor for Sirenas LLC.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 CANOPUS performance sunburst plot. Matthews correlation coefficient (MCC) for the 782 of 2,497 compound classes with at least 50 positive examples.

SVM training dataset. A darker green coloring corresponds to better prediction performance for the class. The size of each slice is chosen such that all classes fit into the figure and has no further meaning. Inner slices represent parent classes of outer slices.

Extended Data Fig. 2 Effect of removing a subclass from the MS/MS training data.

ac, Regular evaluation setup: classes and subclasses are distributed into cross-validation folds, ensuring that methods are never evaluated on the same MS/MS data or structures they were trained on. d-f, We remove all flavonoid glycosides (the subclass) from the MS/MS training data (d), and then evaluate the predictor for glycosides (the class) on these removed MS/MS spectra (e). A perfect method would still classify all flavonoid glycoside MS/MS spectra as glycosides (f). CANOPUS exhibits only a small drop (68% to 97%) in correct classifications (c,f). In contrast, direct prediction performed mostly on par with CANOPUS before removing flavonoid glycosides from the MS/MS training data (c), but misses almost all of them (8%) afterwards (f). We were able to attribute this to the presence of isoflavonoid glycosides in the training data; these do not belong to the flavonoid class, but have highly similar structures and MS/MS spectra, except for the presence of a sugar residue. We observed that direct prediction in (d-f) uses the presence of a sugar residue to infer that a MS/MS spectrum is not a glycoside. In contrast, CANOPUS does not fall for this ‘bait’; heterogeneous training allows us to integrate the substantially more comprehensive structure data in its predictions.

Extended Data Fig. 3 Relative number of compounds annotated at varying ClassyFire class levels in the mice study (a) and the Euphorbia plant study (b).

The ClassyFire ChemOnt ontology is organized as a tree, where the Kingdom is either Organic compounds or Inorganic compounds. Superclasses like Lipids and lipid like molecules, Benzenoids are children of Kingdom class. Flavonoids and Steroids and steroid derivatives are examples for the Class level, while Flavonoid glycosides and Bile acids, alcohols, and derivatives are examples for subclasses. There can be up to 11 levels in the ontology. c, ClassyFire classes of compounds in the biological databases. We observe a similar distribution of class levels as for the two biological datasets, indicating that CANOPUS is comprehensively classifying compounds at all possible compound class levels.

Extended Data Fig. 4 Molecular network and compound class annotations (single class annotations) for the mice digestive system.

Node colors indicate the compound class annotated by CANOPUS; displayed compound classes were manually selected. When a compound is annotated with multiple classes, the class with the larger structural pattern is selected. Nodes are connected by an edge if the spectral similarity is 0.7 or higher.

Extended Data Fig. 5 Molecular network and compound class annotations (muliple class annotations) for the mice digestive system.

Node colors indicate the compound class annotated by CANOPUS; compound classes are the same as in Supplementary Fig. 4 1. Compounds belonging to multiple classes displayed as multicolored nodes. Nodes are connected by an edge if the spectral similarity is 0.7 or higher.

Extended Data Fig. 6 Number of compounds detected for each Euphorbia subgenus.

Orange bars indicate the number of compounds detected here, black ticks indicate the number of compounds reported in the original study. Higher numbers of detected features are not a measure of quality for the two methods, but depend mainly on the preprocessing executed before compound classification.

Extended Data Fig. 7 Number of compounds annotated as diterpenoids in different species of Euphorbia.

Left: absolute number of compounds. Right: relative number of compounds, that is, number of diterpenoids divided by total number of compounds in each species. Black ticks in the left figure mark the reported number of diterpenoids in the original study by Ernst et al.

Extended Data Fig. 8 Number of compounds annotated as triterpenoids in different species of Euphorbia.

Left: absolute number of compounds. Right: relative number of compounds, that is, number of triterpenoids divided by total number of compounds in each species. Black ticks in the left figure mark the reported number of triterpenoids in the original study by Ernst et al.

Extended Data Fig. 9 Number of diterpenoids in different species of Euphorbia.

Black bars show the amount of diterpenoids that have a benzoic acid ester (a), fatty acid ester (b) or two carboxylic acids (c). Source data

Supplementary information

Supplementary Information

Supplementary Tables 4, 5 and 7, Figs. 1–10 and Notes 1 and 2.

Reporting Summary

Supplementary Data. Interactive comparison of Euphorbia plants. The user can select any two plant species to be compared; two sunburst plots then show the number of compounds annotated by CANOPUS for each compound class. Mouse-over allows display details of a compound class, including the number and percentage of compounds belonging to this class and the ClassyFire ontology and description of the class.

Supplementary Table 1

Compound classes from ChemOnt ontology not predicted by CANOPUS.

Supplementary Table 2

Evaluation results for all query MS/MS from the SVM training dataset.

Supplementary Table 3

Performance of all evaluated methods for individual compound classes; evaluation on the independent dataset.

Supplementary Table 6

Standardized SMILES of all compound structures from MassBank and GNPS used as the SVM training dataset.

Source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dührkop, K., Nothias, LF., Fleischauer, M. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra. Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0740-8

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing