Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Database-independent molecular formula annotation using Gibbs sampling through ZODIAC

A Publisher Correction to this article was published on 21 October 2020

This article has been updated

A preprint version of the article is available at bioRxiv.

The confident high-throughput identification of small molecules is one of the most challenging tasks in mass spectrometry-based metabolomics. Annotating the molecular formula of a compound is the first step towards its structural elucidation. Yet even the annotation of molecular formulas remains highly challenging. This is particularly so for large compounds above 500 daltons, and for de novo annotations, for which we consider all chemically feasible formulas. Here we present ZODIAC, a network-based algorithm for the de novo annotation of molecular formulas. Uniquely, it enables fully automated and swift processing of complete experimental runs, providing high-quality, high-confidence molecular formula annotations. This allows us to annotate novel molecular formulas that are absent from even the largest public structure databases. Our method re-ranks molecular formula candidates by considering joint fragments and losses between fragmentation trees. We employ Bayesian statistics and Gibbs sampling. Thorough algorithm engineering ensures fast processing in practice. We evaluate ZODIAC on five datasets, producing results substantially (up to 16.5-fold) better than for several other methods, including SIRIUS, which is the state-of-the-art algorithm for molecular formula annotation at present. Finally, we report and verify several novel molecular formulas annotated by ZODIAC.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Molecular formula annotation error rates.
Fig. 2: Percentage of correct annotations and number of compounds in relation to ZODIAC score.
Fig. 3: Annotation of a novel bromine-containing compound in the diatoms dataset.
Fig. 4: Annotation of a novel chlorine- and iodine-containing compound in the diatoms dataset.
Fig. 5: Running time comparison of SIRIUS and ZODIAC on five datasets.
Fig. 6: Gibbs sampling.

Data availability

Input mzML/mzXML files for the five datasets are available at MassIVE (https://massive.ucsd.edu/), with the following accession numbers for dendroides (MSV000080502), for NIST1950 (MSV000081364), for tomato (MSV000081463), for diatoms (MSV000081731) and for the mice stool (MSV000079949) datasets. SIRIUS and ZODIAC results and a virtual machine on which to reproduce the data are available from https://bio.informatik.uni-jena.de/data/ and https://doi.org/10.6084/m9.figshare.12911171. Source data are provided with this paper.

Code availability

ZODIAC has been integrated into the SIRIUS software and is written in Java. It is open source under the GNU General Public License (version 3), and works on Windows, macOS X and Linux. A command-line version allows batch processing and results can be visualized in a graphical user interface. We provide executable binaries, example files and additional information on the ZODIAC website (https://bio.informatik.uni-jena.de/software/zodiac/). A source copy is hosted on GitHub (https://github.com/boecker-lab/sirius-libs60); the branch ‘zodiac_in_sirius_4_release’ contains the SIRIUS and ZODIAC code used for evaluation in this paper.

Change history

  • 21 October 2020

    An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

  1. 1.

    Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucl. Acids Res. 46, D608–D617 (2018).

    Google Scholar 

  2. 2.

    Kim, S. et al. PubChem substance and compound databases. Nucl. Acids Res. 44, D1202–D1213 (2016).

    Google Scholar 

  3. 3.

    Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).

    Google Scholar 

  4. 4.

    Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).

    Google Scholar 

  5. 5.

    Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 22 (2017).

    Google Scholar 

  6. 6.

    Samaraweera, M. A., Hall, L. M., Hill, D. W. & Grant, D. F. Evaluation of an artificial neural network retention index model for chemical structure identification in nontargeted metabolomics. Anal. Chem. 90, 12752–12760 (2018).

    Google Scholar 

  7. 7.

    Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).

    Google Scholar 

  8. 8.

    Dührkop, K. et al. Classes for the masses: systematic classification of unknowns using fragmentation spectra. Preprint at https://www.biorxiv.org/content/10.1101/2020.04.17.046672v1 (2020).

  9. 9.

    Kind, T. & Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinform. 8, 105 (2007).

    Google Scholar 

  10. 10.

    Stein, S. E. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).

    Google Scholar 

  11. 11.

    Vinaixa, M. et al. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects. Trends Anal. Chem. 78, 23–35 (2016).

    Google Scholar 

  12. 12.

    Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).

    Google Scholar 

  13. 13.

    da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).

    Google Scholar 

  14. 14.

    Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    Google Scholar 

  15. 15.

    Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).

    Google Scholar 

  16. 16.

    Alon, T. & Amirav, A. Isotope abundance analysis methods and software for improved sample identification with supersonic gas chromatography/mass spectrometry. Rapid Commun. Mass Spectrom. 20, 2579–2588 (2006).

    Google Scholar 

  17. 17.

    Böcker, S., Letzel, M., Lipták, Z. S. & Pervukhin, A. Decomposing metabolomic isotope patterns. In Proc. Works. Algorithms in Bioinformatics (WABI 2006) Vol. 4175,12–23 (Springer, Berlin, 2006).

  18. 18.

    Ojanperä, S. et al. Isotopic pattern and accurate mass determination in urine drug screening by liquid chromatography/time-of-flight mass spectrometry. Rapid Commun. Mass Spectrom. 20, 1161–1167 (2006).

    Google Scholar 

  19. 19.

    Böcker, S., Letzel, M., Lipták, Zs & Pervukhin, A. SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics 25, 218–224 (2009).

    Google Scholar 

  20. 20.

    Pluskal, T., Uehara, T. & Yanagida, M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem. 84, 4396–4403 (2012).

    Google Scholar 

  21. 21.

    Valkenborg, D., Mertens, I., Lemière, F., Witters, E. & Burzykowski, T. The isotopic distribution conundrum. Mass Spectrom. Rev. 31, 96–109 (2012).

    Google Scholar 

  22. 22.

    Loos, M., Gerber, C., Corona, F., Hollender, J. & Singer, H. Accelerated isotope fine structure calculation using pruned transition trees. Anal. Chem. 87, 5738–5744 (2015).

    Google Scholar 

  23. 23.

    Böcker, S. & Rasche, F. Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinformatics 24, i49–Ii55 (2008).

  24. 24.

    Stravs, M. A., Schymanski, E. L., Singer, H. P. & Hollender, J. Automatic recalibration and processing of tandem mass spectra using formula annotation. J. Mass Spectrom. 48, 89–99 (2013).

    Google Scholar 

  25. 25.

    Böcker, S. & Dührkop, K. Fragmentation trees reloaded. J. Cheminform. 8, 5 (2016).

    MATH  Google Scholar 

  26. 26.

    Rogers, S., Scheltema, R. A., Girolami, M. & Breitling, R. Probabilistic assignment of formulas to mass peaks in metabolomics experiments. Bioinformatics 25, 512–518 (2009).

    Google Scholar 

  27. 27.

    Daly, R. et al. MetAssign: probabilistic annotation of metabolites from LC-MS data using a Bayesian clustering approach. Bioinformatics 30, 2764–2771 (2014).

    Google Scholar 

  28. 28.

    da Silva, R. R. et al. ProbMetab: an R package for Bayesian probabilistic annotation of LC-MS-based metabolomics. Bioinformatics 30, 1336–1337 (2014).

    Google Scholar 

  29. 29.

    Del Carratore, F. et al. Integrated probabilistic annotation: a Bayesian-based annotation method for metabolomic profiles integrating biochemical connections, isotope patterns, and adduct relationships. Anal. Chem. 91, 12799–12807 (2019).

    Google Scholar 

  30. 30.

    Tziotis, D., Hertkorn, N. & Schmitt-Kopplin, P. Kendrick-analogous network visualisation of ion cyclotron resonance Fourier transform mass spectra: improved options for the assignment of elemental compositions and the classification of organic molecular complexity. Eur. J. Mass Spectrom. 17, 415–421 (2011).

    Google Scholar 

  31. 31.

    Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).

    Google Scholar 

  32. 32.

    Morreel, K. et al. Systematic structural characterization of metabolites in Arabidopsis via candidate substrate-product pair networks. Plant Cell 26, 929–945 (2014).

    Google Scholar 

  33. 33.

    Quinn, R. A. et al. Global chemical effects of the microbiome include new bile-acid conjugations. Nature 579, 123–129 (2020).

    Google Scholar 

  34. 34.

    Esposito, M. et al. Euphorbia dendroides latex as a source of jatrophane esters: isolation, structural analysis, conformational study, and anti-CHIKV activity. J. Natural Prod. 79, 2873–2882 (2016).

    Google Scholar 

  35. 35.

    Röst, H. L. et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat. Methods 13, 741–748 (2016).

    Google Scholar 

  36. 36.

    Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).

    Google Scholar 

  37. 37.

    Nothias, L.-F. et al. Bioactivity-based molecular networking for the discovery of drug leads in natural product bioassay-guided fractionation. J. Natural Prod. 81, 758–767 (2018).

    Google Scholar 

  38. 38.

    Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Ed. 87, 1123–1124 (2010).

    Google Scholar 

  39. 39.

    Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinform. 11, 395 (2010).

    Google Scholar 

  40. 40.

    Nothias, L. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods 17, 905–908 (2020).

    Google Scholar 

  41. 41.

    Simón-Manso, Y. et al. Metabolite profiling of a NIST standard reference material for human plasma (SRM 1950): GC-MS, LC-MS, NMR, and clinical laboratory analyses, libraries, and web-based resources. Anal. Chem. 85, 11725–11731 (2013).

    Google Scholar 

  42. 42.

    Vos, R. C. H. D. et al. Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry. Nat. Protocols 2, 778–791 (2007).

    Google Scholar 

  43. 43.

    Agarwal, V. et al. Complexity of naturally produced polybrominated diphenyl ethers revealed via mass spectrometry. Environ. Sci. Technol. 49, 1339–46 (2015).

    Google Scholar 

  44. 44.

    Andersen, R. & America, P. S. Algal Culturing Techniques (Elsevier Science, 2005).

  45. 45.

    Dittmar, T., Koch, B., Hertkorn, N. & Kattner, G. A simple and efficient method for the solid-phase extraction of dissolved organic matter (SPE-DOM) from seawater. Limnol. Oceanogr. Meth. 6, 230–235 (2008).

    Google Scholar 

  46. 46.

    Petras, D. et al. High-resolution liquid chromatography tandem mass spectrometry enables large scale molecular characterization of dissolved organic matter. Front. Mar. Sci. 4, 405 (2017).

    Google Scholar 

  47. 47.

    Meusel, M. et al. Predicting the presence of uncommon elements in unknown biomolecules from isotope patterns. Anal. Chem. 88, 7556–7566 (2016).

    Google Scholar 

  48. 48.

    Sumner, L. W. et al. Proposed minimum reporting standards for chemical analysis. Metabolomics 3, 211–221 (2007).

    Google Scholar 

  49. 49.

    Karp, R. M. in Complexity of Computer Computations (eds Miller, R. E. & Thatcher, J. W.) 85–103 (Plenum Press, 1972).

  50. 50.

    Downey, R. G. & Fellows, M. R. Parameterized Complexity (Springer, Berlin, 1999).

  51. 51.

    Zuckerman, D. Linear degree extractors and the inapproximability of max clique and chromatic number. In Proc. ACM Symp. on Theory of Computing (STOC 2006) 681–690 (2006).

  52. 52.

    Chen, J., Huang, X., Kanj, I. A. & Xia, G. Strong computational lower bounds via parameterized complexity. J. Comp. Syst. Sci. 72, 1346–1367 (2006).

    MathSciNet  MATH  Google Scholar 

  53. 53.

    Impagliazzo, R. & Paturi, R. On the complexity of k-SAT. J. Comp. Syst. Sci. 62, 367–375 (2001).

    MathSciNet  MATH  Google Scholar 

  54. 54.

    Rasche, F. et al. Identifying the unknowns by aligning fragmentation trees. Anal. Chem. 84, 3417–3426 (2012).

    Google Scholar 

  55. 55.

    Geman, S. & Geman, D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984).

    MATH  Google Scholar 

  56. 56.

    Ludwig, M., Dührkop, K. & Böcker, S. Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints. Bioinformatics 34, i333–i340 (2018).

  57. 57.

    Li, L. et al. MyCompoundID: using an evidence-based metabolome library for metabolite identification. Anal. Chem. 85, 3401–3408 (2013).

    Google Scholar 

  58. 58.

    Meringer, M., Reinker, S., Zhang, J. & Muller, A. MS/MS data improves automated determination of molecular formulas by mass spectrometry. MATCH Commun. Math. Comput. Chem. 65, 259–290 (2011).

    Google Scholar 

  59. 59.

    Heuerding, S. & Clerc, J. T. Simple tools for the computer-aided interpretation of mass spectra. Chemometr. Intell. Lab. Syst. 20, 57–69 (1993).

    Google Scholar 

  60. 60.

    Dührkop, K. et al. boecker-lab/sirius-libs: SIRIUS 4.0.1 including ZODIAC (Version v4.0.1_with_ZODIAC). https://doi.org/10.5281/zenodo.3985859 (2020).

Download references

Acknowledgements

We thank M. Witting for discussions and F. Kretschmer for the fragmentation tree visualization. We acknowledge financial support by the Deutsche Forschungsgemeinschaft to S.B., K.D., M.F., M.A.H. and M.L. (grant BO 1910/20) and D.P. (grant PE 2600/1). I.K. acknowledges funding from the Blasker Environmental Grant, San Diego Foundation. F.V. was funded by the Department of Navy, Office of Naval Research Multidisciplinary University Research Initiative (MURI) Award (award number N00014-15-1-2809). L.-F.N. was supported by European Union’s Horizon 2020 grants (MSCA-GF, 704786). M.M. acknowledges funding from the National Science Foundation (award number 1354050). We acknowledge financial support by the US National Institutes of Health to P.C.D. for the Center for Computational Mass Spectrometry (grant P41 GM103484), the re-use of metabolomics data (grant R03 CA211211) and the tools for rapid and accurate structure elucidation of natural products (grant R01 GM107550). P.C.D. also acknowledges support from the Sloan Foundation and from the Gordon and Betty Moore Foundation.

Author information

Affiliations

Authors

Contributions

S.B. designed the research. S.B. and M.L. developed the computational method with help from K.D. M.L. implemented the computational method with contributions from K.D. and M.F. M.L. and L.-F.N. performed the method evaluation, coordinated by S.B. L.-F.N., I.K. and L.A. contributed to the interpretation of results. M.F. and M.L. integrated ZODIAC into SIRIUS. M.A.H. contributed to the visualization of the novel compound’s data. Mass spectrometry experiments were performed for the dendroides dataset by L.F.N., for the NIST1950 dataset by F.V., for the tomato dataset by L.-F.N and M.M. and for the diatoms dataset by I.K. and D.P. L.A. and P.C.D. coordinated the experimental part of the study. S.B. and M.L. wrote the manuscript, to which L.F.N. and I.K. contributed, in cooperation with all other authors.

Corresponding author

Correspondence to Sebastian Böcker.

Ethics declarations

Competing interests

S.B, K.D., M.F., M.A.H. and M.L. are founders of Bright Giant GmbH. P.C.D. is the scientific advisor for Sirenas LLC.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Statistics on compounds with annotated ground truth molecular formulas.

Given is the number of total compounds, the number of compounds with a ground truth molecular formula and the number which are in the top 50 of SIRIUS-ranked candidates. The median m/z and 25 and 75 percentile considers only candidates in the top 50. We report the maximum absolute value of all relative mass errors in a dataset. Finally, sample standard deviations (STD) of relative mass errors are computed assuming a mean mass error of zero.

Extended Data Fig. 2 Distribution of compound masses.

Distribution of precursor ion m/z of the compounds used as ground truth for the evaluation of the molecular formula annotation on the five datasets. Bins of width 100 are centred at 100, 200, …, 800 m/z.

Source data

Extended Data Fig. 3 ZODIAC processing and evaluation workflow.

(1) Each LC-MS/MS run is processed individually; input mzML/mzXML files are processed using OpenMS, performing feature and adduct detection and producing files in SIRIUS input format. Resulting features combine MS1, MS/MS and adduct information. (2), (3) Filtering is performed on feature, MS/MS and peak level. (4) Similar features are merged between different runs using hierarchical clustering; MS/MS are combined and a best isotope pattern is selected per feature. (5) Missing isotope peaks are searched in MS1 spectra to extend isotope patterns. (6) A final feature filtering step is performed; the remaining features are considered as compounds. (7) SIRIUS is executed. (8) Compounds with few explained peaks are discarded, since a badly explained MS/MS spectrum indicates low quality. (9) ZODIAC is run on the remaining compounds. (10) SIRIUS and ZODIAC are evaluated on the same set of compounds.

Extended Data Fig. 4 Molecular formula annotation error rates.

Error rates on five datasets. Methods are SIRIUS; ZODIAC (without anchors); exact mass over elements carbon, hydrogen, nitrogen and oxygen (‘exact mass (CHNO)’); exact mass over CHNO plus phosphorus and sulfur (‘exact mass (CHNOPS)’); Seven Golden Rules with elements CHNOPS (‘7GR (CHNOPS)’); Seven Golden Rules with elements CHNOPS plus bromine and chlorine (‘7GR (CHNOPSBrCl)’); and GenForm. Between 44 an 271 compounds were processed per dataset, see Extended Data Fig. 1 for details. GenForm is the only publicly available tool for molecular formula inference besides SIRIUS, and considers both the isotope pattern and the fragmentation spectrum. GenForm was restricted to elements CHNOPS, and 7GR (CHNOPSBrCl) cannot annotate iodine-containing compounds; to this end, only SIRIUS and ZODIAC are in theory capable of annotating the two novel molecular formulas C24H47BrNO8P and C15H30ClIO5 reported here. Error rates are based on all compounds with established ground truth, resulting in slightly higher error rates for SIRIUS and ZODIAC on dendroides, tomato and mice stool compared to Fig. 1. Error rates on the five datasets agree well with the mass of compounds in the respective dataset, see Extended Data Fig. 1: larger compounds result in substantially more candidates to be considered, in particular for a larger set of elements, and result in worse annotation rates. For evaluation details see the Methods section.

Source data

Extended Data Fig. 5 Seven Golden Rules applied to annotated molecular formulas.

For each ZODIAC molecular formula annotation, we test whether it meets the molecular formula subset of the Seven Golden Rules (7GR). Each dot represents one annotated compound; molecular formulas are sorted by ZODIAC score.

Source data

Extended Data Fig. 6 Novel molecular formulas.

All molecular formulas are absent from the largest molecular structure databases PubChem and ChemSpider. Only molecular formula annotations with a minimum ZODIAC score of 0.98 are reported such that at least 95% of the MS/MS spectrum intensity is being explained by the SIRIUS fragmentation tree, and at least one molecular formula of the compound is connected to 5 or more compounds. There may be more than one hypothetical compound in an LC-MS run being annotated with one molecular formula, potentially corresponding to different isomers. For such cases, ‘#comp.’ is the number of hypothetical compounds being annotated with the given molecular formula, and ‘max score’ is the maximum ZODIAC score among these annotations. The corresponding compounds are given in Supplementary Table 5. For 90.00% of the compounds, SIRIUS top-ranks the same molecular formula.

Supplementary information

Supplementary Information

Supplementary Figs. 1–7, Supplementary Table 4, Supplementary Note 1.

Reporting Summary

Supplementary Table 1

Manually annotated molecular formulas for compounds in the dendroides dataset. These molecular formulas serve as ground truth for evaluation of SIRIUS and ZODIAC.

Supplementary Table 2

Spectral library hits for datasets NIST1950, tomato, diatoms and mice stool. The molecular formulas of these library hits serve as ground truth for evaluation of SIRIUS and ZODIAC.

Supplementary Table 3

List of input files used for evaluation of five datasets. The included files in mzML/mzXML format correspond to LC-MS/MS runs which were used for evaluation. These runs are subsets of the data provided at MassIVE repository.

Supplementary Table 5

Compounds with a novel molecular formula. Provided are the detailed information for compounds corresponding to the novel molecular formulas in Extended Data Fig. 6. All molecular formulas are absent from the largest molecular structure databases PubChem and ChemSpider.

Source data

Source Data Fig. 1

Statistical source data

Source Data Fig. 2

Statistical source data

Source Data Fig. 3

Mass spectra

Source Data Fig. 4

Mass spectra

Source Data Fig. 5

Statistical source data

Source Data Extended Data Fig. 2

Statistical source data

Source Data Extended Data Fig. 4

Statistical source data

Source Data Extended Data Fig. 5

Statistical source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ludwig, M., Nothias, LF., Dührkop, K. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat Mach Intell 2, 629–641 (2020). https://doi.org/10.1038/s42256-020-00234-6

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing