The confident high-throughput identification of small molecules is one of the most challenging tasks in mass spectrometry-based metabolomics. Annotating the molecular formula of a compound is the first step towards its structural elucidation. Yet even the annotation of molecular formulas remains highly challenging. This is particularly so for large compounds above 500 daltons, and for de novo annotations, for which we consider all chemically feasible formulas. Here we present ZODIAC, a network-based algorithm for the de novo annotation of molecular formulas. Uniquely, it enables fully automated and swift processing of complete experimental runs, providing high-quality, high-confidence molecular formula annotations. This allows us to annotate novel molecular formulas that are absent from even the largest public structure databases. Our method re-ranks molecular formula candidates by considering joint fragments and losses between fragmentation trees. We employ Bayesian statistics and Gibbs sampling. Thorough algorithm engineering ensures fast processing in practice. We evaluate ZODIAC on five datasets, producing results substantially (up to 16.5-fold) better than for several other methods, including SIRIUS, which is the state-of-the-art algorithm for molecular formula annotation at present. Finally, we report and verify several novel molecular formulas annotated by ZODIAC.
Subscribe to Journal
Get full journal access for 1 year
only $8.25 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Input mzML/mzXML files for the five datasets are available at MassIVE (https://massive.ucsd.edu/), with the following accession numbers for dendroides (MSV000080502), for NIST1950 (MSV000081364), for tomato (MSV000081463), for diatoms (MSV000081731) and for the mice stool (MSV000079949) datasets. SIRIUS and ZODIAC results and a virtual machine on which to reproduce the data are available from https://bio.informatik.uni-jena.de/data/ and https://doi.org/10.6084/m9.figshare.12911171. Source data are provided with this paper.
ZODIAC has been integrated into the SIRIUS software and is written in Java. It is open source under the GNU General Public License (version 3), and works on Windows, macOS X and Linux. A command-line version allows batch processing and results can be visualized in a graphical user interface. We provide executable binaries, example files and additional information on the ZODIAC website (https://bio.informatik.uni-jena.de/software/zodiac/). A source copy is hosted on GitHub (https://github.com/boecker-lab/sirius-libs60); the branch ‘zodiac_in_sirius_4_release’ contains the SIRIUS and ZODIAC code used for evaluation in this paper.
Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucl. Acids Res. 46, D608–D617 (2018).
Kim, S. et al. PubChem substance and compound databases. Nucl. Acids Res. 44, D1202–D1213 (2016).
Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).
Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI-MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).
Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminform. 9, 22 (2017).
Samaraweera, M. A., Hall, L. M., Hill, D. W. & Grant, D. F. Evaluation of an artificial neural network retention index model for chemical structure identification in nontargeted metabolomics. Anal. Chem. 90, 12752–12760 (2018).
Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).
Dührkop, K. et al. Classes for the masses: systematic classification of unknowns using fragmentation spectra. Preprint at https://www.biorxiv.org/content/10.1101/2020.04.17.046672v1 (2020).
Kind, T. & Fiehn, O. Seven Golden Rules for heuristic filtering of molecular formulas obtained by accurate mass spectrometry. BMC Bioinform. 8, 105 (2007).
Stein, S. E. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).
Vinaixa, M. et al. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects. Trends Anal. Chem. 78, 23–35 (2016).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
da Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).
Alon, T. & Amirav, A. Isotope abundance analysis methods and software for improved sample identification with supersonic gas chromatography/mass spectrometry. Rapid Commun. Mass Spectrom. 20, 2579–2588 (2006).
Böcker, S., Letzel, M., Lipták, Z. S. & Pervukhin, A. Decomposing metabolomic isotope patterns. In Proc. Works. Algorithms in Bioinformatics (WABI 2006) Vol. 4175,12–23 (Springer, Berlin, 2006).
Ojanperä, S. et al. Isotopic pattern and accurate mass determination in urine drug screening by liquid chromatography/time-of-flight mass spectrometry. Rapid Commun. Mass Spectrom. 20, 1161–1167 (2006).
Böcker, S., Letzel, M., Lipták, Zs & Pervukhin, A. SIRIUS: decomposing isotope patterns for metabolite identification. Bioinformatics 25, 218–224 (2009).
Pluskal, T., Uehara, T. & Yanagida, M. Highly accurate chemical formula prediction tool utilizing high-resolution mass spectra, MS/MS fragmentation, heuristic rules, and isotope pattern matching. Anal. Chem. 84, 4396–4403 (2012).
Valkenborg, D., Mertens, I., Lemière, F., Witters, E. & Burzykowski, T. The isotopic distribution conundrum. Mass Spectrom. Rev. 31, 96–109 (2012).
Loos, M., Gerber, C., Corona, F., Hollender, J. & Singer, H. Accelerated isotope fine structure calculation using pruned transition trees. Anal. Chem. 87, 5738–5744 (2015).
Böcker, S. & Rasche, F. Towards de novo identification of metabolites by analyzing tandem mass spectra. Bioinformatics 24, i49–Ii55 (2008).
Stravs, M. A., Schymanski, E. L., Singer, H. P. & Hollender, J. Automatic recalibration and processing of tandem mass spectra using formula annotation. J. Mass Spectrom. 48, 89–99 (2013).
Böcker, S. & Dührkop, K. Fragmentation trees reloaded. J. Cheminform. 8, 5 (2016).
Rogers, S., Scheltema, R. A., Girolami, M. & Breitling, R. Probabilistic assignment of formulas to mass peaks in metabolomics experiments. Bioinformatics 25, 512–518 (2009).
Daly, R. et al. MetAssign: probabilistic annotation of metabolites from LC-MS data using a Bayesian clustering approach. Bioinformatics 30, 2764–2771 (2014).
da Silva, R. R. et al. ProbMetab: an R package for Bayesian probabilistic annotation of LC-MS-based metabolomics. Bioinformatics 30, 1336–1337 (2014).
Del Carratore, F. et al. Integrated probabilistic annotation: a Bayesian-based annotation method for metabolomic profiles integrating biochemical connections, isotope patterns, and adduct relationships. Anal. Chem. 91, 12799–12807 (2019).
Tziotis, D., Hertkorn, N. & Schmitt-Kopplin, P. Kendrick-analogous network visualisation of ion cyclotron resonance Fourier transform mass spectra: improved options for the assignment of elemental compositions and the classification of organic molecular complexity. Eur. J. Mass Spectrom. 17, 415–421 (2011).
Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).
Morreel, K. et al. Systematic structural characterization of metabolites in Arabidopsis via candidate substrate-product pair networks. Plant Cell 26, 929–945 (2014).
Quinn, R. A. et al. Global chemical effects of the microbiome include new bile-acid conjugations. Nature 579, 123–129 (2020).
Esposito, M. et al. Euphorbia dendroides latex as a source of jatrophane esters: isolation, structural analysis, conformational study, and anti-CHIKV activity. J. Natural Prod. 79, 2873–2882 (2016).
Röst, H. L. et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat. Methods 13, 741–748 (2016).
Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).
Nothias, L.-F. et al. Bioactivity-based molecular networking for the discovery of drug leads in natural product bioassay-guided fractionation. J. Natural Prod. 81, 758–767 (2018).
Pence, H. E. & Williams, A. ChemSpider: an online chemical information resource. J. Chem. Ed. 87, 1123–1124 (2010).
Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinform. 11, 395 (2010).
Nothias, L. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods 17, 905–908 (2020).
Simón-Manso, Y. et al. Metabolite profiling of a NIST standard reference material for human plasma (SRM 1950): GC-MS, LC-MS, NMR, and clinical laboratory analyses, libraries, and web-based resources. Anal. Chem. 85, 11725–11731 (2013).
Vos, R. C. H. D. et al. Untargeted large-scale plant metabolomics using liquid chromatography coupled to mass spectrometry. Nat. Protocols 2, 778–791 (2007).
Agarwal, V. et al. Complexity of naturally produced polybrominated diphenyl ethers revealed via mass spectrometry. Environ. Sci. Technol. 49, 1339–46 (2015).
Andersen, R. & America, P. S. Algal Culturing Techniques (Elsevier Science, 2005).
Dittmar, T., Koch, B., Hertkorn, N. & Kattner, G. A simple and efficient method for the solid-phase extraction of dissolved organic matter (SPE-DOM) from seawater. Limnol. Oceanogr. Meth. 6, 230–235 (2008).
Petras, D. et al. High-resolution liquid chromatography tandem mass spectrometry enables large scale molecular characterization of dissolved organic matter. Front. Mar. Sci. 4, 405 (2017).
Meusel, M. et al. Predicting the presence of uncommon elements in unknown biomolecules from isotope patterns. Anal. Chem. 88, 7556–7566 (2016).
Sumner, L. W. et al. Proposed minimum reporting standards for chemical analysis. Metabolomics 3, 211–221 (2007).
Karp, R. M. in Complexity of Computer Computations (eds Miller, R. E. & Thatcher, J. W.) 85–103 (Plenum Press, 1972).
Downey, R. G. & Fellows, M. R. Parameterized Complexity (Springer, Berlin, 1999).
Zuckerman, D. Linear degree extractors and the inapproximability of max clique and chromatic number. In Proc. ACM Symp. on Theory of Computing (STOC 2006) 681–690 (2006).
Chen, J., Huang, X., Kanj, I. A. & Xia, G. Strong computational lower bounds via parameterized complexity. J. Comp. Syst. Sci. 72, 1346–1367 (2006).
Impagliazzo, R. & Paturi, R. On the complexity of k-SAT. J. Comp. Syst. Sci. 62, 367–375 (2001).
Rasche, F. et al. Identifying the unknowns by aligning fragmentation trees. Anal. Chem. 84, 3417–3426 (2012).
Geman, S. & Geman, D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. Pattern Anal. Mach. Intell. 6, 721–741 (1984).
Ludwig, M., Dührkop, K. & Böcker, S. Bayesian networks for mass spectrometric metabolite identification via molecular fingerprints. Bioinformatics 34, i333–i340 (2018).
Li, L. et al. MyCompoundID: using an evidence-based metabolome library for metabolite identification. Anal. Chem. 85, 3401–3408 (2013).
Meringer, M., Reinker, S., Zhang, J. & Muller, A. MS/MS data improves automated determination of molecular formulas by mass spectrometry. MATCH Commun. Math. Comput. Chem. 65, 259–290 (2011).
Heuerding, S. & Clerc, J. T. Simple tools for the computer-aided interpretation of mass spectra. Chemometr. Intell. Lab. Syst. 20, 57–69 (1993).
Dührkop, K. et al. boecker-lab/sirius-libs: SIRIUS 4.0.1 including ZODIAC (Version v4.0.1_with_ZODIAC). https://doi.org/10.5281/zenodo.3985859 (2020).
We thank M. Witting for discussions and F. Kretschmer for the fragmentation tree visualization. We acknowledge financial support by the Deutsche Forschungsgemeinschaft to S.B., K.D., M.F., M.A.H. and M.L. (grant BO 1910/20) and D.P. (grant PE 2600/1). I.K. acknowledges funding from the Blasker Environmental Grant, San Diego Foundation. F.V. was funded by the Department of Navy, Office of Naval Research Multidisciplinary University Research Initiative (MURI) Award (award number N00014-15-1-2809). L.-F.N. was supported by European Union’s Horizon 2020 grants (MSCA-GF, 704786). M.M. acknowledges funding from the National Science Foundation (award number 1354050). We acknowledge financial support by the US National Institutes of Health to P.C.D. for the Center for Computational Mass Spectrometry (grant P41 GM103484), the re-use of metabolomics data (grant R03 CA211211) and the tools for rapid and accurate structure elucidation of natural products (grant R01 GM107550). P.C.D. also acknowledges support from the Sloan Foundation and from the Gordon and Betty Moore Foundation.
S.B, K.D., M.F., M.A.H. and M.L. are founders of Bright Giant GmbH. P.C.D. is the scientific advisor for Sirenas LLC.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Given is the number of total compounds, the number of compounds with a ground truth molecular formula and the number which are in the top 50 of SIRIUS-ranked candidates. The median m/z and 25 and 75 percentile considers only candidates in the top 50. We report the maximum absolute value of all relative mass errors in a dataset. Finally, sample standard deviations (STD) of relative mass errors are computed assuming a mean mass error of zero.
Distribution of precursor ion m/z of the compounds used as ground truth for the evaluation of the molecular formula annotation on the five datasets. Bins of width 100 are centred at 100, 200, …, 800 m/z.
(1) Each LC-MS/MS run is processed individually; input mzML/mzXML files are processed using OpenMS, performing feature and adduct detection and producing files in SIRIUS input format. Resulting features combine MS1, MS/MS and adduct information. (2), (3) Filtering is performed on feature, MS/MS and peak level. (4) Similar features are merged between different runs using hierarchical clustering; MS/MS are combined and a best isotope pattern is selected per feature. (5) Missing isotope peaks are searched in MS1 spectra to extend isotope patterns. (6) A final feature filtering step is performed; the remaining features are considered as compounds. (7) SIRIUS is executed. (8) Compounds with few explained peaks are discarded, since a badly explained MS/MS spectrum indicates low quality. (9) ZODIAC is run on the remaining compounds. (10) SIRIUS and ZODIAC are evaluated on the same set of compounds.
Error rates on five datasets. Methods are SIRIUS; ZODIAC (without anchors); exact mass over elements carbon, hydrogen, nitrogen and oxygen (‘exact mass (CHNO)’); exact mass over CHNO plus phosphorus and sulfur (‘exact mass (CHNOPS)’); Seven Golden Rules with elements CHNOPS (‘7GR (CHNOPS)’); Seven Golden Rules with elements CHNOPS plus bromine and chlorine (‘7GR (CHNOPSBrCl)’); and GenForm. Between 44 an 271 compounds were processed per dataset, see Extended Data Fig. 1 for details. GenForm is the only publicly available tool for molecular formula inference besides SIRIUS, and considers both the isotope pattern and the fragmentation spectrum. GenForm was restricted to elements CHNOPS, and 7GR (CHNOPSBrCl) cannot annotate iodine-containing compounds; to this end, only SIRIUS and ZODIAC are in theory capable of annotating the two novel molecular formulas C24H47BrNO8P and C15H30ClIO5 reported here. Error rates are based on all compounds with established ground truth, resulting in slightly higher error rates for SIRIUS and ZODIAC on dendroides, tomato and mice stool compared to Fig. 1. Error rates on the five datasets agree well with the mass of compounds in the respective dataset, see Extended Data Fig. 1: larger compounds result in substantially more candidates to be considered, in particular for a larger set of elements, and result in worse annotation rates. For evaluation details see the Methods section.
For each ZODIAC molecular formula annotation, we test whether it meets the molecular formula subset of the Seven Golden Rules (7GR). Each dot represents one annotated compound; molecular formulas are sorted by ZODIAC score.
All molecular formulas are absent from the largest molecular structure databases PubChem and ChemSpider. Only molecular formula annotations with a minimum ZODIAC score of 0.98 are reported such that at least 95% of the MS/MS spectrum intensity is being explained by the SIRIUS fragmentation tree, and at least one molecular formula of the compound is connected to 5 or more compounds. There may be more than one hypothetical compound in an LC-MS run being annotated with one molecular formula, potentially corresponding to different isomers. For such cases, ‘#comp.’ is the number of hypothetical compounds being annotated with the given molecular formula, and ‘max score’ is the maximum ZODIAC score among these annotations. The corresponding compounds are given in Supplementary Table 5. For 90.00% of the compounds, SIRIUS top-ranks the same molecular formula.
Supplementary Figs. 1–7, Supplementary Table 4, Supplementary Note 1.
Manually annotated molecular formulas for compounds in the dendroides dataset. These molecular formulas serve as ground truth for evaluation of SIRIUS and ZODIAC.
Spectral library hits for datasets NIST1950, tomato, diatoms and mice stool. The molecular formulas of these library hits serve as ground truth for evaluation of SIRIUS and ZODIAC.
List of input files used for evaluation of five datasets. The included files in mzML/mzXML format correspond to LC-MS/MS runs which were used for evaluation. These runs are subsets of the data provided at MassIVE repository.
Compounds with a novel molecular formula. Provided are the detailed information for compounds corresponding to the novel molecular formulas in Extended Data Fig. 6. All molecular formulas are absent from the largest molecular structure databases PubChem and ChemSpider.
Statistical source data
Statistical source data
Statistical source data
Statistical source data
Statistical source data
Statistical source data
About this article
Cite this article
Ludwig, M., Nothias, LF., Dührkop, K. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat Mach Intell 2, 629–641 (2020). https://doi.org/10.1038/s42256-020-00234-6
Nature Biotechnology (2021)
Nature Chemical Biology (2021)