Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics

Lai, Zijuan; Tsugawa, Hiroshi; Wohlgemuth, Gert; Mehta, Sajjan; Mueller, Matthew; Zheng, Yuxuan; Ogiwara, Atsushi; Meissen, John; Showalter, Megan; Takeuchi, Kohei; Kind, Tobias; Beal, Peter; Arita, Masanori; Fiehn, Oliver

doi:10.1038/nmeth.4512

Brief Communication
Published: 27 November 2017

Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics

Zijuan Lai^1,2^na1,
Hiroshi Tsugawa ORCID: orcid.org/0000-0002-2015-3958^3,4^na1,
Gert Wohlgemuth¹,
Sajjan Mehta ORCID: orcid.org/0000-0002-7764-3886¹,
Matthew Mueller¹,
Yuxuan Zheng²,
Atsushi Ogiwara⁵,
John Meissen¹,
Megan Showalter¹,
Kohei Takeuchi⁶,
Tobias Kind¹,
Peter Beal²,
Masanori Arita^3,7 &
…
Oliver Fiehn ORCID: orcid.org/0000-0002-6261-8928^1,8

Nature Methods volume 15, pages 53–56 (2018)Cite this article

18k Accesses
341 Citations
74 Altmetric
Metrics details

Subjects

Abstract

Novel metabolites distinct from canonical pathways can be identified through the integration of three cheminformatics tools: BinVestigate, which queries the BinBase gas chromatography–mass spectrometry (GC-MS) metabolome database to match unknowns with biological metadata across over 110,000 samples; MS-DIAL 2.0, a software tool for chromatographic deconvolution of high-resolution GC-MS or liquid chromatography–mass spectrometry (LC-MS); and MS-FINDER 2.0, a structure-elucidation program that uses a combination of 14 metabolome databases in addition to an enzyme promiscuity library. We showcase our workflow by annotating N-methyl-uridine monophosphate (UMP), lysomonogalactosyl-monopalmitin, N-methylalanine, and two propofol derivatives.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: A workflow for the functional and structural identification of unknown metabolites.**

**Figure 2: Metabolomic meta-analysis for origin exploration by BinVestigate.**

**Figure 3: The identification of N-methyl-UMP by MS-DIAL 2.0 and MS-FINDER 2.0.**

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

A host–microbiota interactome reveals extensive transkingdom connectivity

Article 20 March 2024

Scientific discovery in the age of artificial intelligence

Article 02 August 2023

References

Kim, S. et al. Nucleic Acids Res. 44, 1202–1213 (2016).
Article Google Scholar
da Silva, R.R., Dorrestein, P.C. & Quinn, R.A. Proc. Natl. Acad. Sci. USA 112, 12549–12550 (2015).
Article CAS Google Scholar
Hanson, A.D., Pribat, A., Waller, J.C. & de Crécy-Lagard, V. Biochem. J. 425, 1–11 (2009).
Article Google Scholar
Khersonsky, O. & Tawfik, D.S. Annu. Rev. Biochem. 79, 471–505 (2010).
Article CAS Google Scholar
Linster, C.L., Van Schaftingen, E. & Hanson, A.D. Nat. Chem. Biol. 9, 72–80 (2013).
Article CAS Google Scholar
Rappaport, S.M., Barupal, D.K., Wishart, D., Vineis, P. & Scalbert, A. Environ. Health Perspect. 122, 769–774 (2014).
Article Google Scholar
Wikoff, W.R. et al. Proc. Natl. Acad. Sci. USA 106, 3698–3703 (2009).
Article CAS Google Scholar
Kumari, S., Stevens, D., Kind, T., Denkert, C. & Fiehn, O. Anal. Chem. 83, 5895–5902 (2011).
Article CAS Google Scholar
Showalter, M.R., Cajka, T. & Fiehn, O. Curr. Opin. Chem. Biol. 36, 70–76 (2017).
Article CAS Google Scholar
Patti, G.J. et al. Metabolomics 10, 737–743 (2014).
Article CAS Google Scholar
Fiehn, O., Wohlgemuth, G. & Scholz, M. In Data Integration in the Life Sciences (eds. Ludäscher, B. & Raschid, L.) 224–239 (Springer-Verlag, 2005).
Kind, T. et al. Anal. Chem. 81, 10038–10048 (2009).
Article CAS Google Scholar
Tsugawa, H. et al. Nat. Methods 12, 523–526 (2015).
Article CAS Google Scholar
Tsugawa, H. et al. Anal. Chem. 88, 7946–7958 (2016).
Article CAS Google Scholar
Jeffryes, J.G. et al. J. Cheminform. 7, 44 (2015).
Article Google Scholar
Fiehn, O. Trends Analyt. Chem. 27, 261–269 (2008).
Article CAS Google Scholar
Lai, Z. & Fiehn, O. Mass Spectrom. Rev. http://dx.doi.org/10.1002/mas.21518 (2016).
Sud, M. et al. Nucleic Acids Res. 44, D463–D470 (2016).
Article CAS Google Scholar
Haug, K. et al. Nucleic Acids Res. 41, D781–D786 (2013).
Article CAS Google Scholar
Sperber, H. et al. Nat. Cell Biol. 17, 1523–1535 (2015).
Article CAS Google Scholar
Styczynski, M.P. et al. Anal. Chem. 79, 966–973 (2007).
Article CAS Google Scholar
Scholz, M. & Fiehn, O. In Pacific Symposium on Biocomputing 169–180 (World Scientific, 2007).
Fiehn, O. et al. Plant J. 53, 691–704 (2008).
Article CAS Google Scholar
Fattuoni, C. et al. Clin. Chim. Acta 460, 23–32 (2016).
Article CAS Google Scholar
Fiehn, O. Curr. Protoc. Mol. Biol. 114, 30.4.1–30.4.32 (2016).
Article Google Scholar
Tsugawa, H. et al. J. Biosci. Bioeng. 112, 292–298 (2011).
Article CAS Google Scholar
Stein, S.E. J. Am. Soc. Mass Spectrom. 10, 770–781 (1999).
Article CAS Google Scholar
Allen, F., Pon, A., Greiner, R. & Wishart, D. Anal. Chem. 88, 7689–7697 (2016).
Article CAS Google Scholar
Ruttkies, C., Strehmel, N., Scheel, D. & Neumann, S. Rapid Commun. Mass Spectrom. 29, 1521–1529 (2015).
Article CAS Google Scholar
Budczies, J. et al. BMC Genomics 13, 334 (2012).
Article CAS Google Scholar
Lee, D.Y., Park, J.J., Barupal, D.K. & Fiehn, O. Mol. Cell. Proteomics 11, 973–988 (2012).
Article CAS Google Scholar
Hartman, A.L. et al. Proc. Natl. Acad. Sci. USA 106, 17187–17192 (2009).
Article CAS Google Scholar
Flosadóttir, H.D., Jónsson, H., Sigurdsson, S.T. & Ingólfsson, O. Phys. Chem. Chem. Phys. 13, 15283–15290 (2011).
Article Google Scholar
Yamamoto, I., Kimura, T., Tateoka, Y., Watanabe, K. & Ho, I.K. J. Med. Chem. 30, 2227–2231 (1987).
Article CAS Google Scholar
El-Tayeb, A., Qi, A. & Müller, C.E. J. Med. Chem. 49, 7076–7087 (2006).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by the US National Science Foundation (NSF)–Japan Science and Technology Agency (JST) Strategic International Collaborative Research Program (SICORP) for Japan–United States metabolomics. We appreciate funding from the US National Science Foundation (projects MCB 113944 and MCB 1611846 to O.F.), the US National Institutes of Health (U24 DK097154 to O.F.), and AMED–Core Research for Evolutionary Science and Technology (AMED-CREST) and JSPS KAKENHI (grants 15K01812, 15H05897, 15H05898, 17H03621 to M.A.).

Author information

Zijuan Lai and Hiroshi Tsugawa: These authors contributed equally to this work.

Authors and Affiliations

West Coast Metabolomics Center, UC Davis, Davis, California, USA
Zijuan Lai, Gert Wohlgemuth, Sajjan Mehta, Matthew Mueller, John Meissen, Megan Showalter, Tobias Kind & Oliver Fiehn
Department of Chemistry, UC Davis, Davis, California, USA
Zijuan Lai, Yuxuan Zheng & Peter Beal
RIKEN Center for Sustainable Resource Science, Yokohama, Japan
Hiroshi Tsugawa & Masanori Arita
RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
Hiroshi Tsugawa
Reifycs Inc., Tokyo, Japan
Atsushi Ogiwara
Perfume Development Research Laboratory, Kao Corporation, Tokyo, Japan
Kohei Takeuchi
National Institute of Genetics, Mishima, Japan
Masanori Arita
Department of Biochemistry, King Abdulaziz University, Jeddah, Saudi Arabia
Oliver Fiehn

Authors

Zijuan Lai
View author publications
You can also search for this author in PubMed Google Scholar
Hiroshi Tsugawa
View author publications
You can also search for this author in PubMed Google Scholar
Gert Wohlgemuth
View author publications
You can also search for this author in PubMed Google Scholar
Sajjan Mehta
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Mueller
View author publications
You can also search for this author in PubMed Google Scholar
Yuxuan Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Atsushi Ogiwara
View author publications
You can also search for this author in PubMed Google Scholar
John Meissen
View author publications
You can also search for this author in PubMed Google Scholar
Megan Showalter
View author publications
You can also search for this author in PubMed Google Scholar
Kohei Takeuchi
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Kind
View author publications
You can also search for this author in PubMed Google Scholar
Peter Beal
View author publications
You can also search for this author in PubMed Google Scholar
Masanori Arita
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Fiehn
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Z.L., H.T., M.A., and O.F. designed the research. G.W. and S.M. developed the BinVestigate program. H.T. developed the MS-DIAL 2.0 and MS-FINDER 2.0 programs. Z.L. performed the sample preparation, instrumental analysis, and data processing for unknown-compound identification. M.S. contributed biological and LC-MS studies for the identification of N-methyl-UMP. T.K. trained Z.L. in cheminformatics and contributed to validation of MS-FINDER. M.M. wrote the front end for BinVestigate. Z.L. and H.T. performed performance validation and program comparison for MS-DIAL 2.0 and MS-FINDER 2.0. Y.Z. and P.B. synthesized the N-methyl-UMP standard compound. A.O. improved the raw data file reader in ABF conversion. J.M., K.T., and O.F. contributed to the identification of lyso-MGMP and propofol derivatives. Z.L., H.T., M.A., and O.F. thoroughly discussed this project and wrote the manuscript.

Corresponding authors

Correspondence to Masanori Arita or Oliver Fiehn.

Ethics declarations

Competing interests

A.O. is a developer in Reifycs Inc., which provides the ABF converter of mass spectral data for free at http://www.reifycs.com/AbfConverter/.

Integrated supplementary information

Supplementary Figure 1 Cross-study specificity and relevance analysis of unknown BinBase ID 21735 with BinVestigate.

Supplementary Figure 2 Methodology and example for MS-DIAL 2.0 program.

(a) Peak spotting: to determine fragment ions for GC-MS spectra, the detected m/z-RT features are termed as 'peak spots' with computed peak quality and peak sharpness values. (b) Feature detection: all peak spots with identical peak widths and peak top retention times are combined into single array. For each array, peak sharpness values are totaled and a second Gaussian derivative filter is applied to construct 'peak groups'. (c) Deconvolution and identification: MS1Dec chromatogram deconvolution and open access MoNA mass spectral database are utilized to annotate the coeluting metabolites – phosphate, leucine, and glycerol – with 0.4-0.6 s peak top differences. The terms “Match” and “R.Match” mean dot-product and reverse dot-product values calculated in NIST MS search program, respectively.

Supplementary Figure 3 MS-DIAL 2.0 deconvolution example for Agilent GC-Q(MS).

The accuracy of GC-MS chromatogram deconvolution is confirmed by analyzing a biological sample in Agilent GC-Q(MS) system. The other examples using LECO GC-TOF(MS), Shimadzu GC-Q(MS), Bruker GC-Q(MS), and Thermo GC-QExactive (MS) data are shown in Supplementary Data 1.

Supplementary Figure 4 Data stream of the MS-DIAL 2.0 program.

MS-DIAL 2.0 is designed as a universal software for MS data processing. First, MS vendor format or common format (mzML/CDF) data are converted to the ABF binary format for rapid data retrieval while the common formats, while mzML and netCDF can be directly imported. Then MS-DIAL 2.0 performs chromatogram deconvolution with support for any MS analytical platform, ranging from low and high resolution GC-MS (MS/MS) or LC-MS (MS/MS) to data dependent or data independent acquisition method. Finally, the program achieves compound annotation by matching against mass spectral library and further statistical analysis.

Supplementary Figure 5 Workflow for the MS-FINDER 2.0 program.

(a) Accurate mass GC-EI-MS data is utilized as input with defined molecular ion and its adduct type. (b) Derivatized formulas are computationally generated and ranked based on valence and elemental ratio check in combination with the original MS-FINDER formula scoring algorithm. (c) Structure candidates are retrieved from multiple databases. After the candidate is computationally derivatized, the candidates are ranked by the result of substructure assignments from computational mass fragmentations.

Supplementary Figure 6 Performance validation of the MS-FINDER 2.0 program.

(a) The compound logP and natural product likeness comparison between the metabolite dataset for accuracy test (denoted as 'GCMS library') with the databases in MS-FINDER 2.0 (FINDMetDB, MINE, and PubChem). (b) The performance test results of MS-FINDER 2.0 and random sampling method with three structure resource sets.

Supplementary Figure 7 Authentic standard validation for the identification of N-methyl-UMP.

Mass spectra and retention times in GC-MS (a) and LC-MS/MS (b) were compared between BinBase ID 106699 in cancer cell sample with chemically synthesized N-methyl-UMP standard, as well as other isomeric compounds including 2'-O-methyl, 3'-O-methyl, and 5-methyl-UMP(UTP).

Supplementary Figure 8 A workflow for GC-MS and LC-MS/MS identification of N-methylalanine in MS-FINDER 2.0.

The workflow is the same as shown in Figure 3. High resolution GC-MS analytics was used for structure elucidation (left), then LC-MS/MS was applied as additional evidence line (right). Unknown discovery: fragment ions and molecular adduct ions of BinBase ID 160842 were deconvoluted by MS-DIAL 2.0. Formula prediction: C4H9NO2 was scored and ranked at 1st in MS-FINDER 2.0 based on mass errors, isotope ratio errors, and subformula assignments. Formula validation: for GC-MS flow, chemical ionization data with different derivatization methods (MSTFA vs. MSTFAd9) were obtained to verify the formula as well as to yield the number of acidic protons; for LC-MS flow, between theoretical values and experimental values, the mass errors were only 1 mDa, and the isotopic ratio errors were within 1%. Structure prediction: structure candidates were retrieved from MINE DB in addition to internal metabolome database, and in silico fragmented based on hydrogen rearrangement rules, bond dissociation energy, and comprehensive fragmentation rule library (including GC-EI-MS and LC-ESI-MS/MS). N-methyl-alanine was ranked at the most likely structure in MS-FINDER 2.0 with computational assigned substructures. Structure validation: the mass spectra and retention times in GC-MS (left) and LC-MS/MS (right) were matched with chemically synthesized N-methyl-alanine standard.

Supplementary Figure 9 GC-MS identification of lyso-monogalactosyl-monopalmitin with in silico fragmentation and substructure assignments in MS-FINDER 2.0.

After the structure candidates were suggested by MS-FINDER 2.0, the molecular skeleton was confirmed by the result of substructure assignments with manual inspection.

Supplementary Figure 10 GC-MS identification of 4-hydroxypropofol-1-glucuronide with in silico fragmentation and substructure assignments in MS-FINDER 2.0.

After the structure candidates were suggested by MS-FINDER 2.0, the molecular skeleton was confirmed by the result of substructure assignments with manual inspection. Finally, the structure was identified by the authentic standard compound.

Supplementary Figure 11 GC-MS identification of 4-hydroxypropofol-4-glucuronide with in silico fragmentation and substructure assignments in MS-FINDER 2.0.

After the structure candidates were suggested by MS-FINDER 2.0, the molecular skeleton was confirmed by the result of substructure assignments with manual inspection. Finally, the structure was identified by the authentic standard compound.

Supplementary Figure 12 Cross-study specificity and relevance analysis of unknown BinBase ID 8270 with BinVestigate.

Supplementary Figure 13 Investigation for the unique mass ratio of m/z 352 to m/z 315 among the EI-MS spectra of UMP and N-methyl-UMP in BinBase.

The x- and y-axes show the ratio of m/z 352 to m/z 315 and the count of EI-MS records, respectively.

Supplementary Figure 14 MS-DIAL 2.0 background subtraction in peak detection.

(a) Peaks were excluded as spike noise if the ion abundance of one neighbor point from the peak top is zero in unsmoothed raw chromatogram. (b) Peaks were excluded as baseline noise if 4 spike noise signals were programmatically detected within a ± 5 APW region of the peak top.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 (PDF 2475 kb)

Life Sciences Reporting Summary (PDF 129 kb)

Supplementary Table 1

Software comparison for MS-FINDER 2.0 versus CFM-ID, MetFrag, Molecular Structure Correlator (MSC), and MassFrontier (XLSX 222 kb)

Supplementary Table 2

Software comparison for MS-DIAL 2.0 versus AMDIS, AnalyzerPro, and ChromaTOF (XLSX 11 kb)

Supplementary Table 3

Summary of false discovery rate studies (XLSX 10 kb)

Supplementary Data Set 1

Examples of mass spectral deconvolution results from different vendors' instruments obtained with MS-DIAL 2.0.(a) Leco nominal mass GC-TOF-MS. (b) Shimadzu nominal mass GCQMS. (c) Thermo Fisher accurate mass GC-QExactive MS. (d) Bruker accurate mass GC-TOF-MS. (PDF 342 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lai, Z., Tsugawa, H., Wohlgemuth, G. et al. Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics. Nat Methods 15, 53–56 (2018). https://doi.org/10.1038/nmeth.4512

Download citation

Received: 31 August 2017
Accepted: 26 October 2017
Published: 27 November 2017
Issue Date: 01 January 2018
DOI: https://doi.org/10.1038/nmeth.4512

This article is cited by

Potentially compromised systemic and local lactate metabolic balance in glaucoma, which could increase retinal glucose and glutamate concentrations
- Mina Arai-Okuda
- Yusuke Murai
- Makoto Nakamura
Scientific Reports (2024)
Unbiased serum metabolomic analysis in cats with naturally occurring chronic enteropathies before and after medical intervention
- Maria Questa
- Bart C. Weimer
- Sina Marsilio
Scientific Reports (2024)
The BinDiscover database: a biology-focused meta-analysis tool for 156,000 GC–TOF MS metabolome samples
- Parker Ladd Bremer
- Gert Wohlgemuth
- Oliver Fiehn
Journal of Cheminformatics (2023)
Comparative polar and lipid plasma metabolomics differentiate KSHV infection and disease states
- Sara R. Privatt
- Camila Pereira Braga
- Jiri Adamec
Cancer & Metabolism (2023)
BUDDY: molecular formula discovery via bottom-up MS/MS interrogation
- Shipei Xing
- Sam Shen
- Tao Huan
Nature Methods (2023)