Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics


Novel metabolites distinct from canonical pathways can be identified through the integration of three cheminformatics tools: BinVestigate, which queries the BinBase gas chromatography–mass spectrometry (GC-MS) metabolome database to match unknowns with biological metadata across over 110,000 samples; MS-DIAL 2.0, a software tool for chromatographic deconvolution of high-resolution GC-MS or liquid chromatography–mass spectrometry (LC-MS); and MS-FINDER 2.0, a structure-elucidation program that uses a combination of 14 metabolome databases in addition to an enzyme promiscuity library. We showcase our workflow by annotating N-methyl-uridine monophosphate (UMP), lysomonogalactosyl-monopalmitin, N-methylalanine, and two propofol derivatives.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: A workflow for the functional and structural identification of unknown metabolites.
Figure 2: Metabolomic meta-analysis for origin exploration by BinVestigate.
Figure 3: The identification of N-methyl-UMP by MS-DIAL 2.0 and MS-FINDER 2.0.


  1. 1

    Kim, S. et al. Nucleic Acids Res. 44, 1202–1213 (2016).

    Article  Google Scholar 

  2. 2

    da Silva, R.R., Dorrestein, P.C. & Quinn, R.A. Proc. Natl. Acad. Sci. USA 112, 12549–12550 (2015).

    CAS  Article  Google Scholar 

  3. 3

    Hanson, A.D., Pribat, A., Waller, J.C. & de Crécy-Lagard, V. Biochem. J. 425, 1–11 (2009).

    Article  Google Scholar 

  4. 4

    Khersonsky, O. & Tawfik, D.S. Annu. Rev. Biochem. 79, 471–505 (2010).

    CAS  Article  Google Scholar 

  5. 5

    Linster, C.L., Van Schaftingen, E. & Hanson, A.D. Nat. Chem. Biol. 9, 72–80 (2013).

    CAS  Article  Google Scholar 

  6. 6

    Rappaport, S.M., Barupal, D.K., Wishart, D., Vineis, P. & Scalbert, A. Environ. Health Perspect. 122, 769–774 (2014).

    Article  Google Scholar 

  7. 7

    Wikoff, W.R. et al. Proc. Natl. Acad. Sci. USA 106, 3698–3703 (2009).

    CAS  Article  Google Scholar 

  8. 8

    Kumari, S., Stevens, D., Kind, T., Denkert, C. & Fiehn, O. Anal. Chem. 83, 5895–5902 (2011).

    CAS  Article  Google Scholar 

  9. 9

    Showalter, M.R., Cajka, T. & Fiehn, O. Curr. Opin. Chem. Biol. 36, 70–76 (2017).

    CAS  Article  Google Scholar 

  10. 10

    Patti, G.J. et al. Metabolomics 10, 737–743 (2014).

    CAS  Article  Google Scholar 

  11. 11

    Fiehn, O., Wohlgemuth, G. & Scholz, M. In Data Integration in the Life Sciences (eds. Ludäscher, B. & Raschid, L.) 224–239 (Springer-Verlag, 2005).

  12. 12

    Kind, T. et al. Anal. Chem. 81, 10038–10048 (2009).

    CAS  Article  Google Scholar 

  13. 13

    Tsugawa, H. et al. Nat. Methods 12, 523–526 (2015).

    CAS  Article  Google Scholar 

  14. 14

    Tsugawa, H. et al. Anal. Chem. 88, 7946–7958 (2016).

    CAS  Article  Google Scholar 

  15. 15

    Jeffryes, J.G. et al. J. Cheminform. 7, 44 (2015).

    Article  Google Scholar 

  16. 16

    Fiehn, O. Trends Analyt. Chem. 27, 261–269 (2008).

    CAS  Article  Google Scholar 

  17. 17

    Lai, Z. & Fiehn, O. Mass Spectrom. Rev. (2016).

  18. 18

    Sud, M. et al. Nucleic Acids Res. 44, D463–D470 (2016).

    CAS  Article  Google Scholar 

  19. 19

    Haug, K. et al. Nucleic Acids Res. 41, D781–D786 (2013).

    CAS  Article  Google Scholar 

  20. 20

    Sperber, H. et al. Nat. Cell Biol. 17, 1523–1535 (2015).

    CAS  Article  Google Scholar 

  21. 21

    Styczynski, M.P. et al. Anal. Chem. 79, 966–973 (2007).

    CAS  Article  Google Scholar 

  22. 22

    Scholz, M. & Fiehn, O. In Pacific Symposium on Biocomputing 169–180 (World Scientific, 2007).

  23. 23

    Fiehn, O. et al. Plant J. 53, 691–704 (2008).

    CAS  Article  Google Scholar 

  24. 24

    Fattuoni, C. et al. Clin. Chim. Acta 460, 23–32 (2016).

    CAS  Article  Google Scholar 

  25. 25

    Fiehn, O. Curr. Protoc. Mol. Biol. 114, 30.4.1–30.4.32 (2016).

    Article  Google Scholar 

  26. 26

    Tsugawa, H. et al. J. Biosci. Bioeng. 112, 292–298 (2011).

    CAS  Article  Google Scholar 

  27. 27

    Stein, S.E. J. Am. Soc. Mass Spectrom. 10, 770–781 (1999).

    CAS  Article  Google Scholar 

  28. 28

    Allen, F., Pon, A., Greiner, R. & Wishart, D. Anal. Chem. 88, 7689–7697 (2016).

    CAS  Article  Google Scholar 

  29. 29

    Ruttkies, C., Strehmel, N., Scheel, D. & Neumann, S. Rapid Commun. Mass Spectrom. 29, 1521–1529 (2015).

    CAS  Article  Google Scholar 

  30. 30

    Budczies, J. et al. BMC Genomics 13, 334 (2012).

    CAS  Article  Google Scholar 

  31. 31

    Lee, D.Y., Park, J.J., Barupal, D.K. & Fiehn, O. Mol. Cell. Proteomics 11, 973–988 (2012).

    CAS  Article  Google Scholar 

  32. 32

    Hartman, A.L. et al. Proc. Natl. Acad. Sci. USA 106, 17187–17192 (2009).

    CAS  Article  Google Scholar 

  33. 33

    Flosadóttir, H.D., Jónsson, H., Sigurdsson, S.T. & Ingólfsson, O. Phys. Chem. Chem. Phys. 13, 15283–15290 (2011).

    Article  Google Scholar 

  34. 34

    Yamamoto, I., Kimura, T., Tateoka, Y., Watanabe, K. & Ho, I.K. J. Med. Chem. 30, 2227–2231 (1987).

    CAS  Article  Google Scholar 

  35. 35

    El-Tayeb, A., Qi, A. & Müller, C.E. J. Med. Chem. 49, 7076–7087 (2006).

    CAS  Article  Google Scholar 

Download references


This work was supported by the US National Science Foundation (NSF)–Japan Science and Technology Agency (JST) Strategic International Collaborative Research Program (SICORP) for Japan–United States metabolomics. We appreciate funding from the US National Science Foundation (projects MCB 113944 and MCB 1611846 to O.F.), the US National Institutes of Health (U24 DK097154 to O.F.), and AMED–Core Research for Evolutionary Science and Technology (AMED-CREST) and JSPS KAKENHI (grants 15K01812, 15H05897, 15H05898, 17H03621 to M.A.).

Author information




Z.L., H.T., M.A., and O.F. designed the research. G.W. and S.M. developed the BinVestigate program. H.T. developed the MS-DIAL 2.0 and MS-FINDER 2.0 programs. Z.L. performed the sample preparation, instrumental analysis, and data processing for unknown-compound identification. M.S. contributed biological and LC-MS studies for the identification of N-methyl-UMP. T.K. trained Z.L. in cheminformatics and contributed to validation of MS-FINDER. M.M. wrote the front end for BinVestigate. Z.L. and H.T. performed performance validation and program comparison for MS-DIAL 2.0 and MS-FINDER 2.0. Y.Z. and P.B. synthesized the N-methyl-UMP standard compound. A.O. improved the raw data file reader in ABF conversion. J.M., K.T., and O.F. contributed to the identification of lyso-MGMP and propofol derivatives. Z.L., H.T., M.A., and O.F. thoroughly discussed this project and wrote the manuscript.

Corresponding authors

Correspondence to Masanori Arita or Oliver Fiehn.

Ethics declarations

Competing interests

A.O. is a developer in Reifycs Inc., which provides the ABF converter of mass spectral data for free at

Integrated supplementary information

Supplementary Figure 1 Cross-study specificity and relevance analysis of unknown BinBase ID 21735 with BinVestigate.

Supplementary Figure 2 Methodology and example for MS-DIAL 2.0 program.

(a) Peak spotting: to determine fragment ions for GC-MS spectra, the detected m/z-RT features are termed as 'peak spots' with computed peak quality and peak sharpness values. (b) Feature detection: all peak spots with identical peak widths and peak top retention times are combined into single array. For each array, peak sharpness values are totaled and a second Gaussian derivative filter is applied to construct 'peak groups'. (c) Deconvolution and identification: MS1Dec chromatogram deconvolution and open access MoNA mass spectral database are utilized to annotate the coeluting metabolites – phosphate, leucine, and glycerol – with 0.4-0.6 s peak top differences. The terms “Match” and “R.Match” mean dot-product and reverse dot-product values calculated in NIST MS search program, respectively.

Supplementary Figure 3 MS-DIAL 2.0 deconvolution example for Agilent GC-Q(MS).

The accuracy of GC-MS chromatogram deconvolution is confirmed by analyzing a biological sample in Agilent GC-Q(MS) system. The other examples using LECO GC-TOF(MS), Shimadzu GC-Q(MS), Bruker GC-Q(MS), and Thermo GC-QExactive (MS) data are shown in Supplementary Data 1.

Supplementary Figure 4 Data stream of the MS-DIAL 2.0 program.

MS-DIAL 2.0 is designed as a universal software for MS data processing. First, MS vendor format or common format (mzML/CDF) data are converted to the ABF binary format for rapid data retrieval while the common formats, while mzML and netCDF can be directly imported. Then MS-DIAL 2.0 performs chromatogram deconvolution with support for any MS analytical platform, ranging from low and high resolution GC-MS (MS/MS) or LC-MS (MS/MS) to data dependent or data independent acquisition method. Finally, the program achieves compound annotation by matching against mass spectral library and further statistical analysis.

Supplementary Figure 5 Workflow for the MS-FINDER 2.0 program.

(a) Accurate mass GC-EI-MS data is utilized as input with defined molecular ion and its adduct type. (b) Derivatized formulas are computationally generated and ranked based on valence and elemental ratio check in combination with the original MS-FINDER formula scoring algorithm. (c) Structure candidates are retrieved from multiple databases. After the candidate is computationally derivatized, the candidates are ranked by the result of substructure assignments from computational mass fragmentations.

Supplementary Figure 6 Performance validation of the MS-FINDER 2.0 program.

(a) The compound logP and natural product likeness comparison between the metabolite dataset for accuracy test (denoted as 'GCMS library') with the databases in MS-FINDER 2.0 (FINDMetDB, MINE, and PubChem). (b) The performance test results of MS-FINDER 2.0 and random sampling method with three structure resource sets.

Supplementary Figure 7 Authentic standard validation for the identification of N-methyl-UMP.

Mass spectra and retention times in GC-MS (a) and LC-MS/MS (b) were compared between BinBase ID 106699 in cancer cell sample with chemically synthesized N-methyl-UMP standard, as well as other isomeric compounds including 2'-O-methyl, 3'-O-methyl, and 5-methyl-UMP(UTP).

Supplementary Figure 8 A workflow for GC-MS and LC-MS/MS identification of N-methylalanine in MS-FINDER 2.0.

The workflow is the same as shown in Figure 3. High resolution GC-MS analytics was used for structure elucidation (left), then LC-MS/MS was applied as additional evidence line (right). Unknown discovery: fragment ions and molecular adduct ions of BinBase ID 160842 were deconvoluted by MS-DIAL 2.0. Formula prediction: C4H9NO2 was scored and ranked at 1st in MS-FINDER 2.0 based on mass errors, isotope ratio errors, and subformula assignments. Formula validation: for GC-MS flow, chemical ionization data with different derivatization methods (MSTFA vs. MSTFAd9) were obtained to verify the formula as well as to yield the number of acidic protons; for LC-MS flow, between theoretical values and experimental values, the mass errors were only 1 mDa, and the isotopic ratio errors were within 1%. Structure prediction: structure candidates were retrieved from MINE DB in addition to internal metabolome database, and in silico fragmented based on hydrogen rearrangement rules, bond dissociation energy, and comprehensive fragmentation rule library (including GC-EI-MS and LC-ESI-MS/MS). N-methyl-alanine was ranked at the most likely structure in MS-FINDER 2.0 with computational assigned substructures. Structure validation: the mass spectra and retention times in GC-MS (left) and LC-MS/MS (right) were matched with chemically synthesized N-methyl-alanine standard.

Supplementary Figure 9 GC-MS identification of lyso-monogalactosyl-monopalmitin with in silico fragmentation and substructure assignments in MS-FINDER 2.0.

After the structure candidates were suggested by MS-FINDER 2.0, the molecular skeleton was confirmed by the result of substructure assignments with manual inspection.

Supplementary Figure 10 GC-MS identification of 4-hydroxypropofol-1-glucuronide with in silico fragmentation and substructure assignments in MS-FINDER 2.0.

After the structure candidates were suggested by MS-FINDER 2.0, the molecular skeleton was confirmed by the result of substructure assignments with manual inspection. Finally, the structure was identified by the authentic standard compound.

Supplementary Figure 11 GC-MS identification of 4-hydroxypropofol-4-glucuronide with in silico fragmentation and substructure assignments in MS-FINDER 2.0.

After the structure candidates were suggested by MS-FINDER 2.0, the molecular skeleton was confirmed by the result of substructure assignments with manual inspection. Finally, the structure was identified by the authentic standard compound.

Supplementary Figure 12 Cross-study specificity and relevance analysis of unknown BinBase ID 8270 with BinVestigate.

Supplementary Figure 13 Investigation for the unique mass ratio of m/z 352 to m/z 315 among the EI-MS spectra of UMP and N-methyl-UMP in BinBase.

The x- and y-axes show the ratio of m/z 352 to m/z 315 and the count of EI-MS records, respectively.

Supplementary Figure 14 MS-DIAL 2.0 background subtraction in peak detection.

(a) Peaks were excluded as spike noise if the ion abundance of one neighbor point from the peak top is zero in unsmoothed raw chromatogram. (b) Peaks were excluded as baseline noise if 4 spike noise signals were programmatically detected within a ± 5 APW region of the peak top.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 (PDF 2475 kb)

Life Sciences Reporting Summary (PDF 129 kb)

Supplementary Table 1

Software comparison for MS-FINDER 2.0 versus CFM-ID, MetFrag, Molecular Structure Correlator (MSC), and MassFrontier (XLSX 222 kb)

Supplementary Table 2

Software comparison for MS-DIAL 2.0 versus AMDIS, AnalyzerPro, and ChromaTOF (XLSX 11 kb)

Supplementary Table 3

Summary of false discovery rate studies (XLSX 10 kb)

Supplementary Data Set 1

Examples of mass spectral deconvolution results from different vendors' instruments obtained with MS-DIAL 2.0.(a) Leco nominal mass GC-TOF-MS. (b) Shimadzu nominal mass GCQMS. (c) Thermo Fisher accurate mass GC-QExactive MS. (d) Bruker accurate mass GC-TOF-MS. (PDF 342 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Lai, Z., Tsugawa, H., Wohlgemuth, G. et al. Identifying metabolites by integrating metabolome databases with mass spectrometry cheminformatics. Nat Methods 15, 53–56 (2018).

Download citation

Further reading