Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

You are viewing this page in draft mode.

Auto-deconvolution and molecular networking of gas chromatography–mass spectrometry data


We engineered a machine learning approach, MSHub, to enable auto-deconvolution of gas chromatography–mass spectrometry (GC–MS) data. We then designed workflows to enable the community to store, process, share, annotate, compare and perform molecular networking of GC–MS data within the Global Natural Product Social (GNPS) Molecular Networking analysis platform. MSHub/GNPS performs auto-deconvolution of compound fragmentation patterns via unsupervised non-negative matrix factorization and quantifies the reproducibility of fragmentation patterns across samples.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: The processing pipeline and performance.
Fig. 2: Analysis and molecular networking of GC–MS data.

Data availability

All of the data used in the preparation of this manuscript are publicly available at the MassIVE repository at the University of California, San Diego Center for Computational Mass Spectrometry website ( The data set accession numbers are: #1 (MSV000084033), #2 (MSV000085136), #3 (MSV000084034), #4 (MSV000084036), #5 (MSV000084032), #6 (MSV000084038), #7 (MSV000084042), #8 (MSV000084039), #9 (MSV000084040), #10 (MSV000084037), #11 (MSV000084211), #12 (MSV000083598), #13 (MSV000080892), #14 (MSV000080892), #15 (MSV000080892), #16 (MSV000084337), #17 (MSV000083658), #18 (MSV000083743), #19 (MSV000084226), #20 (MSV000083859), #21 (MSV000083294), #22 (MSV000084349), #23 (MSV000081340), #24 (MSV000084348), #25 (MSV000084378), #26 (MSV000084338), #27 (MSV000084339), #28 (MSV000081161), #29 (MSV000084350), #30 (MSV000084377), #31 (MSV000084145), #32 (MSV000084144), #33 (MSV000084146), #34 (MSV000084379), #35 (MSV000084380), #36 (MSV000084276), #37 (MSV000084277) and #38 (MSV000084212).

All of the GNPS analysis jobs for all of the studies are summarized in Supplementary Table 1.

Code availability

The source code of the MSHub software, including low- and high-resolution data processing versions, is available online at Github (version used in GNPS) ( and at BitBucket (standalone version in MSHub developers’ repository, both high and low resolution: Scripts used to parse, filter, organize data and generate the plots in the manuscript are available online at Github ( Script for merging individual .mgf files into a single file for creating global network is available at Github (

The three-dimensional model, the feature table with coordinates used for the mapping and the snapshots shown in Fig. 4a–d are available at The GC–MS-adapted MolNetEnhancer code with an example Jupyter notebook can be found at Source data are provided with this paper.


  1. 1.

    Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).

    CAS  Article  Google Scholar 

  2. 2.

    Aksenov, A. A., da Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1, 0054 (2017).

  3. 3.

    Smirnov, A. et al. ADAP-GC 4.0: application of clustering-assisted multivariate curve resolution to spectral deconvolution of gas chromatography–mass spectrometry metabolomics data. Anal. Chem. 91, 9069–9077 (2019).

    CAS  Article  Google Scholar 

  4. 4.

    Tsugawa, H. et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat. Methods 12, 523–526 (2015).

    CAS  Article  Google Scholar 

  5. 5.

    Amigo, J. M., Skov, T., Bro, R., Coello, J. & Maspoch, S. Solving GC-MS problems with PARAFAC2. Trends Anal. Chem. 27, 714–725 (2008).

    CAS  Article  Google Scholar 

  6. 6.

    Kessler, N. et al. MeltDB 2.0-advances of the metabolomics software system. Bioinformatics 29, 2452–2459 (2013).

    CAS  Article  Google Scholar 

  7. 7.

    Domingo-Almenara, X. et al. eRah: a computational tool integrating spectral deconvolution and alignment with quantification and identification of metabolites in GC/MS-based metabolomics. Anal. Chem. 88, 9821–9829 (2016).

    CAS  Article  Google Scholar 

  8. 8.

    Skogerson, K., Wohlgemuth, G., Barupal, D. K. & Fiehn, O. The volatile compound BinBase mass spectral database. BMC Bioinf. 12, 321 (2011).

    CAS  Article  Google Scholar 

  9. 9.

    Akiyama, K. et al. PRIMe: a web site that assembles tools for metabolomics and transcriptomics. In Silico Biol. 8, 339–345 (2008).

    CAS  PubMed  Google Scholar 

  10. 10.

    Tautenhahn, R., Patti, G. J., Rinehart, D. & Siuzdak, G. XCMS online: a web-based platform to process untargeted metabolomic data. Anal. Chem. 84, 5035–5039 (2012).

    CAS  Article  Google Scholar 

  11. 11.

    Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).

    CAS  Article  Google Scholar 

  12. 12.

    Sud, M. et al. Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 44, D463–D470 (2016).

  13. 13.

    Carroll, A. J., Badger, M. R. & Harvey Millar, A. The MetabolomeExpress project: enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets. BMC Bioinf. 11, 376 (2010).

    Article  Google Scholar 

  14. 14.

    Haug, K. et al. MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Res. 41, D781–D786 (2013).

    CAS  Article  Google Scholar 

  15. 15.

    Hummel, J. et al. Mass spectral search and analysis using the Golm Metabolome Database. in The Handbook of Plant Metabolomics 321–343 (Wiley, 2013).

  16. 16.

    Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    CAS  Article  Google Scholar 

  17. 17.

    Sumner, L. W. et al. Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics 3, 211–221 (2007).

  18. 18.

    Kim, S., Gupta, N., Bandeira, N. & Pevzner, P. A. Spectral dictionaries: integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteom. 8, 53–69 (2009).

    CAS  Article  Google Scholar 

  19. 19.

    Protsyuk, I. et al. 3D molecular cartography using LC–MS facilitated by Optimus and ‘ili software. Nat. Protoc. 13, 134–154 (2018).

    CAS  Article  Google Scholar 

  20. 20.

    Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).

    Article  Google Scholar 

Download references


The conversion of the data from different repositories was supported by grant R03 CA211211 on reuse of metabolomics data, to build enabling chemical analysis tools for the ocean symbiosis program, and the development of a user-friendly interface for GC–MS analysis was supported by the Gordon and Betty Moore Foundation through grant GBMF7622. The University of California, San Diego Center for Microbiome Innovation supported the campus-wide seed grant awards for data collection that enabled the development of some of this infrastructure. P.C.D. was supported by the National Science Foundation (grant no. IOS-1656475) and the National Institutes of Health (NIH) (grant nos. U19 AG063744 01, P41 GM103484, R03 CA211211 and R01 GM107550). K.V. and I.L. are very grateful for the support of the Vodafone Foundation as part of the DRUGS/DreamLab project. The MSHub platform development was supported by NIH/NIAAA grant (R21 AA028432) on integrated machine learning for mass spectrometry data in liver disease, Intelligify Limited and Vodafone Foundation’s DRUGS/CORONA-AI projects on network machine learning for drug repositioning and discovery of hyperfoods with antiviral/anticancer molecules. M.E. was supported by the University of Corsica. L.F.N. was supported by the NIH (R01 GM107550) and the European Union’s Horizon 2020 Research and Innovation Programme (MSCA-GF, 704786). A.B. was supported by the National Institute of Justice Award (2015-DN-BX-K047). Additional support for data acquisition and data storage was provided by the Center for Computational Mass Spectrometry (P41 GM103484). The collection of data from the HomeChem Project was supported by the Sloan Foundation. G.B.H., S.L.F.D., I.L., K.V. and I.B. are grateful for the support of the OG cancer breath analysis study by the National Institute for Health Research London Invitro Diagnostic Co-operative and the NIHR Imperial Biomedical Research Centre, the Rosetrees and Stonegate Trusts and the Imperial College Charity. D.V. acknowledges support from ERC-Consolidator grant 724228 (LEMAN). I.B. acknowledges the contribution of Q. Wen and M. Colavita in the production of the training video. C. Callewaert was supported by the Research Foundation Flanders, with support from the industrial research fund of Ghent University. W.B. was supported by the Research Foundation Flanders. A.A.O. acknowledges the support of the Fulbright Commission and Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET-Argentina). The work of R.L. and P.L.B. on the data set 30 was supported by Metaboscope, part of the ‘Platform 3A’, funded by the European Regional Development Fund, the French Ministry of Research, Higher Education and Innovation, the Provence-Alpes-Côte d’Azur region, the Departmental Council of Vaucluse and the Urban Community of Avignon. S.A. and A.R.F. acknowledge the PlantaSYST project by the European Union’s Horizon 2020 Research and Innovation Programme (SGA-CSA nos. 664621 and 739582 under FPA no. 664620). V.V. acknowledges support from the National Institute on Alcohol Abuse and Alcoholism award R24AA022057. M. Guma and R.C. acknowledge the support of the Krupp Endowed Fund grant. A portion of mass spectra in the public reference library was produced within the framework of the State Task for the Topchiev Institute of Petrochemical Synthesis RAS and with the support of the RUDN University Program 5-100. R.S.B. acknowledges support of the State Task for the Topchiev Institute of Petrochemical Synthesis RAS. L.N.K. acknowledges support of the RUDN University Program 5-100. I.M. acknowledges support of the Israel Science Foundation (project no. 1947/19) and European Research Council under the European Union’s Horizon 2020 Research and Innovation Programme (project no. 640384). J.S. has been supported by NIH/NIAMS R03AR072182, the Colton Center for Autoimmunity, the Rheumatology Research Foundation, the Riley Family Foundation and the Snyder Family Foundation. J. Manasson acknowledges support from the 2017 Group for Research and Assessment of Psoriasis and Psoriatic Arthritis Pilot Research Grant and NIH/NIAMS T32AR069515. R.G. is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship. J.J.J.v.d.H. acknowledges support from an ASDI eScience grant (ASDI.2017.030) from the Netherlands eScience Center-NLeSC. B.A. was supported by the National Science Foundation through the Graduate Research Fellowship Program. GC–MS analyses for collection of the MSV000083743 data set were supported by the Pacific Northwest National Laboratory, Laboratory-Directed Research and Development Program, and were contributed by the Microbiomes in Transition Initiative; data were collected in the Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the Department of Energy (DOE) Office of Biological and Environmental Research and located at the Pacific Northwest National Laboratory (PNNL). PNNL is operated by the Battelle Memorial Institute for the DOE under contract DEAC05-76RLO1830. M. Guma and R.C. acknowledge the support of the Krupp Endowed Fund grant. R.C. was also funded by T32AR064194-07. The authors are grateful to R. da Silva for his contribution to developing the first prototype of the EI data network and his continuous assistance with further development and testing of the infrastructure. The authors are also grateful to M. Vance and D. Farmer, who organized the sampling for HomeChem Indoor Chemistry Project ( that allowed the collection of samples for the MSV000083598 data set. B. Ross has assisted with collecting data for the MSV000084348 data set. GC–MS analyses for collection of the MSV000084211 and MSV000084212 data sets were supported by N757 Doctorados Nacionales and project EXT-2016-69-1713 from the Departamento Administrativo de Ciencia, Tecnología e Innovación (COLCIENCIAS), the seed project INV-2019-67-1747 and the FAPA project of Chiara Carazzone from the Faculty of Science at Universidad de los Andes and the grant FP80740-064-2016 of COLCIENCIAS. The authors are grateful to L. M. Garzón, P. Palacios, M. Gonzalez and J. Hernandez for their contributions to collecting the samples and to J. Oswaldo Turizo for designing and manufacturing the amphibian electrical stimulator. A.S. and X.D. acknowledge support from National Cancer Institute award U01CA235507. The authors are grateful to S. Neuman for feedback regarding the XCMS deconvolution tool.

Author information




P.C.D., A.A.A., M.W. and L.F.N. developed the concept of GNPS for GC–MS data. K.V. designed and supervised MSHub platform development. I.L., D.V., V.V. and K.V. developed the MSHub platform. M.W., Z.Z. and A.A.A. developed the workflows. A.A.A., Z.Z., M.W., B.B.M. and R.S.B. performed infrastructure testing and benchmarking. A.A.A. and Z.Z. assessed EI-based molecular networking. W.B. generated plots for MSHub algorithm performance testing and benchmarking against existing deconvolution tools. Z.Z., A.A. and M.E. generated molecular network plots. M.E. and J.J.J.v.d.H. adapted the MolNetEnhancer workflow for GC–MS molecular networks. A.S., X.D., A.A.A. and B.B.M. conducted comparative testing of MSHub with existing deconvolution tools. A.A.A., A.V.M., M.P., K.L.J. and K.D. conducted three-dimensional skin volatilome mapping studies. S.L.F.D., I.B. and G.B.H. conducted the esophageal and gastric breath analysis cancers detection study. A.A.A., Z.Z., M.P. and M.W. converted and added public libraries to GNPS. A.A.A., A.V.M., S.L.F.D., C. Callewaert, B.B.M., M. Gonzalez, C. Carazzone, A.A., J.T.M., R.A.Q., A.B., A.A.O., D.P., A.M.S., S.P.C., T.O.M., M.C.B., C.D.N., E.Z., V.A., E.H.-F., R.G., M.M.M., I.M., S.E., P.L.B., B.A., R.D., R.L., Y.G., S.P., A.P., G.D., B.L.B., A.F., N.S.P., K.G., C.S., R.C., M. Guma, J. Manasson, J.U.S., D.K.B., S.A. and A.R.F. generated GC–MS data. R.S.B., L.N.K., M.P. and A.A.A. assembled the initial version of the public reference spectra library. R.S. created the MZmine export module for GNPS GC–MS input files and RI markers file export. A.A.A., R.S., I.B., A.A.O., A.M.S., B.A., M. Gonzalez, K.N.M. and R.S.B. produced training videos. M.N.-E., A.A.A., M. Gonzalez, B.B.M., A.S. and L.F.N. wrote and compiled tutorials and documentation. P.C.D., A.A.A., W.B., K.V., R.M. and R.K. wrote the paper.

Corresponding authors

Correspondence to Pieter C. Dorrestein or Kirill Veselkov.

Ethics declarations

Competing interests

P.C.D. is a scientific advisor for Sirenas, Galileo and Cybele. P.C.D. is scientific adviser and cofounder of Enveda and Ometa; this has been approved by UC San Diego. M.W. is a consultant for Sirenas and the founder of Ometa Labs. A.A.A. is a consultant for Ometa Labs.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–10, Tables 1–8 and Notes

Reporting Summary

Source data

Source Data for Supplementary Fig. 4

Deconvolution time testing source data

Source Data Supplementary Fig. 6

Global network cosine distribution source data

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Aksenov, A.A., Laponogov, I., Zhang, Z. et al. Auto-deconvolution and molecular networking of gas chromatography–mass spectrometry data. Nat Biotechnol 39, 169–173 (2021).

Download citation

Further reading


Quick links