Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification

Li, Yuanyue; Kind, Tobias; Folz, Jacob; Vaniya, Arpana; Mehta, Sajjan Singh; Fiehn, Oliver

doi:10.1038/s41592-021-01331-z

Article
Published: 02 December 2021

Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification

Nature Methods volume 18, pages 1524–1531 (2021)Cite this article

9231 Accesses
56 Citations
24 Altmetric
Metrics details

Subjects

Abstract

Compound identification in small-molecule research, such as untargeted metabolomics or exposome research, relies on matching tandem mass spectrometry (MS/MS) spectra against experimental or in silico mass spectral libraries. Most software programs use dot product similarity scores. Here we introduce the concept of MS/MS spectral entropy to improve scoring results in MS/MS similarity searches via library matching. Entropy similarity outperformed 42 alternative similarity algorithms, including dot product similarity, when searching 434,287 spectra against the high-quality NIST20 library. Entropy similarity scores proved to be highly robust even when we added different levels of noise ions. When we applied entropy levels to 37,299 experimental spectra of natural products, false discovery rates of less than 10% were observed at entropy similarity score 0.75. Experimental human gut metabolome data were used to confirm that entropy similarity largely improved the accuracy of MS-based annotations in small-molecule research to false discovery rates below 10%, annotated new compounds and provided the basis to automatically flag poor-quality, noisy spectra.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Spectral entropy as a parameter to characterize MS/MS spectra.**

**Fig. 2: Schema for calculating similarity between MS/MS spectra of small molecules.**

**Fig. 3: ROC curves for 43 different spectral similarity algorithms for 434,287 accurate mass MS/MS spectra of the small-molecule NIST20 database.**

**Fig. 4: Robustness of MS/MS similarity algorithms against random noise ions.**

**Fig. 5: Relationship between FDR and spectral entropy levels for NIST20 accurate mass MS/MS spectra.**

**Fig. 6: Relationship between FDRs and spectral similarity in experimental spectra.**

MS2Query: reliable and scalable MS2 mass spectra-based analogue search

Article Open access 29 March 2023

BLINK enables ultrafast tandem mass spectrometry cosine similarity scoring

Article Open access 18 August 2023

Annotation of natural product compound families using molecular networking topology and structural similarity fingerprinting

Article Open access 19 January 2023

Data availability

NIST Tandem Mass Spectral Library, 2020 release (NIST20) spectra are commercial available and can be purchased from multiple vendors. MassBank of North America database (Massbank.us) spectra can be freely downloaded from Massbank.us (https://massbank.us/). The metabolome dataset of the human upper gut intestinal tract can be freely downloaded from MetabolomicsWorkbench (https://www.metabolomicsworkbench.org/) with accession code ST001794. Source data are provided with this paper.

Code availability

The code for calculating spectral entropy and spectral similarity can be found at GitHub https://github.com/YuanyueLi/SpectralEntropy and at Zenodo https://doi.org/10.5281/zenodo.5591020.

References

Wu, Z., Bagarolo, G. I., Thoröe-Boveleth, S. & Jankowski, J. ‘Lipidomics’: mass spectrometric and chemometric analyses of lipids. Adv. Drug Deliv. Rev. 159, 294–307 (2020).
Article CAS Google Scholar
Xiao, J. F., Zhou, B. & Ressom, H. W. Metabolite identification and quantitation in LC-MS/MS-based metabolomics. Trends Anal. Chem. 32, 1–14 (2012).
Article Google Scholar
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
Article CAS Google Scholar
Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).
Article CAS Google Scholar
Olsen, J. V. & Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl Acad. Sci. USA 101, 13417–13422 (2004).
Article CAS Google Scholar
Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteom. 73, 2092–2123 (2010).
Article CAS Google Scholar
Jeong, K., Kim, S. & Bandeira, N. False discovery rates in spectral identification. BMC Bioinf. 13, S2 (2012).
Article CAS Google Scholar
Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass. Spectrom. 5, 859–866 (1994).
Article CAS Google Scholar
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Article CAS Google Scholar
Oberacher, H. et al. On the inter-instrument and the inter-laboratory transferability of a tandem mass spectral reference library: 2. Optimization and characterization of the search algorithm. J. Mass Spectrom. 44, 494–502 (2009).
Article CAS Google Scholar
Xie, Y., Wang, Y., Nallanathan, A. & Wang, L. An improved K-nearest-neighbor indoor localization method based on Spearman distance. IEEE Signal Process Lett. 23, 351–355 (2016).
Article Google Scholar
Minaev, G., Visa, A. & Piche, R. in 2017 International Conference on Indoor Positioning and Indoor Navigation (IEEE, 2017).
Cha, S.-H. Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1, 300–307 (2007).
Google Scholar
Saraiva Campos, R. & Lovisolo, L. in Handbook of Position Location (eds Seyed, A. et al.) Ch. 15 (Wiley, 2018).
Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).
Article Google Scholar
Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminformatics https://doi.org/10.1186/s13321-016-0174-y (2016).
Wan, K. X., Vidavsky, I. & Gross, M. L. Comparing similar spectra: from similarity index to spectral contrast angle. J. Am. Soc. Mass. Spectrom. 13, 85–88 (2002).
Article CAS Google Scholar
Yilmaz, Ş., Vandermarliere, E. & Martens, L. Methods to calculate spectrum similarity. Methods Mol. Biol. 1549, 75–100 (2017).
Article CAS Google Scholar
Samokhin, A., Sotnezova, K., Lashin, V. & Revelsky, I. Evaluation of mass spectral library search algorithms implemented in commercial software. J. Mass Spectrom. 50, 820–825 (2015).
Article CAS Google Scholar
Du, P. et al. A noise model for mass spectrometry based proteomics. Bioinformatics 24, 1070–1077 (2008).
Article CAS Google Scholar
Folz, J. S., Shalon, D. & Fiehn, O. Metabolomics analysis of time-series human small intestine lumen samples collected in vivo. Food Funct. https://doi.org/10.1039/D1FO01574E (2021).
Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: retention time prediction for compound annotation in untargeted metabolomics. Anal. Chem. 92, 7515–7522 (2020).
Article CAS Google Scholar
Sobus, J. R. et al. Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA. J. Exposure Sci. Environ. Epidemiol. 28, 411–426 (2018).
Article CAS Google Scholar
Colby, S. M. et al. ISiCLE: a quantum chemistry pipeline for establishing in silico collision cross section libraries. Anal. Chem. 91, 4346–4356 (2019).
Article CAS Google Scholar
Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. https://doi.org/10.1038/s41467-017-01318-5 (2017).
Bach, E., Szedmak, S., Brouard, C., Böcker, S. & Rousu, J. Liquid-chromatography retention order prediction for metabolite identification. Bioinformatics 34, i875–i883 (2018).
Article CAS Google Scholar
Cao, M. et al. Predicting retention time in hydrophilic interaction liquid chromatography mass spectrometry and its use for peak annotation in metabolomics. Metabolomics 11, 696–706 (2015).
Article CAS Google Scholar
Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nat. Commun. https://doi.org/10.1038/s41467-019-13680-7 (2019).
Nichols, C. M. et al. Untargeted molecular discovery in primary metabolism: collision cross section as a molecular descriptor in ion mobility-mass spectrometry. Anal. Chem. 90, 14484–14492 (2018).
Article CAS Google Scholar
Zhou, Z. et al. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics. Nat. Commun. https://doi.org/10.1038/s41467-020-18171-8 (2020).
Wang, X. et al. JUMPm: a tool for large-scale identification of metabolites in untargeted metabolomics. Metabolites https://doi.org/10.3390/metabo10050190 (2020).

Download references

Acknowledgements

This work was supported by the funding ‘West Coast Metabolomics Center for Compound Identification’, which was provided by the National Institutes of Health under award number NIH U2C ES030158 to O.F. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations

West Coast Metabolomics Center, UC Davis Genome Center, University of California, Davis, CA, USA
Yuanyue Li, Tobias Kind, Jacob Folz, Arpana Vaniya, Sajjan Singh Mehta & Oliver Fiehn
Olobion, Parc Científic de Barcelona, Barcelona, Spain
Sajjan Singh Mehta

Authors

Yuanyue Li
View author publications
You can also search for this author in PubMed Google Scholar
Tobias Kind
View author publications
You can also search for this author in PubMed Google Scholar
Jacob Folz
View author publications
You can also search for this author in PubMed Google Scholar
Arpana Vaniya
View author publications
You can also search for this author in PubMed Google Scholar
Sajjan Singh Mehta
View author publications
You can also search for this author in PubMed Google Scholar
Oliver Fiehn
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

O.F. supervised and directed this project. Y.L. developed and performed the analysis, T.K. assisted with the concept of the study, J.F. and A.V. acquired and curated experimental data. S.S.M. tested and improved the source code. Y.L. and O.F. wrote the manuscript with comments from all the other authors.

Corresponding author

Correspondence to Oliver Fiehn.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Methods thanks Warwick Dunn, Junmin Peng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Arunima Singh was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Example spectra with different spectral entropy.

Examples of experimental MS/MS spectra downloaded from MassBank.us that are characteristic for spectra with entropy levels 0-4. The precursor ion is shown in grey.

Extended Data Fig. 2 Impact of collision energy on spectral entropy.

(a) Spectral entropy violin density plots for low energy (<15 eV), mid-energy (15-45 eV) and high- energy (>45 eV) high resolution NIST20 MS/MS spectra. (b) Relationship between the increase of collision energy and the increase of spectral entropy in NIST20 MS/MS spectra for positively and negatively charged molecules.

Extended Data Fig. 3 Distribution of spectral entropy within compound classes in the NIST20 MS/MS library consisting of at least 50 molecules with collision energies between 20–40 eV.

The number of spectra is shown on the right of each compound class. For each boxplot, the center is the median. Left and right hinges depict the first and third quartiles. The whiskers stretch to 1.5-times the interquartile range of the corresponding hinge. (a) Positive mode electrospray MS/MS spectra (b) Negative mode electrospray MS/MS spectra.

Extended Data Fig. 4 NIST20 Orbitrap MS/MS spectra for ATP and ADP at collision energy 30 eV.

This comparison shows that similar structures may give highly similar mass spectra. Dot-product similarity yields much higher values, while differences between the mass spectra are most apparent for lower abundant ions that are emphasized by dynamic weighted similarity scores (for example for m/z 410).

Extended Data Fig. 5 Intensity weight for entropy similarity.

(a) The relationship between spectral entropy and weights to be used in the spectral entropy similarity algorithm. (b) The relationship between normalized AUC and entropy cutoff.

Extended Data Fig. 6 Receiver-operator characteristic curves for using MS/MS spectra from the Massbank.us database search against Massbank.us database.

(a) Spectral similarity was compared without precursor ion removal. (b) Spectral similarity was compared with precursor ion removal.

Source data

Extended Data Fig. 7 Calculate the bond difference between molecules in chemical libraries.

Isomer molecules that differed only by one bond were considered highly similar and therefore used as ‘true positives’ in FDR calculations in similarity calculations.

Extended Data Fig. 8 Frequency distribution of identified and unknown metabolites detected in a new experimental dataset from samples of the upper human gut intestinal tract.

At (normalized entropy)⁴ > 0.8, identified metabolites are almost absent but spectra of unknown compounds show a rapid increase in frequency due to poor spectral quality.

Supplementary information

Supplementary Information

Supplementary Notes 1 and 2 and Supplementary Figs. 1–6

Reporting Summary

Supplementary Table 1

The AUC for the ROC curve in Figs. 3 and 4 and Supplementary Fig. 4.

Supplementary Table 2

Newly annotated compounds from the previously unannotated spectra of the human gut metabolome dataset.

Source data

Source Data Fig. 3

Statistical Source Data.

Source Data Fig. 4

Statistical Source Data.

Source Data Fig. 5

Statistical Source Data.

Source Data Fig. 6

Statistical Source Data.

Source Data Extended Data Fig. 6

Statistical Source Data.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, Y., Kind, T., Folz, J. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods 18, 1524–1531 (2021). https://doi.org/10.1038/s41592-021-01331-z

Download citation

Received: 12 February 2021
Accepted: 25 October 2021
Published: 02 December 2021
Issue Date: December 2021
DOI: https://doi.org/10.1038/s41592-021-01331-z

This article is cited by

IDSL_MINT: a deep learning framework to predict molecular fingerprints from mass spectra
- Sadjad Fakouri Baygi
- Dinesh Kumar Barupal
Journal of Cheminformatics (2024)
Tandem mass spectrum prediction for small molecules using graph transformers
- Adamo Young
- Hannes Röst
- Bo Wang
Nature Machine Intelligence (2024)
MetaboAnalystR 4.0: a unified LC-MS workflow for global metabolomics
- Zhiqiang Pang
- Lei Xu
- Jianguo Xia
Nature Communications (2024)
BLINK enables ultrafast tandem mass spectrometry cosine similarity scoring
- Thomas V. Harwood
- Daniel G. C. Treen
- Benjamin P. Bowen
Scientific Reports (2023)
Flash entropy search to query all mass spectral libraries in real time
- Yuanyue Li
- Oliver Fiehn
Nature Methods (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Extended data

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links