Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification


Compound identification in small-molecule research, such as untargeted metabolomics or exposome research, relies on matching tandem mass spectrometry (MS/MS) spectra against experimental or in silico mass spectral libraries. Most software programs use dot product similarity scores. Here we introduce the concept of MS/MS spectral entropy to improve scoring results in MS/MS similarity searches via library matching. Entropy similarity outperformed 42 alternative similarity algorithms, including dot product similarity, when searching 434,287 spectra against the high-quality NIST20 library. Entropy similarity scores proved to be highly robust even when we added different levels of noise ions. When we applied entropy levels to 37,299 experimental spectra of natural products, false discovery rates of less than 10% were observed at entropy similarity score 0.75. Experimental human gut metabolome data were used to confirm that entropy similarity largely improved the accuracy of MS-based annotations in small-molecule research to false discovery rates below 10%, annotated new compounds and provided the basis to automatically flag poor-quality, noisy spectra.

This is a preview of subscription content, access via your institution

Relevant articles

Open Access articles citing this article.

Access options

Rent or buy this article

Get just this article for as long as you need it


Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Spectral entropy as a parameter to characterize MS/MS spectra.
Fig. 2: Schema for calculating similarity between MS/MS spectra of small molecules.
Fig. 3: ROC curves for 43 different spectral similarity algorithms for 434,287 accurate mass MS/MS spectra of the small-molecule NIST20 database.
Fig. 4: Robustness of MS/MS similarity algorithms against random noise ions.
Fig. 5: Relationship between FDR and spectral entropy levels for NIST20 accurate mass MS/MS spectra.
Fig. 6: Relationship between FDRs and spectral similarity in experimental spectra.

Data availability

NIST Tandem Mass Spectral Library, 2020 release (NIST20) spectra are commercial available and can be purchased from multiple vendors. MassBank of North America database ( spectra can be freely downloaded from ( The metabolome dataset of the human upper gut intestinal tract can be freely downloaded from MetabolomicsWorkbench ( with accession code ST001794. Source data are provided with this paper.

Code availability

The code for calculating spectral entropy and spectral similarity can be found at GitHub and at Zenodo


  1. Wu, Z., Bagarolo, G. I., Thoröe-Boveleth, S. & Jankowski, J. ‘Lipidomics’: mass spectrometric and chemometric analyses of lipids. Adv. Drug Deliv. Rev. 159, 294–307 (2020).

    Article  CAS  Google Scholar 

  2. Xiao, J. F., Zhou, B. & Ressom, H. W. Metabolite identification and quantitation in LC-MS/MS-based metabolomics. Trends Anal. Chem. 32, 1–14 (2012).

    Article  Google Scholar 

  3. Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    Article  CAS  Google Scholar 

  4. Elias, J. E. & Gygi, S. P. Target-decoy search strategy for increased confidence in large-scale protein identifications by mass spectrometry. Nat. Methods 4, 207–214 (2007).

    Article  CAS  Google Scholar 

  5. Olsen, J. V. & Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl Acad. Sci. USA 101, 13417–13422 (2004).

    Article  CAS  Google Scholar 

  6. Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteom. 73, 2092–2123 (2010).

    Article  CAS  Google Scholar 

  7. Jeong, K., Kim, S. & Bandeira, N. False discovery rates in spectral identification. BMC Bioinf. 13, S2 (2012).

    Article  CAS  Google Scholar 

  8. Stein, S. E. & Scott, D. R. Optimization and testing of mass spectral library search algorithms for compound identification. J. Am. Soc. Mass. Spectrom. 5, 859–866 (1994).

    Article  CAS  Google Scholar 

  9. Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).

    Article  CAS  Google Scholar 

  10. Oberacher, H. et al. On the inter-instrument and the inter-laboratory transferability of a tandem mass spectral reference library: 2. Optimization and characterization of the search algorithm. J. Mass Spectrom. 44, 494–502 (2009).

    Article  CAS  Google Scholar 

  11. Xie, Y., Wang, Y., Nallanathan, A. & Wang, L. An improved K-nearest-neighbor indoor localization method based on Spearman distance. IEEE Signal Process Lett. 23, 351–355 (2016).

    Article  Google Scholar 

  12. Minaev, G., Visa, A. & Piche, R. in 2017 International Conference on Indoor Positioning and Indoor Navigation (IEEE, 2017).

  13. Cha, S.-H. Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1, 300–307 (2007).

    Google Scholar 

  14. Saraiva Campos, R. & Lovisolo, L. in Handbook of Position Location (eds Seyed, A. et al.) Ch. 15 (Wiley, 2018).

  15. Shannon, C. E. A mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423 (1948).

    Article  Google Scholar 

  16. Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminformatics (2016).

  17. Wan, K. X., Vidavsky, I. & Gross, M. L. Comparing similar spectra: from similarity index to spectral contrast angle. J. Am. Soc. Mass. Spectrom. 13, 85–88 (2002).

    Article  CAS  Google Scholar 

  18. Yilmaz, Ş., Vandermarliere, E. & Martens, L. Methods to calculate spectrum similarity. Methods Mol. Biol. 1549, 75–100 (2017).

    Article  CAS  Google Scholar 

  19. Samokhin, A., Sotnezova, K., Lashin, V. & Revelsky, I. Evaluation of mass spectral library search algorithms implemented in commercial software. J. Mass Spectrom. 50, 820–825 (2015).

    Article  CAS  Google Scholar 

  20. Du, P. et al. A noise model for mass spectrometry based proteomics. Bioinformatics 24, 1070–1077 (2008).

    Article  CAS  Google Scholar 

  21. Folz, J. S., Shalon, D. & Fiehn, O. Metabolomics analysis of time-series human small intestine lumen samples collected in vivo. Food Funct. (2021).

  22. Bonini, P., Kind, T., Tsugawa, H., Barupal, D. K. & Fiehn, O. Retip: retention time prediction for compound annotation in untargeted metabolomics. Anal. Chem. 92, 7515–7522 (2020).

    Article  CAS  Google Scholar 

  23. Sobus, J. R. et al. Integrating tools for non-targeted analysis research and chemical safety evaluations at the US EPA. J. Exposure Sci. Environ. Epidemiol. 28, 411–426 (2018).

    Article  CAS  Google Scholar 

  24. Colby, S. M. et al. ISiCLE: a quantum chemistry pipeline for establishing in silico collision cross section libraries. Anal. Chem. 91, 4346–4356 (2019).

    Article  CAS  Google Scholar 

  25. Scheubert, K. et al. Significance estimation for large scale metabolomics annotations by spectral matching. Nat. Commun. (2017).

  26. Bach, E., Szedmak, S., Brouard, C., Böcker, S. & Rousu, J. Liquid-chromatography retention order prediction for metabolite identification. Bioinformatics 34, i875–i883 (2018).

    Article  CAS  Google Scholar 

  27. Cao, M. et al. Predicting retention time in hydrophilic interaction liquid chromatography mass spectrometry and its use for peak annotation in metabolomics. Metabolomics 11, 696–706 (2015).

    Article  CAS  Google Scholar 

  28. Domingo-Almenara, X. et al. The METLIN small molecule dataset for machine learning-based retention time prediction. Nat. Commun. (2019).

  29. Nichols, C. M. et al. Untargeted molecular discovery in primary metabolism: collision cross section as a molecular descriptor in ion mobility-mass spectrometry. Anal. Chem. 90, 14484–14492 (2018).

    Article  CAS  Google Scholar 

  30. Zhou, Z. et al. Ion mobility collision cross-section atlas for known and unknown metabolite annotation in untargeted metabolomics. Nat. Commun. (2020).

  31. Wang, X. et al. JUMPm: a tool for large-scale identification of metabolites in untargeted metabolomics. Metabolites (2020).

Download references


This work was supported by the funding ‘West Coast Metabolomics Center for Compound Identification’, which was provided by the National Institutes of Health under award number NIH U2C ES030158 to O.F. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.

Author information

Authors and Affiliations



O.F. supervised and directed this project. Y.L. developed and performed the analysis, T.K. assisted with the concept of the study, J.F. and A.V. acquired and curated experimental data. S.S.M. tested and improved the source code. Y.L. and O.F. wrote the manuscript with comments from all the other authors.

Corresponding author

Correspondence to Oliver Fiehn.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Methods thanks Warwick Dunn, Junmin Peng and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Arunima Singh was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Example spectra with different spectral entropy.

Examples of experimental MS/MS spectra downloaded from that are characteristic for spectra with entropy levels 0-4. The precursor ion is shown in grey.

Extended Data Fig. 2 Impact of collision energy on spectral entropy.

(a) Spectral entropy violin density plots for low energy (<15 eV), mid-energy (15-45 eV) and high- energy (>45 eV) high resolution NIST20 MS/MS spectra. (b) Relationship between the increase of collision energy and the increase of spectral entropy in NIST20 MS/MS spectra for positively and negatively charged molecules.

Extended Data Fig. 3 Distribution of spectral entropy within compound classes in the NIST20 MS/MS library consisting of at least 50 molecules with collision energies between 20–40 eV.

The number of spectra is shown on the right of each compound class. For each boxplot, the center is the median. Left and right hinges depict the first and third quartiles. The whiskers stretch to 1.5-times the interquartile range of the corresponding hinge. (a) Positive mode electrospray MS/MS spectra (b) Negative mode electrospray MS/MS spectra.

Extended Data Fig. 4 NIST20 Orbitrap MS/MS spectra for ATP and ADP at collision energy 30 eV.

This comparison shows that similar structures may give highly similar mass spectra. Dot-product similarity yields much higher values, while differences between the mass spectra are most apparent for lower abundant ions that are emphasized by dynamic weighted similarity scores (for example for m/z 410).

Extended Data Fig. 5 Intensity weight for entropy similarity.

(a) The relationship between spectral entropy and weights to be used in the spectral entropy similarity algorithm. (b) The relationship between normalized AUC and entropy cutoff.

Extended Data Fig. 6 Receiver-operator characteristic curves for using MS/MS spectra from the database search against database.

(a) Spectral similarity was compared without precursor ion removal. (b) Spectral similarity was compared with precursor ion removal.

Source data

Extended Data Fig. 7 Calculate the bond difference between molecules in chemical libraries.

Isomer molecules that differed only by one bond were considered highly similar and therefore used as ‘true positives’ in FDR calculations in similarity calculations.

Extended Data Fig. 8 Frequency distribution of identified and unknown metabolites detected in a new experimental dataset from samples of the upper human gut intestinal tract.

At (normalized entropy)4 > 0.8, identified metabolites are almost absent but spectra of unknown compounds show a rapid increase in frequency due to poor spectral quality.

Supplementary information

Supplementary Information

Supplementary Notes 1 and 2 and Supplementary Figs. 1–6

Reporting Summary

Supplementary Table 1

The AUC for the ROC curve in Figs. 3 and 4 and Supplementary Fig. 4.

Supplementary Table 2

Newly annotated compounds from the previously unannotated spectra of the human gut metabolome dataset.

Source data

Source Data Fig. 3

Statistical Source Data.

Source Data Fig. 4

Statistical Source Data.

Source Data Fig. 5

Statistical Source Data.

Source Data Fig. 6

Statistical Source Data.

Source Data Extended Data Fig. 6

Statistical Source Data.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Li, Y., Kind, T., Folz, J. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat Methods 18, 1524–1531 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:

This article is cited by


Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research