Abstract
Public repositories of metabolomics mass spectra encompass more than 1 billion entries. With open search, dot product or entropy similarity, comparisons of a single tandem mass spectrometry spectrum take more than 8 h. Flash entropy search speeds up calculations more than 10,000 times to query 1 billion spectra in less than 2 s, without loss in accuracy. It benefits from using multiple threads and GPU calculations. This algorithm can fully exploit large spectral libraries with little memory overhead for any mass spectrometry laboratory.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$259.00 per year
only $21.58 per issue
Rent or buy this article
Prices vary by article type
from$1.95
to$39.95
Prices may be subject to local taxes which are calculated during checkout


Data availability
All spectra from MassBank.us (https://massbank.us/) and GNPS (https://gnps-external.ucsd.edu/gnpslibrary/ALL_GNPS.mgf) were downloaded on 3 March 2023. Additional MS/MS spectra from public repositories were downloaded from the MassIVE/GNPS (https://gnps.ucsd.edu/ProteoSAFe/datasets.jsp#%7B%22query%22%3A%7B%7D%2C%22title_input%22%3A%22GNPS%22%7D), MetabolomicWorkbench.org (https://www.metabolomicsworkbench.org/) and MetaboLights (https://www.ebi.ac.uk/metabolights/) in December 2022. In total, more than 939 million spectra were available (237,185,147 negative ESI and 701,996,947 positive ESI MS/MS spectra). All the spectra from those sources were used in this study. Source data are provided with this paper.
Code availability
The original source code and benchmark data for the Flash entropy search are available under the Apache License 2.0 on GitHub (https://github.com/YuanyueLi/FlashEntropySearch) and Zenodo (https://doi.org/10.5281/zenodo.7972082), as well as on CodeOcean (https://doi.org/10.24433/CO.8809500.v1). The GUI can be downloaded from the GitHub repository: https://github.com/YuanyueLi/EntropySearch. Flash entropy search is also integrated into the ‘MSEntropy’ package, available for download from https://github.com/YuanyueLi/MSEntropy. Comprehensive documentation for the ‘MSEntropy’ package can be found at https://msentropy.readthedocs.io.
References
Liang, L. et al. Metabolic dynamics and prediction of gestational age and time to delivery in pregnant women. Cell 181, 1680–1692 (2020).
Li, D. & Gaquerel, E. Next-generation mass spectrometry metabolomics revives the functional analysis of plant metabolic diversity. Annu. Rev. Plant Biol. 72, 867–891 (2021).
Choi, M. et al. MassIVE.quant: a community resource of quantitative mass spectrometry–based proteomics datasets. Nat. Methods 17, 981–984 (2020).
Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).
Sud, M. et al. Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 44, D463–D470 (2015).
Haug, K. et al. MetaboLights: a resource evolving in response to the needs of its scientific community. Nucleic Acids Res. 48, D440–D444 (2019).
Wang, M. et al. Mass spectrometry searches using MASST. Nat. Biotechnol. 38, 23–26 (2020).
Chick, J. M. et al. A mass-tolerant database search identifies a large proportion of unassigned spectra in shotgun proteomics as modified peptides. Nat. Biotechnol. 33, 743–749 (2015).
Aisporna, A. et al. Neutral loss mass spectral data enhances molecular similarity analysis in METLIN. J. Am. Soc. Mass. Spectrom. 33, 530–534 (2022).
Watrous, J. et al. Mass spectral molecular networking of living microbial colonies. Proc. Natl Acad. Sci. USA 109, E1743–E1752 (2012).
Burke, M. C. et al. The hybrid search: a mass spectral library search method for discovery of modifications in proteomics. J. Proteome Res. 16, 1924–1935 (2017).
Moorthy, A. S., Wallace, W. E., Kearsley, A. J., Tchekhovskoi, D. V. & Stein, S. E. Combining fragment-ion and neutral-loss matching during mass spectral library searching: a new general purpose algorithm applicable to illicit drug identification. Anal. Chem. 89, 13261–13268 (2017).
Bittremieux, W. et al. Comparison of cosine, modified cosine, and neutral loss based spectrum alignment for discovery of structurally related molecules. J. Am. Soc. Mass. Spectrom. 33, 1733–1744 (2022).
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSFragger: ultrafast and comprehensive peptide identification in mass spectrometry–based proteomics. Nat. Methods 14, 513–520 (2017).
Huber, F. et al. matchms - processing and similarity evaluation of mass spectrometry data. J. Open Source Softw. 5, 2411 (2020).
Harwood, T. et al. BLINK: Ultrafast tandem mass spectrometry cosine similarity scoring. Sci. Rep. 13, 13462 (2023).
Li, Y. et al. Spectral entropy outperforms MS/MS dot product similarity for small-molecule compound identification. Nat. Methods 18, 1524–1531 (2021).
King, E., Overstreet, R., Nguyen, J. & Ciesielski, D. Augmentation of MS/MS libraries with spectral interpolation for improved identification. J. Chem. Inf. Model. 62, 3724–3733 (2022).
Yang, K. L. et al. MSBooster: improving peptide identification rates using deep learning-based features. Nat. Commun. 14, 4539 (2023).
Yi, X. et al. Deep learning prediction boosts phosphoproteomics-based discoveries through improved phosphopeptide identification. Preprint at bioRxiv https://doi.org/10.1101/2023.01.11.523329 (2023).
Bittremieux, W., Laukens, K. & Noble, W. S. Extremely fast and accurate open modification spectral library searching of high-resolution mass spectra using feature hashing and graphics processing units. J. Proteome Res. 18, 3792–3799 (2019).
Harris, C. R. et al. Array programming with NumPy. Nature 585, 357–362 (2020).
Acknowledgements
This study was funded by National Institutes of Health grants U2C ES030158 and R03 OD034497 (to O.F.).
Author information
Authors and Affiliations
Contributions
Y.L. and O.F. conceptualized the study. Y.L. designed the algorithm and performed the benchmarking. O.F. supervised the project. Y.L. and O.F. wrote the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Methods thanks Xusheng Wang and Jianguo Xia for their contribution to the peer review of this work. Peer reviewer reports are available. Primary Handling Editor: Arunima Singh, in collaboration with the Nature Methods team.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Extended data
Extended Data Fig. 1 Examples for calculating Flash entropy similarity.
(a) Example when all ions match between query spectrum (top) and library spectrum (bottom). in the two spectra are matched. (b) Example when only one pair of ions matches between query and library spectra. Note that the sum intensities of ion abundances in each spectrum are normalized to equal 0.5 (see Supplementary Note 1 for equations). Hence, mismatched ions do not contribute themselves into the calculations, but are considered during the normalization process.
Extended Data Fig. 2 Distributions of spectral entropies when sampling spectra from different MS/MS repositories for benchmarking studies.
(a) MassBank.us, (b) GNPS for annotated compounds (library), (c) all combined experimental public MS/MS repositories including MassIVE/GNPS, MetaboLights, MetabolomicsWorkbench and West Coast Metabolomics Center.
Extended Data Fig. 3 Computation time required to perform ‘open search’ queries using entropy similarity for 100 positive ESI and 100 negative ESI mass spectra against spectral libraries of different sizes.
MS/MS spectra were sampled from (a) GNPS (b) public repositories. Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 200 independent MS/MS spectra randomly sampled from (a) GNPS (b) public.
Extended Data Fig. 4 Computation time required to perform ‘open search’ queries using dot product similarity for 100 positive ESI and 100 negative ESI mass spectra against spectral libraries of different sizes.
MS/MS spectra were sampled from (a) MassBank.us, (b) GNPS, (c) public repositories. Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 200 independent MS/MS spectra randomly sampled from (a) MassBank.us, (b) GNPS, (c) public repositories.
Extended Data Fig. 5 Computation time required to perform ‘neutral loss’ searches with entropy similarity for 100 positive ESI and 100 negative ESI mass spectra against spectral libraries of different sizes.
MS/MS spectra were sampled from (a) MassBank.us, (b) GNPS, (c) public repositories. Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 200 independent MS/MS spectra randomly sampled from (a) MassBank.us, (b) GNPS, (c) public repositories.
Extended Data Fig. 6 Computation time required to perform ‘hybrid searches’ with entropy similarity for 100 positive ESI and 100 negative ESI mass spectra against spectral libraries of different sizes.
MS/MS spectra were sampled from (a) MassBank.us, (b) GNPS, (c) public repositories. Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 200 independent MS/MS spectra randomly sampled from (a) MassBank.us, (b) GNPS, (c) public repositories.
Extended Data Fig. 7 Calculation time to open search 100 positive ESI and 100 negative ESI MS/MS spectra at different spectral entropy levels against randomly picked samples from the MassBank.us library.
Box plots display medians as horizontal lines inside the boxes that delineate interquartile ranges (IQR). Whiskers extend to the lowest or highest data point within 1.5x IQR of the 25% and 75% quartiles. N = 100 independent MS/MS spectra randomly sampled from MassBank.us.
Extended Data Fig. 8 Comparison of the accuracy of similarity query results between Flash entropy search and BLINK.
Each dot shows the maximum similarity difference between the fast algorithms and their classic algorithm counterparts. 100 positive ESI and 100 negative ESI MS/MS spectra were sampled from (a) MassBank.us, (b) GNPS, (c) public repositories.
Supplementary information
Supplementary Information
Supplementary Notes 1 and 2 and Tables 1 and 2.
Source data
Source Data Fig. 2
Statistical source data.
Source Data Extended Data Fig. 3
Statistical source data.
Source Data Extended Data Fig. 4
Statistical source data.
Source Data Extended Data Fig. 5
Statistical source data.
Source Data Extended Data Fig. 6
Statistical source data.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Y., Fiehn, O. Flash entropy search to query all mass spectral libraries in real time. Nat Methods 20, 1475–1478 (2023). https://doi.org/10.1038/s41592-023-02012-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41592-023-02012-9