High performance computing framework for tera-scale database search of mass spectrometry data

Haseeb, Muhammad; Saeed, Fahad

doi:10.1038/s43588-021-00113-z

Article
Published: 20 August 2021

High performance computing framework for tera-scale database search of mass spectrometry data

Nature Computational Science volume 1, pages 550–561 (2021)Cite this article

917 Accesses
7 Citations
19 Altmetric
Metrics details

Subjects

Abstract

Database peptide search algorithms deduce peptides from mass spectrometry data. There has been substantial effort in improving their computational efficiency to achieve larger and more complex systems biology studies. However, modern serial and high-performance computing (HPC) algorithms exhibit suboptimal performance mainly due to their ineffective parallel designs (low resource utilization) and high overhead costs. We present an HPC framework, called HiCOPS, for efficient acceleration of the database peptide search algorithms on distributed-memory supercomputers. HiCOPS provides, on average, more than tenfold improvement in speed and superior parallel performance over several existing HPC database search software. We also formulate a mathematical model for performance analysis and optimization, and report near-optimal results for several key metrics including strong-scale efficiency, hardware utilization, load-balance, inter-process communication and I/O overheads. The core parallel design, techniques and optimizations presented in HiCOPS are search-algorithm-independent and can be extended to efficiently accelerate the existing and future algorithms and software.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data

Article Open access 31 October 2023

Flash entropy search to query all mass spectral libraries in real time

Article 21 September 2023

PepQuery2 democratizes public MS proteomics data for rapid peptide searching

Article Open access 18 April 2023

Data availability

All of the datasets used in this study are publicly available from PXD and can be accessed via https://www.ebi.ac.uk/pride/archive/projects/<AccessionNum>, where AccessionNum is the accession number for each dataset mentioned in the text (for example, to access S₁ PXD009072, use https://www.ebi.ac.uk/pride/archive/projects/PXD009072). The Homo sapiens protein sequence database can be downloaded from UniProtKB via https://www.uniprot.org/proteomes/UP000005640. The UniProt SwissProt (reviewed) database can be downloaded via https://www.uniprot.org/uniprot/?query=reviewed:yes. Source data are provided with this paper.

Code availability

The HiCOPS software has been implemented using object-oriented C++17, MPI, OpenMP, Python, Bash and CMake. Instrumentation interface is implemented via Timemory⁴² for performance analysis. Command-line tools for MPI task mapping (Supplementary Section 7), database processing, file format conversion and result post-processing are also distributed with the software. HiCOPS is under active development and all documentation updates, source code releases and so on will be updated on the same web page. The source code is available open-source at https://doi.org/10.5281/zenodo.5094072 (ref. ⁵⁰) and https://github.com/hicops/hicops. Please refer to https://hicops.github.io for detailed documentation, licensing and future software updates.

References

Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteomics 73, 2092–2123 (2010).
Article Google Scholar
Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSfragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513 (2017).
Article Google Scholar
McIlwain, S. et al. Crux: rapid open source protein tandem mass spectrometry analysis. J. Proteome Res. 13, 4488–4491 (2014).
Article Google Scholar
Yuan, Z.-Fe et al. pParse: a method for accurate determination of monoisotopic peaks in high-resolution mass spectra. Proteomics 12, 226–235 (2012).
Article Google Scholar
Deng, Y. et al. pClean: an algorithm to preprocess high-resolution tandem mass spectra for database searching. J. Proteome Res. 18, 3235–3244 (2019).
Article Google Scholar
Degroeve, S. & Martens, L. Ms2pip: a tool for ms/ms peak intensity prediction. Bioinformatics 29, 3199–3203 (2013).
Article Google Scholar
Zhou, X.-X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).
Article Google Scholar
Zhang, J. et al. PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics 11, M111–010587 (2012).
Article Google Scholar
Devabhaktuni, A. et al. TagGraph reveals vast protein modification landscapes from large tandem mass spectrometry datasets. Nat. Biotechnol. 1, 469–479 (2019).
Article Google Scholar
Chi, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat. Biotechnol. 36, 1059–1061 (2018).
Article Google Scholar
Bern, M., Cai, Y. & Goldberg, D. Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. Anal. Chem. 79, 1393–1400 (2007).
Article Google Scholar
Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spec. 5, 976–989 (1994).
Article Google Scholar
Craig, R. & Beavis, R. C. A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spec. 17, 2310–2316 (2003).
Article Google Scholar
Diament, B. J. & Noble, W. S. Faster sequest searching for peptide identification from tandem mass spectra. J. Proteome Res. 10, 3871–3879 (2011).
Article Google Scholar
Eng, J. K., Fischer, B., Grossmann, J. & MacCoss, M. J. A fast sequest cross correlation algorithm. J. Proteome Res. 7, 4598–4602 (2008).
Article Google Scholar
Park, C. Y., Klammer, A. A., Kall, L., MacCoss, M. J. & Noble, W. S. Rapid and accurate peptide identification from tandem mass spectra. J. Proteome Res. 7, 3022–3027 (2008).
Article Google Scholar
Geer, L. Y. et al. Open mass spectrometry search algorithm. J. Proteome Res. 3, 958–964 (2004).
Article Google Scholar
Hebert, A. S. et al. The one hour yeast proteome. Mol. Cell. Proteomics 13, 339–347 (2014).
Article Google Scholar
Nesvizhskii, A. I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670 (2006).
Article Google Scholar
Eng, J. K., Searle, B. C., Clauser, K. R. & Tabb, D. L. A face in the crowd: recognizing peptides through database search. Mol. Cell. Proteomics 10, R111.009522 (2011).
Article Google Scholar
Haseeb, M. & Saeed, F. Efficient shared peak counting in database peptide search using compact data structure for fragment-ion index. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 275–278 (IEEE, 2019).
Williams, S., Waterman, A. & Patterson, D. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009).
Article Google Scholar
Chi, H. et al. pFIND–Alioth: a novel unrestricted database search algorithm to improve the interpretation of high-resolution MS/MS data. J. Proteomics 125, 89–97 (2015).
Article Google Scholar
Marx, V. The big challenges of big data. Nature 498, 255–260 (2013).
Article Google Scholar
Duncan, D. T., Craig, R. & Link, A. J. Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X! tandem. J. Proteome Res. 4, 1842–1847 (2005).
Article Google Scholar
Bjornson, R. D. et al. X!!Tandem, an improved method for running X!Tandem in parallel on collections of commodity computers. J. Proteome Res. 7, 293–299 (2007).
Article Google Scholar
Pratt, B., Howbert, J. J., Tasman, N. I. & Nilsson, E. J. MR-tandem: parallel X! Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics 28, 136–137 (2011).
Article Google Scholar
Li, C., Li, K., Li, K. & Lin, F. MCtandem: an efficient tool for large-scale peptide identification on many integrated core (MIC) architecture. BMC Bioinformatics 20, 397 (2019).
Article Google Scholar
Li, C., Li, K., Chen, T., Zhu, Y. & He, Q. SW-Tandem: a highly efficient tool for large-scale peptide sequencing with parallel spectrum dot product on Sunway TaihuLight. Bioinformatics 35, 3861–3863 (2019).
Article Google Scholar
Chen, L. et al. MS-PyCloud: an open-source, cloud computing-based pipeline for LC-MS/MS data analysis. Preprint at https://www.biorxiv.org/content/10.1101/320887v1 (2018).
Prakash, A., Ahmad, S., Majumder, S., Jenkins, C. & Orsburn, B. Bolt: a new age peptide search engine for comprehensive MS/MS sequencing through vast protein databases in minutes. J. Am. Soc. Mass Spec. 30, 2408–2418 (2019).
Article Google Scholar
Kaiser, P. et al. High-resolution community analysis of deep-sea copepods using maldi-tof protein fingerprinting. Deep Sea Res. I 138, 122–130 (2018).
Article Google Scholar
Rossel, S. & Arbizu, P. M. Revealing higher than expected diversity of Harpacticoida (Crustacea: Copepoda) in the North Sea using MALDI-TOF MS and molecular barcoding. Sci. Rep. 9, 1–14 (2019).
Article Google Scholar
Yates III, J. R. Proteomics of communities: metaproteomics. J. Proteome Res. 18, 2359 (2019).
Article Google Scholar
Saeed, F., Haseeb, M. & Lyengar, S. S. Communication lower-bounds for distributed-memory computations for mass spectrometry based omics data. Preprint at https://arxiv.org/abs/2009.14123v2 (2021).
Beyter, D., Lin, M. S., Yu, Y., Pieper, R. & Bafna, V. Proteostorm: an ultrafast metaproteomics database search framework. Cell Syst. 7, 463–467 (2018).
Article Google Scholar
Valiant, L. G. A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990).
Article Google Scholar
Tiskin, A. BSP (Bulk Synchronous Parallelism) 192–199 (Springer, 2011); https://doi.org/10.1007/978-0-387-09766-4_311
Towns, J. et al. XSEDE: accelerating scientific discovery. Comput. Sci. Eng. 16, 62–74 (2014).
Article Google Scholar
Eng, J. K., Jahan, T. A. & Hoopmann, M. R. Comet: an open-source MS/MS sequence database search tool. Proteomics 13, 22–24 (2013).
Article Google Scholar
Craig, R. & Beavis, R. C. Tandem: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004).
Article Google Scholar
Madsen, J. R. et al. Timemory: modular performance analysis for HPC. In International Conference on High Performance Computing 434–452 (Springer, 2020).
Stevens, R., Ramprakash, J., Messina, P., Papka, M. & Riley, K. Aurora: Argonne’s Next-Generation Exascale Supercomputer Technical Report (Argonne National Laboratory, 2019).
Liu, K., Li, S., Wang, L., Ye, Y. & Tang, H. Full-spectrum prediction of peptides tandem mass spectra using deep neural network. Analytical chemistry 92, 4275–4283 (2020).
Article Google Scholar
Lin, Y.-M., Chen, C.-T. & Chang, J.-M. MS2CNN: predicting MS/MS spectrum based on protein sequence using deep convolutional neural networks. BMC Genomics 20, 1–10 (2019).
Article Google Scholar
Haseeb, M., Afzali, F. & Saeed, F. LBE: a computational load balancing algorithm for speeding up parallel peptide search in mass-spectrometry based proteomics. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 191–198 (IEEE, 2019).
Ding, J., Shi, J., Poirier, G. G. & Wu, F.-X. A novel approach to denoising ion trap tandem mass spectra. Proteome Sci. 7, 9 (2009).
Article Google Scholar
Fenyö, D. & Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774 (2003).
Article Google Scholar
LaViola, J. J. Double exponential smoothing: an alternative to kalman filter-based predictive tracking. In Proc. Workshop on Virtual Environments 2003 199–206 (The Eurographics Association, 2003).
Haseeb, M. & Saeed, F. hicops/hicops: HiCOPS v1.0.0—1st Public Release (Zenodo, 2021); https://doi.org/10.5281/zenodo.5094072
Haseeb, M. & Saeed, F. Source Data: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry Data (Zenodo, 2021); https://doi.org/10.5281/zenodo.5076575

Download references

Acknowledgements

This work used the National Science Foundation (NSF) XSEDE supercomputers through allocations TG-CCR150017 and TG-ASC200004 (F.S.). This research was supported by the NIGMS of the National Institutes of Health (NIH) under award number: R01GM134384 (F.S.). The authors were further supported by the NSF under award number: NSF CAREER OAC-1925960 (F.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH and/or the NSF.

Author information

Authors and Affiliations

Knight Foundation School of Computing and Information Sciences, Florida International University, Miami, FL, USA
Muhammad Haseeb & Fahad Saeed
Biomolecular Sciences Institute (BSI), Florida International University, Miami, FL, USA
Fahad Saeed
Department of Human and Molecular Genetics, Herbert Wertheim School of Medicine, Florida International University, Miami, FL, USA
Fahad Saeed

Authors

Muhammad Haseeb
View author publications
You can also search for this author in PubMed Google Scholar
Fahad Saeed
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

M.H. and F.S. designed the parallel computational framework. M.H. implemented the software. M.H. and F.S. designed and performed the experiments, performed calculations, analyzed the data and results, and wrote the manuscript.

Corresponding authors

Correspondence to Muhammad Haseeb or Fahad Saeed.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Reviewer recognition statement Nature Computational Science thanks Robert Bjornson, Benjamin Neely, Yasset Perez-Riverol and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor recognition statement Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Supplementary information

Supplementary Information

Supplementary Figs. 1–8, Sections 1–9 and Algorithms 1–5.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Haseeb, M., Saeed, F. High performance computing framework for tera-scale database search of mass spectrometry data. Nat Comput Sci 1, 550–561 (2021). https://doi.org/10.1038/s43588-021-00113-z

Download citation

Received: 02 December 2020
Accepted: 16 July 2021
Published: 20 August 2021
Issue Date: August 2021
DOI: https://doi.org/10.1038/s43588-021-00113-z

This article is cited by

GPU-acceleration of the distributed-memory database peptide search of mass spectrometry data
- Muhammad Haseeb
- Fahad Saeed
Scientific Reports (2023)