Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

High performance computing framework for tera-scale database search of mass spectrometry data

Abstract

Database peptide search algorithms deduce peptides from mass spectrometry data. There has been substantial effort in improving their computational efficiency to achieve larger and more complex systems biology studies. However, modern serial and high-performance computing (HPC) algorithms exhibit suboptimal performance mainly due to their ineffective parallel designs (low resource utilization) and high overhead costs. We present an HPC framework, called HiCOPS, for efficient acceleration of the database peptide search algorithms on distributed-memory supercomputers. HiCOPS provides, on average, more than tenfold improvement in speed and superior parallel performance over several existing HPC database search software. We also formulate a mathematical model for performance analysis and optimization, and report near-optimal results for several key metrics including strong-scale efficiency, hardware utilization, load-balance, inter-process communication and I/O overheads. The core parallel design, techniques and optimizations presented in HiCOPS are search-algorithm-independent and can be extended to efficiently accelerate the existing and future algorithms and software.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Methods overview.
Fig. 2: Correctness analysis.
Fig. 3: Speed comparisons.
Fig. 4: Performance Metrics.
Fig. 5: Overhead analysis.

Data availability

All of the datasets used in this study are publicly available from PXD and can be accessed via https://www.ebi.ac.uk/pride/archive/projects/<AccessionNum>, where AccessionNum is the accession number for each dataset mentioned in the text (for example, to access S1 PXD009072, use https://www.ebi.ac.uk/pride/archive/projects/PXD009072). The Homo sapiens protein sequence database can be downloaded from UniProtKB via https://www.uniprot.org/proteomes/UP000005640. The UniProt SwissProt (reviewed) database can be downloaded via https://www.uniprot.org/uniprot/?query=reviewed:yes. Source data are provided with this paper.

Code availability

The HiCOPS software has been implemented using object-oriented C++17, MPI, OpenMP, Python, Bash and CMake. Instrumentation interface is implemented via Timemory42 for performance analysis. Command-line tools for MPI task mapping (Supplementary Section 7), database processing, file format conversion and result post-processing are also distributed with the software. HiCOPS is under active development and all documentation updates, source code releases and so on will be updated on the same web page. The source code is available open-source at https://doi.org/10.5281/zenodo.5094072 (ref. 50) and https://github.com/hicops/hicops. Please refer to https://hicops.github.io for detailed documentation, licensing and future software updates.

References

  1. 1.

    Nesvizhskii, A. I. A survey of computational methods and error rate estimation procedures for peptide and protein identification in shotgun proteomics. J. Proteomics 73, 2092–2123 (2010).

    Article  Google Scholar 

  2. 2.

    Kong, A. T., Leprevost, F. V., Avtonomov, D. M., Mellacheruvu, D. & Nesvizhskii, A. I. MSfragger: ultrafast and comprehensive peptide identification in mass spectrometry-based proteomics. Nat. Methods 14, 513 (2017).

    Article  Google Scholar 

  3. 3.

    McIlwain, S. et al. Crux: rapid open source protein tandem mass spectrometry analysis. J. Proteome Res. 13, 4488–4491 (2014).

    Article  Google Scholar 

  4. 4.

    Yuan, Z.-Fe et al. pParse: a method for accurate determination of monoisotopic peaks in high-resolution mass spectra. Proteomics 12, 226–235 (2012).

    Article  Google Scholar 

  5. 5.

    Deng, Y. et al. pClean: an algorithm to preprocess high-resolution tandem mass spectra for database searching. J. Proteome Res. 18, 3235–3244 (2019).

    Article  Google Scholar 

  6. 6.

    Degroeve, S. & Martens, L. Ms2pip: a tool for ms/ms peak intensity prediction. Bioinformatics 29, 3199–3203 (2013).

    Article  Google Scholar 

  7. 7.

    Zhou, X.-X. et al. pDeep: predicting MS/MS spectra of peptides with deep learning. Anal. Chem. 89, 12690–12697 (2017).

    Article  Google Scholar 

  8. 8.

    Zhang, J. et al. PEAKS DB: de novo sequencing assisted database search for sensitive and accurate peptide identification. Mol. Cell. Proteomics 11, M111–010587 (2012).

    Article  Google Scholar 

  9. 9.

    Devabhaktuni, A. et al. TagGraph reveals vast protein modification landscapes from large tandem mass spectrometry datasets. Nat. Biotechnol. 1, 469–479 (2019).

    Article  Google Scholar 

  10. 10.

    Chi, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat. Biotechnol. 36, 1059–1061 (2018).

    Article  Google Scholar 

  11. 11.

    Bern, M., Cai, Y. & Goldberg, D. Lookup peaks: a hybrid of de novo sequencing and database search for protein identification by tandem mass spectrometry. Anal. Chem. 79, 1393–1400 (2007).

    Article  Google Scholar 

  12. 12.

    Eng, J. K., McCormack, A. L. & Yates, J. R. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spec. 5, 976–989 (1994).

    Article  Google Scholar 

  13. 13.

    Craig, R. & Beavis, R. C. A method for reducing the time required to match protein sequences with tandem mass spectra. Rapid Commun. Mass Spec. 17, 2310–2316 (2003).

    Article  Google Scholar 

  14. 14.

    Diament, B. J. & Noble, W. S. Faster sequest searching for peptide identification from tandem mass spectra. J. Proteome Res. 10, 3871–3879 (2011).

    Article  Google Scholar 

  15. 15.

    Eng, J. K., Fischer, B., Grossmann, J. & MacCoss, M. J. A fast sequest cross correlation algorithm. J. Proteome Res. 7, 4598–4602 (2008).

    Article  Google Scholar 

  16. 16.

    Park, C. Y., Klammer, A. A., Kall, L., MacCoss, M. J. & Noble, W. S. Rapid and accurate peptide identification from tandem mass spectra. J. Proteome Res. 7, 3022–3027 (2008).

    Article  Google Scholar 

  17. 17.

    Geer, L. Y. et al. Open mass spectrometry search algorithm. J. Proteome Res. 3, 958–964 (2004).

    Article  Google Scholar 

  18. 18.

    Hebert, A. S. et al. The one hour yeast proteome. Mol. Cell. Proteomics 13, 339–347 (2014).

    Article  Google Scholar 

  19. 19.

    Nesvizhskii, A. I. et al. Dynamic spectrum quality assessment and iterative computational analysis of shotgun proteomic data toward more efficient identification of post-translational modifications, sequence polymorphisms, and novel peptides. Mol. Cell. Proteomics 5, 652–670 (2006).

    Article  Google Scholar 

  20. 20.

    Eng, J. K., Searle, B. C., Clauser, K. R. & Tabb, D. L. A face in the crowd: recognizing peptides through database search. Mol. Cell. Proteomics 10, R111.009522 (2011).

    Article  Google Scholar 

  21. 21.

    Haseeb, M. & Saeed, F. Efficient shared peak counting in database peptide search using compact data structure for fragment-ion index. In 2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 275–278 (IEEE, 2019).

  22. 22.

    Williams, S., Waterman, A. & Patterson, D. Roofline: an insightful visual performance model for multicore architectures. Commun. ACM 52, 65–76 (2009).

    Article  Google Scholar 

  23. 23.

    Chi, H. et al. pFIND–Alioth: a novel unrestricted database search algorithm to improve the interpretation of high-resolution MS/MS data. J. Proteomics 125, 89–97 (2015).

    Article  Google Scholar 

  24. 24.

    Marx, V. The big challenges of big data. Nature 498, 255–260 (2013).

    Article  Google Scholar 

  25. 25.

    Duncan, D. T., Craig, R. & Link, A. J. Parallel tandem: a program for parallel processing of tandem mass spectra using PVM or MPI and X! tandem. J. Proteome Res. 4, 1842–1847 (2005).

    Article  Google Scholar 

  26. 26.

    Bjornson, R. D. et al. X!!Tandem, an improved method for running X!Tandem in parallel on collections of commodity computers. J. Proteome Res. 7, 293–299 (2007).

    Article  Google Scholar 

  27. 27.

    Pratt, B., Howbert, J. J., Tasman, N. I. & Nilsson, E. J. MR-tandem: parallel X! Tandem using Hadoop MapReduce on Amazon Web Services. Bioinformatics 28, 136–137 (2011).

    Article  Google Scholar 

  28. 28.

    Li, C., Li, K., Li, K. & Lin, F. MCtandem: an efficient tool for large-scale peptide identification on many integrated core (MIC) architecture. BMC Bioinformatics 20, 397 (2019).

    Article  Google Scholar 

  29. 29.

    Li, C., Li, K., Chen, T., Zhu, Y. & He, Q. SW-Tandem: a highly efficient tool for large-scale peptide sequencing with parallel spectrum dot product on Sunway TaihuLight. Bioinformatics 35, 3861–3863 (2019).

    Article  Google Scholar 

  30. 30.

    Chen, L. et al. MS-PyCloud: an open-source, cloud computing-based pipeline for LC-MS/MS data analysis. Preprint at https://www.biorxiv.org/content/10.1101/320887v1 (2018).

  31. 31.

    Prakash, A., Ahmad, S., Majumder, S., Jenkins, C. & Orsburn, B. Bolt: a new age peptide search engine for comprehensive MS/MS sequencing through vast protein databases in minutes. J. Am. Soc. Mass Spec. 30, 2408–2418 (2019).

    Article  Google Scholar 

  32. 32.

    Kaiser, P. et al. High-resolution community analysis of deep-sea copepods using maldi-tof protein fingerprinting. Deep Sea Res. I 138, 122–130 (2018).

    Article  Google Scholar 

  33. 33.

    Rossel, S. & Arbizu, P. M. Revealing higher than expected diversity of Harpacticoida (Crustacea: Copepoda) in the North Sea using MALDI-TOF MS and molecular barcoding. Sci. Rep. 9, 1–14 (2019).

    Article  Google Scholar 

  34. 34.

    Yates III, J. R. Proteomics of communities: metaproteomics. J. Proteome Res. 18, 2359 (2019).

    Article  Google Scholar 

  35. 35.

    Saeed, F., Haseeb, M. & Lyengar, S. S. Communication lower-bounds for distributed-memory computations for mass spectrometry based omics data. Preprint at https://arxiv.org/abs/2009.14123v2 (2021).

  36. 36.

    Beyter, D., Lin, M. S., Yu, Y., Pieper, R. & Bafna, V. Proteostorm: an ultrafast metaproteomics database search framework. Cell Syst. 7, 463–467 (2018).

    Article  Google Scholar 

  37. 37.

    Valiant, L. G. A bridging model for parallel computation. Commun. ACM 33, 103–111 (1990).

    Article  Google Scholar 

  38. 38.

    Tiskin, A. BSP (Bulk Synchronous Parallelism) 192–199 (Springer, 2011); https://doi.org/10.1007/978-0-387-09766-4_311

  39. 39.

    Towns, J. et al. XSEDE: accelerating scientific discovery. Comput. Sci. Eng. 16, 62–74 (2014).

    Article  Google Scholar 

  40. 40.

    Eng, J. K., Jahan, T. A. & Hoopmann, M. R. Comet: an open-source MS/MS sequence database search tool. Proteomics 13, 22–24 (2013).

    Article  Google Scholar 

  41. 41.

    Craig, R. & Beavis, R. C. Tandem: matching proteins with tandem mass spectra. Bioinformatics 20, 1466–1467 (2004).

    Article  Google Scholar 

  42. 42.

    Madsen, J. R. et al. Timemory: modular performance analysis for HPC. In International Conference on High Performance Computing 434–452 (Springer, 2020).

  43. 43.

    Stevens, R., Ramprakash, J., Messina, P., Papka, M. & Riley, K. Aurora: Argonne’s Next-Generation Exascale Supercomputer Technical Report (Argonne National Laboratory, 2019).

  44. 44.

    Liu, K., Li, S., Wang, L., Ye, Y. & Tang, H. Full-spectrum prediction of peptides tandem mass spectra using deep neural network. Analytical chemistry 92, 4275–4283 (2020).

    Article  Google Scholar 

  45. 45.

    Lin, Y.-M., Chen, C.-T. & Chang, J.-M. MS2CNN: predicting MS/MS spectrum based on protein sequence using deep convolutional neural networks. BMC Genomics 20, 1–10 (2019).

    Article  Google Scholar 

  46. 46.

    Haseeb, M., Afzali, F. & Saeed, F. LBE: a computational load balancing algorithm for speeding up parallel peptide search in mass-spectrometry based proteomics. In 2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) 191–198 (IEEE, 2019).

  47. 47.

    Ding, J., Shi, J., Poirier, G. G. & Wu, F.-X. A novel approach to denoising ion trap tandem mass spectra. Proteome Sci. 7, 9 (2009).

    Article  Google Scholar 

  48. 48.

    Fenyö, D. & Beavis, R. C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774 (2003).

    Article  Google Scholar 

  49. 49.

    LaViola, J. J. Double exponential smoothing: an alternative to kalman filter-based predictive tracking. In Proc. Workshop on Virtual Environments 2003 199–206 (The Eurographics Association, 2003).

  50. 50.

    Haseeb, M. & Saeed, F. hicops/hicops: HiCOPS v1.0.0—1st Public Release (Zenodo, 2021); https://doi.org/10.5281/zenodo.5094072

  51. 51.

    Haseeb, M. & Saeed, F. Source Data: High Performance Computing Framework for Tera-Scale Database Search of Mass Spectrometry Data (Zenodo, 2021); https://doi.org/10.5281/zenodo.5076575

Download references

Acknowledgements

This work used the National Science Foundation (NSF) XSEDE supercomputers through allocations TG-CCR150017 and TG-ASC200004 (F.S.). This research was supported by the NIGMS of the National Institutes of Health (NIH) under award number: R01GM134384 (F.S.). The authors were further supported by the NSF under award number: NSF CAREER OAC-1925960 (F.S.). The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH and/or the NSF.

Author information

Affiliations

Authors

Contributions

M.H. and F.S. designed the parallel computational framework. M.H. implemented the software. M.H. and F.S. designed and performed the experiments, performed calculations, analyzed the data and results, and wrote the manuscript.

Corresponding authors

Correspondence to Muhammad Haseeb or Fahad Saeed.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Reviewer recognition statement Nature Computational Science thanks Robert Bjornson, Benjamin Neely, Yasset Perez-Riverol and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Editor recognition statement Handling editor: Ananya Rastogi, in collaboration with the Nature Computational Science team.

Supplementary information

Supplementary Information

Supplementary Figs. 1–8, Sections 1–9 and Algorithms 1–5.

Source data

Source Data Fig. 2

HiCOPS hyperscores and expectscores across serial and parallel runs and common peptide identifications for MSFragger and HiCOPS.

Source Data Fig. 3

Runtime profiles for several tools for speed comparison and other insights.

Source Data Fig. 4

Raw code instrumentation results for performance evaluation.

Source Data Fig. 5

Raw code instrumentation results for overhead evaluation.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Haseeb, M., Saeed, F. High performance computing framework for tera-scale database search of mass spectrometry data. Nat Comput Sci 1, 550–561 (2021). https://doi.org/10.1038/s43588-021-00113-z

Download citation

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing