Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine

Chi, Hao; Liu, Chao; Yang, Hao; Zeng, Wen-Feng; Wu, Long; Zhou, Wen-Jing; Wang, Rui-Min; Niu, Xiu-Nan; Ding, Yue-He; Zhang, Yao; Wang, Zhao-Wei; Chen, Zhen-Lin; Sun, Rui-Xiang; Liu, Tao; Tan, Guang-Ming; Dong, Meng-Qiu; Xu, Ping; Zhang, Pei-Heng; He, Si-Min

doi:10.1038/nbt.4236

Brief Communication
Published: 08 October 2018

Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine

Hao Chi ORCID: orcid.org/0000-0002-7898-2599^1,2^na1,
Chao Liu^1,2^na1,
Hao Yang^1,2^na1,
Wen-Feng Zeng^1,2,
Long Wu^1,2,
Wen-Jing Zhou^1,2,
Rui-Min Wang^1,2,
Xiu-Nan Niu^1,2,
Yue-He Ding³,
Yao Zhang^4,5,
Zhao-Wei Wang^1,2,
Zhen-Lin Chen^1,2,
Rui-Xiang Sun^1,2,
Tao Liu¹,
Guang-Ming Tan¹,
Meng-Qiu Dong ORCID: orcid.org/0000-0002-6094-1182³,
Ping Xu⁴,
Pei-Heng Zhang¹ &
…
Si-Min He^1,2

Nature Biotechnology volume 36, pages 1059–1061 (2018)Cite this article

10k Accesses
222 Citations
30 Altmetric
Metrics details

Subjects

Abstract

We present a sequence-tag-based search engine, Open-pFind, to identify peptides in an ultra-large search space that includes coeluting peptides, unexpected modifications and digestions. Our method detects peptides with higher precision and speed than seven other search engines. Open-pFind identified 70–85% of the tandem mass spectra in four large-scale datasets and 14,064 proteins, each supported by at least two protein-unique peptides, in a human proteome dataset.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Figure 1: Open-pFind outperforms other search engines on a metabolically labeled dataset.**

**Figure 2: Performance evaluation of Open-pFind and seven other search engines with diverse types of datasets.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

Improving microbial phylogeny with citizen science within a mass-market video game

Article Open access 15 April 2024

Roman Sarrazin-Gendron, Parham Ghasemloo Gheidari, … Jérôme Waldispühl

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Tiffany J. Callahan, Ignacio J. Tripodi, … Lawrence E. Hunter

References

Chick, J.M. et al. Nat. Biotechnol. 33, 743–749 (2015).
Article CAS Google Scholar
Bogdanow, B., Zauber, H. & Selbach, M. Mol. Cell. Proteomics 15, 2791–2801 (2016).
Article CAS Google Scholar
Chalkley, R.J. et al. Mol. Cell. Proteomics 4, 1189–1193 (2005).
Article CAS Google Scholar
Tanner, S. et al. Anal. Chem. 77, 4626–4639 (2005).
Article CAS Google Scholar
Griss, J. et al. Nat. Methods 13, 651–656 (2016).
Article CAS Google Scholar
Madar, I.H. et al. Anal. Chem. 89, 1244–1253 (2017).
Article CAS Google Scholar
Michalski, A., Cox, J. & Mann, M. J. Proteome Res. 10, 1785–1793 (2011).
Article CAS Google Scholar
Skinner, O.S. & Kelleher, N.L. Nat. Biotechnol. 33, 717–718 (2015).
Article CAS Google Scholar
Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D. & Nesvizhskii, A.I. Nat. Methods 14, 513–520 (2017).
Article CAS Google Scholar
Liu, C. et al. Anal. Chem. 86, 5286–5294 (2014).
Article CAS Google Scholar
Michalski, A. et al. Mol. Cell. Proteomics 10, M111.011015 (2011).
Article Google Scholar
Sharma, K. et al. Nat. Neurosci. 18, 1819–1831 (2015).
Article CAS Google Scholar
Kim, M.S. et al. Nature 509, 575–581 (2014).
Article CAS Google Scholar
Granholm, V., Navarro, J.F., Noble, W.S. & Käll, L. J. Proteomics 80, 123–131 (2013).
Article CAS Google Scholar
Sechi, S. & Chait, B.T. Anal. Chem. 70, 5150–5158 (1998).
Article CAS Google Scholar
Ezkurdia, I. et al. Expert Rev. Proteomics 12, 579–593 (2015).
Article CAS Google Scholar
Savitski, M.M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff, M. Mol. Cell. Proteomics 14, 2394–2404 (2015).
Article CAS Google Scholar
Yuan, Z.F. et al. Proteomics 12, 226–235 (2012).
Article CAS Google Scholar
Mann, M. & Wilm, M. Anal. Chem. 66, 4390–4399 (1994).
Article CAS Google Scholar
Tabb, D.L., Saraf, A. & Yates, J.R. III. Anal. Chem. 75, 6415–6421 (2003).
Article CAS Google Scholar
Kim, S., Gupta, N., Bandeira, N. & Pevzner, P.A. Mol. Cell. Proteomics 8, 53–69 (2009).
Article CAS Google Scholar
The, M., MacCoss, M.J., Noble, W.S. & Käll, L. J. Am. Soc. Mass Spectrom. 27, 1719–1727 (2016).
Article CAS Google Scholar
Käll, L., Canterbury, J.D., Weston, J., Noble, W.S. & MacCoss, M.J. Nat. Methods 4, 923–925 (2007).
Article Google Scholar
Eng, J.K., McCormack, A.L. & Yates, J.R. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Article CAS Google Scholar
Cox, J. & Mann, M. Nat. Biotechnol. 26, 1367–1372 (2008).
Article CAS Google Scholar
Cox, J. et al. J. Proteome Res. 10, 1794–1805 (2011).
Article CAS Google Scholar
Craig, R. & Beavis, R.C. Rapid Commun. Mass Spectrom. 17, 2310–2316 (2003).
Article CAS Google Scholar
Creasy, D.M. & Cottrell, J.S. Proteomics 4, 1534–1536 (2004).
Article CAS Google Scholar
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. J. Mach. Learn. Res. 9, 1871–1874 (2008).
Google Scholar
Chi, H. et al. J. Proteomics 125, 89–97 (2015).
Article CAS Google Scholar
Zhou, X.X. et al. Anal. Chem. 89, 12690–12697 (2017).
Article CAS Google Scholar
Ding, Y.H. et al. J. Biol. Chem. 292, 1187–1196 (2017).
Article CAS Google Scholar
Du, Y., Parks, B.A., Sohn, S., Kwast, K.E. & Kelleher, N.L. Anal. Chem. 78, 686–694 (2006).
Article CAS Google Scholar
Leinonen, R. et al. Bioinformatics 20, 3236–3237 (2004).
Article CAS Google Scholar

Download references

Acknowledgements

This work was supported by grants from the National Key Research and Development Program of China (No. 2016YFA0501300 to S.-M.H., 2017YFA0505100 and 2017YFC0906600 to P.X. and 2012CB316502 to P.-H.Z.), the Youth Innovation Promotion Association CAS (No. 2014091 to H.C.), the National Natural Science Foundation of China (31470805 to H.C., 31670834 to P.X., 31700727 to C.L., and 21475141 to S.-M.H.), the CAS Interdisciplinary Innovation Team (Y604061000 to S.-M.H.), the International Collaboration Program (2014DFB30020 to P.X.), and the Beijing Training Project for The Leading Talents in S&T (Z161100004916024 to P.X.).

Author information

Hao Chi, Chao Liu and Hao Yang: These authors contributed equally to this work.

Authors and Affiliations

Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, China
Hao Chi, Chao Liu, Hao Yang, Wen-Feng Zeng, Long Wu, Wen-Jing Zhou, Rui-Min Wang, Xiu-Nan Niu, Zhao-Wei Wang, Zhen-Lin Chen, Rui-Xiang Sun, Tao Liu, Guang-Ming Tan, Pei-Heng Zhang & Si-Min He
University of Chinese Academy of Sciences, Beijing, China
Hao Chi, Chao Liu, Hao Yang, Wen-Feng Zeng, Long Wu, Wen-Jing Zhou, Rui-Min Wang, Xiu-Nan Niu, Zhao-Wei Wang, Zhen-Lin Chen, Rui-Xiang Sun & Si-Min He
National Institute of Biological Sciences, Beijing, Beijing, China.,
Yue-He Ding & Meng-Qiu Dong
State Key Laboratory of Proteomics, Beijing Proteome Research Center, National Center for Protein Sciences (Beijing), Beijing Institute of Lifeomics, Beijing, China
Yao Zhang & Ping Xu
State Key Laboratory of Biocontrol and Guangdong Provincial Key Laboratory of Plant Resources, College of Ecology and Evolution, Sun Yat-Sen University, Guangzhou, China
Yao Zhang

Authors

Hao Chi
View author publications
You can also search for this author in PubMed Google Scholar
Chao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Yang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Feng Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Long Wu
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Jing Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Rui-Min Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xiu-Nan Niu
View author publications
You can also search for this author in PubMed Google Scholar
Yue-He Ding
View author publications
You can also search for this author in PubMed Google Scholar
Yao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zhao-Wei Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zhen-Lin Chen
View author publications
You can also search for this author in PubMed Google Scholar
Rui-Xiang Sun
View author publications
You can also search for this author in PubMed Google Scholar
Tao Liu
View author publications
You can also search for this author in PubMed Google Scholar
Guang-Ming Tan
View author publications
You can also search for this author in PubMed Google Scholar
Meng-Qiu Dong
View author publications
You can also search for this author in PubMed Google Scholar
Ping Xu
View author publications
You can also search for this author in PubMed Google Scholar
Pei-Heng Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Si-Min He
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

H.C. developed the kernel algorithm and command-line tool of Open-pFind, analyzed the data and wrote the manuscript. C.L. developed the validation method using metabolically labeled datasets. H.Y. developed the post-processing tool pBuild and helped with data analysis. W.-F.Z. helped to develop the machine learning module. L.W. developed the pre-processing tool pParse. Y.-H.D. and M.-Q.D. provided the Dong-Ecoli-QE dataset. Y.Z. and P.X. provided the Xu-Yeast-QEHF dataset. W.-J.Z., R.-M.W., X.-N.N., Z.-W.W., Z.-L.C. and R.-X.S. helped with the development of interface and data analysis. T.L., G.-M.T. and P.-H.Z. helped with the performance test on the workstation. S.-M.H. coordinated the study. All of the authors helped to revise the manuscript.

Corresponding authors

Correspondence to Hao Chi or Si-Min He.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 The workflow of Open-pFind.

The MS data are first preprocessed by pParse, and then the MS/MS data are searched by the open search module. Next, the MS/MS data are re-searched by the restricted search module against a refined search space based on the learned information in the reranking step. Finally, the results obtained from both the open and restricted searches are merged, reranked again and reported.

Supplementary Figure 2 An overview of the Open-pFind results for the Dong-Ecoli-QE dataset.

This figure is from a screenshot of the embedded software tool pBuild, which was designed to graphically represent pFind Studio results. a) Distributions of mass deviations (in ppm) of the identified PSMs in the five raw files. The number of PSMs for each raw (N) is 16,007 (10mM), 13,418 (30mM), 16,931 (60mM), 18,575 (150mM3) and 24,523 (1000mM), respectively. Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1–1.5 × IQR to Q3+1.5 × IQR. b) Score distributions of the target (blue) and decoy (red) PSMs. c) Distribution of the enzymatic specificity of identified peptides. d) Percentages of the highly abundant modifications in all identified peptides. e) Distribution of the number of missed cleavage sites. f) Distribution of the number of peptides identified from one tandem mass spectrum, e.g., two peptides can be identified from 12.65% of the total spectra.

Supplementary Figure 3 Precision evaluation of search engines using the Dong-Ecoli-QE dataset.

Metabolically labeled datasets are searched against the protein database, in which only the unlabeled peptides are considered, and pQuant is then independently used for check the quantitation ratio of each PSM.

Supplementary Figure 4 Percentages of NaN-ratio PSMs identified in the Dong-Ecoli-QE dataset (I).

For the open search engines, only PSMs with no or common modifications specified in the restricted search engines were considered in the analysis.

a) Consistently identified PSMs from the comparison of every two search engines are considered.

b) Separately identified PSMs from the comparison of every two search engines are considered. The ¹⁵N- and unlabeled peptides are used to calculate the quantitation values in both a) and b).

c) Similar to a), but the ¹³C- and unlabeled peptides are used to calculate the quantitation values.

d) Similar to b), but the ¹³C- and unlabeled peptides are used to calculate the quantitation values.

Supplementary Figure 5 Percentages of NaN-ratio PSMs identified in the Xu-Yeast-QEHF dataset.

For the open search engines, only PSMs with no or common modifications specified in the restricted search engines were considered in the analysis.

a) Consistently identified PSMs from the comparison of every two search engines are considered.

b) Separately identified PSMs from the comparison of every two search engines are considered.

Supplementary Figure 6 Percentages of NaN-ratio PSMs identified in the Dong-Ecoli-QE dataset (II).

All PSMs (both modified and unmodified ones) were considered in the analysis.

a) Consistently identified PSMs from the comparison of every two search engines are considered.

b) Separately identified PSMs from the comparison of every two search engines are considered. The ¹⁵N- and unlabeled peptides are used to calculate the quantitation values in both a) and b).

c) Similar to a), but the ¹³C- and unlabeled peptides are used to calculate the quantitation values.

d) Similar to b), but the ¹³C- and unlabeled peptides are used to calculate the quantitation values.

Supplementary Figure 7 Analysis of NaN-ratio proteins in the Dong-Ecoli-QE dataset.

Open-pFind is compared with each of the other seven engines.

a) The number of proteins that are consistently identified by two engines (&), separately identified by Open-pFind (+) or the other engine (−), and the percentages of NaN-ratio proteins in those three sets (&, + and −, respectively). The quantitation value of a protein is the NaN ratio if no supporting protein-unique peptides are assigned with normal quantitation values.

b) Similar to a), but proteins supported with at least two protein-unique peptides are considered. The quantitation value of a protein is the NaN ratio if the number of the supporting protein-unique peptides with normal quantitation values is less than two.

¹⁵N-labeled peptides were used to calculate the quantitation values in both a) and b).

c) Similar to a), but ¹³C-labeled peptides were used for calculating the quantitation values.

d) Similar to b), but ¹³C-labeled peptides were used for calculating the quantitation values.

Supplementary Figure 8 Performance evaluation of the four open search engines using the entrapment strategy for the four published datasets.

The datasets were chosen as shown in Supplementary Table 1. Each database search process used the target database (human or mouse) appended by the entrapment database (Arabidopsis thaliana). All databases were downloaded from UniProt (http://www.uniprot.org), and only the reviewed proteins were used. The other database search parameters were the same as those shown in Supplementary Table 3.

In each subfigure, the x-axis denotes the number of total PSMs reported by the open search engines, and the y-axis denotes the fraction of the entrapment PSMs in the total PSMs, e.g., for the Mann-Human-Velos dataset, Open-pFind reported 23,876 PSMs at 1% FDR at the peptide level, in which 0.25% of the spectra were matched with peptides from the entrapment database.

Supplementary Figure 9 The distribution of the results reported by Open-pFind that were not included in the traditional search space.

For example, among all distinct peptides identified by Open-pFind for Mann-Human-Velos dataset, the semi-tryptic/non-specific peptides, peptides with unexpected modifications including mutations, peptides identified with additional precursors exported by pParse, accounted for 4.3%, 17.1% and 27.7%, respectively, and a total of 42.9% peptides fall outside of the restricted search space.

Supplementary Figure 10 The search spaces for different search modes.

The four modes includes the restricted search with no modifications (No mod), with common modifications (Common, with carbamidomethylation of C, oxidation of M, Gln→pyro-Glu at N-termini of peptides and acetylation at N-termini of proteins), and with phosphorylation (Phospho, with common modifications and phosphorylation of S, T, and Y), and the open search (Unimod, with at most one modification in Unimod), together with three different types of digestion. For each mode, 1,000 experiments were performed, each with 1,000 randomly selected proteins from UniProt database, which were digested in silico into peptides. Then, the distribution of the number of peptides from each experiment is shown in one box-plot (N = 1,000). Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1–1.5 × IQR to Q3+1.5 × IQR.

Supplementary Figure 11 Relationship between the protein database reduction and the running time of Open-pFind.

#PR denotes the number of proteins in the reduced protein database after open search, i.e., the first step of Open-pFind. #PO denotes the number of proteins in the original protein database specified before database searches. The red curve denotes the ratio of the running time of Open-pFind to the running time of pFind in the analyses of the six datasets. Each bar denotes the mean and each error bar (blue whisker) denotes the standard deviation (SD) measured among different datasets (N = 5, 22, 3, 24, 4, 24 for the number of raw files in the six datasets shown from left to right, respectively). Each grey dot denotes a data point in the corresponding dataset. For example, in the analysis of the Xu-Yeast-QEHF dataset, ~97% of the proteins in the original database were removed after the open search step on average; as a result, the speed of Open-pFind was approximately twice that of pFind.

Supplementary Figure 12 The comparison between the picked TDA approach and the common two-peptide TDA approach for protein inference with the Kim data.

The number of proteins reported by the picked strategy was 16,133, which was larger than the number that was reported using the two-peptide rule (14,064), but seven additional olfactory receptors were detected with the picked strategy.

Supplementary Figure 13 An example of a mixed spectrum with three co-eluting peptides.

The spectrum is entitled as Ecoli-1to1to1-un-C13-N15-60mM-20150823.22783.22783.dta, in which three co-eluting peptides were identified. a) MS1 information for the three peptides in the isolation window around this tandem mass spectrum. b) The match between the spectrum and the three resulting peptides, taken from a screenshot of pBuild.

Three isotopic clusters are clearly distinguished from each other, while the observed fragmentation sites were near-complete for all three peptides.

Supplementary Figure 14 Similarity between the actual and predicted fragment ions for each of the three co-eluting peptides identified from one spectrum.

The spectrum is entitled as Ecoli-1to1to1-un-C13-N15-60mM-20150823.22783.22783.2.dta. The predicted spectra are from pDeep algorithm. The similarity was measured with the Pearson's correlation coefficients (PCC): a) TLAEGQNVEFEIQDGQK, b) SEYLGDPDFVK and c) VSLAADPVEEIK. The number of singly charged b and y ions (N), for which the correlations were determined, is 32, 20 and 22 for a)‒c), respectively.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 (PDF 1660 kb)

Life Sciences Reporting Summary (PDF 132 kb)

Supplementary Tables and Notes

Supplementary Tables 1–10, Supplementary Notes 1–5 (PDF 4108 kb)

Supplementary Data 1

The number of identified PSMs, distinct peptides and distinct peptide sequences for the four published datasets (XLSX 14 kb)

Supplementary Data 2

The consistency of the identification results (XLSX 16 kb)

Supplementary Data 3

Detailed information for highly abundant modifications of cysteine residues in Kim data (XLSX 19 kb)

Supplementary Data 4

Detailed information for 694 semi-tryptic peptides verified by UniProt in Kim data (XLSX 72 kb)

Supplementary Code

Supplementary Code (EXE 21153 kb)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chi, H., Liu, C., Yang, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat Biotechnol 36, 1059–1061 (2018). https://doi.org/10.1038/nbt.4236

Download citation

Received: 03 December 2017
Accepted: 03 August 2018
Published: 08 October 2018
Issue Date: November 2018
DOI: https://doi.org/10.1038/nbt.4236

This article is cited by

VCF1 is a p97/VCP cofactor promoting recognition of ubiquitylated p97-UFD1-NPL4 substrates
- Ann Schirin Mirsanaye
- Saskia Hoffmann
- Niels Mailand
Nature Communications (2024)
Prediction of glycopeptide fragment mass spectra by deep learning
- Yi Yang
- Qun Fang
Nature Communications (2024)
ULK1-dependent phosphorylation of PKM2 antagonizes O-GlcNAcylation and regulates the Warburg effect in breast cancer
- Zibin Zhou
- Xiyuan Zheng
- Jing Li
Oncogene (2024)
Fast alignment of mass spectra in large proteomics datasets, capturing dissimilarities arising from multiple complex modifications of peptides
- Grégoire Prunier
- Mehdi Cherkaoui
- Dominique Tessier
BMC Bioinformatics (2023)
Mzion enables deep and precise identification of peptides in data-dependent acquisition proteomics
- Qiang Zhang
Scientific Reports (2023)

Subjects

Abstract

Access options

Similar content being viewed by others

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Integrated supplementary information

Supplementary information

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Search

Quick links