Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Brief Communication
  • Published:

Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine

Abstract

We present a sequence-tag-based search engine, Open-pFind, to identify peptides in an ultra-large search space that includes coeluting peptides, unexpected modifications and digestions. Our method detects peptides with higher precision and speed than seven other search engines. Open-pFind identified 70–85% of the tandem mass spectra in four large-scale datasets and 14,064 proteins, each supported by at least two protein-unique peptides, in a human proteome dataset.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Figure 1: Open-pFind outperforms other search engines on a metabolically labeled dataset.
Figure 2: Performance evaluation of Open-pFind and seven other search engines with diverse types of datasets.

Similar content being viewed by others

References

  1. Chick, J.M. et al. Nat. Biotechnol. 33, 743–749 (2015).

    Article  CAS  Google Scholar 

  2. Bogdanow, B., Zauber, H. & Selbach, M. Mol. Cell. Proteomics 15, 2791–2801 (2016).

    Article  CAS  Google Scholar 

  3. Chalkley, R.J. et al. Mol. Cell. Proteomics 4, 1189–1193 (2005).

    Article  CAS  Google Scholar 

  4. Tanner, S. et al. Anal. Chem. 77, 4626–4639 (2005).

    Article  CAS  Google Scholar 

  5. Griss, J. et al. Nat. Methods 13, 651–656 (2016).

    Article  CAS  Google Scholar 

  6. Madar, I.H. et al. Anal. Chem. 89, 1244–1253 (2017).

    Article  CAS  Google Scholar 

  7. Michalski, A., Cox, J. & Mann, M. J. Proteome Res. 10, 1785–1793 (2011).

    Article  CAS  Google Scholar 

  8. Skinner, O.S. & Kelleher, N.L. Nat. Biotechnol. 33, 717–718 (2015).

    Article  CAS  Google Scholar 

  9. Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D. & Nesvizhskii, A.I. Nat. Methods 14, 513–520 (2017).

    Article  CAS  Google Scholar 

  10. Liu, C. et al. Anal. Chem. 86, 5286–5294 (2014).

    Article  CAS  Google Scholar 

  11. Michalski, A. et al. Mol. Cell. Proteomics 10, M111.011015 (2011).

    Article  Google Scholar 

  12. Sharma, K. et al. Nat. Neurosci. 18, 1819–1831 (2015).

    Article  CAS  Google Scholar 

  13. Kim, M.S. et al. Nature 509, 575–581 (2014).

    Article  CAS  Google Scholar 

  14. Granholm, V., Navarro, J.F., Noble, W.S. & Käll, L. J. Proteomics 80, 123–131 (2013).

    Article  CAS  Google Scholar 

  15. Sechi, S. & Chait, B.T. Anal. Chem. 70, 5150–5158 (1998).

    Article  CAS  Google Scholar 

  16. Ezkurdia, I. et al. Expert Rev. Proteomics 12, 579–593 (2015).

    Article  CAS  Google Scholar 

  17. Savitski, M.M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff, M. Mol. Cell. Proteomics 14, 2394–2404 (2015).

    Article  CAS  Google Scholar 

  18. Yuan, Z.F. et al. Proteomics 12, 226–235 (2012).

    Article  CAS  Google Scholar 

  19. Mann, M. & Wilm, M. Anal. Chem. 66, 4390–4399 (1994).

    Article  CAS  Google Scholar 

  20. Tabb, D.L., Saraf, A. & Yates, J.R. III. Anal. Chem. 75, 6415–6421 (2003).

    Article  CAS  Google Scholar 

  21. Kim, S., Gupta, N., Bandeira, N. & Pevzner, P.A. Mol. Cell. Proteomics 8, 53–69 (2009).

    Article  CAS  Google Scholar 

  22. The, M., MacCoss, M.J., Noble, W.S. & Käll, L. J. Am. Soc. Mass Spectrom. 27, 1719–1727 (2016).

    Article  CAS  Google Scholar 

  23. Käll, L., Canterbury, J.D., Weston, J., Noble, W.S. & MacCoss, M.J. Nat. Methods 4, 923–925 (2007).

    Article  Google Scholar 

  24. Eng, J.K., McCormack, A.L. & Yates, J.R. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    Article  CAS  Google Scholar 

  25. Cox, J. & Mann, M. Nat. Biotechnol. 26, 1367–1372 (2008).

    Article  CAS  Google Scholar 

  26. Cox, J. et al. J. Proteome Res. 10, 1794–1805 (2011).

    Article  CAS  Google Scholar 

  27. Craig, R. & Beavis, R.C. Rapid Commun. Mass Spectrom. 17, 2310–2316 (2003).

    Article  CAS  Google Scholar 

  28. Creasy, D.M. & Cottrell, J.S. Proteomics 4, 1534–1536 (2004).

    Article  CAS  Google Scholar 

  29. Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. J. Mach. Learn. Res. 9, 1871–1874 (2008).

    Google Scholar 

  30. Chi, H. et al. J. Proteomics 125, 89–97 (2015).

    Article  CAS  Google Scholar 

  31. Zhou, X.X. et al. Anal. Chem. 89, 12690–12697 (2017).

    Article  CAS  Google Scholar 

  32. Ding, Y.H. et al. J. Biol. Chem. 292, 1187–1196 (2017).

    Article  CAS  Google Scholar 

  33. Du, Y., Parks, B.A., Sohn, S., Kwast, K.E. & Kelleher, N.L. Anal. Chem. 78, 686–694 (2006).

    Article  CAS  Google Scholar 

  34. Leinonen, R. et al. Bioinformatics 20, 3236–3237 (2004).

    Article  CAS  Google Scholar 

Download references

Acknowledgements

This work was supported by grants from the National Key Research and Development Program of China (No. 2016YFA0501300 to S.-M.H., 2017YFA0505100 and 2017YFC0906600 to P.X. and 2012CB316502 to P.-H.Z.), the Youth Innovation Promotion Association CAS (No. 2014091 to H.C.), the National Natural Science Foundation of China (31470805 to H.C., 31670834 to P.X., 31700727 to C.L., and 21475141 to S.-M.H.), the CAS Interdisciplinary Innovation Team (Y604061000 to S.-M.H.), the International Collaboration Program (2014DFB30020 to P.X.), and the Beijing Training Project for The Leading Talents in S&T (Z161100004916024 to P.X.).

Author information

Authors and Affiliations

Authors

Contributions

H.C. developed the kernel algorithm and command-line tool of Open-pFind, analyzed the data and wrote the manuscript. C.L. developed the validation method using metabolically labeled datasets. H.Y. developed the post-processing tool pBuild and helped with data analysis. W.-F.Z. helped to develop the machine learning module. L.W. developed the pre-processing tool pParse. Y.-H.D. and M.-Q.D. provided the Dong-Ecoli-QE dataset. Y.Z. and P.X. provided the Xu-Yeast-QEHF dataset. W.-J.Z., R.-M.W., X.-N.N., Z.-W.W., Z.-L.C. and R.-X.S. helped with the development of interface and data analysis. T.L., G.-M.T. and P.-H.Z. helped with the performance test on the workstation. S.-M.H. coordinated the study. All of the authors helped to revise the manuscript.

Corresponding authors

Correspondence to Hao Chi or Si-Min He.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 The workflow of Open-pFind.

The MS data are first preprocessed by pParse, and then the MS/MS data are searched by the open search module. Next, the MS/MS data are re-searched by the restricted search module against a refined search space based on the learned information in the reranking step. Finally, the results obtained from both the open and restricted searches are merged, reranked again and reported.

Supplementary Figure 2 An overview of the Open-pFind results for the Dong-Ecoli-QE dataset.

This figure is from a screenshot of the embedded software tool pBuild, which was designed to graphically represent pFind Studio results. a) Distributions of mass deviations (in ppm) of the identified PSMs in the five raw files. The number of PSMs for each raw (N) is 16,007 (10mM), 13,418 (30mM), 16,931 (60mM), 18,575 (150mM3) and 24,523 (1000mM), respectively. Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1–1.5 × IQR to Q3+1.5 × IQR. b) Score distributions of the target (blue) and decoy (red) PSMs. c) Distribution of the enzymatic specificity of identified peptides. d) Percentages of the highly abundant modifications in all identified peptides. e) Distribution of the number of missed cleavage sites. f) Distribution of the number of peptides identified from one tandem mass spectrum, e.g., two peptides can be identified from 12.65% of the total spectra.

Supplementary Figure 3 Precision evaluation of search engines using the Dong-Ecoli-QE dataset.

Metabolically labeled datasets are searched against the protein database, in which only the unlabeled peptides are considered, and pQuant is then independently used for check the quantitation ratio of each PSM.

Supplementary Figure 4 Percentages of NaN-ratio PSMs identified in the Dong-Ecoli-QE dataset (I).

For the open search engines, only PSMs with no or common modifications specified in the restricted search engines were considered in the analysis.

a) Consistently identified PSMs from the comparison of every two search engines are considered.

b) Separately identified PSMs from the comparison of every two search engines are considered. The 15N- and unlabeled peptides are used to calculate the quantitation values in both a) and b).

c) Similar to a), but the 13C- and unlabeled peptides are used to calculate the quantitation values.

d) Similar to b), but the 13C- and unlabeled peptides are used to calculate the quantitation values.

Supplementary Figure 5 Percentages of NaN-ratio PSMs identified in the Xu-Yeast-QEHF dataset.

For the open search engines, only PSMs with no or common modifications specified in the restricted search engines were considered in the analysis.

a) Consistently identified PSMs from the comparison of every two search engines are considered.

b) Separately identified PSMs from the comparison of every two search engines are considered.

Supplementary Figure 6 Percentages of NaN-ratio PSMs identified in the Dong-Ecoli-QE dataset (II).

All PSMs (both modified and unmodified ones) were considered in the analysis.

a) Consistently identified PSMs from the comparison of every two search engines are considered.

b) Separately identified PSMs from the comparison of every two search engines are considered. The 15N- and unlabeled peptides are used to calculate the quantitation values in both a) and b).

c) Similar to a), but the 13C- and unlabeled peptides are used to calculate the quantitation values.

d) Similar to b), but the 13C- and unlabeled peptides are used to calculate the quantitation values.

Supplementary Figure 7 Analysis of NaN-ratio proteins in the Dong-Ecoli-QE dataset.

Open-pFind is compared with each of the other seven engines.

a) The number of proteins that are consistently identified by two engines (&), separately identified by Open-pFind (+) or the other engine (−), and the percentages of NaN-ratio proteins in those three sets (&, + and −, respectively). The quantitation value of a protein is the NaN ratio if no supporting protein-unique peptides are assigned with normal quantitation values.

b) Similar to a), but proteins supported with at least two protein-unique peptides are considered. The quantitation value of a protein is the NaN ratio if the number of the supporting protein-unique peptides with normal quantitation values is less than two.

15N-labeled peptides were used to calculate the quantitation values in both a) and b).

c) Similar to a), but 13C-labeled peptides were used for calculating the quantitation values.

d) Similar to b), but 13C-labeled peptides were used for calculating the quantitation values.

Supplementary Figure 8 Performance evaluation of the four open search engines using the entrapment strategy for the four published datasets.

The datasets were chosen as shown in Supplementary Table 1. Each database search process used the target database (human or mouse) appended by the entrapment database (Arabidopsis thaliana). All databases were downloaded from UniProt (http://www.uniprot.org), and only the reviewed proteins were used. The other database search parameters were the same as those shown in Supplementary Table 3.

In each subfigure, the x-axis denotes the number of total PSMs reported by the open search engines, and the y-axis denotes the fraction of the entrapment PSMs in the total PSMs, e.g., for the Mann-Human-Velos dataset, Open-pFind reported 23,876 PSMs at 1% FDR at the peptide level, in which 0.25% of the spectra were matched with peptides from the entrapment database.

Supplementary Figure 9 The distribution of the results reported by Open-pFind that were not included in the traditional search space.

For example, among all distinct peptides identified by Open-pFind for Mann-Human-Velos dataset, the semi-tryptic/non-specific peptides, peptides with unexpected modifications including mutations, peptides identified with additional precursors exported by pParse, accounted for 4.3%, 17.1% and 27.7%, respectively, and a total of 42.9% peptides fall outside of the restricted search space.

Supplementary Figure 10 The search spaces for different search modes.

The four modes includes the restricted search with no modifications (No mod), with common modifications (Common, with carbamidomethylation of C, oxidation of M, Gln→pyro-Glu at N-termini of peptides and acetylation at N-termini of proteins), and with phosphorylation (Phospho, with common modifications and phosphorylation of S, T, and Y), and the open search (Unimod, with at most one modification in Unimod), together with three different types of digestion. For each mode, 1,000 experiments were performed, each with 1,000 randomly selected proteins from UniProt database, which were digested in silico into peptides. Then, the distribution of the number of peptides from each experiment is shown in one box-plot (N = 1,000). Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1–1.5 × IQR to Q3+1.5 × IQR.

Supplementary Figure 11 Relationship between the protein database reduction and the running time of Open-pFind.

#PR denotes the number of proteins in the reduced protein database after open search, i.e., the first step of Open-pFind. #PO denotes the number of proteins in the original protein database specified before database searches. The red curve denotes the ratio of the running time of Open-pFind to the running time of pFind in the analyses of the six datasets. Each bar denotes the mean and each error bar (blue whisker) denotes the standard deviation (SD) measured among different datasets (N = 5, 22, 3, 24, 4, 24 for the number of raw files in the six datasets shown from left to right, respectively). Each grey dot denotes a data point in the corresponding dataset. For example, in the analysis of the Xu-Yeast-QEHF dataset, ~97% of the proteins in the original database were removed after the open search step on average; as a result, the speed of Open-pFind was approximately twice that of pFind.

Supplementary Figure 12 The comparison between the picked TDA approach and the common two-peptide TDA approach for protein inference with the Kim data.

The number of proteins reported by the picked strategy was 16,133, which was larger than the number that was reported using the two-peptide rule (14,064), but seven additional olfactory receptors were detected with the picked strategy.

Supplementary Figure 13 An example of a mixed spectrum with three co-eluting peptides.

The spectrum is entitled as Ecoli-1to1to1-un-C13-N15-60mM-20150823.22783.22783.dta, in which three co-eluting peptides were identified. a) MS1 information for the three peptides in the isolation window around this tandem mass spectrum. b) The match between the spectrum and the three resulting peptides, taken from a screenshot of pBuild.

Three isotopic clusters are clearly distinguished from each other, while the observed fragmentation sites were near-complete for all three peptides.

Supplementary Figure 14 Similarity between the actual and predicted fragment ions for each of the three co-eluting peptides identified from one spectrum.

The spectrum is entitled as Ecoli-1to1to1-un-C13-N15-60mM-20150823.22783.22783.2.dta. The predicted spectra are from pDeep algorithm. The similarity was measured with the Pearson's correlation coefficients (PCC): a) TLAEGQNVEFEIQDGQK, b) SEYLGDPDFVK and c) VSLAADPVEEIK. The number of singly charged b and y ions (N), for which the correlations were determined, is 32, 20 and 22 for a)‒c), respectively.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 (PDF 1660 kb)

Life Sciences Reporting Summary (PDF 132 kb)

Supplementary Tables and Notes

Supplementary Tables 1–10, Supplementary Notes 1–5 (PDF 4108 kb)

Supplementary Data 1

The number of identified PSMs, distinct peptides and distinct peptide sequences for the four published datasets (XLSX 14 kb)

Supplementary Data 2

The consistency of the identification results (XLSX 16 kb)

Supplementary Data 3

Detailed information for highly abundant modifications of cysteine residues in Kim data (XLSX 19 kb)

Supplementary Data 4

Detailed information for 694 semi-tryptic peptides verified by UniProt in Kim data (XLSX 72 kb)

Supplementary Code

Supplementary Code (EXE 21153 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chi, H., Liu, C., Yang, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat Biotechnol 36, 1059–1061 (2018). https://doi.org/10.1038/nbt.4236

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/nbt.4236

This article is cited by

Search

Quick links

Nature Briefing: Translational Research

Sign up for the Nature Briefing: Translational Research newsletter — top stories in biotechnology, drug discovery and pharma.

Get what matters in translational research, free to your inbox weekly. Sign up for Nature Briefing: Translational Research