Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine

Abstract

We present a sequence-tag-based search engine, Open-pFind, to identify peptides in an ultra-large search space that includes coeluting peptides, unexpected modifications and digestions. Our method detects peptides with higher precision and speed than seven other search engines. Open-pFind identified 70–85% of the tandem mass spectra in four large-scale datasets and 14,064 proteins, each supported by at least two protein-unique peptides, in a human proteome dataset.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Figure 1: Open-pFind outperforms other search engines on a metabolically labeled dataset.
Figure 2: Performance evaluation of Open-pFind and seven other search engines with diverse types of datasets.

References

  1. 1

    Chick, J.M. et al. Nat. Biotechnol. 33, 743–749 (2015).

    CAS  Article  Google Scholar 

  2. 2

    Bogdanow, B., Zauber, H. & Selbach, M. Mol. Cell. Proteomics 15, 2791–2801 (2016).

    CAS  Article  Google Scholar 

  3. 3

    Chalkley, R.J. et al. Mol. Cell. Proteomics 4, 1189–1193 (2005).

    CAS  Article  Google Scholar 

  4. 4

    Tanner, S. et al. Anal. Chem. 77, 4626–4639 (2005).

    CAS  Article  Google Scholar 

  5. 5

    Griss, J. et al. Nat. Methods 13, 651–656 (2016).

    CAS  Article  Google Scholar 

  6. 6

    Madar, I.H. et al. Anal. Chem. 89, 1244–1253 (2017).

    CAS  Article  Google Scholar 

  7. 7

    Michalski, A., Cox, J. & Mann, M. J. Proteome Res. 10, 1785–1793 (2011).

    CAS  Article  Google Scholar 

  8. 8

    Skinner, O.S. & Kelleher, N.L. Nat. Biotechnol. 33, 717–718 (2015).

    CAS  Article  Google Scholar 

  9. 9

    Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D. & Nesvizhskii, A.I. Nat. Methods 14, 513–520 (2017).

    CAS  Article  Google Scholar 

  10. 10

    Liu, C. et al. Anal. Chem. 86, 5286–5294 (2014).

    CAS  Article  Google Scholar 

  11. 11

    Michalski, A. et al. Mol. Cell. Proteomics 10, M111.011015 (2011).

    Article  Google Scholar 

  12. 12

    Sharma, K. et al. Nat. Neurosci. 18, 1819–1831 (2015).

    CAS  Article  Google Scholar 

  13. 13

    Kim, M.S. et al. Nature 509, 575–581 (2014).

    CAS  Article  Google Scholar 

  14. 14

    Granholm, V., Navarro, J.F., Noble, W.S. & Käll, L. J. Proteomics 80, 123–131 (2013).

    CAS  Article  Google Scholar 

  15. 15

    Sechi, S. & Chait, B.T. Anal. Chem. 70, 5150–5158 (1998).

    CAS  Article  Google Scholar 

  16. 16

    Ezkurdia, I. et al. Expert Rev. Proteomics 12, 579–593 (2015).

    CAS  Article  Google Scholar 

  17. 17

    Savitski, M.M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff, M. Mol. Cell. Proteomics 14, 2394–2404 (2015).

    CAS  Article  Google Scholar 

  18. 18

    Yuan, Z.F. et al. Proteomics 12, 226–235 (2012).

    CAS  Article  Google Scholar 

  19. 19

    Mann, M. & Wilm, M. Anal. Chem. 66, 4390–4399 (1994).

    CAS  Article  Google Scholar 

  20. 20

    Tabb, D.L., Saraf, A. & Yates, J.R. III. Anal. Chem. 75, 6415–6421 (2003).

    CAS  Article  Google Scholar 

  21. 21

    Kim, S., Gupta, N., Bandeira, N. & Pevzner, P.A. Mol. Cell. Proteomics 8, 53–69 (2009).

    CAS  Article  Google Scholar 

  22. 22

    The, M., MacCoss, M.J., Noble, W.S. & Käll, L. J. Am. Soc. Mass Spectrom. 27, 1719–1727 (2016).

    CAS  Article  Google Scholar 

  23. 23

    Käll, L., Canterbury, J.D., Weston, J., Noble, W.S. & MacCoss, M.J. Nat. Methods 4, 923–925 (2007).

    Article  Google Scholar 

  24. 24

    Eng, J.K., McCormack, A.L. & Yates, J.R. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    CAS  Article  Google Scholar 

  25. 25

    Cox, J. & Mann, M. Nat. Biotechnol. 26, 1367–1372 (2008).

    CAS  Article  Google Scholar 

  26. 26

    Cox, J. et al. J. Proteome Res. 10, 1794–1805 (2011).

    CAS  Article  Google Scholar 

  27. 27

    Craig, R. & Beavis, R.C. Rapid Commun. Mass Spectrom. 17, 2310–2316 (2003).

    CAS  Article  Google Scholar 

  28. 28

    Creasy, D.M. & Cottrell, J.S. Proteomics 4, 1534–1536 (2004).

    CAS  Article  Google Scholar 

  29. 29

    Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. J. Mach. Learn. Res. 9, 1871–1874 (2008).

    Google Scholar 

  30. 30

    Chi, H. et al. J. Proteomics 125, 89–97 (2015).

    CAS  Article  Google Scholar 

  31. 31

    Zhou, X.X. et al. Anal. Chem. 89, 12690–12697 (2017).

    CAS  Article  Google Scholar 

  32. 32

    Ding, Y.H. et al. J. Biol. Chem. 292, 1187–1196 (2017).

    CAS  Article  Google Scholar 

  33. 33

    Du, Y., Parks, B.A., Sohn, S., Kwast, K.E. & Kelleher, N.L. Anal. Chem. 78, 686–694 (2006).

    CAS  Article  Google Scholar 

  34. 34

    Leinonen, R. et al. Bioinformatics 20, 3236–3237 (2004).

    CAS  Article  Google Scholar 

Download references

Acknowledgements

This work was supported by grants from the National Key Research and Development Program of China (No. 2016YFA0501300 to S.-M.H., 2017YFA0505100 and 2017YFC0906600 to P.X. and 2012CB316502 to P.-H.Z.), the Youth Innovation Promotion Association CAS (No. 2014091 to H.C.), the National Natural Science Foundation of China (31470805 to H.C., 31670834 to P.X., 31700727 to C.L., and 21475141 to S.-M.H.), the CAS Interdisciplinary Innovation Team (Y604061000 to S.-M.H.), the International Collaboration Program (2014DFB30020 to P.X.), and the Beijing Training Project for The Leading Talents in S&T (Z161100004916024 to P.X.).

Author information

Affiliations

Authors

Contributions

H.C. developed the kernel algorithm and command-line tool of Open-pFind, analyzed the data and wrote the manuscript. C.L. developed the validation method using metabolically labeled datasets. H.Y. developed the post-processing tool pBuild and helped with data analysis. W.-F.Z. helped to develop the machine learning module. L.W. developed the pre-processing tool pParse. Y.-H.D. and M.-Q.D. provided the Dong-Ecoli-QE dataset. Y.Z. and P.X. provided the Xu-Yeast-QEHF dataset. W.-J.Z., R.-M.W., X.-N.N., Z.-W.W., Z.-L.C. and R.-X.S. helped with the development of interface and data analysis. T.L., G.-M.T. and P.-H.Z. helped with the performance test on the workstation. S.-M.H. coordinated the study. All of the authors helped to revise the manuscript.

Corresponding authors

Correspondence to Hao Chi or Si-Min He.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Integrated supplementary information

Supplementary Figure 1 The workflow of Open-pFind.

The MS data are first preprocessed by pParse, and then the MS/MS data are searched by the open search module. Next, the MS/MS data are re-searched by the restricted search module against a refined search space based on the learned information in the reranking step. Finally, the results obtained from both the open and restricted searches are merged, reranked again and reported.

Supplementary Figure 2 An overview of the Open-pFind results for the Dong-Ecoli-QE dataset.

This figure is from a screenshot of the embedded software tool pBuild, which was designed to graphically represent pFind Studio results. a) Distributions of mass deviations (in ppm) of the identified PSMs in the five raw files. The number of PSMs for each raw (N) is 16,007 (10mM), 13,418 (30mM), 16,931 (60mM), 18,575 (150mM3) and 24,523 (1000mM), respectively. Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1–1.5 × IQR to Q3+1.5 × IQR. b) Score distributions of the target (blue) and decoy (red) PSMs. c) Distribution of the enzymatic specificity of identified peptides. d) Percentages of the highly abundant modifications in all identified peptides. e) Distribution of the number of missed cleavage sites. f) Distribution of the number of peptides identified from one tandem mass spectrum, e.g., two peptides can be identified from 12.65% of the total spectra.

Supplementary Figure 3 Precision evaluation of search engines using the Dong-Ecoli-QE dataset.

Metabolically labeled datasets are searched against the protein database, in which only the unlabeled peptides are considered, and pQuant is then independently used for check the quantitation ratio of each PSM.

Supplementary Figure 4 Percentages of NaN-ratio PSMs identified in the Dong-Ecoli-QE dataset (I).

For the open search engines, only PSMs with no or common modifications specified in the restricted search engines were considered in the analysis.a) Consistently identified PSMs from the comparison of every two search engines are considered.b) Separately identified PSMs from the comparison of every two search engines are considered. The 15N- and unlabeled peptides are used to calculate the quantitation values in both a) and b).c) Similar to a), but the 13C- and unlabeled peptides are used to calculate the quantitation values.d) Similar to b), but the 13C- and unlabeled peptides are used to calculate the quantitation values.

Supplementary Figure 5 Percentages of NaN-ratio PSMs identified in the Xu-Yeast-QEHF dataset.

For the open search engines, only PSMs with no or common modifications specified in the restricted search engines were considered in the analysis.a) Consistently identified PSMs from the comparison of every two search engines are considered.b) Separately identified PSMs from the comparison of every two search engines are considered.

Supplementary Figure 6 Percentages of NaN-ratio PSMs identified in the Dong-Ecoli-QE dataset (II).

All PSMs (both modified and unmodified ones) were considered in the analysis.a) Consistently identified PSMs from the comparison of every two search engines are considered.b) Separately identified PSMs from the comparison of every two search engines are considered. The 15N- and unlabeled peptides are used to calculate the quantitation values in both a) and b).c) Similar to a), but the 13C- and unlabeled peptides are used to calculate the quantitation values.d) Similar to b), but the 13C- and unlabeled peptides are used to calculate the quantitation values.

Supplementary Figure 7 Analysis of NaN-ratio proteins in the Dong-Ecoli-QE dataset.

Open-pFind is compared with each of the other seven engines.a) The number of proteins that are consistently identified by two engines (&), separately identified by Open-pFind (+) or the other engine (−), and the percentages of NaN-ratio proteins in those three sets (&, + and −, respectively). The quantitation value of a protein is the NaN ratio if no supporting protein-unique peptides are assigned with normal quantitation values.b) Similar to a), but proteins supported with at least two protein-unique peptides are considered. The quantitation value of a protein is the NaN ratio if the number of the supporting protein-unique peptides with normal quantitation values is less than two.15N-labeled peptides were used to calculate the quantitation values in both a) and b).c) Similar to a), but 13C-labeled peptides were used for calculating the quantitation values.d) Similar to b), but 13C-labeled peptides were used for calculating the quantitation values.

Supplementary Figure 8 Performance evaluation of the four open search engines using the entrapment strategy for the four published datasets.

The datasets were chosen as shown in Supplementary Table 1. Each database search process used the target database (human or mouse) appended by the entrapment database (Arabidopsis thaliana). All databases were downloaded from UniProt (http://www.uniprot.org), and only the reviewed proteins were used. The other database search parameters were the same as those shown in Supplementary Table 3.In each subfigure, the x-axis denotes the number of total PSMs reported by the open search engines, and the y-axis denotes the fraction of the entrapment PSMs in the total PSMs, e.g., for the Mann-Human-Velos dataset, Open-pFind reported 23,876 PSMs at 1% FDR at the peptide level, in which 0.25% of the spectra were matched with peptides from the entrapment database.

Supplementary Figure 9 The distribution of the results reported by Open-pFind that were not included in the traditional search space.

For example, among all distinct peptides identified by Open-pFind for Mann-Human-Velos dataset, the semi-tryptic/non-specific peptides, peptides with unexpected modifications including mutations, peptides identified with additional precursors exported by pParse, accounted for 4.3%, 17.1% and 27.7%, respectively, and a total of 42.9% peptides fall outside of the restricted search space.

Supplementary Figure 10 The search spaces for different search modes.

The four modes includes the restricted search with no modifications (No mod), with common modifications (Common, with carbamidomethylation of C, oxidation of M, Gln→pyro-Glu at N-termini of peptides and acetylation at N-termini of proteins), and with phosphorylation (Phospho, with common modifications and phosphorylation of S, T, and Y), and the open search (Unimod, with at most one modification in Unimod), together with three different types of digestion. For each mode, 1,000 experiments were performed, each with 1,000 randomly selected proteins from UniProt database, which were digested in silico into peptides. Then, the distribution of the number of peptides from each experiment is shown in one box-plot (N = 1,000). Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1–1.5 × IQR to Q3+1.5 × IQR.

Supplementary Figure 11 Relationship between the protein database reduction and the running time of Open-pFind.

#PR denotes the number of proteins in the reduced protein database after open search, i.e., the first step of Open-pFind. #PO denotes the number of proteins in the original protein database specified before database searches. The red curve denotes the ratio of the running time of Open-pFind to the running time of pFind in the analyses of the six datasets. Each bar denotes the mean and each error bar (blue whisker) denotes the standard deviation (SD) measured among different datasets (N = 5, 22, 3, 24, 4, 24 for the number of raw files in the six datasets shown from left to right, respectively). Each grey dot denotes a data point in the corresponding dataset. For example, in the analysis of the Xu-Yeast-QEHF dataset, ~97% of the proteins in the original database were removed after the open search step on average; as a result, the speed of Open-pFind was approximately twice that of pFind.

Supplementary Figure 12 The comparison between the picked TDA approach and the common two-peptide TDA approach for protein inference with the Kim data.

The number of proteins reported by the picked strategy was 16,133, which was larger than the number that was reported using the two-peptide rule (14,064), but seven additional olfactory receptors were detected with the picked strategy.

Supplementary Figure 13 An example of a mixed spectrum with three co-eluting peptides.

The spectrum is entitled as Ecoli-1to1to1-un-C13-N15-60mM-20150823.22783.22783.dta, in which three co-eluting peptides were identified. a) MS1 information for the three peptides in the isolation window around this tandem mass spectrum. b) The match between the spectrum and the three resulting peptides, taken from a screenshot of pBuild.Three isotopic clusters are clearly distinguished from each other, while the observed fragmentation sites were near-complete for all three peptides.

Supplementary Figure 14 Similarity between the actual and predicted fragment ions for each of the three co-eluting peptides identified from one spectrum.

The spectrum is entitled as Ecoli-1to1to1-un-C13-N15-60mM-20150823.22783.22783.2.dta. The predicted spectra are from pDeep algorithm. The similarity was measured with the Pearson's correlation coefficients (PCC): a) TLAEGQNVEFEIQDGQK, b) SEYLGDPDFVK and c) VSLAADPVEEIK. The number of singly charged b and y ions (N), for which the correlations were determined, is 32, 20 and 22 for a)‒c), respectively.

Supplementary information

Supplementary Text and Figures

Supplementary Figures 1–14 (PDF 1660 kb)

Life Sciences Reporting Summary (PDF 132 kb)

Supplementary Tables and Notes

Supplementary Tables 1–10, Supplementary Notes 1–5 (PDF 4108 kb)

Supplementary Data 1

The number of identified PSMs, distinct peptides and distinct peptide sequences for the four published datasets (XLSX 14 kb)

Supplementary Data 2

The consistency of the identification results (XLSX 16 kb)

Supplementary Data 3

Detailed information for highly abundant modifications of cysteine residues in Kim data (XLSX 19 kb)

Supplementary Data 4

Detailed information for 694 semi-tryptic peptides verified by UniProt in Kim data (XLSX 72 kb)

Supplementary Code

Supplementary Code (EXE 21153 kb)

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Chi, H., Liu, C., Yang, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat Biotechnol 36, 1059–1061 (2018). https://doi.org/10.1038/nbt.4236

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing