We present a sequence-tag-based search engine, Open-pFind, to identify peptides in an ultra-large search space that includes coeluting peptides, unexpected modifications and digestions. Our method detects peptides with higher precision and speed than seven other search engines. Open-pFind identified 70–85% of the tandem mass spectra in four large-scale datasets and 14,064 proteins, each supported by at least two protein-unique peptides, in a human proteome dataset.
Subscribe to Journal
Get full journal access for 1 year
only $21.58 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
Chick, J.M. et al. Nat. Biotechnol. 33, 743–749 (2015).
Bogdanow, B., Zauber, H. & Selbach, M. Mol. Cell. Proteomics 15, 2791–2801 (2016).
Chalkley, R.J. et al. Mol. Cell. Proteomics 4, 1189–1193 (2005).
Tanner, S. et al. Anal. Chem. 77, 4626–4639 (2005).
Griss, J. et al. Nat. Methods 13, 651–656 (2016).
Madar, I.H. et al. Anal. Chem. 89, 1244–1253 (2017).
Michalski, A., Cox, J. & Mann, M. J. Proteome Res. 10, 1785–1793 (2011).
Skinner, O.S. & Kelleher, N.L. Nat. Biotechnol. 33, 717–718 (2015).
Kong, A.T., Leprevost, F.V., Avtonomov, D.M., Mellacheruvu, D. & Nesvizhskii, A.I. Nat. Methods 14, 513–520 (2017).
Liu, C. et al. Anal. Chem. 86, 5286–5294 (2014).
Michalski, A. et al. Mol. Cell. Proteomics 10, M111.011015 (2011).
Sharma, K. et al. Nat. Neurosci. 18, 1819–1831 (2015).
Kim, M.S. et al. Nature 509, 575–581 (2014).
Granholm, V., Navarro, J.F., Noble, W.S. & Käll, L. J. Proteomics 80, 123–131 (2013).
Sechi, S. & Chait, B.T. Anal. Chem. 70, 5150–5158 (1998).
Ezkurdia, I. et al. Expert Rev. Proteomics 12, 579–593 (2015).
Savitski, M.M., Wilhelm, M., Hahne, H., Kuster, B. & Bantscheff, M. Mol. Cell. Proteomics 14, 2394–2404 (2015).
Yuan, Z.F. et al. Proteomics 12, 226–235 (2012).
Mann, M. & Wilm, M. Anal. Chem. 66, 4390–4399 (1994).
Tabb, D.L., Saraf, A. & Yates, J.R. III. Anal. Chem. 75, 6415–6421 (2003).
Kim, S., Gupta, N., Bandeira, N. & Pevzner, P.A. Mol. Cell. Proteomics 8, 53–69 (2009).
The, M., MacCoss, M.J., Noble, W.S. & Käll, L. J. Am. Soc. Mass Spectrom. 27, 1719–1727 (2016).
Käll, L., Canterbury, J.D., Weston, J., Noble, W.S. & MacCoss, M.J. Nat. Methods 4, 923–925 (2007).
Eng, J.K., McCormack, A.L. & Yates, J.R. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).
Cox, J. & Mann, M. Nat. Biotechnol. 26, 1367–1372 (2008).
Cox, J. et al. J. Proteome Res. 10, 1794–1805 (2011).
Craig, R. & Beavis, R.C. Rapid Commun. Mass Spectrom. 17, 2310–2316 (2003).
Creasy, D.M. & Cottrell, J.S. Proteomics 4, 1534–1536 (2004).
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R. & Lin, C.-J. J. Mach. Learn. Res. 9, 1871–1874 (2008).
Chi, H. et al. J. Proteomics 125, 89–97 (2015).
Zhou, X.X. et al. Anal. Chem. 89, 12690–12697 (2017).
Ding, Y.H. et al. J. Biol. Chem. 292, 1187–1196 (2017).
Du, Y., Parks, B.A., Sohn, S., Kwast, K.E. & Kelleher, N.L. Anal. Chem. 78, 686–694 (2006).
Leinonen, R. et al. Bioinformatics 20, 3236–3237 (2004).
This work was supported by grants from the National Key Research and Development Program of China (No. 2016YFA0501300 to S.-M.H., 2017YFA0505100 and 2017YFC0906600 to P.X. and 2012CB316502 to P.-H.Z.), the Youth Innovation Promotion Association CAS (No. 2014091 to H.C.), the National Natural Science Foundation of China (31470805 to H.C., 31670834 to P.X., 31700727 to C.L., and 21475141 to S.-M.H.), the CAS Interdisciplinary Innovation Team (Y604061000 to S.-M.H.), the International Collaboration Program (2014DFB30020 to P.X.), and the Beijing Training Project for The Leading Talents in S&T (Z161100004916024 to P.X.).
The authors declare no competing financial interests.
Integrated supplementary information
The MS data are first preprocessed by pParse, and then the MS/MS data are searched by the open search module. Next, the MS/MS data are re-searched by the restricted search module against a refined search space based on the learned information in the reranking step. Finally, the results obtained from both the open and restricted searches are merged, reranked again and reported.
This figure is from a screenshot of the embedded software tool pBuild, which was designed to graphically represent pFind Studio results. a) Distributions of mass deviations (in ppm) of the identified PSMs in the five raw files. The number of PSMs for each raw (N) is 16,007 (10mM), 13,418 (30mM), 16,931 (60mM), 18,575 (150mM3) and 24,523 (1000mM), respectively. Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1–1.5 × IQR to Q3+1.5 × IQR. b) Score distributions of the target (blue) and decoy (red) PSMs. c) Distribution of the enzymatic specificity of identified peptides. d) Percentages of the highly abundant modifications in all identified peptides. e) Distribution of the number of missed cleavage sites. f) Distribution of the number of peptides identified from one tandem mass spectrum, e.g., two peptides can be identified from 12.65% of the total spectra.
Metabolically labeled datasets are searched against the protein database, in which only the unlabeled peptides are considered, and pQuant is then independently used for check the quantitation ratio of each PSM.
For the open search engines, only PSMs with no or common modifications specified in the restricted search engines were considered in the analysis.a) Consistently identified PSMs from the comparison of every two search engines are considered.b) Separately identified PSMs from the comparison of every two search engines are considered. The 15N- and unlabeled peptides are used to calculate the quantitation values in both a) and b).c) Similar to a), but the 13C- and unlabeled peptides are used to calculate the quantitation values.d) Similar to b), but the 13C- and unlabeled peptides are used to calculate the quantitation values.
For the open search engines, only PSMs with no or common modifications specified in the restricted search engines were considered in the analysis.a) Consistently identified PSMs from the comparison of every two search engines are considered.b) Separately identified PSMs from the comparison of every two search engines are considered.
All PSMs (both modified and unmodified ones) were considered in the analysis.a) Consistently identified PSMs from the comparison of every two search engines are considered.b) Separately identified PSMs from the comparison of every two search engines are considered. The 15N- and unlabeled peptides are used to calculate the quantitation values in both a) and b).c) Similar to a), but the 13C- and unlabeled peptides are used to calculate the quantitation values.d) Similar to b), but the 13C- and unlabeled peptides are used to calculate the quantitation values.
Open-pFind is compared with each of the other seven engines.a) The number of proteins that are consistently identified by two engines (&), separately identified by Open-pFind (+) or the other engine (−), and the percentages of NaN-ratio proteins in those three sets (&, + and −, respectively). The quantitation value of a protein is the NaN ratio if no supporting protein-unique peptides are assigned with normal quantitation values.b) Similar to a), but proteins supported with at least two protein-unique peptides are considered. The quantitation value of a protein is the NaN ratio if the number of the supporting protein-unique peptides with normal quantitation values is less than two.15N-labeled peptides were used to calculate the quantitation values in both a) and b).c) Similar to a), but 13C-labeled peptides were used for calculating the quantitation values.d) Similar to b), but 13C-labeled peptides were used for calculating the quantitation values.
Supplementary Figure 8 Performance evaluation of the four open search engines using the entrapment strategy for the four published datasets.
The datasets were chosen as shown in Supplementary Table 1. Each database search process used the target database (human or mouse) appended by the entrapment database (Arabidopsis thaliana). All databases were downloaded from UniProt (http://www.uniprot.org), and only the reviewed proteins were used. The other database search parameters were the same as those shown in Supplementary Table 3.In each subfigure, the x-axis denotes the number of total PSMs reported by the open search engines, and the y-axis denotes the fraction of the entrapment PSMs in the total PSMs, e.g., for the Mann-Human-Velos dataset, Open-pFind reported 23,876 PSMs at 1% FDR at the peptide level, in which 0.25% of the spectra were matched with peptides from the entrapment database.
Supplementary Figure 9 The distribution of the results reported by Open-pFind that were not included in the traditional search space.
For example, among all distinct peptides identified by Open-pFind for Mann-Human-Velos dataset, the semi-tryptic/non-specific peptides, peptides with unexpected modifications including mutations, peptides identified with additional precursors exported by pParse, accounted for 4.3%, 17.1% and 27.7%, respectively, and a total of 42.9% peptides fall outside of the restricted search space.
The four modes includes the restricted search with no modifications (No mod), with common modifications (Common, with carbamidomethylation of C, oxidation of M, Gln→pyro-Glu at N-termini of peptides and acetylation at N-termini of proteins), and with phosphorylation (Phospho, with common modifications and phosphorylation of S, T, and Y), and the open search (Unimod, with at most one modification in Unimod), together with three different types of digestion. For each mode, 1,000 experiments were performed, each with 1,000 randomly selected proteins from UniProt database, which were digested in silico into peptides. Then, the distribution of the number of peptides from each experiment is shown in one box-plot (N = 1,000). Box-plot elements: center line, median; box limits, first and third quartile (Q1 and Q3); whiskers, from Q1–1.5 × IQR to Q3+1.5 × IQR.
Supplementary Figure 11 Relationship between the protein database reduction and the running time of Open-pFind.
#PR denotes the number of proteins in the reduced protein database after open search, i.e., the first step of Open-pFind. #PO denotes the number of proteins in the original protein database specified before database searches. The red curve denotes the ratio of the running time of Open-pFind to the running time of pFind in the analyses of the six datasets. Each bar denotes the mean and each error bar (blue whisker) denotes the standard deviation (SD) measured among different datasets (N = 5, 22, 3, 24, 4, 24 for the number of raw files in the six datasets shown from left to right, respectively). Each grey dot denotes a data point in the corresponding dataset. For example, in the analysis of the Xu-Yeast-QEHF dataset, ~97% of the proteins in the original database were removed after the open search step on average; as a result, the speed of Open-pFind was approximately twice that of pFind.
Supplementary Figure 12 The comparison between the picked TDA approach and the common two-peptide TDA approach for protein inference with the Kim data.
The number of proteins reported by the picked strategy was 16,133, which was larger than the number that was reported using the two-peptide rule (14,064), but seven additional olfactory receptors were detected with the picked strategy.
The spectrum is entitled as Ecoli-1to1to1-un-C13-N15-60mM-20150823.22783.22783.dta, in which three co-eluting peptides were identified. a) MS1 information for the three peptides in the isolation window around this tandem mass spectrum. b) The match between the spectrum and the three resulting peptides, taken from a screenshot of pBuild.Three isotopic clusters are clearly distinguished from each other, while the observed fragmentation sites were near-complete for all three peptides.
Supplementary Figure 14 Similarity between the actual and predicted fragment ions for each of the three co-eluting peptides identified from one spectrum.
The spectrum is entitled as Ecoli-1to1to1-un-C13-N15-60mM-20150823.22783.22783.2.dta. The predicted spectra are from pDeep algorithm. The similarity was measured with the Pearson's correlation coefficients (PCC): a) TLAEGQNVEFEIQDGQK, b) SEYLGDPDFVK and c) VSLAADPVEEIK. The number of singly charged b and y ions (N), for which the correlations were determined, is 32, 20 and 22 for a)‒c), respectively.
Supplementary Figures 1–14 (PDF 1660 kb)
Supplementary Tables 1–10, Supplementary Notes 1–5 (PDF 4108 kb)
The number of identified PSMs, distinct peptides and distinct peptide sequences for the four published datasets (XLSX 14 kb)
The consistency of the identification results (XLSX 16 kb)
Detailed information for highly abundant modifications of cysteine residues in Kim data (XLSX 19 kb)
Detailed information for 694 semi-tryptic peptides verified by UniProt in Kim data (XLSX 72 kb)
Supplementary Code (EXE 21153 kb)
About this article
Cite this article
Chi, H., Liu, C., Yang, H. et al. Comprehensive identification of peptides in tandem mass spectra using an efficient open search engine. Nat Biotechnol 36, 1059–1061 (2018). https://doi.org/10.1038/nbt.4236
A semi-tryptic peptide centric metaproteomic mining approach and its potential utility in capturing signatures of gut microbial proteolysis
IEEE Access (2021)
Molecular Cell (2021)
Nucleic Acids Research (2021)
Journal of Proteome Research (2021)