PECAN: library-free peptide detection for data-independent acquisition tandem mass spectrometry data

Journal name:
Nature Methods
Volume:
14,
Pages:
903–908
Year published:
DOI:
doi:10.1038/nmeth.4390
Received
Accepted
Published online

Abstract

Data-independent acquisition (DIA) is an emerging mass spectrometry (MS)-based technique for unbiased and reproducible measurement of protein mixtures. DIA tandem mass spectrometry spectra are often highly multiplexed, containing product ions from multiple cofragmenting precursors. Detecting peptides directly from DIA data is therefore challenging; most DIA data analyses require spectral libraries. Here we present PECAN (http://pecan.maccosslab.org), a library-free, peptide-centric tool that robustly and accurately detects peptides directly from DIA data. PECAN reports evidence of detection based on product ion scoring, which enables detection of low-abundance analytes with poor precursor ion signal. We demonstrate the chromatographic peak picking accuracy and peptide detection capability of PECAN, and we further validate its detection with data-dependent acquisition and targeted analyses. Lastly, we used PECAN to build a plasma proteome library from DIA data and to query known sequence variants.

At a glance

Figures

  1. Overview of PECAN workflow.
    Figure 1: Overview of PECAN workflow.

    PECAN takes DIA data, peptides of interest, and a background proteome database as inputs; and it outputs evidence of detection with auxiliary scores for every query peptide and PECAN-generated decoy peptide. Percolator uses PECAN output to train a classifier to distinguish correct and incorrect evidence, and then it outputs confident peptide and protein detection with estimated FDR.

  2. PECAN peak picking performance on the SIS data set.
    Figure 2: PECAN peak picking performance on the SIS data set.

    422 stable-isotope-labeled standard (SIS) peptides were diluted in water (blue), yeast lysate (orange), or HeLa lysate (black) and measured in three replicates. The combined percentage of correct SIS peaks (a) and the total number of SIS peaks (b) reported by PECAN before FDR control. The combined percentage of correct SIS (c) and the total number of reported SIS peaks (d) after the PECAN-reported evidence of detection were subjected to peptide-level FDR control per measurement at q-value < 0.01 by Percolator.

  3. Validation of PECAN detection with GST fusion proteins.
    Figure 3: Validation of PECAN detection with GST fusion proteins.

    Comparative analysis of peptide detection from DIA and DDA data from HeLa protein digest. Number of unique peptides (a) and unique proteins (b) detected by PECAN DIA and Comet DDA workflows. (c) SRM validation workflow for a set of analytical standards synthesized using in vitro transcription translation (IVTT). (d) Comparative analysis of retention time of HeLa peptides detected by PECAN from DIA data and IVTT peptides detected from SRM.

  4. Deep proteome measurement with gas-phase fractionation.
    Figure 4: Deep proteome measurement with gas-phase fractionation.

    Comparison of numbers of unique peptides (a) and proteins (b) detected by PECAN from 1×GPF, 2×GPF, and 4×GPF DIA data when queried with the human UniProt Swiss-Prot database. (c) Retention time comparison of 12,952 PECAN-detected peptides form 1×GPF and 2×GPF relative to 4×GPF. (d) Number of peptides detected by either PECAN or DIA-Umpire, or by both, from the three GPF DIA data sets.

  5. Natural variants in the plasma library data.
    Figure 5: Natural variants in the plasma library data.

    Full-length canonical sequences of serotransferrin (a) and apolipoprotein A1 (b) are obtained from accession numbers P02787 and P02647, respectively, at the human UniProt Swiss-Prot database. Blue boxes represent PECAN-detected peptides from the plasma library data when queried with canonical sequences. Red boxes represent PECAN-detected variant-specific peptides from the plasma library data when queried with variant-specific tryptic peptides from 3,714 variants.

  6. Retention time analysis for common peptides from Comet-DDA and PECAN-DIA.
    Supplementary Fig. 1: Retention time analysis for common peptides from Comet-DDA and PECAN-DIA.

    Of the 5,182 peptides commonly detected by PECAN from 4xGPF DIA data and Comet from 4xGPF DDA data, 27 peptides were identified more than 2 minutes apart.

  7. Dynamic range of DIA plasma library.
    Supplementary Fig. 2: Dynamic range of DIA plasma library.

    Relative concentration values of 248 plasma proteins are taken from the literature. (Source: Leigh Anderson, The Plasma Proteome Institute, Washington, DC, USA, modified from ref Mol. Cell Proteomics 1, 845–847, 2002.) Color of the dot represents the number of peptides unique to the protein or only shared by its isoforms in the DIA plasma library. Note that some literature values are measurement for protein complex or specific fragments of the protein (e.g. values for Prothrombin and Fibrinogen alpha chain), of which the intact protein concentration could be higher.

  8. Assessment of background scores estimation with 1,000 random sampling.
    Supplementary Fig. 3: Assessment of background scores estimation with 1,000 random sampling.

    (a) Boxplot shows the distribution of 2,185 CVs of the RSEs from 1,000 random sampling at each decoy size. (b) The estimated background scores with 2,000 charge 2 and 2,000 charge 3 decoys for 2,185 MS/MS spectra presented over retention time. Black lines trace the median of the decoy means from 1,000 estimations by random sampling and the blue shades are segments between the 25th and 75th percentiles. (c) Bonferroni corrected p-values from Wilcoxon rank-sum tests between the 1,000 estimations using either 2,000 charge 2 or 2,000 charge 3 decoys for individual spectrum. Grey lines indicated the p-value is smaller than 0.05 and therefore rejected the null hypothesis.

  9. Evidence qualifying procedure in PECAN.
    Supplementary Fig. 4: Evidence qualifying procedure in PECAN.

    An evidence of detection (abbr. evidence) for a query peptide p at the time t is the average of the calibrated primary scores from a short period of retention time (see Methods), centered at the time t. Following this flowchart, PECAN reports a user-defined number of qualified evidence(s) that are calculated from primary scores which have never been used to calculate other qualified evidences(s).

References

  1. Venable, J.D., Dong, M.-Q., Wohlschlegel, J., Dillin, A. & Yates, J.R. Automated approach for quantitative analysis of complex peptide mixtures from tandem mass spectra. Nat. Methods 1, 3945 (2004).
  2. Chapman, J.D., Goodlett, D.R. & Masselon, C.D. Multiplexed and data-independent tandem mass spectrometry for global proteome profiling. Mass Spectrom. Rev. 33, 452470 (2014).
  3. Röst, H.L. et al. OpenSWATH enables automated, targeted analysis of data-independent acquisition MS data. Nat. Biotechnol. 32, 219223 (2014).
  4. Wang, J. et al. MSPLIT-DIA: sensitive peptide identification for data-independent acquisition. Nat. Methods 12, 11061108 (2015).
  5. Ting, Y.S. et al. Peptide-centric proteome analysis: an alternative strategy for the analysis of tandem mass spectrometry data. Mol. Cell. Proteomics 14, 23012307 (2015).
  6. Tsou, C.-C. et al. DIA-Umpire: comprehensive computational framework for data-independent acquisition proteomics. Nat. Methods 12, 258264 (2015).
  7. Li, Y. et al. Group-DIA: analyzing multiple data-independent acquisition mass spectrometry data files. Nat. Methods 12, 11051106 (2015).
  8. Panchaud, A. et al. Precursor acquisition independent from ion count: how to dive deeper into the proteomics ocean. Anal. Chem. 81, 64816488 (2009).
  9. Weisbrod, C.R., Eng, J.K., Hoopmann, M.R., Baker, T. & Bruce, J.E. Accurate peptide fragment mass analysis: multiplexed peptide identification and quantification. J. Proteome Res. 11, 16211632 (2012).
  10. Käll, L., Canterbury, J.D., Weston, J., Noble, W.S. & MacCoss, M.J. Semi-supervised learning for peptide identification from shotgun proteomics datasets. Nat. Methods 4, 923925 (2007).
  11. Eng, J.K., Jahan, T.A. & Hoopmann, M.R. Comet: an open-source MS/MS sequence database search tool. Proteomics 13, 2224 (2012).
  12. Gillet, L.C. et al. Targeted data extraction of the MS/MS spectra generated by data-independent acquisition: a new concept for consistent and accurate proteome analysis. Mol. Cell. Proteomics 11, O111.016717 (2012).
  13. Beausoleil, S.A., Villén, J., Gerber, S.A., Rush, J. & Gygi, S.P. A probability-based approach for high-throughput protein phosphorylation analysis and site localization. Nat. Biotechnol. 24, 12851292 (2006).
  14. Bald, T. et al. pymzML--Python module for high-throughput bioinformatics on mass spectrometry data. Bioinformatics 28, 10521053 (2012).
  15. Martens, L. et al. mzML—a community standard for mass spectrometry data. Mol. Cell. Proteomics 10, R110.000133 (2011).
  16. MacLean, B. et al. Skyline: an open source document editor for creating and analyzing targeted proteomics experiments. Bioinformatics 26, 966968 (2010).
  17. Murray, K.K. et al. Definitions of terms relating to mass spectrometry (IUPAC recommendations 2013). Pure Appl. Chem. 85, 15151609 (2013).
  18. Granholm, V., Navarro, J.F., Noble, W.S. & Käll, L. Determining the calibration of confidence estimation procedures for unique peptides in shotgun proteomics. J. Proteomics 80, 123131 (2013).
  19. Stergachis, A.B., MacLean, B., Lee, K., Stamatoyannopoulos, J.A. & MacCoss, M.J. Rapid empirical discovery of optimal peptides for targeted proteomics. Nat. Methods 8, 10411043 (2011).
  20. Davis, M.T. et al. Towards defining the urinary proteome using liquid chromatography-tandem mass spectrometry. II. Limitations of complex mixture analyses. Proteomics 1, 108117 (2001).
  21. Serang, O., MacCoss, M.J. & Noble, W.S. Efficient marginalization to compute protein posterior probabilities from shotgun mass spectrometry data. J. Proteome Res. 9, 53465357 (2010).

Download references

Author information

Affiliations

  1. Department of Genome Sciences, University of Washington, Seattle, Washington, USA.

    • Ying S Ting,
    • Jarrett D Egertson,
    • James G Bollinger,
    • Brian C Searle,
    • William Stafford Noble &
    • Michael J MacCoss
  2. Biological Sciences Division, Pacific Northwest National Laboratory, Richland, Washington, USA.

    • Samuel H Payne
  3. Department of Computer Science and Engineering, University of Washington, Seattle, Washington, USA.

    • William Stafford Noble

Contributions

Y.S.T. and M.J.M. designed the experiments. Y.S.T. developed the algorithms with input from J.D.E., S.H.P., B.C.S., W.S.N., and M.J.M. Y.S.T. performed the analyses. Y.S.T. and J.G.B. acquired the data. Software was written by Y.S.T. with input from J.D.E. and B.C.S. The manuscript was written by Y.S.T. with substantial input from J.D.E., S.H.P., W.S.N., and M.J.M.

Competing financial interests

The MacCoss Lab at the University of Washington has a sponsored research agreement with Thermo Fisher Scientific, the manufacturer of the instrumentation used in this research. Additionally, M.J.M. is a paid consultant for Thermo Fisher Scientific.

Corresponding author

Correspondence to:

Author details

Supplementary information

Supplementary Figures

  1. Supplementary Figure 1: Retention time analysis for common peptides from Comet-DDA and PECAN-DIA. (81 KB)

    Of the 5,182 peptides commonly detected by PECAN from 4xGPF DIA data and Comet from 4xGPF DDA data, 27 peptides were identified more than 2 minutes apart.

  2. Supplementary Figure 2: Dynamic range of DIA plasma library. (383 KB)

    Relative concentration values of 248 plasma proteins are taken from the literature. (Source: Leigh Anderson, The Plasma Proteome Institute, Washington, DC, USA, modified from ref Mol. Cell Proteomics 1, 845–847, 2002.) Color of the dot represents the number of peptides unique to the protein or only shared by its isoforms in the DIA plasma library. Note that some literature values are measurement for protein complex or specific fragments of the protein (e.g. values for Prothrombin and Fibrinogen alpha chain), of which the intact protein concentration could be higher.

  3. Supplementary Figure 3: Assessment of background scores estimation with 1,000 random sampling. (261 KB)

    (a) Boxplot shows the distribution of 2,185 CVs of the RSEs from 1,000 random sampling at each decoy size. (b) The estimated background scores with 2,000 charge 2 and 2,000 charge 3 decoys for 2,185 MS/MS spectra presented over retention time. Black lines trace the median of the decoy means from 1,000 estimations by random sampling and the blue shades are segments between the 25th and 75th percentiles. (c) Bonferroni corrected p-values from Wilcoxon rank-sum tests between the 1,000 estimations using either 2,000 charge 2 or 2,000 charge 3 decoys for individual spectrum. Grey lines indicated the p-value is smaller than 0.05 and therefore rejected the null hypothesis.

  4. Supplementary Figure 4: Evidence qualifying procedure in PECAN. (50 KB)

    An evidence of detection (abbr. evidence) for a query peptide p at the time t is the average of the calibrated primary scores from a short period of retention time (see Methods), centered at the time t. Following this flowchart, PECAN reports a user-defined number of qualified evidence(s) that are calculated from primary scores which have never been used to calculate other qualified evidences(s).

PDF files

  1. Supplementary Text and Figures (4,147 KB)

    Supplementary Figures 1–4, Supplementary Tables 1–3 and Supplementary Notes 1–6

  2. Reporting Summary (132 KB)

    Life Sciences Reporting Summary

Additional data