Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study


The Human Proteome Organization (HUPO) recently completed the first large-scale collaborative study to characterize the human serum and plasma proteomes. The study was carried out in different locations and used diverse methods and instruments to compare and integrate tandem mass spectrometry (MS/MS) data on aliquots of pooled serum and plasma from healthy subjects. Liquid chromatography (LC)-MS/MS data sets from 18 laboratories were matched to the International Protein Index database, and an initial integration exercise resulted in 9,504 proteins identified with one or more peptides, and 3,020 proteins identified with two or more peptides. This article uses a rigorous statistical approach to take into account the length of coding regions in genes, and multiple hypothesis-testing techniques. On this basis, we now present a reduced set of 889 proteins identified with a confidence level of at least 95%. We also discuss the importance of such an integrated analysis in providing an accurate representation of a proteome as well as the value such data sets contain for the high-confidence identification of protein matches to novel exons, some of which may be localized in alternatively spliced forms of known plasma proteins and some in previously nonannotated gene sequences.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Figure 1: Distribution of protein identifications.
Figure 2: Number of peptides identified as a function of protein concentration.
Figure 3: Distribution of peptides identified for β-2-glycoprotein 1.
Figure 4: Bar plot of the distribution of ORFs types by gene.
Figure 5: Novel ORFs in the APOE gene.


  1. 1

    Mann, M. Mass spectrometry-based proteomics. Nature 422, 198–207 (2003).

    Article  Google Scholar 

  2. 2

    Sadygov, R., Cociorva, D. & Yates, J.R. Large-scale database searching using tandem mass spectra: looking up the answer in the back of the book. Nat. Methods 1, 195–202 (2004).

    CAS  Article  Google Scholar 

  3. 3

    Olsen, J. & Mann, M. Improved peptide identification in proteomics by two consecutive stages of mass spectrometric fragmentation. Proc. Natl. Acad. Sci. USA 101, 13417–13422 (2004).

    CAS  Article  Google Scholar 

  4. 4

    Orchard, S., Hermjakob, H. & Apweiler, R. Annotating the human proteome. Mol. Cell. Proteomics 4, 435–440 (2005).

    CAS  Article  Google Scholar 

  5. 5

    Hanash, S. & Celis, J.E. The human proteome organization: a mission to advance proteome knowledge. Mol. Cell. Proteomics 1, 413–414 (2002).

    CAS  Article  Google Scholar 

  6. 6

    Omenn, G.S. The Human Proteome Organization plasma proteome project pilot phase: reference specimens, technology platform comparisons, and standardized data submissions and analyses. Proteomics 4, 1235–1240 (2004).

    CAS  Article  Google Scholar 

  7. 7

    Omenn, G.S. et al. Overview of the HUPO Plasma Proteome Project: Results from the pilot phase with 35 collaborating laboratories and multiple analytical groups, generating a core dataset of 3020 proteins and a publicly-available database. Proteomics 5, 3226–3245 (2005).

    CAS  Article  Google Scholar 

  8. 8

    Kersey, P. et al. The International Protein Index: an integrated database for proteomics experiments. Proteomics 4, 1985–1988 (2004).

    CAS  Article  Google Scholar 

  9. 9

    Adamski, M. et al. Data management and preliminary data analysis in the pilot phase of the HUPO Plasma Proteome Project. Proteomics 5, 3246–3261 (2005).

    CAS  Article  Google Scholar 

  10. 10

    Carr, S. et al. The need for guidelines in publication of peptide and protein identification data. Mol. Cell. Proteomics 3, 531–533 (2004).

    CAS  Article  Google Scholar 

  11. 11

    Cargile, B.J., Bundy, J.L. & Stephenson, J.L. Potential for false positive identifications from large databases through tandem mass spectrometry. J. Proteome Res. 3, 1082–1085 (2004).

    CAS  Article  Google Scholar 

  12. 12

    Eriksson, J. & Fenyo, D. Protein identification in complex mixtures. J. Proteome Res. 4, 387–393 (2005).

    CAS  Article  Google Scholar 

  13. 13

    Fenyo, D. & Beavis, R.C. A method for assessing the statistical significance of mass spectrometry-based protein identifications using general scoring schemes. Anal. Chem. 75, 768–774 (2003).

    Article  Google Scholar 

  14. 14

    Keller, A., Nesvizhskii, A.I., Kolker, E. & Aebersold, R. Empirical statistical model to estimate the accuracy of peptide identifications made by MS/MS and database search. Anal. Chem. 74, 5383–5392 (2002).

    CAS  Article  Google Scholar 

  15. 15

    Nesvizhskii, A.I., Keller, A., Kolker, E. & Aebersold, R. A statistical model for identifying proteins by tandem mass spectrometry. Anal. Chem. 75, 4646–4658 (2003).

    CAS  Article  Google Scholar 

  16. 16

    Sadygov, R.G. & Yates, J.R. A hypergeometric probability model for protein identification and validation using tandem mass spectral data and protein sequence databases. Anal. Chem. 75, 3792–3798 (2003).

    CAS  Article  Google Scholar 

  17. 17

    Shen, Y. et al. Ultra-high-efficiency strong cation exchange LC/RPLC/MS/MS for high dynamic range characterization of the human plasma proteome. Anal. Chem. 76, 1134–1144 (2004).

    CAS  Article  Google Scholar 

  18. 18

    Perkins, D.N., Pappin, D.J., Creasy, D.M. & Cottrell, J.S. Probability-based protein identification by searching sequence databases using mass spectrometry data. Electrophoresis 20, 3551–3567 (1999).

    CAS  Article  Google Scholar 

  19. 19

    Beer, I., Barnea, E., Ziv, T. & Admon, A. Improving large-scale proteomics by clustering of mass spectrometry data. Proteomics 4, 950–960 (2004).

    CAS  Article  Google Scholar 

  20. 20

    Eng, J.K., McCormack, A.L. & Yates, J.R.I. An approach to correlate tandem mass spectral data of peptides with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom. 5, 976–989 (1994).

    CAS  Article  Google Scholar 

  21. 21

    Haab, B.B. et al. Immunoassay and antibody microarray analysis of the HUPO reference specimens: systematic variation between sample types and calibration of mass spectrometry data. Proteomics 5, 3278–3291 (2005).

    CAS  Article  Google Scholar 

  22. 22

    Ishihama, Y. et al. Exponentially modified protein abundance index (emPAI) for estimation of absolute protein amount in proteomics by the number of sequenced peptides per protein. Mol. Cell. Proteomics 4, 1265–1272 (2005).

    CAS  Article  Google Scholar 

  23. 23

    O'Brien, T.J. et al. The CA 125 gene: an extracellular superstructure dominated by repeat sequences. Tumour Biol. 22, 348–366 (2001).

    CAS  Article  Google Scholar 

  24. 24

    Bendtsen, J.D., Nielsen, H., vonHeijne, G. & Brunak, S. Improved predication of signal peptides: SignalP 3.0. J. Mol. Biol. 340, 783–795 (2004).

    Article  Google Scholar 

  25. 25

    Miyakis, S., Giannakopoulos, B. & Krilis, S.A. Beta 2 glycoprotein I–function in health and disease. Thromb. Res. 114, 335–346 (2004).

    CAS  Article  Google Scholar 

  26. 26

    Tang, H.Y. et al. A novel four-dimensional strategy combining protein and peptide separation methods enables detection of low-abundance proteins in human plasma and serum proteomes. Proteomics 5, 3329–3342 (2005).

    CAS  Article  Google Scholar 

  27. 27

    Wang, H. et al. Intact-protein based high-resolution three-dimensional quantitative analysis system for proteome profiling of biological fluids. Mol. Cell. Proteomics 4, 618–625 (2005).

    CAS  Article  Google Scholar 

  28. 28

    Misek, D.E. et al. A wide range of protein isoforms in serum and plasma uncovered by a quantitative Intact Protein Analysis System (IPAS). Proteomics 5, 3343–3351 (2005).

    CAS  Article  Google Scholar 

  29. 29

    Choudhary, J.S., Blackstock, W.P., Creasy, D.M. & Cottrell, J.S. Interrogating the human genome using uninterpreted mass spectrometry data. Proteomics 1, 651–667 (2001).

    CAS  Article  Google Scholar 

  30. 30

    Kuster, B., Mortensen, P., Andersen, J.S. & Mann, M. Mass spectrometry allows direct identification of proteins in large genomes. Proteomics 1, 641–650 (2001).

    CAS  Article  Google Scholar 

  31. 31

    Kreahling, J. & Graveley, B.R. The origins and implications of Alternative splicing. Trends Genet. 20, 1–4 (2004).

    CAS  Article  Google Scholar 

  32. 32

    Link, A.J. et al. Direct analysis of protein complexes using mass spectrometry. Nat. Biotechnol. 17, 676–682 (1999).

    CAS  Article  Google Scholar 

  33. 33

    Liu, H., Sadygov, R.G. & Yates, J.R. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Anal. Chem. 76, 4193–4201 (2004).

    CAS  Article  Google Scholar 

  34. 34

    Washburn, M.P., Wolters, D. & Yates, J.R. Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol. 19, 242–247 (2001).

    CAS  Article  Google Scholar 

  35. 35

    Ghaemmaghami, S. et al. Global analysis of protein expression in yeast. Nature 425, 737–741 (2003).

    CAS  Article  Google Scholar 

  36. 36

    Anderson, N.L. et al. The human plasma proteome: a nonredundant list developed by combination of four separate sources. Mol. Cell. Proteomics 3, 311–316 (2004).

    CAS  Article  Google Scholar 

  37. 37

    Chan, K.C. et al. Analysis of the human serum proteome. Clin. Proteomics 1, 101–225 (2004).

    Article  Google Scholar 

  38. 38

    Zhou, M. et al. An investigation in the human serum “interactome”. Electrophoresis 25, 1289–1298 (2004).

    CAS  Article  Google Scholar 

  39. 39

    Jaffe, J.D., Berg, H.C. & Church, G.M. Proteogenomic mapping as a complementary method to perform genome annotation. Proteomics 4, 59–77 (2004).

    CAS  Article  Google Scholar 

  40. 40

    Oyama, M. et al. Analysis of small human proteins reveals the translation of upstream open reading frames of mRNAs. Genome Res. 14, 2048–2052 (2004).

    CAS  Article  Google Scholar 

Download references


The collaborative HUPO Plasma Protein study and the data analysis presented here have been supported by a trans-National Institutes of Health grant supplement 84982 administered by the National Cancer Institute, by pharmaceutical and technology company sponsors and by voluntary efforts of collaborating laboratories.

Author information



Corresponding author

Correspondence to Samir M Hanash.

Ethics declarations

Competing interests

The authors declare no competing financial interests.

Supplementary information

Supplementary Fig. 1

Accrual of identifications as a function of sampling. (PDF 20 kb)

Supplementary Fig. 2

Complement component 3 isoforms. (PDF 20 kb)

Supplementary Table 1

Numbers of protein identificaitons by specifmen and by methodologies applied in individual laboratories. (PDF 90 kb)

Supplementary Table 2

List of high-confidence protein identifications. (PDF 116 kb)

Supplementary Table 3

Intragenic peptides not in an annotated exon. (PDF 15 kb)

Supplementary Notes (PDF 25 kb)

Rights and permissions

Reprints and Permissions

About this article

Cite this article

States, D., Omenn, G., Blackwell, T. et al. Challenges in deriving high-confidence protein identifications from data gathered by a HUPO plasma proteome collaborative study. Nat Biotechnol 24, 333–338 (2006).

Download citation

Further reading


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing