A large-scale evaluation of computational protein function prediction

Journal name:
Nature Methods
Volume:
10,
Pages:
221–227
Year published:
DOI:
doi:10.1038/nmeth.2340
Received
Accepted
Published online

Abstract

Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.

At a glance

Figures

  1. Experiment timeline and target analysis.
    Figure 1: Experiment timeline and target analysis.

    (a) Timeline for the CAFA experiment. (b) Number of target sequences per organism. The graph shows the number of target sequences for each of the ontologies (Molecular Function and Biological Process) as well as the total number of targets, obtained as a union between sequences in the two ontologies. Of 866 proteins, 531 had Molecular Function annotations and 587 had Biological Process annotations. (c) Distribution of target sequences in each ontology according to the number of leaf terms available for each protein sequence. For example, in the Molecular Function category, 79% of proteins had one leaf term, 16% had two leaf terms, and so on. A term is considered a leaf term for a particular target if no other GO term associated with that sequence is its descendant.

  2. Overall performance evaluation.
    Figure 2: Overall performance evaluation.

    (a,b) The maximum F-measure for the top-performing methods for Molecular Function ontology (a) and Biological Process ontology (b). All panels show the top ten participating methods in each category as well as the BLAST and Naive baseline methods. Note that 33 models outperformed BLAST in the Molecular Function category, whereas 26 models outperformed BLAST in the Biological Process category (cutoff scores below which methods were excluded from the panels were 0.468 and 0.300 for the Molecular Function and Biological Process categories, respectively). In the Molecular Function category, proteins with “protein binding” as their only leaf term were excluded from the analysis because the protein binding term was not considered informative (results that include those proteins are presented in Supplementary Fig. 3). A perfect predictor would be characterized with Fmax = 1. Confidence intervals (95%) were determined using bootstrapping with n = 10,000 iterations on the set of target sequences. For cases in which a principal investigator participated in multiple teams, only the results of the best-scoring method are presented.

  3. Domain analysis and performance evaluation for single-domain versus multidomain eukaryotic targets.
    Figure 3: Domain analysis and performance evaluation for single-domain versus multidomain eukaryotic targets.

    (a) Distribution of target proteins with respect to the number of Pfam domains they contain. (b) Performance evaluation in the Molecular Function category. Each of the ten top-performing methods showed higher accuracy (higher Fmax) on single-domain proteins. Confidence intervals (95%) were determined using bootstrapping with n = 10,000 iterations on the set of target sequences.

  4. Case study on the human PNPT1 gene.
    Figure 4: Case study on the human PNPT1 gene.

    (a) Domain architecture of human PNPT1 gene according to the Pfam classification. For each domain, the numbers of different leaf terms (for the Molecular Function and Biological Process categories) associated with any protein in Swiss-Prot database containing this domain are shown. (b) Molecular Function terms (six of which are leaves) associated with the human PNPT1 gene in Swiss-Prot as of December 2011. Colored circles represent the predicted terms for three representative methods as well as two baseline methods. The prediction threshold for each method was selected to correspond to the point in the precision-recall space that provides the maximum F-measure. J (blue), Jones-UCL; O (magenta), Team Orengo; d (navy blue), dcGO; B (green), BLAST; N (brown), Naive. Dashed lines indicate the presence of other terms between the source and destination nodes.

References

  1. Liolios, K. et al. The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 38, D346D354 (2010).
  2. Bork, P. et al. Predicting function: from genes to genomes and back. J. Mol. Biol. 283, 707725 (1998).
  3. Rost, B., Liu, J., Nair, R., Wrzeszczynski, K.O. & Ofran, Y. Automatic prediction of protein function. Cell Mol. Life Sci. 60, 26372650 (2003).
  4. Watson, J.D., Laskowski, R.A. & Thornton, J.M. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. 15, 275284 (2005).
  5. Friedberg, I. Automated protein function prediction—the genomic challenge. Brief. Bioinform. 7, 225242 (2006).
  6. Sharan, R., Ulitsky, I. & Shamir, R. Network-based prediction of protein function. Mol. Syst. Biol. 3, 88 (2007).
  7. Lee, D., Redfern, O. & Orengo, C. Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 8, 9951005 (2007).
  8. Punta, M. & Ofran, Y. The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Comput. Biol. 4, e1000160 (2008).
  9. Rentzsch, R. & Orengo, C.A. Protein function prediction—the power of multiplicity. Trends Biotechnol. 27, 210219 (2009).
  10. Xin, F. & Radivojac, P. Computational methods for identification of functional residues in protein structures. Curr. Protein Pept. Sci. 12, 456469 (2011).
  11. Altschul, S.F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 33893402 (1997).
  12. Jensen, L.J. et al. Prediction of human protein function from post-translational modifications and localization features. J. Mol. Biol. 319, 12571265 (2002).
  13. Wass, M.N. & Sternberg, M.J. ConFunc—functional annotation in the twilight zone. Bioinformatics 24, 798806 (2008).
  14. Martin, D.M., Berriman, M. & Barton, G.J. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5, 178 (2004).
  15. Hawkins, T., Luban, S. & Kihara, D. Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 15, 15501556 (2006).
  16. Clark, W.T. & Radivojac, P. Analysis of protein function and its prediction from amino acid sequence. Proteins 79, 20862096 (2011).
  17. Pellegrini, M., Marcotte, E.M., Thompson, M.J., Eisenberg, D. & Yeates, T.O. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 42854288 (1999).
  18. Marcotte, E.M. et al. Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751753 (1999).
  19. Enault, F., Suhre, K. & Claverie, J.M. Phydbac “Gene Function Predictor”: a gene annotation tool based on genomic context analysis. BMC Bioinformatics 6, 247 (2005).
  20. Engelhardt, B.E., Jordan, M.I., Muratore, K.E. & Brenner, S.E. Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput. Biol. 1, e45 (2005).
  21. Gaudet, P., Livstone, M.S., Lewis, S.E. & Thomas, P.D. Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Brief. Bioinform. 12, 449462 (2011).
  22. Deng, M., Zhang, K., Mehta, S., Chen, T. & Sun, F. Prediction of protein function using protein-protein interaction data. J. Comput. Biol. 10, 947960 (2003).
  23. Letovsky, S. & Kasif, S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19 (suppl. 1), i197i204 (2003).
  24. Vazquez, A., Flammini, A., Maritan, A. & Vespignani, A. Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol. 21, 697700 (2003).
  25. Nabieva, E., Jim, K., Agarwal, A., Chazelle, B. & Singh, M. Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21 (suppl. 1), i302i310 (2005).
  26. Pazos, F. & Sternberg, M.J. Automated prediction of protein function and detection of functional sites from structure. Proc. Natl. Acad. Sci. USA 101, 1475414759 (2004).
  27. Pal, D. & Eisenberg, D. Inference of protein function from protein structure. Structure 13, 121130 (2005).
  28. Laskowski, R.A., Watson, J.D. & Thornton, J.M. Protein function prediction using local 3D templates. J. Mol. Biol. 351, 614626 (2005).
  29. Huttenhower, C., Hibbs, M., Myers, C. & Troyanskaya, O.G. A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22, 28902897 (2006).
  30. Troyanskaya, O.G., Dolinski, K., Owen, A.B., Altman, R.B. & Botstein, D. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA 100, 83488353 (2003).
  31. Lee, I., Date, S.V., Adai, A.T. & Marcotte, E.M. A probabilistic functional network of yeast genes. Science 306, 15551558 (2004).
  32. Costello, J.C. et al. Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function. Genome Biol. 10, R97 (2009).
  33. Kourmpetis, Y.A., van Dijk, A.D., Bink, M.C., van Ham, R.C. & ter Braak, C.J. Bayesian Markov Random Field analysis for protein function prediction based on network data. PLoS ONE 5, e9293 (2010).
  34. Sokolov, A. & Ben-Hur, A. Hierarchical classification of gene ontology terms using the GOstruct method. J. Bioinform. Comput. Biol. 8, 357376 (2010).
  35. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 2529 (2000).
  36. Bairoch, A. et al. The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154D159 (2005).
  37. Schnoes, A.M., Brown, S.D., Dodevski, I. & Babbitt, P.C. Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 5, e1000605 (2009).
  38. Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290D301 (2012).
  39. Wang, G. et al. PNPASE regulates RNA import into mitochondria. Cell 142, 456467 (2010).
  40. Sarkar, D. et al. Down-regulation of Myc as a potential target for growth arrest induced by human polynucleotide phosphorylase (hPNPaseold-35) in human melanoma cells. J. Biol. Chem. 278, 2454224551 (2003).
  41. Wu, J. & Li, Z. Human polynucleotide phosphorylase reduces oxidative RNA damage and protects HeLa cell against oxidative stress. Biochem. Biophys. Res. Commun. 372, 288292 (2008).
  42. Wang, D.D., Shu, Z., Lieser, S.A., Chen, P.L. & Lee, W.H. Human mitochondrial SUV3 and polynucleotide phosphorylase form a 330-kDa heteropentamer to cooperatively degrade double-stranded RNA with a 3′-to-5′ directionality. J. Biol. Chem. 284, 2081220821 (2009).
  43. Portnoy, V., Palnizky, G., Yehudai-Resheff, S., Glaser, F. & Schuster, G. Analysis of the human polynucleotide phosphorylase (PNPase) reveals differences in RNA binding and response to phosphate compared to its bacterial and chloroplast counterparts. RNA 14, 297309 (2008).
  44. Jeffery, C.J. Moonlighting proteins. Trends Biochem. Sci. 24, 811 (1999).
  45. Khersonsky, O. & Tawfik, D.S. Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471505 (2010).
  46. Brenner, S.E. Errors in genome annotation. Trends Genet. 15, 132133 (1999).
  47. Doolittle, R.F. Of URFS and ORFS: A Primer on How to Analyze Derived Amino Acid Sequences (University Science Books, 1986).
  48. Addou, S., Rentzsch, R., Lee, D. & Orengo, C.A. Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. J. Mol. Biol. 387, 416430 (2009).
  49. Nehrt, N.L., Clark, W.T., Radivojac, P. & Hahn, M.W. Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput. Biol. 7, e1002073 (2011).
  50. Brown, S.D., Gerlt, J.A., Seffernick, J.L. & Babbitt, P.C. A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol. 7, R8 (2006).
  51. Gerlt, J.A. et al. The Enzyme Function Initiative. Biochemistry 50, 99509962 (2011).
  52. Barrell, D. et al. The GOA database in 2009—an integrated Gene Ontology Annotation resource. Nucleic Acids Res. 37, D396D403 (2009).
  53. Hanley, J.A. & McNeil, B.J. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 2936 (1982).

Download references

Author information

Affiliations

  1. School of Informatics and Computing, Indiana University, Bloomington, Indiana, USA.

    • Predrag Radivojac &
    • Wyatt T Clark
  2. Buck Institute for Research on Aging, Novato, California, USA.

    • Tal Ronnen Oron,
    • Tobias Wittkop &
    • Sean D Mooney
  3. Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, California, USA.

    • Alexandra M Schnoes &
    • Patricia C Babbitt
  4. Department of Computer Science, Colorado State University, Fort Collins, Colorado, USA.

    • Artem Sokolov,
    • Kiley Graim &
    • Asa Ben-Hur
  5. Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, California, USA.

    • Artem Sokolov
  6. Computational Bioscience Program, University of Colorado School of Medicine, Aurora, Colorado, USA.

    • Christopher Funk &
    • Karin Verspoor
  7. National ICT Australia, Victoria Research Laboratory, Melbourne, Australia.

    • Karin Verspoor
  8. Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, California, USA.

    • Gaurav Pandey,
    • Susanna Repo &
    • Steven E Brenner
  9. Mount Sinai School of Medicine, New York, New York, USA.

    • Gaurav Pandey
  10. Joint Graduate Group in Bioengineering, University of California, Berkeley, Berkeley, California, USA.

    • Jeffrey M Yunes
  11. Department of Electrical Engineering and Computer Science, University of California, Berkeley, Berkeley, California, USA.

    • Ameet S Talwalkar
  12. European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, UK.

    • Susanna Repo
  13. Biophysics Graduate Program, University of California, Berkeley, Berkeley, California, USA.

    • Michael L Souza
  14. Department of Biology, University of Bologna, Bologna, Italy.

    • Damiano Piovesan &
    • Rita Casadio
  15. Department of Computer Science, University of Missouri, Columbia, Missouri, USA.

    • Zheng Wang &
    • Jianlin Cheng
  16. Department of Computer Science, University of Bristol, Bristol, UK.

    • Hai Fang &
    • Julian Gough
  17. Department of Biological and Environmental Sciences & Institute of Biotechnology, Viikki Biocentre, University of Helsinki, Helsinki, Finland.

    • Patrik Koskinen,
    • Petri Törönen,
    • Jussi Nokso-Koivisto &
    • Liisa Holm
  18. Department of Computer Science, University College London, London, UK.

    • Domenico Cozzetto,
    • Daniel W A Buchan,
    • Kevin Bryson &
    • David T Jones
  19. Bioinformatics Group, Centre for Development of Advanced Computing, Pune University Campus, Pune, India.

    • Bhakti Limaye,
    • Harshal Inamdar,
    • Avik Datta,
    • Sunitha K Manjari &
    • Rajendra Joshi
  20. Department of Computer Science, Purdue University, West Lafayette, Indiana, USA.

    • Meghana Chitale &
    • Daisuke Kihara
  21. Department of Biological Sciences, Purdue University, West Lafayette, Indiana, USA.

    • Daisuke Kihara
  22. Department of Molecular and Human Genetics, Computational and Integrative Biomedical Research Center, Baylor College of Medicine, Houston, Texas, USA.

    • Andreas M Lisewski,
    • Serkan Erdin,
    • Eric Venner &
    • Olivier Lichtarge
  23. University College London, Institute for Structural and Molecular Biology, London, UK.

    • Robert Rentzsch &
    • Christine Orengo
  24. Department of Computer Science, Centre for Systems and Synthetic Biology, Royal Holloway, University of London, Egham, UK.

    • Haixuan Yang,
    • Alfonso E Romero,
    • Prajwal Bhat &
    • Alberto Paccanaro
  25. Technische Universität München, Bioinformatik-I12, Informatik, Garching, Germany.

    • Tobias Hamp,
    • Rebecca Kaßner,
    • Stefan Seemayer,
    • Esmeralda Vicedo,
    • Christian Schaefer,
    • Dominik Achten,
    • Florian Auer,
    • Ariane Boehm,
    • Tatjana Braun,
    • Maximilian Hecht,
    • Mark Heron,
    • Peter Hönigschmid,
    • Thomas A Hopf,
    • Stefanie Kaufmann,
    • Michael Kiening,
    • Denis Krompass,
    • Cedric Landerer,
    • Yannick Mahlich,
    • Manfred Roos &
    • Burkhard Rost
  26. Department of Information Technology, University of Turku, Turku Centre for Computer Science, Turku, Finland.

    • Jari Björne &
    • Tapio Salakoski
  27. School of Computing, Queen's University, Kingston, Ontario, Canada.

    • Andrew Wong &
    • Hagit Shatkay
  28. Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, USA.

    • Hagit Shatkay
  29. Max Planck Institute for Informatics, Saarbrücken, Germany.

    • Fanny Gatzmann &
    • Ingolf Sommer
  30. Centre for Bioinformatics, Division of Molecular Biosciences, Imperial College, London, UK.

    • Mark N Wass &
    • Michael J E Sternberg
  31. Structural Computational Biology Group, Spanish National Cancer Research Centre, Madrid, Spain.

    • Mark N Wass
  32. Division of Electronics, Rudjer Boskovic Institute, Zagreb, Croatia.

    • Nives Škunca,
    • Fran Supek,
    • Matko Bošnjak &
    • Tomislav Šmuc
  33. Department of Knowledge Technologies, Jožef Stefan Institute, Ljubljana, Slovenia.

    • Panče Panov &
    • Sašo Džeroski
  34. Biometris, Wageningen University and Research Centre, Wageningen, The Netherlands.

    • Yiannis A I Kourmpetis,
    • Aalt D J van Dijk &
    • Cajo J F ter Braak
  35. Bioinformatics Systems, Nestlé Institute of Health Sciences, Lausanne, Switzerland.

    • Yiannis A I Kourmpetis
  36. Applied Bioinformatics, Plant Research International, Wageningen, The Netherlands.

    • Aalt D J van Dijk
  37. Institute of Biostatistics, School of Life Sciences, Fudan University, Shanghai, China.

    • Yuanpeng Zhou,
    • Qingtian Gong,
    • Xinran Dong &
    • Weidong Tian
  38. Department of Molecular Medicine, University of Padova, Padova, Italy.

    • Marco Falda,
    • Enrico Lavezzo &
    • Stefano Toppo
  39. Istituto Agrario San Michele all'Adige Research and Innovation Centre, Trento, Italy.

    • Paolo Fontana
  40. Department of Information Engineering, University of Padova, Padova, Italy.

    • Barbara Di Camillo
  41. Department of Computer and Information Sciences, Temple University, Philadelphia, Pennsylvania, USA.

    • Liang Lan,
    • Nemanja Djuric,
    • Yuhong Guo &
    • Slobodan Vucetic
  42. Swiss Institute of Bioinformatics, Geneva, Switzerland.

    • Amos Bairoch
  43. Department of Human Protein Sciences, University of Geneva, Geneva, Switzerland.

    • Amos Bairoch
  44. Department of Biological Chemistry, Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel.

    • Michal Linial
  45. Department of Microbiology, Miami University, Oxford, Ohio, USA.

    • Iddo Friedberg
  46. Department of Computer Science and Software Engineering, Miami University, Oxford, Ohio, USA.

    • Iddo Friedberg

Contributions

P.R. and I.F. conceived of the CAFA experiment, supervised the project and wrote most of the manuscript. S.D.M. participated in the design of and supervised the method assessment. W.T.C. performed the analysis of feasibility of the experiment and most of the target and performance analysis and contributed to writing. P.R. and W.T.C. designed and produced figures. T.R.O. developed the web interface, including the portal for submission and the storage of predictions. T.R.O. and T.W. verified the assessment code and participated in analysis. A.M.S. designed and performed the analysis of targets. A. Bairoch, M.L., P.C.B., S.E.B., C.O. and B.R. steered the CAFA experiment, provided critical guidance and participated in writing. The remaining authors participated in the experiment, provided writing and data for their methods and contributed comments on the manuscript.

Competing financial interests

The authors declare no competing financial interests.

Corresponding authors

Correspondence to:

Author details

Supplementary information

PDF files

  1. Supplementary Text and Figures (3M)

    Supplementary Figures 1–8, Supplementary Table 3 and Supplementary Note

Excel files

  1. Supplementary Table 1 (98K)

    List of all target sequences and their experimentally determined functional terms.

  2. Supplementary Table 2 (25K)

    Area under the ROC curves (AUC) for the functional terms covering at least 15 target sequences.

Additional data