Automated annotation of protein function is challenging. As the number of sequenced genomes rapidly grows, the overwhelming majority of protein products can only be annotated computationally. If computational predictions are to be relied upon, it is crucial that the accuracy of these methods be high. Here we report the results from the first large-scale community-based critical assessment of protein function annotation (CAFA) experiment. Fifty-four methods representing the state of the art for protein function prediction were evaluated on a target set of 866 proteins from 11 organisms. Two findings stand out: (i) today's best protein function prediction algorithms substantially outperform widely used first-generation methods, with large gains on all types of targets; and (ii) although the top methods perform well enough to guide experiments, there is considerable need for improvement of currently available tools.
At a glance
- The Genomes On Line Database (GOLD) in 2009: status of genomic and metagenomic projects and their associated metadata. Nucleic Acids Res. 38, D346–D354 (2010). et al.
- Predicting function: from genes to genomes and back. J. Mol. Biol. 283, 707–725 (1998). et al.
- Automatic prediction of protein function. Cell Mol. Life Sci. 60, 2637–2650 (2003). , , , &
- Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol. 15, 275–284 (2005). , &
- Automated protein function prediction—the genomic challenge. Brief. Bioinform. 7, 225–242 (2006).
- Network-based prediction of protein function. Mol. Syst. Biol. 3, 88 (2007). , &
- Predicting protein function from sequence and structure. Nat. Rev. Mol. Cell Biol. 8, 995–1005 (2007). , &
- The rough guide to in silico function prediction, or how to use sequence and structure information to predict protein function. PLoS Comput. Biol. 4, e1000160 (2008). &
- Protein function prediction—the power of multiplicity. Trends Biotechnol. 27, 210–219 (2009). &
- Computational methods for identification of functional residues in protein structures. Curr. Protein Pept. Sci. 12, 456–469 (2011). &
- Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997). et al.
- Prediction of human protein function from post-translational modifications and localization features. J. Mol. Biol. 319, 1257–1265 (2002). et al.
- ConFunc—functional annotation in the twilight zone. Bioinformatics 24, 798–806 (2008). &
- GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics 5, 178 (2004). , &
- Enhanced automated function prediction using distantly related sequences and contextual association by PFP. Protein Sci. 15, 1550–1556 (2006). , &
- Analysis of protein function and its prediction from amino acid sequence. Proteins 79, 2086–2096 (2011). &
- Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA 96, 4285–4288 (1999). , , , &
- Detecting protein function and protein-protein interactions from genome sequences. Science 285, 751–753 (1999). et al.
- Phydbac “Gene Function Predictor”: a gene annotation tool based on genomic context analysis. BMC Bioinformatics 6, 247 (2005). , &
- Protein molecular function prediction by Bayesian phylogenomics. PLoS Comput. Biol. 1, e45 (2005). , , &
- Phylogenetic-based propagation of functional annotations within the Gene Ontology consortium. Brief. Bioinform. 12, 449–462 (2011). , , &
- Prediction of protein function using protein-protein interaction data. J. Comput. Biol. 10, 947–960 (2003). , , , &
- Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics 19 (suppl. 1), i197–i204 (2003). &
- Global protein function prediction from protein-protein interaction networks. Nat. Biotechnol. 21, 697–700 (2003). , , &
- Whole-proteome prediction of protein function via graph-theoretic analysis of interaction maps. Bioinformatics 21 (suppl. 1), i302–i310 (2005). , , , &
- Automated prediction of protein function and detection of functional sites from structure. Proc. Natl. Acad. Sci. USA 101, 14754–14759 (2004). &
- Inference of protein function from protein structure. Structure 13, 121–130 (2005). &
- Protein function prediction using local 3D templates. J. Mol. Biol. 351, 614–626 (2005). , &
- A scalable method for integration and functional analysis of multiple microarray datasets. Bioinformatics 22, 2890–2897 (2006). , , &
- A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc. Natl. Acad. Sci. USA 100, 8348–8353 (2003). , , , &
- A probabilistic functional network of yeast genes. Science 306, 1555–1558 (2004). , , &
- Gene networks in Drosophila melanogaster: integrating experimental data to predict gene function. Genome Biol. 10, R97 (2009). et al.
- Bayesian Markov Random Field analysis for protein function prediction based on network data. PLoS ONE 5, e9293 (2010). , , , &
- Hierarchical classification of gene ontology terms using the GOstruct method. J. Bioinform. Comput. Biol. 8, 357–376 (2010). &
- Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000). et al.
- The Universal Protein Resource (UniProt). Nucleic Acids Res. 33, D154–D159 (2005). et al.
- Annotation error in public databases: misannotation of molecular function in enzyme superfamilies. PLoS Comput. Biol. 5, e1000605 (2009). , , &
- The Pfam protein families database. Nucleic Acids Res. 40, D290–D301 (2012). et al.
- PNPASE regulates RNA import into mitochondria. Cell 142, 456–467 (2010). et al.
- Down-regulation of Myc as a potential target for growth arrest induced by human polynucleotide phosphorylase (hPNPaseold-35) in human melanoma cells. J. Biol. Chem. 278, 24542–24551 (2003). et al.
- Human polynucleotide phosphorylase reduces oxidative RNA damage and protects HeLa cell against oxidative stress. Biochem. Biophys. Res. Commun. 372, 288–292 (2008). &
- Human mitochondrial SUV3 and polynucleotide phosphorylase form a 330-kDa heteropentamer to cooperatively degrade double-stranded RNA with a 3′-to-5′ directionality. J. Biol. Chem. 284, 20812–20821 (2009). , , , &
- Analysis of the human polynucleotide phosphorylase (PNPase) reveals differences in RNA binding and response to phosphate compared to its bacterial and chloroplast counterparts. RNA 14, 297–309 (2008). , , , &
- Moonlighting proteins. Trends Biochem. Sci. 24, 8–11 (1999).
- Enzyme promiscuity: a mechanistic and evolutionary perspective. Annu. Rev. Biochem. 79, 471–505 (2010). &
- Errors in genome annotation. Trends Genet. 15, 132–133 (1999).
- Of URFS and ORFS: A Primer on How to Analyze Derived Amino Acid Sequences (University Science Books, 1986).
- Domain-based and family-specific sequence identity thresholds increase the levels of reliable protein function transfer. J. Mol. Biol. 387, 416–430 (2009). , , &
- Testing the ortholog conjecture with comparative functional genomic data from mammals. PLoS Comput. Biol. 7, e1002073 (2011). , , &
- A gold standard set of mechanistically diverse enzyme superfamilies. Genome Biol. 7, R8 (2006). , , &
- The Enzyme Function Initiative. Biochemistry 50, 9950–9962 (2011). et al.
- The GOA database in 2009—an integrated Gene Ontology Annotation resource. Nucleic Acids Res. 37, D396–D403 (2009). et al.
- The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36 (1982). &
- Supplementary Text and Figures (3M)
Supplementary Figures 1–8, Supplementary Table 3 and Supplementary Note