Abstract
Recent successes in protein function prediction have shown the superiority of approaches that integrate multiple types of experimental evidence over methods that rely solely on homology. However, newly sequenced organisms continue to represent a difficult challenge, because only their protein sequences are available and they lack data derived from large-scale experiments. Here we introduce S2F (Sequence to Function), a network propagation approach for the functional annotation of newly sequenced organisms. Our main idea is to systematically transfer functionally relevant data from model organisms to newly sequenced ones, thus allowing us to use a label propagation approach. S2F introduces a novel label diffusion algorithm that can account for the presence of overlapping communities of proteins with related functions. As most newly sequenced organisms are bacteria, we tested our approach in the context of bacterial genomes. Our extensive evaluation shows a great improvement over existing sequence-based methods, as well as four state-of-the-art general-purpose protein function prediction methods. Our work demonstrates that employing a diffusion process over networks of transferred functional data is an effective way to improve predictions over simple homology. S2F is applicable to any type of newly sequenced organism as well as to those for which experimental evidence is available. A free, easy to run version of S2F is available at https://www.paccanarolab.org/s2f.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$119.00 per year
only $9.92 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The input sequence files25 in FASTA format for all the organisms used in this paper are available at https://doi.org/10.5281/zenodo.5514323. The same URL also contains the detailed list of all organisms excluded when testing each specific bacterium.
Code availability
The code for S2F is freely available and maintained at https://www.paccanarolab.org/s2f. The exact version26 used for this publication is available at https://doi.org/10.5281/zenodo.5513071.
References
Cruz, L. M., Trefflich, S., Weiss, V. A. & Castro, M. A. A. Protein function prediction. Methods Mol. Biol. 1654, 55–75 (2017).
Shehu, A., Barbará, D. & Molloy, K. in Big Data Analytics in Genomics (ed. Wong, K.-C.) 225–298 (Springer, 2016); https://doi.org/10.1007/978-3-319-41279-5_7
Jiang, Y. et al. An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17, 184 (2016).
Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).
Cowen, L., Ideker, T., Raphael, B. J. & Sharan, R. Network propagation: a universal amplifier of genetic associations. Nat. Rev. Genet. 18, 551–562 (2017).
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).
Valentini, G. True path rule hierarchical ensembles for genome-wide gene function prediction. IEEE/ACM Trans. Comput. Biol. Bioinform. 8, 832–847 (2011).
Friedberg, I. & Radivojac, P. in The Gene Ontology Handbook (eds Dessimoz, C. & Škunca, N.) 133–146 (Springer, 2017); https://doi.org/10.1007/978-1-4939-3743-1_10
Obozinski, G., Lanckriet, G., Grant, C., Jordan, M. I. & Noble, W. S. Consistent probabilistic outputs for protein function prediction. Genome Biol. 9, S6 (2008).
Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 47, D351–D360 (2019).
Walhout, A. J. et al. Protein interaction mapping in C. elegans using proteins involved in vulval development. Science 287, 116–122 (2000).
Yu, H. et al. Annotation transfer between genomes: protein-protein interologs and protein-DNA regulogs. Genome Res. 14, 1107–1118 (2004).
Ben-Hur, A. & Noble, W. S. Kernel methods for predicting protein-protein interactions. Bioinformatics 21, i38–i46 (2005).
Sharan, R. et al. Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA 102, 1974–1979 (2005).
Szklarczyk, D. et al. STRING v11: protein-protein association networks with increased coverage, supporting functional discovery in genome-wide experimental datasets. Nucleic Acids Res. 47, D607–D613 (2019).
Mostafavi, S., Ray, D., Warde-Farley, D., Grouios, C. & Morris, Q. GeneMANIA: a real-time multiple association network integration algorithm for predicting gene function. Genome Biol. 9, S4 (2008).
Huntley, R. P. et al. The GOA database: gene ontology annotation updates for 2015. Nucleic Acids Res. 43, D1057–D1063 (2015).
Lavezzo, E., Falda, M., Fontana, P., Bianco, L. & Toppo, S. Enhancing protein function prediction with taxonomic constraints—the Argot2.5 web server. Methods 93, 15–23 (2016).
Kulmanov, M. & Hoehndorf, R. DeepGOPlus: improved protein function prediction from sequence. Bioinformatics 36, 422–429 (2020).
You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).
You, R. et al. NetGO: improving large-scale protein function prediction with massive network information. Nucleic Acids Res. 47, W379–W387 (2019).
Makrodimitris, S., van Ham, R. C. H. J. & Reinders, M. J. T. Automatic gene function prediction in the 2020s. Genes 11, 1264 (2020).
Cao, M. et al. Going the distance for protein function prediction: a new distance metric for protein interaction networks. PLoS ONE 8, e76339 (2013).
Zhou, D., Bousquet, O., Lal, T. N., Weston, J. & Schölkopf, B. Learning with local and global consistency. In Proc. 16th International Conference on Neural Information Processing Systems (eds Thrun, S. et al.) 321–328 (MIT, 2004).
Torres, M., Yang, H., Romero, A. E. & Paccanaro, A. Input data for 'Protein function prediction for newly sequenced organisms'. Zenodo https://doi.org/10.5281/ZENODO.5514323 (2021).
Torres, M., Yang, H., Romero, A. E. & Paccanaro, A. Source code for 'Protein function prediction for newly sequenced organisms'. Zenodo https://doi.org/10.5281/ZENODO.5513071 (2021).
UniProt Consortium UniProt: a worldwide hub of protein knowledge. Nucleic Acids Res. 47, D506–D515 (2019).
Acknowledgements
The first idea for this project was conceived in discussions with T. Gianoulis, who we remember dearly for her intelligence, kindness, enthusiasm and passion for research. We also thank P. Bhat, T. Nepusz, J. Caceres, M. Frasca, G. Valentini, A. Devoto, L. Bögre, R. Sasidharan and M. Gerstein for many important and stimulating discussions. A.P. was supported by Biotechnology and Biological Sciences Research Council (https://bbsrc.ukri.org/) grants numbers BB/K004131/1, BB/F00964X/1 and BB/M025047/1, Medical Research Council (https://mrc.ukri.org) grant number MR/T001070/1, Consejo Nacional de Ciencia y Tecnología Paraguay (https://www.conacyt.gov.py/) grants numbers 14-INV-088 and PINV15–315, National Science Foundation Advances in Bio Informatics (https://www.nsf.gov/) grant number 1660648, Fundação de Amparo à Pesquisa do Estado do Rio de Janeiro grant number E-26/201.079/2021 (260380) and Fundação Getulio Vargas.
Author information
Authors and Affiliations
Contributions
A.P. conceived the study. A.P. and H.Y. devised the algorithms, developed the prototype and performed preliminary evaluations. M.T. and A.E.R. implemented and extended the algorithms and evaluation metrics, performed large-scale experiments and analysed the results. A.P., M.T. and A.E.R. wrote the manuscript and evaluated the biological relevance of the results. All authors discussed the results and implications. A.P. supervised the project.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Peer review information Nature Machine Intelligence thanks Jiecong Lin and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–58, Notes 1–18 and Table 1.
Rights and permissions
About this article
Cite this article
Torres, M., Yang, H., Romero, A.E. et al. Protein function prediction for newly sequenced organisms. Nat Mach Intell 3, 1050–1060 (2021). https://doi.org/10.1038/s42256-021-00419-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-021-00419-7
This article is cited by
-
AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding
Genome Biology (2024)
-
PLMSearch: Protein language model powers accurate and fast sequence search for remote homology
Nature Communications (2024)
-
Domain-PFP allows protein function prediction using function-aware domain embedding representations
Communications Biology (2023)
-
Combining views for newly sequenced organisms
Nature Machine Intelligence (2021)