Abstract
Understanding the relationship between amino acid sequence and protein function is a long-standing challenge with far-reaching scientific and translational implications. State-of-the-art alignment-based techniques cannot predict function for one-third of microbial protein sequences, hampering our ability to exploit data from diverse organisms. Here, we train deep learning models to accurately predict functional annotations for unaligned amino acid sequences across rigorous benchmark assessments built from the 17,929 families of the protein families database Pfam. The models infer known patterns of evolutionary substitutions and learn representations that accurately cluster sequences from unseen families. Combining deep models with existing methods significantly improves remote homology detection, suggesting that the deep models learn complementary information. This approach extends the coverage of Pfam by >9.5%, exceeding additions made over the last decade, and predicts function for 360 human reference proteome proteins with no previous Pfam annotation. These results suggest that deep learning models will be a core component of future protein annotation tools.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$209.00 per year
only $17.42 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
Data availability
The data splits described in this manuscript are available for download at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/random_split and https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/clustered_split, and an interactive notebook for data loading is available at https://www.kaggle.com/googleai/pfam-seed-random-split. Model predictions for Pfam-N are freely available to download as part of the Pfam v.34.0 release from http://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam34.0/.
Code availability
The TensorFlow API, specifically tensorflow-gpu v.1.15.4, was used to implement and train all deep models using the architectures described in the Methods. Code that documents model training using Python v.3.7 is available on GitHub at https://github.com/google-research/google-research/tree/master/using_dl_to_annotate_protein_universe. The training and validation datasets used for creating each model are available as described in the preceding section. Trained models are available in Google Cloud Storage at https://console.cloud.google.com/storage/browser/brain-genomics-public/research/proteins/pfam/models/single_domain_per_sequence_zipped_models, including the ensembles trained on the Pfam-seed random split, Pfam-seed clustered split, Pfam-full random split (all Pfam v.32.0) and the models used to generate Pfam-N v.34.0. ProtCNN inference was run using a custom Python script that (1) read in FASTA records and (2) ran inference of the ProtCNN as a TensorFlow SavedModel. An interactive notebook that demonstrates inference using ProtCNN is available at https://colab.research.google.com/github/google-research/google-research/blob/master/using_dl_to_annotate_protein_universe/neural_network/Neural_network_accuracy_on_random_seed_split.ipynb. An interactive notebook showing use of the trained models to produce Pfam class predictions as well as embeddings is available in GitHub at https://colab.sandbox.google.com/github/google-research/google-research/blob/master/using_dl_to_annotate_protein_universe/Using_Deep_Learning_to_Annotate_the_Protein_Universe.ipynb.
References
Steinegger, M. & Söding, J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nat. Commun. 9, 2542 (2018).
Söding, J. Protein homology detection by HMM–HMM comparison. Bioinformatics 21, 951–960 (2004).
Biegert, A. & Söding, J. Sequence context-specific profiles for homology searching. Proc. Natl Acad. Sci. USA 106, 3770–3775 (2009).
Finn, R. D., Clements, J. & Eddy, S. R. HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 39, W29–W37 (2011).
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).
Price, M. N. et al. Mutant phenotypes for thousands of bacterial genes of unknown function. Nature 557, 503–509 (2018).
Chang, Y.-C. et al. COMBREX-DB: an experiment centered database of protein function: knowledge, predictions and knowledge gaps. Nucleic Acids Res. 44, D330–D335 (2015).
UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45, D158–D169 (2017).
Hou, J., Adhikari, B. & Cheng, J. DeepSF: deep convolutional neural network for mapping protein sequences to folds. Bioinformatics 34, 1295–1303 (2017).
Kulmanov, M., Khan, M. A. & Hoehndorf, R. DeepGO: predicting protein functions from sequence and interactions using a deep ontology-aware classifier. Bioinformatics 34, 660–668 (2017).
Cao, R. et al. ProLanGO: protein function prediction using neural machine translation based on a recurrent neural network. Molecules 22, 1732 (2017).
Li, Y. et al. DEEPre: sequence-based enzyme ec number prediction by deep learning. Bioinformatics 34, 760–769 (2017).
Szalkai, B. & Grolmusz, V. Near perfect protein multi-label classification with deep neural networks. Methods 132, 50–56 (2018).
Zou, Z., Tian, S., Gao, X. & Li, Y. mlDEEPre: multi-functional enzyme function prediction with hierarchical multi-label deep learning. Front. Genet. 9, 714 (2019).
Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at bioRxiv https://doi.org/10.1101/365965 (2018).
Zhang, D. and Kabuka, M. R. Protein family classification with multi-layer graph convolutional networks. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 2390–2393 (IEEE, 2018).
Liu, X. Deep recurrent neural network for protein function prediction from sequence. Preprint at https://arxiv.org/abs/1701.08318 (2017).
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PloS ONE 10, e0141287 (2015).
Sinai, S., Kelsic, E., Church, G. M. & Nowak, M. A. Variational auto-encoding of protein sequences. Preprint at https://arxiv.org/abs/1712.03346 (2017).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep. 11, 1160 (2021).
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2018).
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Johnson, L. S., Eddy, S. R. & Portugaly, E. Hidden Markov model speed heuristic and iterative hmm search procedure. BMC Bioinformatics 11, 431 (2010).
Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci. USA 89, 10915–10919 (1992).
Campen, A. et al. TOP-IDP-scale: a new amino acid scale measuring propensity for intrinsic disorder. Protein Pept. Lett. 15, 956–963 (2008).
Pace, C. N. & Scholtz, J. M. A helix propensity scale based on experimental studies of peptides and proteins. Biophysical J. 75, 422–427 (1998).
Finn, R. D. et al. Pfam: clans, web tools and services. Nucleic Acids Res. 34, D247–D251 (2006).
Bateman, A. What are these new families with 2, 3, 4 endings? Xfam Blog https://xfam.wordpress.com/2012/01/19/what-are-these-new-families-with-_2-_3-_4-endings/ (2012).
Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 44, D279–D285 (2015).
Bateman, A. Google research team bring deep learning to Pfam. Xfam Blog https://xfam.wordpress.com/2021/03/24/google-research-team-bring-deep-learning-to-pfam/ (2021).
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2014).
Li, Y., Jourdain, A. A., Calvo, S. E., Liu, J. S. & Mootha, V. K. CLIC, a tool for expanding biological pathways based on co-expression across thousands of datasets. PLoS Comput. Biol. 13, e1005653 (2017).
Hausrath, A. C., Ramirez, N. A., Ly, A. T. & McEvoy, M. M. The bacterial copper resistance protein CopG contains a cysteine-bridged tetranuclear copper cluster. J. Biol. Chem. 295, 11364–11376 (2020).
Hinton, G., Vinyals, O. & Dean, J. Distilling the knowledge in a neural network. Preprint at https://arxiv.org/abs/1503.02531 (2015).
L.L. Sonnhammer, E., Eddy, S. R. & Durbin, R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 28, 405–420 (1997).
He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. In Proc. IEEE Conference on Computer Vision and Pattern Recognition 770–778 (IEEE, 2016).
Yu, F. and Koltun, V. Multi-scale context aggregation by dilated convolutions. Preprint at https://arxiv.org/abs/1511.07122 (2015).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
El-Gebali, S., Richardson, L. & Finn, R. Repeats in Pfam. EMBL-EBI Training https://doi.org/10.6019/TOL.Pfam_repeats-t.2018.00001.1 (2018).
UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 46, 2699 (2018).
Acknowledgements
We thank J. Smith for countless conversations and guidance throughout this project; E. Bixby for an implementation of ragged tensor processing that sped up our ProtCNN implementation substantially on GPU; C. McClean, B. Alipanahi and S. Kearnes for extensive proofreading and feedback and Z. Nado for programming advice. L.J.C. gratefully acknowledges support from the Simons Foundation.
Author information
Authors and Affiliations
Contributions
M.L.B., D.B., M.A.D. and L.J.C. conceived the study. All authors designed, implemented and used machine learning models to annotate protein domain sequences, analyzed the data and developed the approach used for Pfam-N. M.L.B., D.B. and L.J.C. wrote the paper, with input from all authors.
Corresponding authors
Ethics declarations
Competing interests
M.L.B., D.B., D.H.B., T.S., B.C., D.S., M.A.D. and L.J.C. performed research as part of their employment at Google LLC. Google is a technology company that sells machine learning services as part of its business. Portions of this work are covered by US patent WO2020210591A1, filed by Google.
Peer review
Peer review information
Nature Biotechnology thanks Christian Dallago for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
Supplementary Figs. 1–13, Methods and Tables 1–19.
Rights and permissions
About this article
Cite this article
Bileschi, M.L., Belanger, D., Bryant, D.H. et al. Using deep learning to annotate the protein universe. Nat Biotechnol 40, 932–937 (2022). https://doi.org/10.1038/s41587-021-01179-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41587-021-01179-w
This article is cited by
-
Efficient evolution of human antibodies from general protein language models
Nature Biotechnology (2024)
-
Artificial intelligence and illusions of understanding in scientific research
Nature (2024)
-
Comprehensive detection and characterization of human druggable pockets through binding site descriptors
Nature Communications (2024)
-
Accurate prediction of protein function using statistics-informed graph networks
Nature Communications (2024)
-
Protein family annotation for the Unified Human Gastrointestinal Proteome by DPCfam clustering
Scientific Data (2024)