Abstract
Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.
This is a preview of subscription content, access via your institution
Relevant articles
Open Access articles citing this article.
-
How to approach machine learning-based prediction of drug/compound–target interactions
Journal of Cheminformatics Open Access 06 February 2023
-
Organizing the bacterial annotation space with amino acid sequence embeddings
BMC Bioinformatics Open Access 23 September 2022
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 digital issues and online access to articles
$99.00 per year
only $8.25 per issue
Rent or buy this article
Get just this article for as long as you need it
$39.95
Prices may be subject to local taxes which are calculated during checkout





Data availability
All of the datasets and results of this study are available for download at https://github.com/kansil/PROBE. Protein representation and MSA files are available via Zenodo at https://doi.org/10.5281/zenodo.5795850 (ref. 116).
Code availability
The source code of this study is available for download at https://github.com/kansil/PROBE. A ready-to-use web-tool containing all models of four benchmarks, to reproduce the results and to test new representation methods on the same predictive tasks are available on the CodeOcean platform, which is reachable from https://PROBE.kansil.org (ref. 117).
References
Dalkiran, A. et al. ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinf. 19, 334 (2018).
Dobson, P. D. & Doig, A. J. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol. 330, 771–783 (2003).
Latino, D. A. R. S. & Aires-de-Sousa, J. Assignment of EC numbers to enzymatic reactions with MOLMAP reaction descriptors and random forests. J. Chem. Inf. Model. 49, 1839–1846 (2009).
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).
Kimothi, D., Soni, A., Biyani, P. & Hogan, J. M. Distributed representations for biological sequence analysis. Preprint at https://arxiv.org/abs/1608.05949 (2016).
Nguyen, S., Li, Z. & Shang, Y. Deep networks and continuous distributed representation of protein sequences for protein quality assessment. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI) 527–534 (IEEE, 2017); https://doi.org/10.1109/ICTAI.2017.00086
Keskin, O., Tuncbag, N. & Gursoy, A. Predicting protein–protein interactions from the molecular to the proteome level. Chem. Rev. 116, 4884–4909 (2016).
Rifaioglu, A. S. et al. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Briefings Bioinform. 20, 1878–1912 (2019).
Rifaioglu, A. S. et al. DEEPScreen: high performance drug-target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chem. Sci. 11, 2531–2557 (2020).
Rifaioglu, A. S. et al. MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery. Bioinformatics 37, 693–704 (2021).
Doğan, T. et al. Protein domain-based prediction of compound–target interactions and experimental validation on LIM kinases. PLoS Comput. Biol. 17, e1009171 (2021).
Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins 86, 7–15 (2018).
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Rifaioglu, A. S., Doğan, T., Jesus Martin, M., Cetin-Atalay, R. & Atalay, V. DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks. Sci. Rep. 9, 7344 (2019).
You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).
Jain, A. & Kihara, D. Phylo-PFP: improved automated protein function prediction using phylogenetic distance of distantly related sequences. Bioinformatics 35, 753–759 (2019).
The Gene Ontology Consortium. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D330–D338 (2019).
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).
Liu, L. et al. Deep learning for generic object detection: a survey. Int. J. Comput. Vision 128, 261–318 (2020).
Zhang, C., Patras, P. & Haddadi, H. Deep learning in mobile and wireless networking: a survey. IEEE Commun. Surv. Tutor. 21, 2224–2287 (2019).
Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).
Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 1817 (2016).
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint at https://arxiv.org/abs/1910.10683 (2019).
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Advances in Neural Information Processing Systems Vol. 34 (NeurIPS, 2021).
Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. Preprint at https://arxiv.org/abs/2007.06225 (2020).
Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).
Heinzinger, M. et al. Modeling the language of life-deep learning protein sequences. Bioinformatics 360, 540 (2019).
Kim, S., Lee, H., Kim, K. & Kang, J. Mut2Vec: distributed representation of cancerous mutations. BMC Med. Genomics 11, 33 (2018).
Du, J. et al. Gene2vec: distributed representation of genes based on co-expression. BMC Genomics 20, 82 (2019).
Choy, C. T., Wong, C. H. & Chan, S. L. Infer related genes from large scale gene expression dataset with embedding. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/362848v2 (2018).
Rao, R. et al. MSA transformer. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.02.12.430858v3 (2021).
Lu, A. X., Zhang, H., Ghassemi, M. & Moses, A. Self-supervised contrastive learning of protein representations by mutual information maximization. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2020.09.04.283929v2 (2020).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
Johnson, L. S., Eddy, S. R. & Portugaly, E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinf. 11, 431 (2010).
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Gromiha, M. M. Protein Sequence Analysis. In Protein Bioinformatics (ed. Gromiha, M. M.) Ch. 2, 29–62 (Academic, 2010); https://doi.org/10.1016/B978-8-1312-2297-3.50002-3
Chou, K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19 (2005).
Wang, J. et al. POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 33, 2756–2758 (2017).
Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213–D221 (2015).
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).
Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884–D891 (2021).
Mirabello, C. & Wallner, B. rawMSA: end-to-end deep learning using raw multiple sequence alignments. PLoS ONE 14, e0220182 (2019).
Xu, Y., Song, J., Wilson, C. & Whisstock, J. C. PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction. Sci. Rep. 8, 8240 (2018).
Lin, D. & Others. An information-theoretic definition of similarity. In ICML '98: Proc. 15th International Conference on Machine Learning 296–304 (ACM, 1998).
Pedregosa, F., Varoquaux, G. & Gramfort, A. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep. 11, 1160 (2021).
Villegas-Morcillo, A. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37, 162–170 (2021).
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5998–6008 (Curran Associates, 2017).
Vig, J. et al. BERTology meets biology: interpreting attention in protein language models. Preprint at https://arxiv.org/abs/2006.15222 (2020).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 1–21 (2012).
Brysbaert, M., Stevens, M., Mandera, P. & Keuleers, E. How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Front. Psychol. 7, 1116 (2016).
Higgins, I. et al. Towards a definition of disentangled representations. Preprint at https://arxiv.org/abs/1812.02230 (2018).
Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. eLife 8, e39397 (2019).
Öztürk, H., Ozkirimli, E. & Özgür, A. WideDTA: prediction of drug-target binding affinity. Preprint at https://arxiv.org/abs/1902.04166 (2019).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Doğan, T. et al. CROssBAR: Comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res. 49, e96–e96 (2021).
Burk, M. J. & Van Dien, S. Biotechnology for chemical production: challenges and opportunities. Trends Biotechnol. 34, 187–190 (2016).
Gainza, P., Nisonoff, H. M. & Donald, B. R. Algorithms for protein design. Curr. Opin. Struct. Biol. 39, 16–26 (2016).
Baker, D. An exciting but challenging road ahead for computational enzyme design. Protein Sci. 19, 1817–1819 (2010).
Röthlisberger, D. et al. Kemp elimination catalysts by computational enzyme design. Nature 453, 190–195 (2008).
Privett, H. K. et al. Iterative approach to computational enzyme design. Proc. Natl Acad. Sci. USA 109, 3790–3795 (2012).
Chan, H. S., Shimizu, S. & Kaya, H. Cooperativity principles in protein folding. Methods Enzymol. 380, 350–379 (2004).
Lippow, S. M., Wittrup, K. D. & Tidor, B. Computational design of antibody-affinity improvement beyond in vivo maturation. Nat. Biotechnol. 25, 1171–1176 (2007).
Looger, L. L., Dwyer, M. A., Smith, J. J. & Hellinga, H. W. Computational design of receptor and sensor proteins with novel functions. Nature 423, 185–190 (2003).
Duan, Y. et al. A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 24, 1999–2012 (2003).
Brunk, E. & Rothlisberger, U. Mixed quantum mechanical/molecular mechanical molecular dynamics simulations of biological systems in ground and electronically excited states. Chem. Rev. 115, 6217–6263 (2015).
Childers, M. C. & Daggett, V. Insights from molecular dynamics simulations for computational protein design. Mol. Syst. Des. Eng. 2, 9–33 (2017).
Hollingsworth, S. A. & Dror, R. O. Molecular dynamics simulation for all. Neuron 99, 1129–1143 (2018).
Camilloni, C. & Vendruscolo, M. Statistical mechanics of the denatured state of a protein using replica-averaged metadynamics. J. Am. Chem. Soc. 136, 8982–8991 (2014).
Huang, S.-Y. & Zou, X. Statistical mechanics-based method to extract atomic distance-dependent potentials from protein structures. Proteins 79, 2648–2661 (2011).
Pierce, N. A. & Winfree, E. Protein design is NP-hard. Protein Eng. 15, 779–782 (2002).
Eguchi, R. R., Anand, N., Choe, C. A. & Huang, P.-S. IG-VAE: Generative modeling of immunoglobulin proteins by direct 3D coordinate generation. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2020.08.07.242347v2 (2020).
Ng, A. Y. & Jordan, M. I. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Advances in Neural Information Processing Systems (eds. Dietterich, T. G., Becker, S. & Ghahramani, Z.) Vol. 14, 841–848 (MIT Press, 2002).
Salakhutdinov, R. Learning deep generative models. Annu. Rev. Stat. Appl. 2, 361–385 (2015).
Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.07.18.452833v1 (2021).
Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinformatics Advances 1, vbab035 (2021).
Yu, G. et al. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26, 976–978 (2010).
McInnes, B. T. & Pedersen, T. Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text. J. Biomed. Inform. 46, 1116–1124 (2013).
Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).
Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).
Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).
Moal, I. H. & Fernández-Recio, J. SKEMPI: a structural kinetic and energetic database of mutant protein interactions and its use in empirical models. Bioinformatics 28, 2600–2607 (2012).
Chen, M. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35, i305–i314 (2019).
Tipping, M. E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001).
Wan, F. & Zeng, J. (M.). Deep learning with feature embedding for compound–protein interaction prediction. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/086033v1 (2016).
Asgari, E., McHardy, A. C. & Mofrad, M. R. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci. Rep. 9, 3577 (2019).
Öztürk, H., Özgür, A. & Ozkirimli, E. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics 34, i821–i829 (2018).
Oubounyt, M., Louadi, Z., Tayara, H. & To Chong, K. Deep learning models based on distributed feature representations for alternative splicing prediction. IEEE Access 6, 58826–58834 (2018).
Mirabello, C. & Wallner, B. rawMSA: End-to-end deep learning makes protein sequence profiles and feature extraction obsolete. Bioinformatics 228 (2018).
Dutta, A., Dubey, T., Singh, K. K. & Anand, A. SpliceVec: distributed feature representations for splice junction prediction. Comput. Biol. Chem. 74, 434–441 (2018).
Mejía-Guerra, M. K. & Buckler, E. S. A k-mer grammar analysis to uncover maize regulatory architecture. BMC Plant Biol. 19, 103 (2019).
Cohen, T., Widdows, D., Heiden, J. A. V., Gupta, N. T. & Kleinstein, S. H. Graded vector representations of immunoglobulins produced in response to west Nile virus. In Quantum Interaction (eds de Barros, J. A., Coecke, B. & Pothos, E.) 135–148 (Springer, 2017).
Ng, P. dna2vec: Consistent vector representations of variable-length k-mers. Preprint at https://arxiv.org/abs/1701.06279 (2017).
Jaeger, S., Fulle, S. & Turk, S. Mol2vec: Unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58, 27–35 (2018).
Viehweger, A., Krautwurst, S., Parks, D. H., König, B. & Marz, M. An encoding of genome content for machine learning. Preprint at https://www.biorxiv.org/content/10.1101/524280v3 (2019).
Qi, Y., Oja, M., Weston, J. & Noble, W. S. A unified multitask architecture for predicting local protein properties. PLoS ONE 7, e32235 (2012).
Melvin, I., Weston, J., Noble, W. S. & Leslie, C. Detecting remote evolutionary relationships among proteins by large-scale semantic embedding. PLoS Comput. Biol. 7, e1001047 (2011).
Choi, J., Oh, I., Seo, S. & Ahn, J. G2Vec: distributed gene representations for identification of cancer prognostic genes. Sci. Rep. 8, 13729 (2018).
You, R. & Zhu, S. DeepText2Go: Improving large-scale protein function prediction with deep semantic text representation. In 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 42–49 (IEEE, 2017); https://doi.org/10.1109/BIBM.2017.8217622
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. Preprint at https://arxiv.org/abs/1902.08661 (2019).
Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/365965v1 (2018).
Kané, H., Coulibali, M., Abdalla, A. & Ajanoh, P. Augmenting protein network embeddings with sequence information. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/730481v3 (2019).
Faisal, M. R. et al. Improving protein sequence classification performance using adjacent and overlapped segments on existing protein descriptors. JBiSE 11, 126–143 (2018).
Strodthoff, N., Wagner, P., Wenzel, M. & Samek, W. UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36, 2401–2409 (2020).
Asgari, E., Poerner, N., McHardy, A. C. & Mofrad, M. R. K. DeepPrime2Sec: deep learning for protein secondary structure prediction from the primary sequences. Preprint at bioRxiv https://www.biorxiv.org/content/early/2019/07/18/705426 (2019)
Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01179-w (2022).
Unsal, S. et al. Learning Functional Properties of Proteins with Language Models Data Sets (Zenodo, 2020); https://doi.org/10.5281/zenodo.5795850
Unsal, S. et al. PROBE (Protein Representation Benchmark): Function-Centric Evaluation of Protein Representation Methods (Code Ocean, 2021); https://doi.org/10.24433/CO.5123923.v2
Acknowledgements
This work was supported by TUBITAK (project no. 318S218). We thank G. Tatar (faculty member, KTU, Turkey) for reading and commenting on the manuscript, and to G.M.Ç. Şılbır (PhD Candidate, KTU, Turkey) for contributing to the drawing of figures.
Author information
Authors and Affiliations
Contributions
T.D., S.U. and A.C.A. conceived the idea and planned the work. S.U. evaluated the literature, constructed representation vectors and prepared the datasets and carried out the analysis for the semantic similarity inference and protein–protein binding affinity estimation benchmarks. H.A. and S.U. prepared the datasets and carried out the analysis for the ontology-based protein family classification benchmark. M.A. and S.U. prepared the datasets and carried out the analysis for the drug target protein family classification benchmark. S.U., A.C.A. and T.D. have written the manuscript. T.D., A.C.A and K.T. supervised the overall study. All authors approved the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Machine Intelligence thanks Christian Dallago, Céline Marquet and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Supplementary Information
1. Different approaches for representing proteins. 2. Classical protein representations. 3. Protein representation learning. 4. Protein representation methods benchmarked in this study. 5. Objective-based classification of a comprehensive list of protein representations. 6. Traits of successful protein representations. 7. Performance evaluation metrics. 8. Extended results. 9. Extended discussion. Supplementary Figs. 1–15 and Tables 1–11.
Rights and permissions
About this article
Cite this article
Unsal, S., Atas, H., Albayrak, M. et al. Learning functional properties of proteins with language models. Nat Mach Intell 4, 227–245 (2022). https://doi.org/10.1038/s42256-022-00457-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s42256-022-00457-9
This article is cited by
-
How to approach machine learning-based prediction of drug/compound–target interactions
Journal of Cheminformatics (2023)
-
Linguistically inspired roadmap for building biologically reliable protein language models
Nature Machine Intelligence (2023)
-
Organizing the bacterial annotation space with amino acid sequence embeddings
BMC Bioinformatics (2022)