Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Analysis
  • Published:

Learning functional properties of proteins with language models

A preprint version of the article is available at bioRxiv.

Abstract

Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Schematic representation of the study.
Fig. 2: Protein semantic similarity inference benchmark results.
Fig. 3: Ontology-based protein function prediction benchmark results.
Fig. 4: Drug target protein family classification benchmark results.
Fig. 5: Protein–protein binding affinity estimation benchmark results.

Similar content being viewed by others

Data availability

All of the datasets and results of this study are available for download at https://github.com/kansil/PROBE. Protein representation and MSA files are available via Zenodo at https://doi.org/10.5281/zenodo.5795850 (ref. 116).

Code availability

The source code of this study is available for download at https://github.com/kansil/PROBE. A ready-to-use web-tool containing all models of four benchmarks, to reproduce the results and to test new representation methods on the same predictive tasks are available on the CodeOcean platform, which is reachable from https://PROBE.kansil.org (ref. 117).

References

  1. Dalkiran, A. et al. ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinf. 19, 334 (2018).

    Article  Google Scholar 

  2. Dobson, P. D. & Doig, A. J. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol. 330, 771–783 (2003).

    Article  Google Scholar 

  3. Latino, D. A. R. S. & Aires-de-Sousa, J. Assignment of EC numbers to enzymatic reactions with MOLMAP reaction descriptors and random forests. J. Chem. Inf. Model. 49, 1839–1846 (2009).

    Article  Google Scholar 

  4. Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).

    Article  Google Scholar 

  5. Kimothi, D., Soni, A., Biyani, P. & Hogan, J. M. Distributed representations for biological sequence analysis. Preprint at https://arxiv.org/abs/1608.05949 (2016).

  6. Nguyen, S., Li, Z. & Shang, Y. Deep networks and continuous distributed representation of protein sequences for protein quality assessment. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI) 527–534 (IEEE, 2017); https://doi.org/10.1109/ICTAI.2017.00086

  7. Keskin, O., Tuncbag, N. & Gursoy, A. Predicting protein–protein interactions from the molecular to the proteome level. Chem. Rev. 116, 4884–4909 (2016).

    Article  Google Scholar 

  8. Rifaioglu, A. S. et al. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Briefings Bioinform. 20, 1878–1912 (2019).

    Article  Google Scholar 

  9. Rifaioglu, A. S. et al. DEEPScreen: high performance drug-target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chem. Sci. 11, 2531–2557 (2020).

    Article  Google Scholar 

  10. Rifaioglu, A. S. et al. MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery. Bioinformatics 37, 693–704 (2021).

    Article  Google Scholar 

  11. Doğan, T. et al. Protein domain-based prediction of compound–target interactions and experimental validation on LIM kinases. PLoS Comput. Biol. 17, e1009171 (2021).

    Article  Google Scholar 

  12. Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins 86, 7–15 (2018).

    Article  Google Scholar 

  13. Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).

  14. Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).

  15. Rifaioglu, A. S., Doğan, T., Jesus Martin, M., Cetin-Atalay, R. & Atalay, V. DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks. Sci. Rep. 9, 7344 (2019).

    Article  Google Scholar 

  16. You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).

    Article  Google Scholar 

  17. Jain, A. & Kihara, D. Phylo-PFP: improved automated protein function prediction using phylogenetic distance of distantly related sequences. Bioinformatics 35, 753–759 (2019).

    Article  Google Scholar 

  18. The Gene Ontology Consortium. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D330–D338 (2019).

  19. Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).

    Article  Google Scholar 

  20. LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    Article  Google Scholar 

  21. Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).

    Article  Google Scholar 

  22. Liu, L. et al. Deep learning for generic object detection: a survey. Int. J. Comput. Vision 128, 261–318 (2020).

  23. Zhang, C., Patras, P. & Haddadi, H. Deep learning in mobile and wireless networking: a survey. IEEE Commun. Surv. Tutor. 21, 2224–2287 (2019).

    Article  Google Scholar 

  24. Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).

    Article  Google Scholar 

  25. Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 1817 (2016).

    Article  Google Scholar 

  26. Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint at https://arxiv.org/abs/1910.10683 (2019).

  27. Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).

    Google Scholar 

  28. Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Advances in Neural Information Processing Systems Vol. 34 (NeurIPS, 2021).

  29. Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. Preprint at https://arxiv.org/abs/2007.06225 (2020).

  30. Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).

    Article  Google Scholar 

  31. Heinzinger, M. et al. Modeling the language of life-deep learning protein sequences. Bioinformatics 360, 540 (2019).

    Google Scholar 

  32. Kim, S., Lee, H., Kim, K. & Kang, J. Mut2Vec: distributed representation of cancerous mutations. BMC Med. Genomics 11, 33 (2018).

    Article  Google Scholar 

  33. Du, J. et al. Gene2vec: distributed representation of genes based on co-expression. BMC Genomics 20, 82 (2019).

    Article  Google Scholar 

  34. Choy, C. T., Wong, C. H. & Chan, S. L. Infer related genes from large scale gene expression dataset with embedding. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/362848v2 (2018).

  35. Rao, R. et al. MSA transformer. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.02.12.430858v3 (2021).

  36. Lu, A. X., Zhang, H., Ghassemi, M. & Moses, A. Self-supervised contrastive learning of protein representations by mutual information maximization. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2020.09.04.283929v2 (2020).

  37. Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).

    Article  Google Scholar 

  38. Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).

    Article  Google Scholar 

  39. Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).

    Article  Google Scholar 

  40. Johnson, L. S., Eddy, S. R. & Portugaly, E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinf. 11, 431 (2010).

    Article  Google Scholar 

  41. Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).

    Article  Google Scholar 

  42. Gromiha, M. M. Protein Sequence Analysis. In Protein Bioinformatics (ed. Gromiha, M. M.) Ch. 2, 29–62 (Academic, 2010); https://doi.org/10.1016/B978-8-1312-2297-3.50002-3

  43. Chou, K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19 (2005).

    Article  Google Scholar 

  44. Wang, J. et al. POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 33, 2756–2758 (2017).

    Article  Google Scholar 

  45. Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213–D221 (2015).

    Article  Google Scholar 

  46. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

    Article  Google Scholar 

  47. Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884–D891 (2021).

    Article  Google Scholar 

  48. Mirabello, C. & Wallner, B. rawMSA: end-to-end deep learning using raw multiple sequence alignments. PLoS ONE 14, e0220182 (2019).

    Article  Google Scholar 

  49. Xu, Y., Song, J., Wilson, C. & Whisstock, J. C. PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction. Sci. Rep. 8, 8240 (2018).

    Article  Google Scholar 

  50. Lin, D. & Others. An information-theoretic definition of similarity. In ICML '98: Proc. 15th International Conference on Machine Learning 296–304 (ACM, 1998).

  51. Pedregosa, F., Varoquaux, G. & Gramfort, A. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

  52. Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep. 11, 1160 (2021).

    Article  Google Scholar 

  53. Villegas-Morcillo, A. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37, 162–170 (2021).

    Article  Google Scholar 

  54. Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).

    Article  Google Scholar 

  55. Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5998–6008 (Curran Associates, 2017).

  56. Vig, J. et al. BERTology meets biology: interpreting attention in protein language models. Preprint at https://arxiv.org/abs/2006.15222 (2020).

  57. Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).

    Article  Google Scholar 

  58. Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 1–21 (2012).

    Article  Google Scholar 

  59. Brysbaert, M., Stevens, M., Mandera, P. & Keuleers, E. How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Front. Psychol. 7, 1116 (2016).

    Article  Google Scholar 

  60. Higgins, I. et al. Towards a definition of disentangled representations. Preprint at https://arxiv.org/abs/1812.02230 (2018).

  61. Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. eLife 8, e39397 (2019).

    Article  Google Scholar 

  62. Öztürk, H., Ozkirimli, E. & Özgür, A. WideDTA: prediction of drug-target binding affinity. Preprint at https://arxiv.org/abs/1902.04166 (2019).

  63. Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).

    Article  Google Scholar 

  64. Doğan, T. et al. CROssBAR: Comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res. 49, e96–e96 (2021).

    Article  Google Scholar 

  65. Burk, M. J. & Van Dien, S. Biotechnology for chemical production: challenges and opportunities. Trends Biotechnol. 34, 187–190 (2016).

    Article  Google Scholar 

  66. Gainza, P., Nisonoff, H. M. & Donald, B. R. Algorithms for protein design. Curr. Opin. Struct. Biol. 39, 16–26 (2016).

    Article  Google Scholar 

  67. Baker, D. An exciting but challenging road ahead for computational enzyme design. Protein Sci. 19, 1817–1819 (2010).

    Article  Google Scholar 

  68. Röthlisberger, D. et al. Kemp elimination catalysts by computational enzyme design. Nature 453, 190–195 (2008).

    Article  Google Scholar 

  69. Privett, H. K. et al. Iterative approach to computational enzyme design. Proc. Natl Acad. Sci. USA 109, 3790–3795 (2012).

    Article  Google Scholar 

  70. Chan, H. S., Shimizu, S. & Kaya, H. Cooperativity principles in protein folding. Methods Enzymol. 380, 350–379 (2004).

    Article  Google Scholar 

  71. Lippow, S. M., Wittrup, K. D. & Tidor, B. Computational design of antibody-affinity improvement beyond in vivo maturation. Nat. Biotechnol. 25, 1171–1176 (2007).

    Article  Google Scholar 

  72. Looger, L. L., Dwyer, M. A., Smith, J. J. & Hellinga, H. W. Computational design of receptor and sensor proteins with novel functions. Nature 423, 185–190 (2003).

    Article  Google Scholar 

  73. Duan, Y. et al. A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 24, 1999–2012 (2003).

    Article  Google Scholar 

  74. Brunk, E. & Rothlisberger, U. Mixed quantum mechanical/molecular mechanical molecular dynamics simulations of biological systems in ground and electronically excited states. Chem. Rev. 115, 6217–6263 (2015).

    Article  Google Scholar 

  75. Childers, M. C. & Daggett, V. Insights from molecular dynamics simulations for computational protein design. Mol. Syst. Des. Eng. 2, 9–33 (2017).

    Article  Google Scholar 

  76. Hollingsworth, S. A. & Dror, R. O. Molecular dynamics simulation for all. Neuron 99, 1129–1143 (2018).

    Article  Google Scholar 

  77. Camilloni, C. & Vendruscolo, M. Statistical mechanics of the denatured state of a protein using replica-averaged metadynamics. J. Am. Chem. Soc. 136, 8982–8991 (2014).

    Article  Google Scholar 

  78. Huang, S.-Y. & Zou, X. Statistical mechanics-based method to extract atomic distance-dependent potentials from protein structures. Proteins 79, 2648–2661 (2011).

    Article  Google Scholar 

  79. Pierce, N. A. & Winfree, E. Protein design is NP-hard. Protein Eng. 15, 779–782 (2002).

    Article  Google Scholar 

  80. Eguchi, R. R., Anand, N., Choe, C. A. & Huang, P.-S. IG-VAE: Generative modeling of immunoglobulin proteins by direct 3D coordinate generation. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2020.08.07.242347v2 (2020).

  81. Ng, A. Y. & Jordan, M. I. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Advances in Neural Information Processing Systems (eds. Dietterich, T. G., Becker, S. & Ghahramani, Z.) Vol. 14, 841–848 (MIT Press, 2002).

  82. Salakhutdinov, R. Learning deep generative models. Annu. Rev. Stat. Appl. 2, 361–385 (2015).

    Article  Google Scholar 

  83. Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.07.18.452833v1 (2021).

  84. Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinformatics Advances 1, vbab035 (2021).

  85. Yu, G. et al. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26, 976–978 (2010).

    Article  Google Scholar 

  86. McInnes, B. T. & Pedersen, T. Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text. J. Biomed. Inform. 46, 1116–1124 (2013).

    Article  Google Scholar 

  87. Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).

    Article  Google Scholar 

  88. Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).

    Article  Google Scholar 

  89. Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).

    Article  Google Scholar 

  90. Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).

    Article  Google Scholar 

  91. Moal, I. H. & Fernández-Recio, J. SKEMPI: a structural kinetic and energetic database of mutant protein interactions and its use in empirical models. Bioinformatics 28, 2600–2607 (2012).

    Article  Google Scholar 

  92. Chen, M. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35, i305–i314 (2019).

    Article  Google Scholar 

  93. Tipping, M. E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001).

  94. Wan, F. & Zeng, J. (M.). Deep learning with feature embedding for compound–protein interaction prediction. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/086033v1 (2016).

  95. Asgari, E., McHardy, A. C. & Mofrad, M. R. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci. Rep. 9, 3577 (2019).

    Article  Google Scholar 

  96. Öztürk, H., Özgür, A. & Ozkirimli, E. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics 34, i821–i829 (2018).

    Article  Google Scholar 

  97. Oubounyt, M., Louadi, Z., Tayara, H. & To Chong, K. Deep learning models based on distributed feature representations for alternative splicing prediction. IEEE Access 6, 58826–58834 (2018).

    Article  Google Scholar 

  98. Mirabello, C. & Wallner, B. rawMSA: End-to-end deep learning makes protein sequence profiles and feature extraction obsolete. Bioinformatics 228 (2018).

  99. Dutta, A., Dubey, T., Singh, K. K. & Anand, A. SpliceVec: distributed feature representations for splice junction prediction. Comput. Biol. Chem. 74, 434–441 (2018).

    Article  Google Scholar 

  100. Mejía-Guerra, M. K. & Buckler, E. S. A k-mer grammar analysis to uncover maize regulatory architecture. BMC Plant Biol. 19, 103 (2019).

    Article  Google Scholar 

  101. Cohen, T., Widdows, D., Heiden, J. A. V., Gupta, N. T. & Kleinstein, S. H. Graded vector representations of immunoglobulins produced in response to west Nile virus. In Quantum Interaction (eds de Barros, J. A., Coecke, B. & Pothos, E.) 135–148 (Springer, 2017).

  102. Ng, P. dna2vec: Consistent vector representations of variable-length k-mers. Preprint at https://arxiv.org/abs/1701.06279 (2017).

  103. Jaeger, S., Fulle, S. & Turk, S. Mol2vec: Unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58, 27–35 (2018).

    Article  Google Scholar 

  104. Viehweger, A., Krautwurst, S., Parks, D. H., König, B. & Marz, M. An encoding of genome content for machine learning. Preprint at https://www.biorxiv.org/content/10.1101/524280v3 (2019).

  105. Qi, Y., Oja, M., Weston, J. & Noble, W. S. A unified multitask architecture for predicting local protein properties. PLoS ONE 7, e32235 (2012).

    Article  Google Scholar 

  106. Melvin, I., Weston, J., Noble, W. S. & Leslie, C. Detecting remote evolutionary relationships among proteins by large-scale semantic embedding. PLoS Comput. Biol. 7, e1001047 (2011).

    Article  MathSciNet  Google Scholar 

  107. Choi, J., Oh, I., Seo, S. & Ahn, J. G2Vec: distributed gene representations for identification of cancer prognostic genes. Sci. Rep. 8, 13729 (2018).

    Article  Google Scholar 

  108. You, R. & Zhu, S. DeepText2Go: Improving large-scale protein function prediction with deep semantic text representation. In 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 42–49 (IEEE, 2017); https://doi.org/10.1109/BIBM.2017.8217622

  109. Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. Preprint at https://arxiv.org/abs/1902.08661 (2019).

  110. Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/365965v1 (2018).

  111. Kané, H., Coulibali, M., Abdalla, A. & Ajanoh, P. Augmenting protein network embeddings with sequence information. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/730481v3 (2019).

  112. Faisal, M. R. et al. Improving protein sequence classification performance using adjacent and overlapped segments on existing protein descriptors. JBiSE 11, 126–143 (2018).

    Article  Google Scholar 

  113. Strodthoff, N., Wagner, P., Wenzel, M. & Samek, W. UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36, 2401–2409 (2020).

    Article  Google Scholar 

  114. Asgari, E., Poerner, N., McHardy, A. C. & Mofrad, M. R. K. DeepPrime2Sec: deep learning for protein secondary structure prediction from the primary sequences. Preprint at bioRxiv https://www.biorxiv.org/content/early/2019/07/18/705426 (2019)

  115. Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01179-w (2022).

  116. Unsal, S. et al. Learning Functional Properties of Proteins with Language Models Data Sets (Zenodo, 2020); https://doi.org/10.5281/zenodo.5795850

  117. Unsal, S. et al. PROBE (Protein Representation Benchmark): Function-Centric Evaluation of Protein Representation Methods (Code Ocean, 2021); https://doi.org/10.24433/CO.5123923.v2

Download references

Acknowledgements

This work was supported by TUBITAK (project no. 318S218). We thank G. Tatar (faculty member, KTU, Turkey) for reading and commenting on the manuscript, and to G.M.Ç. Şılbır (PhD Candidate, KTU, Turkey) for contributing to the drawing of figures.

Author information

Authors and Affiliations

Authors

Contributions

T.D., S.U. and A.C.A. conceived the idea and planned the work. S.U. evaluated the literature, constructed representation vectors and prepared the datasets and carried out the analysis for the semantic similarity inference and protein–protein binding affinity estimation benchmarks. H.A. and S.U. prepared the datasets and carried out the analysis for the ontology-based protein family classification benchmark. M.A. and S.U. prepared the datasets and carried out the analysis for the drug target protein family classification benchmark. S.U., A.C.A. and T.D. have written the manuscript. T.D., A.C.A and K.T. supervised the overall study. All authors approved the manuscript.

Corresponding author

Correspondence to Tunca Doğan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Christian Dallago, Céline Marquet and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

1. Different approaches for representing proteins. 2. Classical protein representations. 3. Protein representation learning. 4. Protein representation methods benchmarked in this study. 5. Objective-based classification of a comprehensive list of protein representations. 6. Traits of successful protein representations. 7. Performance evaluation metrics. 8. Extended results. 9. Extended discussion. Supplementary Figs. 1–15 and Tables 1–11.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Unsal, S., Atas, H., Albayrak, M. et al. Learning functional properties of proteins with language models. Nat Mach Intell 4, 227–245 (2022). https://doi.org/10.1038/s42256-022-00457-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s42256-022-00457-9

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics