Learning functional properties of proteins with language models

Unsal, Serbulent; Atas, Heval; Albayrak, Muammer; Turhan, Kemal; Acar, Aybar C.; Doğan, Tunca

doi:10.1038/s42256-022-00457-9

Analysis
Published: 21 March 2022

Learning functional properties of proteins with language models

Nature Machine Intelligence volume 4, pages 227–245 (2022)Cite this article

8735 Accesses
47 Citations
100 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Data-centric approaches have been used to develop predictive methods for elucidating uncharacterized properties of proteins; however, studies indicate that these methods should be further improved to effectively solve critical problems in biomedicine and biotechnology, which can be achieved by better representing the data at hand. Novel data representation approaches mostly take inspiration from language models that have yielded ground-breaking improvements in natural language processing. Lately, these approaches have been applied to the field of protein science and have displayed highly promising results in terms of extracting complex sequence–structure–function relationships. In this study we conducted a detailed investigation over protein representation learning by first categorizing/explaining each approach, subsequently benchmarking their performances on predicting: (1) semantic similarities between proteins, (2) ontology-based protein functions, (3) drug target protein families and (4) protein–protein binding affinity changes following mutations. We evaluate and discuss the advantages and disadvantages of each method over the benchmark results, source datasets and algorithms used, in comparison with classical model-driven approaches. Finally, we discuss current challenges and suggest future directions. We believe that the conclusions of this study will help researchers to apply machine/deep learning-based representation techniques to protein data for various predictive tasks, and inspire the development of novel methods.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Schematic representation of the study.**

**Fig. 2: Protein semantic similarity inference benchmark results.**

**Fig. 3: Ontology-based protein function prediction benchmark results.**

**Fig. 4: Drug target protein family classification benchmark results.**

**Fig. 5: Protein–protein binding affinity estimation benchmark results.**

Highly accurate protein structure prediction with AlphaFold

Article Open access 15 July 2021

John Jumper, Richard Evans, … Demis Hassabis

An open source knowledge graph ecosystem for the life sciences

Article Open access 11 April 2024

Tiffany J. Callahan, Ignacio J. Tripodi, … Lawrence E. Hunter

A 5′ UTR language model for decoding untranslated regions of mRNA and function predictions

Article 05 April 2024

Yanyi Chu, Dan Yu, … Mengdi Wang

Data availability

All of the datasets and results of this study are available for download at https://github.com/kansil/PROBE. Protein representation and MSA files are available via Zenodo at https://doi.org/10.5281/zenodo.5795850 (ref. ¹¹⁶).

Code availability

The source code of this study is available for download at https://github.com/kansil/PROBE. A ready-to-use web-tool containing all models of four benchmarks, to reproduce the results and to test new representation methods on the same predictive tasks are available on the CodeOcean platform, which is reachable from https://PROBE.kansil.org (ref. ¹¹⁷).

References

Dalkiran, A. et al. ECPred: a tool for the prediction of the enzymatic functions of protein sequences based on the EC nomenclature. BMC Bioinf. 19, 334 (2018).
Article Google Scholar
Dobson, P. D. & Doig, A. J. Distinguishing enzyme structures from non-enzymes without alignments. J. Mol. Biol. 330, 771–783 (2003).
Article Google Scholar
Latino, D. A. R. S. & Aires-de-Sousa, J. Assignment of EC numbers to enzymatic reactions with MOLMAP reaction descriptors and random forests. J. Chem. Inf. Model. 49, 1839–1846 (2009).
Article Google Scholar
Asgari, E. & Mofrad, M. R. K. Continuous distributed representation of biological sequences for deep proteomics and genomics. PLoS ONE 10, e0141287 (2015).
Article Google Scholar
Kimothi, D., Soni, A., Biyani, P. & Hogan, J. M. Distributed representations for biological sequence analysis. Preprint at https://arxiv.org/abs/1608.05949 (2016).
Nguyen, S., Li, Z. & Shang, Y. Deep networks and continuous distributed representation of protein sequences for protein quality assessment. In 2017 IEEE 29th International Conference on Tools with Artificial Intelligence (ICTAI) 527–534 (IEEE, 2017); https://doi.org/10.1109/ICTAI.2017.00086
Keskin, O., Tuncbag, N. & Gursoy, A. Predicting protein–protein interactions from the molecular to the proteome level. Chem. Rev. 116, 4884–4909 (2016).
Article Google Scholar
Rifaioglu, A. S. et al. Recent applications of deep learning and machine intelligence on in silico drug discovery: methods, tools and databases. Briefings Bioinform. 20, 1878–1912 (2019).
Article Google Scholar
Rifaioglu, A. S. et al. DEEPScreen: high performance drug-target interaction prediction with convolutional neural networks using 2-D structural compound representations. Chem. Sci. 11, 2531–2557 (2020).
Article Google Scholar
Rifaioglu, A. S. et al. MDeePred: novel multi-channel protein featurization for deep learning-based binding affinity prediction in drug discovery. Bioinformatics 37, 693–704 (2021).
Article Google Scholar
Doğan, T. et al. Protein domain-based prediction of compound–target interactions and experimental validation on LIM kinases. PLoS Comput. Biol. 17, e1009171 (2021).
Article Google Scholar
Moult, J., Fidelis, K., Kryshtafovych, A., Schwede, T. & Tramontano, A. Critical assessment of methods of protein structure prediction (CASP)-Round XII. Proteins 86, 7–15 (2018).
Article Google Scholar
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596 (2021).
Baek, M. et al. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876 (2021).
Rifaioglu, A. S., Doğan, T., Jesus Martin, M., Cetin-Atalay, R. & Atalay, V. DEEPred: automated protein function prediction with multi-task feed-forward deep neural networks. Sci. Rep. 9, 7344 (2019).
Article Google Scholar
You, R. et al. GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank. Bioinformatics 34, 2465–2473 (2018).
Article Google Scholar
Jain, A. & Kihara, D. Phylo-PFP: improved automated protein function prediction using phylogenetic distance of distantly related sequences. Bioinformatics 35, 753–759 (2019).
Article Google Scholar
The Gene Ontology Consortium. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D330–D338 (2019).
Zhou, N. et al. The CAFA challenge reports improved protein function prediction and new functional annotations for hundreds of genes through experimental screens. Genome Biol. 20, 244 (2019).
Article Google Scholar
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Article Google Scholar
Esteva, A. et al. A guide to deep learning in healthcare. Nat. Med. 25, 24–29 (2019).
Article Google Scholar
Liu, L. et al. Deep learning for generic object detection: a survey. Int. J. Comput. Vision 128, 261–318 (2020).
Zhang, C., Patras, P. & Haddadi, H. Deep learning in mobile and wireless networking: a survey. IEEE Commun. Surv. Tutor. 21, 2224–2287 (2019).
Article Google Scholar
Zou, J. et al. A primer on deep learning in genomics. Nat. Genet. 51, 12–18 (2019).
Article Google Scholar
Weiss, K., Khoshgoftaar, T. M. & Wang, D. A survey of transfer learning. J. Big Data 3, 1817 (2016).
Article Google Scholar
Raffel, C. et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Preprint at https://arxiv.org/abs/1910.10683 (2019).
Rao, R. et al. Evaluating protein transfer learning with TAPE. Adv. Neural Inf. Process. Syst. 32, 9689–9701 (2019).
Google Scholar
Meier, J. et al. Language models enable zero-shot prediction of the effects of mutations on protein function. In Advances in Neural Information Processing Systems Vol. 34 (NeurIPS, 2021).
Elnaggar, A. et al. ProtTrans: towards cracking the language of life’s code through self-supervised deep learning and high performance computing. Preprint at https://arxiv.org/abs/2007.06225 (2020).
Yang, K. K., Wu, Z., Bedbrook, C. N. & Arnold, F. H. Learned protein embeddings for machine learning. Bioinformatics 34, 2642–2648 (2018).
Article Google Scholar
Heinzinger, M. et al. Modeling the language of life-deep learning protein sequences. Bioinformatics 360, 540 (2019).
Google Scholar
Kim, S., Lee, H., Kim, K. & Kang, J. Mut2Vec: distributed representation of cancerous mutations. BMC Med. Genomics 11, 33 (2018).
Article Google Scholar
Du, J. et al. Gene2vec: distributed representation of genes based on co-expression. BMC Genomics 20, 82 (2019).
Article Google Scholar
Choy, C. T., Wong, C. H. & Chan, S. L. Infer related genes from large scale gene expression dataset with embedding. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/362848v2 (2018).
Rao, R. et al. MSA transformer. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.02.12.430858v3 (2021).
Lu, A. X., Zhang, H., Ghassemi, M. & Moses, A. Self-supervised contrastive learning of protein representations by mutual information maximization. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2020.09.04.283929v2 (2020).
Alley, E. C., Khimulya, G., Biswas, S., AlQuraishi, M. & Church, G. M. Unified rational protein engineering with sequence-based deep representation learning. Nat. Methods 16, 1315–1322 (2019).
Article Google Scholar
Rives, A. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc. Natl Acad. Sci. USA 118, e2016239118 (2021).
Article Google Scholar
Buchfink, B., Reuter, K. & Drost, H.-G. Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat. Methods 18, 366–368 (2021).
Article Google Scholar
Johnson, L. S., Eddy, S. R. & Portugaly, E. Hidden Markov model speed heuristic and iterative HMM search procedure. BMC Bioinf. 11, 431 (2010).
Article Google Scholar
Mistry, J. et al. Pfam: the protein families database in 2021. Nucleic Acids Res. 49, D412–D419 (2021).
Article Google Scholar
Gromiha, M. M. Protein Sequence Analysis. In Protein Bioinformatics (ed. Gromiha, M. M.) Ch. 2, 29–62 (Academic, 2010); https://doi.org/10.1016/B978-8-1312-2297-3.50002-3
Chou, K.-C. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics 21, 10–19 (2005).
Article Google Scholar
Wang, J. et al. POSSUM: a bioinformatics toolkit for generating numerical sequence feature descriptors based on PSSM profiles. Bioinformatics 33, 2756–2758 (2017).
Article Google Scholar
Mitchell, A. et al. The InterPro protein families database: the classification resource after 15 years. Nucleic Acids Res. 43, D213–D221 (2015).
Article Google Scholar
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).
Article Google Scholar
Howe, K. L. et al. Ensembl 2021. Nucleic Acids Res. 49, D884–D891 (2021).
Article Google Scholar
Mirabello, C. & Wallner, B. rawMSA: end-to-end deep learning using raw multiple sequence alignments. PLoS ONE 14, e0220182 (2019).
Article Google Scholar
Xu, Y., Song, J., Wilson, C. & Whisstock, J. C. PhosContext2vec: a distributed representation of residue-level sequence contexts and its application to general and kinase-specific phosphorylation site prediction. Sci. Rep. 8, 8240 (2018).
Article Google Scholar
Lin, D. & Others. An information-theoretic definition of similarity. In ICML '98: Proc. 15th International Conference on Machine Learning 296–304 (ACM, 1998).
Pedregosa, F., Varoquaux, G. & Gramfort, A. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Littmann, M., Heinzinger, M., Dallago, C., Olenyi, T. & Rost, B. Embeddings from deep learning transfer GO annotations beyond homology. Sci. Rep. 11, 1160 (2021).
Article Google Scholar
Villegas-Morcillo, A. et al. Unsupervised protein embeddings outperform hand-crafted sequence and structure features at predicting molecular function. Bioinformatics 37, 162–170 (2021).
Article Google Scholar
Mendez, D. et al. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Res. 47, D930–D940 (2019).
Article Google Scholar
Vaswani, A. et al. Attention is all you need. in Advances in Neural Information Processing Systems 30 (eds. Guyon, I. et al.) 5998–6008 (Curran Associates, 2017).
Vig, J. et al. BERTology meets biology: interpreting attention in protein language models. Preprint at https://arxiv.org/abs/2006.15222 (2020).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Article Google Scholar
Kaufman, S., Rosset, S., Perlich, C. & Stitelman, O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans. Knowl. Discov. Data 6, 1–21 (2012).
Article Google Scholar
Brysbaert, M., Stevens, M., Mandera, P. & Keuleers, E. How many words do we know? Practical estimates of vocabulary size dependent on word definition, the degree of language input and the participant’s age. Front. Psychol. 7, 1116 (2016).
Article Google Scholar
Higgins, I. et al. Towards a definition of disentangled representations. Preprint at https://arxiv.org/abs/1812.02230 (2018).
Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. eLife 8, e39397 (2019).
Article Google Scholar
Öztürk, H., Ozkirimli, E. & Özgür, A. WideDTA: prediction of drug-target binding affinity. Preprint at https://arxiv.org/abs/1902.04166 (2019).
Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nat. Methods 15, 816–822 (2018).
Article Google Scholar
Doğan, T. et al. CROssBAR: Comprehensive resource of biomedical relations with knowledge graph representations. Nucleic Acids Res. 49, e96–e96 (2021).
Article Google Scholar
Burk, M. J. & Van Dien, S. Biotechnology for chemical production: challenges and opportunities. Trends Biotechnol. 34, 187–190 (2016).
Article Google Scholar
Gainza, P., Nisonoff, H. M. & Donald, B. R. Algorithms for protein design. Curr. Opin. Struct. Biol. 39, 16–26 (2016).
Article Google Scholar
Baker, D. An exciting but challenging road ahead for computational enzyme design. Protein Sci. 19, 1817–1819 (2010).
Article Google Scholar
Röthlisberger, D. et al. Kemp elimination catalysts by computational enzyme design. Nature 453, 190–195 (2008).
Article Google Scholar
Privett, H. K. et al. Iterative approach to computational enzyme design. Proc. Natl Acad. Sci. USA 109, 3790–3795 (2012).
Article Google Scholar
Chan, H. S., Shimizu, S. & Kaya, H. Cooperativity principles in protein folding. Methods Enzymol. 380, 350–379 (2004).
Article Google Scholar
Lippow, S. M., Wittrup, K. D. & Tidor, B. Computational design of antibody-affinity improvement beyond in vivo maturation. Nat. Biotechnol. 25, 1171–1176 (2007).
Article Google Scholar
Looger, L. L., Dwyer, M. A., Smith, J. J. & Hellinga, H. W. Computational design of receptor and sensor proteins with novel functions. Nature 423, 185–190 (2003).
Article Google Scholar
Duan, Y. et al. A point-charge force field for molecular mechanics simulations of proteins based on condensed-phase quantum mechanical calculations. J. Comput. Chem. 24, 1999–2012 (2003).
Article Google Scholar
Brunk, E. & Rothlisberger, U. Mixed quantum mechanical/molecular mechanical molecular dynamics simulations of biological systems in ground and electronically excited states. Chem. Rev. 115, 6217–6263 (2015).
Article Google Scholar
Childers, M. C. & Daggett, V. Insights from molecular dynamics simulations for computational protein design. Mol. Syst. Des. Eng. 2, 9–33 (2017).
Article Google Scholar
Hollingsworth, S. A. & Dror, R. O. Molecular dynamics simulation for all. Neuron 99, 1129–1143 (2018).
Article Google Scholar
Camilloni, C. & Vendruscolo, M. Statistical mechanics of the denatured state of a protein using replica-averaged metadynamics. J. Am. Chem. Soc. 136, 8982–8991 (2014).
Article Google Scholar
Huang, S.-Y. & Zou, X. Statistical mechanics-based method to extract atomic distance-dependent potentials from protein structures. Proteins 79, 2648–2661 (2011).
Article Google Scholar
Pierce, N. A. & Winfree, E. Protein design is NP-hard. Protein Eng. 15, 779–782 (2002).
Article Google Scholar
Eguchi, R. R., Anand, N., Choe, C. A. & Huang, P.-S. IG-VAE: Generative modeling of immunoglobulin proteins by direct 3D coordinate generation. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2020.08.07.242347v2 (2020).
Ng, A. Y. & Jordan, M. I. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. In Advances in Neural Information Processing Systems (eds. Dietterich, T. G., Becker, S. & Ghahramani, Z.) Vol. 14, 841–848 (MIT Press, 2002).
Salakhutdinov, R. Learning deep generative models. Annu. Rev. Stat. Appl. 2, 361–385 (2015).
Article Google Scholar
Madani, A. et al. Deep neural language modeling enables functional protein generation across families. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/2021.07.18.452833v1 (2021).
Stärk, H., Dallago, C., Heinzinger, M. & Rost, B. Light attention predicts protein location from the language of life. Bioinformatics Advances 1, vbab035 (2021).
Yu, G. et al. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics 26, 976–978 (2010).
Article Google Scholar
McInnes, B. T. & Pedersen, T. Evaluating measures of semantic similarity and relatedness to disambiguate terms in biomedical text. J. Biomed. Inform. 46, 1116–1124 (2013).
Article Google Scholar
Spearman, C. The proof and measurement of association between two things. Am. J. Psychol. 15, 72–101 (1904).
Article Google Scholar
Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).
Article Google Scholar
Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Article Google Scholar
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).
Article Google Scholar
Moal, I. H. & Fernández-Recio, J. SKEMPI: a structural kinetic and energetic database of mutant protein interactions and its use in empirical models. Bioinformatics 28, 2600–2607 (2012).
Article Google Scholar
Chen, M. et al. Multifaceted protein–protein interaction prediction based on Siamese residual RCNN. Bioinformatics 35, i305–i314 (2019).
Article Google Scholar
Tipping, M. E. Sparse Bayesian learning and the relevance vector machine. J. Mach. Learn. Res. 1, 211–244 (2001).
Wan, F. & Zeng, J. (M.). Deep learning with feature embedding for compound–protein interaction prediction. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/086033v1 (2016).
Asgari, E., McHardy, A. C. & Mofrad, M. R. Probabilistic variable-length segmentation of protein sequences for discriminative motif discovery (DiMotif) and sequence embedding (ProtVecX). Sci. Rep. 9, 3577 (2019).
Article Google Scholar
Öztürk, H., Özgür, A. & Ozkirimli, E. DeepDTA: deep drug-target binding affinity prediction. Bioinformatics 34, i821–i829 (2018).
Article Google Scholar
Oubounyt, M., Louadi, Z., Tayara, H. & To Chong, K. Deep learning models based on distributed feature representations for alternative splicing prediction. IEEE Access 6, 58826–58834 (2018).
Article Google Scholar
Mirabello, C. & Wallner, B. rawMSA: End-to-end deep learning makes protein sequence profiles and feature extraction obsolete. Bioinformatics 228 (2018).
Dutta, A., Dubey, T., Singh, K. K. & Anand, A. SpliceVec: distributed feature representations for splice junction prediction. Comput. Biol. Chem. 74, 434–441 (2018).
Article Google Scholar
Mejía-Guerra, M. K. & Buckler, E. S. A k-mer grammar analysis to uncover maize regulatory architecture. BMC Plant Biol. 19, 103 (2019).
Article Google Scholar
Cohen, T., Widdows, D., Heiden, J. A. V., Gupta, N. T. & Kleinstein, S. H. Graded vector representations of immunoglobulins produced in response to west Nile virus. In Quantum Interaction (eds de Barros, J. A., Coecke, B. & Pothos, E.) 135–148 (Springer, 2017).
Ng, P. dna2vec: Consistent vector representations of variable-length k-mers. Preprint at https://arxiv.org/abs/1701.06279 (2017).
Jaeger, S., Fulle, S. & Turk, S. Mol2vec: Unsupervised machine learning approach with chemical intuition. J. Chem. Inf. Model. 58, 27–35 (2018).
Article Google Scholar
Viehweger, A., Krautwurst, S., Parks, D. H., König, B. & Marz, M. An encoding of genome content for machine learning. Preprint at https://www.biorxiv.org/content/10.1101/524280v3 (2019).
Qi, Y., Oja, M., Weston, J. & Noble, W. S. A unified multitask architecture for predicting local protein properties. PLoS ONE 7, e32235 (2012).
Article Google Scholar
Melvin, I., Weston, J., Noble, W. S. & Leslie, C. Detecting remote evolutionary relationships among proteins by large-scale semantic embedding. PLoS Comput. Biol. 7, e1001047 (2011).
Article MathSciNet Google Scholar
Choi, J., Oh, I., Seo, S. & Ahn, J. G2Vec: distributed gene representations for identification of cancer prognostic genes. Sci. Rep. 8, 13729 (2018).
Article Google Scholar
You, R. & Zhu, S. DeepText2Go: Improving large-scale protein function prediction with deep semantic text representation. In 2017 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) 42–49 (IEEE, 2017); https://doi.org/10.1109/BIBM.2017.8217622
Bepler, T. & Berger, B. Learning protein sequence embeddings using information from structure. Preprint at https://arxiv.org/abs/1902.08661 (2019).
Schwartz, A. S. et al. Deep semantic protein representation for annotation, discovery, and engineering. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/365965v1 (2018).
Kané, H., Coulibali, M., Abdalla, A. & Ajanoh, P. Augmenting protein network embeddings with sequence information. Preprint at bioRxiv https://www.biorxiv.org/content/10.1101/730481v3 (2019).
Faisal, M. R. et al. Improving protein sequence classification performance using adjacent and overlapped segments on existing protein descriptors. JBiSE 11, 126–143 (2018).
Article Google Scholar
Strodthoff, N., Wagner, P., Wenzel, M. & Samek, W. UDSMProt: universal deep sequence models for protein classification. Bioinformatics 36, 2401–2409 (2020).
Article Google Scholar
Asgari, E., Poerner, N., McHardy, A. C. & Mofrad, M. R. K. DeepPrime2Sec: deep learning for protein secondary structure prediction from the primary sequences. Preprint at bioRxiv https://www.biorxiv.org/content/early/2019/07/18/705426 (2019)
Bileschi, M. L. et al. Using deep learning to annotate the protein universe. Nat. Biotechnol. https://doi.org/10.1038/s41587-021-01179-w (2022).
Unsal, S. et al. Learning Functional Properties of Proteins with Language Models Data Sets (Zenodo, 2020); https://doi.org/10.5281/zenodo.5795850
Unsal, S. et al. PROBE (Protein Representation Benchmark): Function-Centric Evaluation of Protein Representation Methods (Code Ocean, 2021); https://doi.org/10.24433/CO.5123923.v2

Download references

Acknowledgements

This work was supported by TUBITAK (project no. 318S218). We thank G. Tatar (faculty member, KTU, Turkey) for reading and commenting on the manuscript, and to G.M.Ç. Şılbır (PhD Candidate, KTU, Turkey) for contributing to the drawing of figures.

Author information

Authors and Affiliations

Cancer Systems Biology Laboratory (KanSiL), Graduate School of Informatics, Middle East Technical University, Ankara, Turkey
Serbulent Unsal, Heval Atas, Aybar C. Acar & Tunca Doğan
Department of Biostatistics and Medical Informatics, Karadeniz Technical University, Trabzon, Turkey
Serbulent Unsal, Muammer Albayrak & Kemal Turhan
Department of Computer Engineering, Hacettepe University, Ankara, Turkey
Tunca Doğan
Institute of Informatics, Hacettepe University, Ankara, Turkey
Tunca Doğan

Authors

Serbulent Unsal
View author publications
You can also search for this author in PubMed Google Scholar
Heval Atas
View author publications
You can also search for this author in PubMed Google Scholar
Muammer Albayrak
View author publications
You can also search for this author in PubMed Google Scholar
Kemal Turhan
View author publications
You can also search for this author in PubMed Google Scholar
Aybar C. Acar
View author publications
You can also search for this author in PubMed Google Scholar
Tunca Doğan
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

T.D., S.U. and A.C.A. conceived the idea and planned the work. S.U. evaluated the literature, constructed representation vectors and prepared the datasets and carried out the analysis for the semantic similarity inference and protein–protein binding affinity estimation benchmarks. H.A. and S.U. prepared the datasets and carried out the analysis for the ontology-based protein family classification benchmark. M.A. and S.U. prepared the datasets and carried out the analysis for the drug target protein family classification benchmark. S.U., A.C.A. and T.D. have written the manuscript. T.D., A.C.A and K.T. supervised the overall study. All authors approved the manuscript.

Corresponding author

Correspondence to Tunca Doğan.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Machine Intelligence thanks Christian Dallago, Céline Marquet and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

1. Different approaches for representing proteins. 2. Classical protein representations. 3. Protein representation learning. 4. Protein representation methods benchmarked in this study. 5. Objective-based classification of a comprehensive list of protein representations. 6. Traits of successful protein representations. 7. Performance evaluation metrics. 8. Extended results. 9. Extended discussion. Supplementary Figs. 1–15 and Tables 1–11.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Unsal, S., Atas, H., Albayrak, M. et al. Learning functional properties of proteins with language models. Nat Mach Intell 4, 227–245 (2022). https://doi.org/10.1038/s42256-022-00457-9

Download citation

Received: 14 August 2020
Accepted: 08 February 2022
Published: 21 March 2022
Issue Date: March 2022
DOI: https://doi.org/10.1038/s42256-022-00457-9

This article is cited by

DisoFLAG: accurate prediction of protein intrinsic disorder and its functions using graph-based interaction protein language model
- Yihe Pang
- Bin Liu
BMC Biology (2024)
AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding
- Lingyan Zheng
- Shuiyang Shi
- Feng Zhu
Genome Biology (2024)
PLMSearch: Protein language model powers accurate and fast sequence search for remote homology
- Wei Liu
- Ziye Wang
- Shanfeng Zhu
Nature Communications (2024)
Codon language embeddings provide strong signals for use in protein engineering
- Carlos Outeiral
- Charlotte M. Deane
Nature Machine Intelligence (2024)
Generative AI
- Stefan Feuerriegel
- Jochen Hartmann
- Patrick Zschech
Business & Information Systems Engineering (2024)