De novo protein design with a language model

Kotsiliti, Eleni

doi:10.1038/s41587-022-01518-5

Research Highlight
Published: 07 October 2022

Proteomics

De novo protein design with a language model

Eleni Kotsiliti¹

Nature Biotechnology volume 40, page 1433 (2022)Cite this article

3194 Accesses
3 Citations
28 Altmetric
Metrics details

Access through your institution

Buy or subscribe

Artificial intelligence models of natural language are becoming ever more adept at processing and ‘understanding’ language, and are widely used in applications such as automated speech recognition, translation, smart assistants and text generation. How useful they might be for modeling biological phenomena is a question of longstanding interest among computational biologists. Recently, language models have been applied to such problems as protein function prediction, protein evolution analysis and protein design. A report in Nature Communications by Ferruz et al. advances the application to protein design, providing an unsupervised, autoregressive language model for generating de novo protein sequences.

Because the authors sought to generate new sequence, they constructed an autoregressive model. Autoregressive language models are trained on large natural-language data sets to predict a missing word or sentence in a text from the previous words or sentences in that text. Ferruz et al. trained their model on ~50 million unannotated protein sequences representing the entire space of known proteins, with ~10% left out of the training set to evaluate the model. Next, they compared the protein sequences generated by their model to new sets of known protein sequences. The predicted sequences resembled known sequences in their amino acid propensities and in the percentage of ordered globular domains. A homology analysis found that the predicted sequences were overall distant from known sequences. Notably, the predicted sequences populated ‘dark’ areas of the proteome, regions where the protein structures have never been observed or perhaps never tried by nature.

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Author information

Authors and Affiliations

Berlin, Germany
Eleni Kotsiliti

Authors

Eleni Kotsiliti
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kotsiliti, E. De novo protein design with a language model. Nat Biotechnol 40, 1433 (2022). https://doi.org/10.1038/s41587-022-01518-5

Download citation

Published: 07 October 2022
Issue Date: October 2022
DOI: https://doi.org/10.1038/s41587-022-01518-5

De novo protein design with a language model

Access options

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

ProtGPT2 is a deep unsupervised language model for protein design

Search

Quick links

Access options

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links