Protein language models using convolutions

Tang, Lin

doi:10.1038/s41592-024-02252-3

Research Highlight
Published: 12 April 2024

Bioinformatics

Protein language models using convolutions

Lin Tang¹

Nature Methods volume 21, page 550 (2024)Cite this article

749 Accesses
1 Altmetric
Metrics details

Subjects

Access through your institution

Buy or subscribe

The prowess of protein language models (PLMs) has been demonstrated in handling various tasks, such as protein structure prediction, function analysis and engineering, and novel protein design. Transformers, a deep learning architecture that excels in learning relationships in sequence data, have been commonly employed as the backbone of PLMs, first being pretrained on huge datasets of protein sequences to become versed in the ‘language’ of the protein universe and then adapted for multiple downstream tasks. However, their remarkable performances come at a cost of high computational burden, limiting the length of protein sequences they can digest. Curious to know whether transformers were the only architecture that would work for protein language models, Kevin Yang and colleagues at Microsoft Research New England explored the potential of using another architecture to build PLMs.

The team experimented with convolutional neural networks (CNNs), which were developed earlier than transformers in deep learning research and also widely applied to biological data analysis. One of CNNs’ major appealing features is their linear scalability in sequence length, compared to quadratic scalability of transformers. Yang and colleagues built a series of CNN-based protein language models called CARP (convolutional autoencoding representations of proteins) using the same pretraining strategy and dataset as the popular existing transformer-based PLM ESM. When comparing these models on both the pretraining tasks and a number of downstream tasks (for example, prediction of protein structure, mutation effect, fitness, fluorescence and stability), to their surprise, the overall performance of CARP was on par with, and in some cases even better than, ESM. Furthermore, “We were surprised that, for both architectures, downstream performance did not necessarily improve for bigger models with better pretrain performance,” says Yang.

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

Author information

Authors and Affiliations

Nature Methods https://www.nature.com/nmeth/
Lin Tang

Authors

Lin Tang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Lin Tang.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tang, L. Protein language models using convolutions. Nat Methods 21, 550 (2024). https://doi.org/10.1038/s41592-024-02252-3

Download citation

Published: 12 April 2024
Issue Date: April 2024
DOI: https://doi.org/10.1038/s41592-024-02252-3

Protein language models using convolutions

Subjects

Access options

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Search

Quick links

Subjects

Access options

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links