Proteins are large macromolecules with complex 3D structures that have a role in almost all biological processes at the cellular level. An ingenious biological design approach enables this large range of functionality: with 20 amino acids as building blocks, an endless combination of sequences is possible, producing long chains that form helices or other secondary structures that in turn fold into 3D conformations. Proteins bind and interact with molecules or other proteins in specific ways based on their folding structure, which enables them to perform biological processes. Functions such as neural signalling, the transport of vital molecules, the regulation of the cell cycle and cell death, transcription of DNA and the production of proteins itself are made possible through protein interactions.

Interestingly, 3D protein structure and functionality can be deduced solely from the underlying one-dimensional sequence of amino acids. In the past two decades, various deep learning approaches and architectures have been applied to this problem and exciting new research opportunities have emerged, such as in synthetic biology identifying drug targets, and in studying fundamental biological processes.

The past few years have seen substantial breakthroughs in this area with the development of AlphaFold and RoseTTAfold1. These models are built on deep learning advances and the abundance of training data of painstakingly measured 3D structures from several imaging modalities, over several decades. A next step is to go beyond static structures and predict dynamic changes such as when proteins interact with other proteins or molecules. In an Article in this issue (shown on the cover), Qiao et al. present a state-of-the-art generative diffusion model to predict binding between a protein and ligand based on the amino acid sequence and ligand molecular graph input. Moreover, the study provides exciting insights into how binding interactions change the 3D protein structure.

This issue also features a study on a specific therapeutic application involving protein design. In their Article, Lyu et al. develop a method to generate large virus-inspired proteins that can be used in gene therapy. In particular, they focus on finding viral vectors with suitable properties that avoid rejection by the human immune system. Lyu et al. take adenoviruses as a starting point and generate modified versions with a variational autoencoder approach. Given the long sequence of the adenovirus and lack of training data (only 88 types of human adenovirus are known), they also add the pre-trained protein language model ProtBert2 to the encoder. The overall approach can generate modified versions that have similar folding structure to the virus but are not recognized by human antibodies and could be deployed as viral vectors in gene therapy.

In another Article in this issue, Kulmanov et al. tackle the challenge of predicting and classifying functionality of proteins whose role in biological processes are unknown. Their starting point is the Gene Ontology system, a major classification system in bioinformatics. The authors deploy the pre-trained ESM2 protein language model3 in combination with symbolic reasoning to demonstrate the ability to predict a protein’s function from its sequence. By using ESM2 to generate multiple neural network models trained with different gene ontology interpretations, and checking that a prediction is true in all trained models, the authors identify connections between a protein’s sequence, function and the hierarchy of ontology terms.

In a fourth Article in this issue, Outeiral and Deane consider using sequences of codons instead of amino acids for predictions with protein language models. Codons are groups of three DNA nucleotides that encode amino acids. The authors show that predictions based on codons, which contain rich biological information, outperform other protein language model approaches on benchmarks, including models with many more parameters.

The past few years have revealed the power of large language models in a surprising range of applications, and it seems likely that protein science will see another wave of breakthroughs with such models. In fact, there are interesting analogies between natural language and protein modelling. As Ferruz and Höcker4 discussed, amino acids arrange in various combinations to form structures that carry function, similar to how letters form words and sentences that carry meaning. However, as the authors also noted, the parallel does not work precisely, as among others it is not clear how to identify individual ‘words’ in protein language models. Further exploring the connection between natural language processing and protein models, and identifying linguistic rules5, could lead to an improved understanding of the relationship between sequences and protein function.