In a new study published in Cell Research, Chinese researchers have trained ProMEP, a multimodal protein representation model that enables accurate, zero-shot prediction of mutation effects in proteins. Without utilizing multiple sequence alignments, the model uniquely integrates sequence and structural information to both model and predict mutational consequences and prioritize beneficial mutations to improve protein activity, motivating its usage for diverse protein engineering and biotechnology applications.
Mutations in protein sequences can often lead to significant changes in protein function, influencing enzyme activity, inducing human diseases, and driving viral evolution. Accurate prediction of these effects is crucial not only to predict variant effects but also to improve protein engineering, thus enhancing directed evolution efforts and de novo protein design. Unfortunately, traditional methods, including alignment-based approaches, stability predictors, and supervised learning models, have limitations such as dependence on multiple sequence alignments (MSAs) and annotated datasets, making them less efficient and scalable.
Recently, state-of-the-art protein language models (pLMs), such as ESM-2,1 have been trained on over 250 million protein sequences, and capture critical properties of proteins, including structure, function, interaction, and biophysical properties, from just input sequence. These models, inspired by natural-language-processing techniques, learn feature-rich representations of protein sequences that can be applied to downstream tasks such as structure prediction,1 protein function prediction,2 binding site prediction,3 de novo peptide design,4 and can be further extended to incorporate key additional features, such as post-translational modifications.5
However, while traditional pLMs only utilize sequence information, incorporating structural context has been suggested to improve pLM capabilities, as the spatial arrangement of amino acids and their interactions help determine the protein’s functional properties. Models like SaProt6 and ProstT57 have led the way to integrate structural information into pLMs, capturing long-range contact information that is vital for accurate function prediction, and have thus performed optimally on standard protein benchmarking tasks. Still, these models have not been optimized for variant effect prediction. Another groundbreaking model, AlphaMissense has explicitly utilized structural context for mutation effect prediction, but as it relies on MSAs, the accuracy of predictions often depends on the availability of homologous sequences.8 This reliance makes AlphaMissense less scalable and more resource-intensive, particularly when dealing with large protein datasets or when sequence homologs are sparse.
ProMEP9 overcomes many of these challenges by utilizing a multimodal representation model that integrates information from both protein sequence and structural contexts, significantly improving computational speed. Instead of applying the traditional methods of representing protein structures as graphs or contact maps, the researchers employ a novel point cloud representation, allowing for representing protein structures at atomic resolution efficiently. Via an SE(3)-Transformer-based structure-embedding module to enforce 3D equivariance of rotations and translations, ProMEP extracts embeddings for each residue from the protein point cloud representation, while another sequence-embedding module obtains sequence embeddings. Both embeddings are then combined and forwarded to a Transformer encoder to generate a comprehensive representation of the protein. Trained on ~160 million proteins using a masking paradigm, ProMEP’s combination of sequence- and structure-embedding modules enables the model to capture sequence context, local secondary structure context, and the global protein folding context. Benchmarked on 15 datasets with protein annotations, ProMEP performs favorably compared to the existing pLMs and shallow multimodal models and achieves state-of-the-art performance for zero-shot prediction of mutation effects for proteins spanning diverse functions. Notably, the model generalizes strongly, accurately predicting mutation effects for proteins with low sequence homology and de novo designed proteins — areas where traditional MSA-based methods have struggled (Fig. 1).
Most importantly, ProMEP learns comprehensive fitness landscapes for input proteins and prioritizes beneficial mutations, facilitating the identification and design of enhanced variants. To prove this, the authors used ProMEP to enhance the activities of the lowly-active transposase-related RNA-guided nuclease TnpB, as well as the TadA-based CRISPR adenine base editor (ABE). By predicting the fitness score of all X-to-R mutants, ProMEP accurately prioritized beneficial mutations in these enzymes, strongly improving their editing efficiency and reducing their off-target effects in mammalian cells, even improving upon editing properties of the highly active and widely-used ABE8e base editor.
Overall, ProMEP’s ability to predict mutation effects without relying on MSAs marks a significant advancement in computational protein modeling and experimental protein engineering. The robust experimental data suggest that ProMEP could be applied in an in silico protein evolution scheme, optimizing proteins entirely through computational predictions before experimental testing. However, ProMEP’s dependence on structural representations may limit its effectiveness for intrinsically disordered proteins, whose static structural representations may be inaccurate. The recent release of ESM310 demonstrates the growing popularity of integrating structural and sequence modeling in protein language modeling. Despite this, both ESM3 and ProMEP are likely to face challenges in designing and predicting variants for conformationally flexible proteins, highlighting an area that still requires exploration.
References
Lin, Z. et al. Science 379, 1123–1130 (2023).
Zhou, Z. et al. Nat. Commun. 15, 5566 (2024).
Brixi, G. et al. Commun. Biol. 6, 1081 (2023).
Chen, T. et al. arXiv https://doi.org/10.48550/ARXIV.2310.03842 (2023).
Peng, Z., Schussheim, B. & Chatterjee, P. bioRxiv https://doi.org/10.1101/2024.02.28.581983 (2024).
Su, J. et al. bioRxiv https://doi.org/10.1101/2023.10.01.560349 (2023).
Heinzinger, M. et al. bioRxiv https://doi.org/10.1101/2023.07.23.550085 (2023).
Cheng, J. et al. Science 381, eadg7492 (2023).
Cheng, P. et al. Cell Res. https://doi.org/10.1038/s41422-024-00989-2 (2024).
Hayes, T. et al. bioRxiv https://doi.org/10.1101/2024.07.01.600583 (2024).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Chen, T., Chatterjee, P. Synergizing sequence and structure representations to predict protein variants. Cell Res 34, 597–598 (2024). https://doi.org/10.1038/s41422-024-01010-6
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41422-024-01010-6