Introduction

Macromolecules (e.g., proteins, RNAs, or DNAs) are essential to biophysical processes. While they can be represented using lower-dimensional representations such as linear sequences (1D) or chemical bond graphs (2D), a more intrinsic and informative form is the three-dimensional geometry1. 3D shapes are critical to not only understanding the physical mechanisms of action but also answering a number of questions associated with drug discovery and molecular design2. Consequently, tremendous efforts in structural biology have been devoted to deriving insights from their conformations3,4,5.

With the rapid advances of deep learning (DL) techniques, it has been an attractive challenge to represent and reason about macromolecules’ structures in the 3D space. In particular, different sorts of 3D information, including bond lengths and dihedral angles, play an essential role. In order to encode them, a number of 3D geometric graph neural networks (GGNNs) or CNNs6,7,8,9 have been proposed, and simultaneously achieve several crucial properties of Euclidean geometry such as E(3) or SE(3) equivariance and symmetry. Notably, they are essential constituents of geometric deep learning (GDL), an umbrella term that generalizes networks to Euclidean or non-Euclidean domains10.

Meanwhile, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity. The abundance of 1D amino acid sequences has spurred increasing interest in developing protein language models at the scale of evolution, such as the series of ESM11,12,13 and ProtTrans14. These protein language models can capture information about secondary and tertiary structures and can be generalized across a broad range of downstream applications. To be explicit, they have recently been demonstrated with strong capabilities in uncovering protein structures12, predicting the effect of sequence variation on function11, learning inverse folding15 and many other general purposes13.

With the fruitful progress in protein language models, more and more studies have considered enhancing GGNNs’ ability by leveraging the knowledge of those protein language models12,16,17. This is nontrivial because compared to sequence learning, 3D structures are much harder to obtain and thus less prevalent. Consequently, learning about the structure of proteins leads to a reduced amount of training data. For example, the SAbDab database18 merely has 3K antibody-antigen structures without duplicate. The SCOPe database19 has 226K annotated structures, and the SIFTS database20 comprises around 220K annotated enzyme structures. These numbers are orders of magnitude lower than the data set sizes that can inspire major breakthroughs in the deep learning community. In contrast, while the Protein Data Bank (PDB)21 possesses approximately 182K macromolecule structures, databases like Pfam22 and UniParc23 contains more than 47M and 250M protein sequences respectively.

In addition to the data size, the benefit of protein sequence to structure learning also has solid evidence and theoretical support. Remarkably, the idea that biological function and structures are documented in the statistics of protein sequences selected through evolution has a long history24. The unobserved variables that decide a protein’s fitness, including structure, function, and stability, leave a record in the distribution of observed natural sequences25. Those protein language models use self-supervision to unlock the information encoded in protein sequence variations, which is also beneficial for GGNNs. Accordingly, in this paper, we comprehensively investigate the promotion of GGNNs’ capability with the knowledge learned by protein language models (see Fig. 1). The improvements come from two major lines. Firstly, GGNNs can benefit from the information that emerges in the learned representations of those protein language models on fundamental properties of proteins, including secondary structures, contacts, and biological activity. This kind of knowledge may be difficult for GGNNs to be aware of and learn in a specific downstream task. To confirm this claim, we conduct a toy experiment to demonstrate that conventional graph connectivity mechanisms prevent existing GGNNs from being cognizant of residues’ absolute and relative positions in the protein sequence. Secondly and more intuitively, protein language models serve as an alternative way of enriching GGNNs’ training data and allow GGNNs to be exposed to more different families of proteins, thereby greatly strengthening GGNNs’ generalization capability.

Fig. 1: Illustration of our framework to strengthen GGNNs with knowledge of protein language models.
figure 1

The protein sequence is first forwarded into a pretrained protein language model to extract per-residue representations, which are then used as node features in 3D protein graphs for GGNNs.

We examine our hypothesis across a wide range of benchmarks, containing model quality assessment, protein-protein interface prediction, protein-protein rigid-body docking, and ligand binding affinity prediction. Extensive experiments show that the incorporation and combination of pretrained protein language models’ knowledge significantly improve GGNNs’ performance for various problems, which require distinct domain knowledge. By utilizing the unprecedented view into the language of protein sequences provided by powerful protein language models, GGNNs promise to augment our understanding of a vast database of poorly understood protein structures. Our work hopes to shed more light on how to bridge the gap between the thriving geometric deep learning and mature protein language models and better leverage different modalities of proteins.

Results and discussion

Our toy experiments illustrate that existing GGNNs are unaware of the positional order inside the protein sequences. Taking a step further, we show in this section that incorporating knowledge learned by large-scale protein language models can robustly enhance GGNN’s capacity in a wide variety of downstream tasks.

Tasks and datasets

  • Model Quality Assessment (MQA) aims to select the best structural model of a protein from a large pool of candidate structures and is an essential step in structure prediction26. For a number of recently solved but unreleased structures, structure generation programs produce a large number of candidate structures. MQA approaches are evaluated by their capability of predicting the global distance test (GDT-TS score) of a candidate structure compared to the experimentally solved structure of that target. Its database is composed of all structural models submitted to the Critical Assessment of Structure Prediction (CASP)27 over the last 18 years. The data is split temporally by competition year. MQA is similar to the Protein Structure Ranking (PSR) task introduced by Townshend et al.2.

  • Protein-protein Rigid-body Docking (PPRD) computationally predicts the 3D structure of a protein-protein complex from the individual unbound structures. It assumes that no conformation change within the proteins happens during binding. We leverage Docking Benchmark 5.5 (DB5.5)28 as the database. It is a gold standard dataset in terms of data quality and contains 253 structures.

  • Protein-protein Interface (PPI) investigates whether two amino acids will contact when their respective proteins bind. It is an important problem in understanding how proteins interact with each other, e.g., antibody proteins recognize diseases by binding to antigens. We use the Database of Interacting Protein Structures (DIPS), a comprehensive dataset of protein complexes mined from the PDB29, and randomly select 15K samples for evaluation.

  • Ligand Binding Affinity (LBA) is an essential task for drug discovery applications. It predicts the strength of a candidate drug molecule’s interaction with a target protein. Specifically, we aim to forecast \(pK=-{\log }_{10}K\), where K is the binding affinity in Molar units. We use the PDBbind database30,31, a curated database containing protein-ligand complexes from the PDB and their corresponding binding strengths. The protein-ligand complexes are split such that no protein in the test dataset has more than 30% or 60% sequence identity with any protein in the training dataset.

Experimental setup

We evaluate our proposed framework on the instances of several state-of-the-art geometric networks, using Pytorch32 and PyG33 on four standard protein benchmarks. For MQA, PPI, and LBA, we use GVP-GNN, EGNN, and Molformer as backbones. For PPRD, we utilize a deep learning model, EquiDock34, as the backbone. It approximates the binding pockets and obtains the docking poses using keypoint matching and alignment. For more experimental details, please refer to Supplementary Note 3.

Single-protein representation task

For MQA, we document First Rank Loss, Spearman correlation (RS), Pearson’s correlation (RP), and Kendall rank correlation (KR) in Table 1. The introduction of protein language models has brought a significant average increase of 32.63% and 55.71% in global and mean RS, of 34.66% and 58.75% in global and mean RP, and of 43.21% and 63.20% in global and mean KR respectively. With the aid of language models, GVP-GNN achieves the optimal global RS, global RP, and KR of 84.92%, 85.44%, and 67.98% separately.

Table 1 Results on MQA.

Apart from that, we provide a full comparison with all existing approaches in Table 2. We elect RWplus35, ProQ3D36, VoroMQA37, SBROD38, 3DCNN2, 3DGNN2, 3DOCNN39, DimeNet40, GraphQA41, and GBPNet42 as the baselines. Performance is recorded in Table 2, where the second best is underlined. It can be concluded that even if GVP-GNN is not the best architecture, it can largely outperform existing methods including the state-of-the-art no-pretraining method set by Ayken and Xia42 (i.e., GBPNet) and the state-of-the-art pretraining results set by Jing et al.43 if it is enhanced by the protein language model.

Table 2 Comparison of performance on MQA.

Protein-protein representation tasks

For PPRD, we report three items as measurements: the complex root mean squared deviation (RMSD), the ligand RMSD, and the interface RMSD in Table 3. The interface is determined with a distance threshold less than 8Å. It is noteworthy that, unlike the EquiDock paper, we do not apply the Kabsch algorithm to superimpose the receptor and the ligand. Contrastingly, the receptor protein is fixed during evaluation. All three metrics decrease considerably with improvements of 11.61%, 12.83%, and 31.01% in complex, ligand, and interface median RMSD, respectively. Notably, we also report the result of EquiDock, which is first pretrained on DIPS and then fine-tuned on DB5. It can be discovered that DIPS-pretrained EquiDock still performs worse than EquiDock equipped with pretrained language models. This strongly demonstrates that structural pretraining for GGNNs may not benefit GGNNs more than pretrained protein language models.

Table 3 Performance of PPRD on DB5.5 Test Set.

For PPI, we record AUROC as the metric in Fig. 2. It can be found that AUROC increases for 6.93%, 14.01%, and 22.62% for GVP-GNN, EGNN, and Molformer respectively. It is worth noting that Molformer falls behind EGNN and GVP-GNN originally in this task. But after injecting knowledge learned by protein language models, Molformer achieves competitive or even better performance than EGNN or GVP-GNN. This indicates that protein language models can realize the potential of GGNNs to the full extent and greatly narrow the gap between different geometric deep learning architectures. The results mentioned above are amazing because, unlike MQA, PPRD and PPI study the geometric interactions between two proteins. Though existing protein language models are all trained on single protein sequences, our experiments show that the evolution information hidden in unpaired sequences can also be valuable to analyze the multi-protein environment.

Fig. 2: Some ablation studies.
figure 2

a Results of PPI with and without PLMs. b Performance of GGNNs on MQA with ESM-2 at different scales.

Protein-molecules representation task

For LBA, we compare RMSD, RS, RP, and KR in Table 4. The incorporation of protein language models produces a remarkably average decline of 11.26% and 6.15% in RMSD for 30% and 60% identity, an average increase of 51.09% and 9.52% in RP for the 30% and 60% identity, an average increment of 66.60% and 8.90% in RS for the 30% and 60% identity, and an average increment of 68.52% and 6.70% in KR for the 30% and 60% identity. It can be seen that the improvements in the 30% sequence identity is higher than that in the less restrictive 60% sequence identity. This confirms that protein language models benefit GGNNs more when the unseen samples belong to different protein domains. Moreover, contrasting PPRD or PPI, LBA studies how proteins interact with small molecules. Our outcome demonstrates that rich protein representations encoded by protein language models can also contribute to the analysis of protein’s reaction to other non-protein drug-like molecules. The result of a different data split has been placed in Supplementary Table 1.

Table 4 Results on LBA.

In addition, we compare thoroughly with existing approaches for LBA in Table 5, where the second best is underlined. We select a broad range of models including DeepAffinity44, Cormorant45, LSTM46, TAPE47, ProtTrans14, 3DCNN2, GNN2, MaSIF48, DGAT49, DGIN49, DGAT-GCN49, HoloProt50, and GBPNet42 as the baseline. It is clear that even if EGNN is a median-level architecture, it can achieve the best RMSD and the best Pearson’s correlation when enhanced by protein language models, beating a group of strong baselines including HoloProt50 and GBPNet42.

Table 5 Comparison of performance on LBA.

Scale and type of protein language models

It has been observed that as the size of the language model increases, there are consistent improvements in tasks like structure prediction12. Here we conduct an ablation study to investigate the effect of protein language models’ sizes on GGNNs. Specifically, we explore different ESM-2 with the parameter numbers of 8M, 35M, 150M, 650M, and 3B and plot results in Fig. 2. It verifies that scaling the protein language model is advantageous for GGNNs. More additional results can be found in Supplementary Note 4. We also provide a comparison of different sorts of PLMs’ influence in Supplementary Table 2. Besides that, we investigate the difference of PLMs’ effectiveness with and without MSA in Supplementary Table 3.

Limitations

Despite our successful confirmation that PLMs can promote geometric deep learning, there are several limitations and extensions of our framework left open for future investigation. For instance, our 3D protein graphs are residue-level. We believe atom-level protein graphs also benefit from our approach, but its increase in performance needs further exploration.

Conclusion

In this study, we investigate a problem that has been long ignored by existing geometric deep learning methods for proteins. That is, how to employ the abundant protein sequence data for 3D geometric representation learning. To answer this question, we propose to leverage the knowledge learned by existing advanced pre-trained protein language models and use their amino acid representations as the initial features. We conduct a variety of experiments such as protein-protein docking and model quality assessment to demonstrate the efficacy of our approach. Our work provides a simple but effective mechanism to bridge the gap between 1D sequential models and 3D geometric neural networks, and hope to throw light on how to combine information encoded in different protein modalities.

Method

Sequence recovery analysis

Preliminary and motivations

It is commonly acknowledged that protein structures maintain much more information than their corresponding amino acid sequences. And for decades long, it has been an open challenge for computational biologists to predict protein structure from its amino acid sequence. Though the advancement of Alphafold (AF)51 and RosettaFold52 has made a huge step in alleviating the limitation brought by the number of available experimentally determined protein structures, neither AF nor its successors such as Alphafold-Multimer53, IgFold54, and HelixFold55 are a panacea. Their predicted structures can be severely inaccurate when the protein is orphan and lacks multiple sequence alignment (MSA) as the template. Consequently, it is hard to conclude that protein sequences can be perfectly transformed to the structure modality by current tools and be used as extra training resources for GGNNs.

Moreover, we argue that even if conformation is a higher-dimensional representation, the prevailing learning paradigm may forbid GGNNs from capturing the knowledge that is uniquely preserved in protein sequences. Recall that GGNNs are mainly diverse in their patterns to employ 3D geometries, the input features include distance56, angles40, torsion, and terms of other orders57. The position index hidden in protein sequences, however, is usually neglected when constructing 3D graphs for GGNNs. Therefore, in this section, we design a toy trial to examine whether GGNNs can succeed in recovering this kind of positional information.

Protein graph construction

Here the structure of a protein can be represented as an atom-level or residue-level graph \({{{{{{{\mathcal{G}}}}}}}}=({{{{{{{\mathcal{V}}}}}}}},{{{{{{{\mathcal{E}}}}}}}})\), where \({{{{{{{\mathcal{V}}}}}}}}\) and \({{{{{{{\mathcal{E}}}}}}}}=({e}_{ij})\) correspond to the set of N nodes and M edges respectively. Nodes have their 3D coordinates \({{{{{{{\bf{x}}}}}}}}\in {{\mathbb{R}}}^{N\times 3}\) and the initial ψh-dimension roto-translational invariant features \({{{{{{{\bf{h}}}}}}}}\in {{\mathbb{R}}}^{N\times {\psi }_{h}}\) (e.g., atom types and electronegativity, residue classes). Normally, there are three types of options to construct connectivity for molecules: r-ball graphs, fully-connected (FC) graphs, and K-nearest neighbors (KNN) graphs. In our setting, nodes are linked to K = 10 nearest neighbors for KNN graphs, and edges include all atom pairs within a distance cutoff of 8Å for r-ball graphs.

Recovery from graphs to sequences

Since most prior studies choose to establish 3D protein graphs based on purely geometric information and ignore their sequential identities, it provokes the following position identity question:

Can existing GGNNs identify the sequential position order only from geometric structures of proteins?

To answer this question, we formulate two categories of toy tasks (see Fig. 3). The first one is absolute position recognition (APR), which is a classification task. Models are asked to directly predict the position index ranging from 1 to N, the residue number of each protein. This task adopts accuracy as the metric and expects models to discriminate the absolute position of the amino acid within the whole protein sequence. We compute the distribution of the protein sequence lengths in Supplementary Fig. 1.

Fig. 3: Illustration of the sequence recovery problem.
figure 3

a Protein residue graph construction. Here we draw graphs in 2D for better visualization but study 3D graphs for GGNNs. b Two sequence recovery tasks. The first requires GGNNs to predict the absolute position index for each residue in the protein sequence. The second aims to forecast the minimum distance of each amino acid to the two sides of the protein sequence.

In addition to that, we propose the second task named relative position estimation (RPE) to focus on the relative position of each residue. Models are required to predict the minimum distance of residue to the two sides of the given protein and the root mean squared error (RMSE) is used as the metric. This task aims to examine the capability of GGNNs to distinguish which segment the amino acid belongs to (i.e., the center section of the protein or the end of the protein).

Experiments

Backbones

We adopt three technically distinct and broadly accepted architectures of GGNNs for empirical verification. To be specific, GVP-GNN7,43 extends standard dense layers to operate on collections of Euclidean vectors, performing both geometric and relational reasoning on efficient representations of macromolecules. EGNN58 is a translation, rotation, reflection, and permutation equivariant GNN without expensive spherical harmonics. Molformer9 employs the self-attention mechanism for 3D point clouds while guarantees SE(3)-equivariance.

Dataset

We exploit a small non-redundant subset of high-resolution structures from the PDB. To be specific, we use only X-ray structures with resolution < 3.0Å, and enforce a 60% sequence identity threshold. This results in a total of 2643, 330, and 330 PDB structures for the train, validation, and test sets, respectively. Experimental details, the summary of the database, and the description of these GGNNs are elaborated in Supplementary Notes 1 and 2.

Empirical results and analysis

Table 6 documents the overall results, where metrics are labeled with / if higher/lower is better, respectively. It can be found that all GGNNs fail to recognize either the absolute or the relative positional information encoded in the protein sequences with an accuracy lower than 1% and an extremely high RMSE.

Table 6 Results of two residue position identification tasks.

This phenomenon stems from the conventional ways to build graph connectivity, which usually excludes sequential information. To be specific, unlike common applications of GNNs such as citation networks59, social networks60, knowledge graphs61, molecules do not have explicitly defined edges or adjacency. On the one hand, r-ball graphs utilize a cut-off distance, which is usually set as a hyperparameter, to determine the particle connections. But it is hard to guarantee a cut-off to properly include all crucial node interactions for complicated and large molecules. On the other hand, FC graphs that consider all pairwise distances will cause severe redundancies, dramatically increasing the computational complexity especially when proteins consist of thousands of residues. Besides, GGNNs also easily get confused by excessive noise, leading to unsatisfactory performance. As a remedy, KNN becomes a more popular choice to establish graph connectivity for proteins34,62,63. However, all of them take no account of the sequential information and require GGNNs to learn this original sequential order during training.

The lack of sequential information can yield several problems. To begin with, residues are unaware of their relative positions in the proteins. For instance, two residues can be close in the 3D space but distant in the sequence, which can mislead models to find the correct backbone chain. Secondly, according to the characteristics of the MP mechanism, two residues in a protein with the same neighborhood are expected to share similar representations. Nevertheless, the role of those two residues can be significantly separate64 when they are located at different segments of the protein. Thus, GGNNs may be incapable of differentiating two residues with the same 1-hop local structures. This restriction has already been distinguished by several works6,65, but none of them make a strict and thorough investigation. Admittedly, sequential order may only be necessary for certain tasks. But this toy experiment strongly indicates that the knowledge monopolized by amino acid sequences can be lost if GGNNs only learn from protein structures.

Integration of language models into geometric networks

As discussed before, learning about 3D structures cannot directly benefit from large amounts of sequential data. Subsequently, the model sizes of GGNNs are limited, or instead, overfitting may occur66. On the contrary, comparing the number of protein sequences in the UniProt database67 to the number of known structures in the PDB, there are over 1700 times more sequences than structures. More importantly, the availability of new protein sequence data continues to far outpace the availability of experimental protein structure data, only increasing the need for accurate protein modeling tools.

Therefore, we introduce a straightforward approach to assist GGNNs with pretrained protein language models. To this end, we feed amino acid sequences into those protein language models, where ESM-212 is adopted in our case, and extract the per-residue representations, denoted as \({{{{{{{\bf{h}}}}}}}}^{\prime} \in {{\mathbb{R}}}^{N\times {\psi }_{PLM}}\). Here ψPLM = 1280. Then \({{{{{{{\bf{h}}}}}}}}^{\prime}\) can be added or concatenated to the per-atom feature h. For residue-level graphs, \({{{{{{{\bf{h}}}}}}}}^{\prime}\) immediately replaces the original h as the input node features.

Notably, incompatibility exists between the experimental structure and its original amino acid sequence. That is, structures stored in the PDB files are usually incomplete and some strings of residues are missing due to inevitable realistic issues68. They, therefore, do not perfectly match the corresponding sequences (i.e., FASTA sequence). There are two choices to address this mismatch. On the one hand, we can simply use the fragmentary sequence as the substitute for the integral amino acid sequence and forward it into the protein language models. On the other hand, we can leverage a dynamic programming algorithm provided by Biopython69 to implement pairwise sequence alignment and abandon residues that do not exist in the PDB structures. It is empirically discovered that no big difference exists between them, so we adopt the former processing mechanism for simplicity.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.