Integration of pre-trained protein language models into geometric deep learning networks

Geometric deep learning has recently achieved great success in non-Euclidean domains, and learning on 3D structures of large biomolecules is emerging as a distinct research area. However, its efficacy is largely constrained due to the limited quantity of structural data. Meanwhile, protein language models trained on substantial 1D sequences have shown burgeoning capabilities with scale in a broad range of applications. Several preceding studies consider combining these different protein modalities to promote the representation power of geometric neural networks but fail to present a comprehensive understanding of their benefits. In this work, we integrate the knowledge learned by well-trained protein language models into several state-of-the-art geometric networks and evaluate a variety of protein representation learning benchmarks, including protein-protein interface prediction, model quality assessment, protein-protein rigid-body docking, and binding affinity prediction. Our findings show an overall improvement of 20% over baselines. Strong evidence indicates that the incorporation of protein language models’ knowledge enhances geometric networks’ capacity by a significant margin and can be generalized to complex tasks.


M
acromolecules (e.g., proteins, RNAs, or DNAs) are essential to biophysical processes.While they can be represented using lower-dimensional representations such as linear sequences (1D) or chemical bond graphs (2D), a more intrinsic and informative form is the three-dimensional geometry 1 .3D shapes are critical to not only understanding the physical mechanisms of action but also answering a number of questions associated with drug discovery and molecular design 2 .Consequently, tremendous efforts in structural biology have been devoted to deriving insights from their conformations [3][4][5] .
With the rapid advances of deep learning (DL) techniques, it has been an attractive challenge to represent and reason about macromolecules' structures in the 3D space.In particular, different sorts of 3D information, including bond lengths and dihedral angles, play an essential role.In order to encode them, a number of 3D geometric graph neural networks (GGNNs) or CNNs [6][7][8][9] have been proposed, and simultaneously achieve several crucial properties of Euclidean geometry such as E (3) or SE (3)  equivariance and symmetry.Notably, they are essential constituents of geometric deep learning (GDL), an umbrella term that generalizes networks to Euclidean or non-Euclidean domains 10 .
Meanwhile, the anticipated growth of sequencing promises unprecedented data on natural sequence diversity.The abundance of 1D amino acid sequences has spurred increasing interest in developing protein language models at the scale of evolution, such as the series of ESM [11][12][13] and ProtTrans 14 .These protein language models can capture information about secondary and tertiary structures and can be generalized across a broad range of downstream applications.To be explicit, they have recently been demonstrated with strong capabilities in uncovering protein structures 12 , predicting the effect of sequence variation on function 11 , learning inverse folding 15 and many other general purposes 13 .
With the fruitful progress in protein language models, more and more studies have considered enhancing GGNNs' ability by leveraging the knowledge of those protein language models 12,16,17 .This is nontrivial because compared to sequence learning, 3D structures are much harder to obtain and thus less prevalent.Consequently, learning about the structure of proteins leads to a reduced amount of training data.For example, the SAbDab database 18 merely has 3K antibody-antigen structures without duplicate.The SCOPe database 19 has 226K annotated structures, and the SIFTS database 20 comprises around 220K annotated enzyme structures.These numbers are orders of magnitude lower than the data set sizes that can inspire major breakthroughs in the deep learning community.In contrast, while the Protein Data Bank (PDB) 21 possesses approximately 182K macromolecule structures, databases like Pfam 22 and UniParc 23 contains more than 47M and 250M protein sequences respectively.
In addition to the data size, the benefit of protein sequence to structure learning also has solid evidence and theoretical support.Remarkably, the idea that biological function and structures are documented in the statistics of protein sequences selected through evolution has a long history 24 .The unobserved variables that decide a protein's fitness, including structure, function, and stability, leave a record in the distribution of observed natural sequences 25 .Those protein language models use self-supervision to unlock the information encoded in protein sequence variations, which is also beneficial for GGNNs.Accordingly, in this paper, we comprehensively investigate the promotion of GGNNs' capability with the knowledge learned by protein language models (see Fig. 1).The improvements come from two major lines.Firstly, GGNNs can benefit from the information that emerges in the learned representations of those protein language models on fundamental properties of proteins, including secondary structures, contacts, and biological activity.This kind of knowledge may be difficult for GGNNs to be aware of and learn in a specific downstream task.To confirm this claim, we conduct a toy experiment to demonstrate that conventional graph connectivity mechanisms prevent existing GGNNs from being cognizant of residues' absolute and relative positions in the protein sequence.Secondly and more intuitively, protein language models serve as an alternative way of enriching GGNNs' training data and allow GGNNs to be exposed to more different families of proteins, thereby greatly strengthening GGNNs' generalization capability.
We examine our hypothesis across a wide range of benchmarks, containing model quality assessment, protein-protein interface prediction, protein-protein rigid-body docking, and ligand binding affinity prediction.Extensive experiments show that the incorporation and combination of pretrained protein language models' knowledge significantly improve GGNNs' performance for various problems, which require distinct domain knowledge.By utilizing the unprecedented view into the language of protein sequences provided by powerful protein language models, GGNNs promise to augment our understanding of a vast database of poorly understood protein structures.Our work hopes to shed more light on how to bridge the gap between the thriving geometric deep learning and mature protein language models and better leverage different modalities of proteins.

Results and discussion
Our toy experiments illustrate that existing GGNNs are unaware of the positional order inside the protein sequences.Taking a step further, we show in this section that incorporating knowledge learned by large-scale protein language models can robustly enhance GGNN's capacity in a wide variety of downstream tasks.
, and randomly select 15K samples for evaluation.
• Ligand Binding Affinity (LBA) is an essential task for drug discovery applications.It predicts the strength of a candidate drug molecule's interaction with a target protein.
Specifically, we aim to forecast pK ¼ Àlog 10 K, where K is the binding affinity in Molar units.We use the PDBbind database 30,31 , a curated database containing protein-ligand complexes from the PDB and their corresponding binding strengths.The protein-ligand complexes are split such that no protein in the test dataset has more than 30% or 60% sequence identity with any protein in the training dataset.
Experimental setup.We evaluate our proposed framework on the instances of several state-of-the-art geometric networks, using Pytorch 32 and PyG 33 on four standard protein benchmarks.For MQA, PPI, and LBA, we use GVP-GNN, EGNN, and Molformer as backbones.For PPRD, we utilize a deep learning model, EquiDock 34 , as the backbone.It approximates the binding pockets and obtains the docking poses using keypoint matching and alignment.For more experimental details, please refer to Supplementary Note 3.
Single-protein representation task.For MQA, we document First Rank Loss, Spearman correlation (R S ), Pearson's correlation (R P ), and Kendall rank correlation (K R ) in Table 1.The introduction of protein language models has brought a significant average increase of 32.63% and 55.71% in global and mean R S , of 34.66% and 58.75% in global and mean R P , and of 43.21% and 63.20% in global and mean K R respectively.With the aid of language models, GVP-GNN achieves the optimal global R S , global R P , and K R of 84.92%, 85.44%, and 67.98% separately.
Apart from that, we provide a full comparison with all existing approaches in Table 2.We elect RWplus 35 , ProQ3D 36 , VoroMQA 37 , SBROD 38 , 3DCNN 2 , 3DGNN 2 , 3DOCNN 39 , DimeNet 40 , GraphQA 41 , and GBPNet 42 as the baselines.Performance is recorded in Table 2, where the second best is underlined.It can be concluded that even if GVP-GNN is not the best architecture, it can largely outperform existing methods including the state-of-the-art no-pretraining method set by Ayken and Xia 42 (i.e., GBPNet) and the state-of-the-art pretraining results set by Jing et al. 43 if it is enhanced by the protein language model.Protein-protein representation tasks.For PPRD, we report three items as measurements: the complex root mean squared deviation (RMSD), the ligand RMSD, and the interface RMSD in Table 3.The interface is determined with a distance threshold less than 8Å.It is noteworthy that, unlike the EquiDock paper, we do not apply the Kabsch algorithm to superimpose the receptor and the ligand.Contrastingly, the receptor protein is fixed during evaluation.All three metrics decrease considerably with improvements of 11.61%, 12.83%, and 31.01% in complex, ligand, and interface median RMSD, respectively.Notably, we also report the result of EquiDock, which is first pretrained on DIPS and then fine-tuned on DB5.It can be discovered that DIPS-pretrained EquiDock still performs worse than EquiDock equipped with pretrained language models.This strongly demonstrates that structural pretraining for GGNNs may not benefit GGNNs more than pretrained protein language models.
For PPI, we record AUROC as the metric in Fig. 2. It can be found that AUROC increases for 6.93%, 14.01%, and 22.62% for GVP-GNN, EGNN, and Molformer respectively.It is worth noting that Molformer falls behind EGNN and GVP-GNN originally in this task.But after injecting knowledge learned by protein language models, Molformer achieves competitive or even The column of 'PLM' indicates whether the protein language model is used.The First Rank Loss is the average difference between the true scores of the best model and the top-ranked model for each target.Results are reported with mean ± standard deviation over three repeated runs and the best performance is in bold.
better performance than EGNN or GVP-GNN.This indicates that protein language models can realize the potential of GGNNs to the full extent and greatly narrow the gap between different geometric deep learning architectures.The results mentioned above are amazing because, unlike MQA, PPRD and PPI study the geometric interactions between two proteins.Though existing protein language models are all trained on single protein sequences, our experiments show that the evolution information hidden in unpaired sequences can also be valuable to analyze the multi-protein environment.a These results are taken from ref. 42 .b These results are taken from ref. 2 .c These results are re-produced.Protein-molecules representation task.For LBA, we compare RMSD, R S , R P , and K R in Table 4.The incorporation of protein language models produces a remarkably average decline of 11.26% and 6.15% in RMSD for 30% and 60% identity, an average increase of 51.09% and 9.52% in R P for the 30% and 60% identity, an average increment of 66.60% and 8.90% in R S for the 30% and 60% identity, and an average increment of 68.52% and 6.70% in K R for the 30% and 60% identity.It can be seen that the improvements in the 30% sequence identity is higher than that in the less restrictive 60% sequence identity.This confirms that protein language models benefit GGNNs more when the unseen samples belong to different protein domains.Moreover, contrasting PPRD or PPI, LBA studies how proteins interact with small molecules.Our outcome demonstrates that rich protein representations encoded by protein language models can also contribute to the analysis of protein's reaction to other nonprotein drug-like molecules.The result of a different data split has been placed in Supplementary Table 1.
In addition, we compare thoroughly with existing approaches for LBA in Table 5, where the second best is underlined.We select a broad range of models including DeepAffinity 44 , Cormorant 45 , LSTM 46 , TAPE 47 , ProtTrans 14 , 3DCNN 2 , GNN 2 , MaSIF 48 , DGAT 49 , DGIN 49 , DGAT-GCN 49 , HoloProt 50 , and GBPNet 42 as the baseline.It is clear that even if EGNN is a median-level architecture, it can achieve the best RMSD and the best Pearson's correlation when enhanced by protein language models, beating a group of strong baselines including HoloProt 50 and GBPNet 42 .
Scale and type of protein language models.It has been observed that as the size of the language model increases, there are consistent improvements in tasks like structure prediction 12 .Here we conduct an ablation study to investigate the effect of protein language models' sizes on GGNNs.Specifically, we explore different ESM-2 with the parameter numbers of 8M, 35M, 150M, 650M, and 3B and plot results in Fig. 2. It verifies that scaling the  Models are sorted by the year they are released.Results are reported with mean ± standard deviation over three repeated runs and the best and second best performance are bolded and underlined, respectively.a These results are taken from ref. 2 .b These results are taken from ref. 42 .c These results are copied from ref. 50 .d These results are re-produced.
protein language model is advantageous for GGNNs.More additional results can be found in Supplementary Note 4. We also provide a comparison of different sorts of PLMs' influence in Supplementary Table 2.Besides that, we investigate the difference of PLMs' effectiveness with and without MSA in Supplementary Table 3.
Limitations.Despite our successful confirmation that PLMs can promote geometric deep learning, there are several limitations and extensions of our framework left open for future investigation.For instance, our 3D protein graphs are residue-level.We believe atom-level protein graphs also benefit from our approach, but its increase in performance needs further exploration.

Conclusion
In this study, we investigate a problem that has been long ignored by existing geometric deep learning methods for proteins.That is, how to employ the abundant protein sequence data for 3D geometric representation learning.To answer this question, we propose to leverage the knowledge learned by existing advanced pre-trained protein language models and use their amino acid representations as the initial features.We conduct a variety of experiments such as protein-protein docking and model quality assessment to demonstrate the efficacy of our approach.Our work provides a simple but effective mechanism to bridge the gap between 1D sequential models and 3D geometric neural networks, and hope to throw light on how to combine information encoded in different protein modalities.

Method
Sequence recovery analysis Preliminary and motivations.It is commonly acknowledged that protein structures maintain much more information than their corresponding amino acid sequences.And for decades long, it has been an open challenge for computational biologists to predict protein structure from its amino acid sequence.Though the advancement of Alphafold (AF) 51 and RosettaFold 52 has made a huge step in alleviating the limitation brought by the number of available experimentally determined protein structures, neither AF nor its successors such as Alphafold-Multimer 53 , IgFold 54 , and HelixFold 55 are a panacea.Their predicted structures can be severely inaccurate when the protein is orphan and lacks multiple sequence alignment (MSA) as the template.Consequently, it is hard to conclude that protein sequences can be perfectly transformed to the structure modality by current tools and be used as extra training resources for GGNNs.Moreover, we argue that even if conformation is a higher-dimensional representation, the prevailing learning paradigm may forbid GGNNs from capturing the knowledge that is uniquely preserved in protein sequences.Recall that GGNNs are mainly diverse in their patterns to employ 3D geometries, the input features include distance 56 , angles 40 , torsion, and terms of other orders 57 .The position index hidden in protein sequences, however, is usually neglected when constructing 3D graphs for GGNNs.Therefore, in this section, we design a toy trial to examine whether GGNNs can succeed in recovering this kind of positional information.
Protein graph construction.Here the structure of a protein can be represented as an atom-level or residue-level graph G ¼ ðV; EÞ, where V and E ¼ ðe ij Þ correspond to the set of N nodes and M edges respectively.Nodes have their 3D coordinates x 2 R N 3 and the initial ψ h -dimension roto-translational invariant features h 2 R N ψ h (e.g., atom types and electronegativity, residue classes).Normally, there are three types of options to construct connectivity for molecules: r-ball graphs, fullyconnected (FC) graphs, and K-nearest neighbors (KNN) graphs.In our setting, nodes are linked to K = 10 nearest neighbors for KNN graphs, and edges include all atom pairs within a distance cutoff of 8Å for r-ball graphs.
Recovery from graphs to sequences.Since most prior studies choose to establish 3D protein graphs based on purely geometric information and ignore their sequential identities, it provokes the following position identity question: Can existing GGNNs identify the sequential position order only from geometric structures of proteins?
To answer this question, we formulate two categories of toy tasks (see Fig. 3).The first one is absolute position recognition (APR), which is a classification task.Models are asked to directly predict the position index ranging from 1 to N, the residue number of each protein.This task adopts accuracy as the metric and expects models to discriminate the absolute position of the amino acid within the whole protein sequence.We compute the distribution of the protein sequence lengths in Supplementary Fig. 1.
In addition to that, we propose the second task named relative position estimation (RPE) to focus on the relative position of each residue.Models are required to predict the minimum distance of residue to the two sides of the given protein and the root mean squared error (RMSE) is used as the metric.This task aims to examine the capability of GGNNs to distinguish which segment the amino acid belongs to (i.e., the center section of the protein or the end of the protein).

Experiments
Backbones: We adopt three technically distinct and broadly accepted architectures of GGNNs for empirical verification.To be specific, GVP-GNN 7,43 extends standard dense layers to operate on collections of Euclidean vectors, performing both geometric and relational reasoning on efficient representations of macromolecules.EGNN 58 is a translation, rotation, reflection, and permutation equivariant GNN without expensive spherical harmonics.Molformer 9 employs the self-attention mechanism for 3D point clouds while guarantees SE(3)-equivariance.
Dataset: We exploit a small non-redundant subset of high-resolution structures from the PDB.To be specific, we use only X-ray structures with resolution < 3.0Å, and enforce a 60% sequence identity threshold.This results in a total of 2643, 330, and 330 PDB structures for the train, validation, and test sets, respectively.Experimental details, the summary of the database, and the description of these GGNNs are elaborated in Supplementary Notes 1 and 2.
Empirical results and analysis: Table 6 documents the overall results, where metrics are labeled with ↑/↓ if higher/lower is better, respectively.It can be found that all GGNNs fail to recognize either the absolute or the relative positional information encoded in the protein sequences with an accuracy lower than 1% and an extremely high RMSE.
This phenomenon stems from the conventional ways to build graph connectivity, which usually excludes sequential information.To be specific, unlike common applications of GNNs such as citation networks 59 , social networks 60 , knowledge graphs 61 , molecules do not have explicitly defined edges or adjacency.On the one hand, r-ball graphs utilize a cut-off distance, which is usually set as a hyperparameter, to determine the particle connections.But it is hard to guarantee a cut-off to properly include all crucial node interactions for complicated and large molecules.On the other hand, FC graphs that consider all pairwise distances will cause severe redundancies, dramatically increasing the computational complexity especially when proteins consist of thousands of residues.Besides, GGNNs also easily get confused by excessive noise, leading to unsatisfactory performance.As a remedy, KNN becomes a more popular choice to establish graph connectivity for proteins 34,62,63 .However, all of them take no account of the sequential information and require GGNNs to learn this original sequential order during training.
The lack of sequential information can yield several problems.To begin with, residues are unaware of their relative positions in the proteins.For instance, two residues can be close in the 3D space but distant in the sequence, which can mislead models to find the correct backbone chain.Secondly, according to the characteristics of the MP mechanism, two residues in a protein with the same neighborhood are expected to share similar representations.Nevertheless, the role of those two residues can be significantly separate 64 when they are located at different segments of the protein.Thus, GGNNs may be incapable of differentiating two residues with the same 1-hop local structures.This restriction has already been distinguished by several works 6,65 , but none of them make a strict and thorough investigation.Admittedly, sequential order may only be necessary for certain tasks.But this toy experiment strongly indicates that the knowledge monopolized by amino acid sequences can be lost if GGNNs only learn from protein structures.
Integration of language models into geometric networks.As discussed before, learning about 3D structures cannot directly benefit from large amounts of sequential data.Subsequently, the model sizes of GGNNs are limited, or instead, overfitting may occur 66 .On the contrary, comparing the number of protein sequences in the UniProt database 67 to the number of known structures in the PDB, there are over 1700 times more sequences than structures.More importantly, the availability of new protein sequence data continues to far outpace the availability of experimental protein structure data, only increasing the need for accurate protein modeling tools.
Therefore, we introduce a straightforward approach to assist GGNNs with pretrained protein language models.To this end, we feed amino acid sequences into those protein language models, where ESM-2 12 is adopted in our case, and extract the per-residue representations, denoted as h 0 2 R N ψ PLM .Here ψ PLM = 1280.Then h 0 can be added or concatenated to the per-atom feature h.For residue-level graphs, h 0 immediately replaces the original h as the input node features.
Notably, incompatibility exists between the experimental structure and its original amino acid sequence.That is, structures stored in the PDB files are usually incomplete and some strings of residues are missing due to inevitable realistic issues 68 .They, therefore, do not perfectly match the corresponding sequences (i.e., FASTA sequence).There are two choices to address this mismatch.On the one hand, we can simply use the fragmentary sequence as the substitute for the integral amino acid sequence and forward it into the protein language models.On the other hand, we can leverage a dynamic programming algorithm provided by Biopython 69 to implement pairwise sequence alignment and abandon residues that do not exist in the PDB structures.It is empirically discovered that no big difference exists between them, so we adopt the former processing mechanism for simplicity.
Reporting summary.Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.Results are reported with mean ± standard deviation over three repeated runs and the best performance is in bold.

Fig. 1
Fig.1Illustration of our framework to strengthen GGNNs with knowledge of protein language models.The protein sequence is first forwarded into a pretrained protein language model to extract per-residue representations, which are then used as node features in 3D protein graphs for GGNNs.

Fig. 2
Fig. 2 Some ablation studies.a Results of PPI with and without PLMs.b Performance of GGNNs on MQA with ESM-2 at different scales.

Task 1 : 8 Fig. 3
Fig.3Illustration of the sequence recovery problem.a Protein residue graph construction.Here we draw graphs in 2D for better visualization but study 3D graphs for GGNNs.b Two sequence recovery tasks.The first requires GGNNs to predict the absolute position index for each residue in the protein sequence.The second aims to forecast the minimum distance of each amino acid to the two sides of the protein sequence.

Table 1
Results on MQA.

Table 3
Performance of PPRD on DB5.5 Test Set.
Models with ♣ are directly trained and tested on DB5, while EquiDock with ♠ is first pretrained on DIPS and fine-tuned on the DB5 training set.Results are reported with mean ± standard deviation over three repeated runs and the best performance is in bold.

Table 2
Comparison of performance on MQA.Models are sorted by the year they are released.Results are reported with mean ± standard deviation over three repeated runs and the best and second best performance are bolded and underlined, respectively.

Table 4
Results on LBA.Results are reported with mean ± standard deviation over three repeated runs and the best performance is in bold.

Table 5
Comparison of performance on LBA.