Prediction of inter-chain distance maps of protein complexes with 2D attention-based deep neural networks

Guo, Zhiye; Liu, Jian; Skolnick, Jeffrey; Cheng, Jianlin

doi:10.1038/s41467-022-34600-2

Download PDF

Article
Open access
Published: 15 November 2022

Prediction of inter-chain distance maps of protein complexes with 2D attention-based deep neural networks

Nature Communications volume 13, Article number: 6963 (2022) Cite this article

3685 Accesses
15 Citations
4 Altmetric
Metrics details

Subjects

Abstract

Residue-residue distance information is useful for predicting tertiary structures of protein monomers or quaternary structures of protein complexes. Many deep learning methods have been developed to predict intra-chain residue-residue distances of monomers accurately, but few methods can accurately predict inter-chain residue-residue distances of complexes. We develop a deep learning method CDPred (i.e., Complex Distance Prediction) based on the 2D attention-powered residual network to address the gap. Tested on two homodimer datasets, CDPred achieves the precision of 60.94% and 42.93% for top L/5 inter-chain contact predictions (L: length of the monomer in homodimer), respectively, substantially higher than DeepHomo’s 37.40% and 23.08% and GLINTER’s 48.09% and 36.74%. Tested on the two heterodimer datasets, the top Ls/5 inter-chain contact prediction precision (Ls: length of the shorter monomer in heterodimer) of CDPred is 47.59% and 22.87% respectively, surpassing GLINTER’s 23.24% and 13.49%. Moreover, the prediction of CDPred is complementary with that of AlphaFold2-multimer.

DNCON2_Inter: predicting interchain contacts for homodimeric and homomultimeric protein complexes using multiple sequence alignments of monomers and deep learning

Article Open access 10 June 2021

Protein–protein contact prediction by geometric triangle-aware protein language models

Article 19 October 2023

Analyzing effect of quadruple multiple sequence alignments on deep learning based protein inter-residue distance prediction

Article Open access 07 April 2021

Introduction

Proteins are a key building block of life. The function of a protein is largely determined by its three-dimensional structure¹. Sometimes single-chain proteins (monomers) can perform certain functions, while the structures of most individual proteins interact to form multi-chain complex structures (multimers) to carry out their biological function². Therefore, modeling the three-dimensional structure of both monomers and protein complexes is crucial for studying protein function.

Deep learning has been applied to advance the prediction of the tertiary structures of monomers since 2012³. Over a decade, many deep learning methods were developed to predict intra-chain residue-residue contact maps or distance maps of monomers^4,5,6,7,8, which were used by contact/distance-based modeling methods such as CONFOLD⁹ and Rosetta¹⁰ to build their tertiary structures. Extensive studies^9,11,12,13 have shown that if a sufficiently accurate intra-chain distance map is predicted, then the protein’s tertiary structure can be accurately constructed. Most recently, AlphaFold2¹⁴ uses an end-to-end deep learning method to predict both tertiary structures and residue-residue distances of monomers, achieved a very high average accuracy (~90 Global Distance Test (GDT-TS) score¹⁵ in the 14th Critical Assessment of Techniques for Protein Structure Prediction (CASP14) in 2020. Recently, AlphaFold2 was extended to AlphaFold-multimer¹⁶ and AF2Complex¹⁷ to improve the prediction of quaternary structures of multimers.

Following the deep learning revolution in the prediction of intra-chain residue-residue distances and tertiary structures, recently some deep learning methods were developed to predict the inter-chain residue-residue contact map of homodimers and/or heterodimers, such as ComplexContact¹⁸, DeepHomo¹⁹, DRcon²⁰, and GLINTER²¹ that predicts the contact map for both homodimers and heterodimers using as input a graph representation of protein monomer structure and the row attention maps generated from multiple sequence alignments (MSAs) by the MSA transformer²². The attention map calculated by the MSA transformer is a kind of residue-residue co-evolutionary feature extracted from MSAs. It has been automatically trained on millions of MSAs to capture the co-evolutionary information across many diverse protein families during its unsupervised pretraining. Despite the significant progress, the accuracy of inter-chain contact prediction is still much lower than that of intra-chain contact/distance prediction, which calls for the development of more methods to tackle this problem.

In this work, we develop a protein complex distance prediction method (CDPred) based on a deep learning architecture combining the strengths of the deep residual network²³, a channel-wise attention mechanism, and a spatial-wise attention mechanism to predict the inter-chain distance maps of both homodimers and heterodimers. As in GLINTER, the attention map of the MSA generated by the MSA transformer is used as one input for CDPred. The predicted distance map for monomers in dimers is used as another input feature. Different from the existing deep learning methods, CDPred predicts inter-chain distances rather than binary inter-chain contacts (contact or no contact) that the current methods, such as DeepHomo and GLINTER predict. We test the CDPred rigorously on two homodimer test datasets and two heterodimer test datasets. For these datasets, CDPred yields much higher accuracy than DeepHomo and GLINTER.

Results

Evaluation of inter-chain contact prediction for homodimers

We compare CDPred with DNCON2_inter²⁴, DeepComplex²⁵, DeepHomo, and GLINTER on the HomoTest1 homodimer test dataset with the results shown in Table 1. The input tertiary structures for all three methods are predicted structures corresponding to the unbound monomer structures. The DNCON2_inter is run with the recommended parameters. The DeepComplex web server is used to get its prediction results. The results of DeepHomo are obtained from its publication. Three versions of CDPred are tested. The first version (CDPred_BFD) uses the MSAs generated from the BFD database as input. The second version (CDPred_Uniclust) uses the MSAs generated from the Uniclust30 database as input. The third version (CDPred) uses the average of the distance maps predicted by CDPred_BFD and CDPred_Uniclust as the prediction. Because DeepHomo and GLINTER predict binary inter-chain contacts at an 8 Å threshold instead of distances, we convert the inter-chain distance predictions of CDPred, CDPred_BFD, and CDPred_Uniclust into binary contact predictions for comparison. The definition of inter-chain contact is the same as GLINTER and DeepHomo, i.e., a pair of inter-chain residues is considered to be in contact if the distance between their two closest heavy atoms is less than 8 Å. This definition is used to evaluate all the inter-chain contact predictions in this work.

Table 1 The precision of top 5, top 10, top L/10, top L/5, and top L contact predictions, accuracy order (AccOrder), accuracy rate (AccRate), and AUC (area under receiver operating characteristic curve) score on the HomoTest1 test dataset for DNCON2_inter, DeepComplex, DeepHomo, GLINTER, and three versions of CDPred

Full size table

CDPred achieves the highest contact prediction precision across the board among all the methods. For instance, CDPred has a top L/5 contact prediction precision of 60.94%, which is 50.34% percentage points higher than DNCON2_inter, 9.64% percentage points higher than DeepComplex, 23.54% percentage points higher than DeepHomo, and 12.85% percentage points higher than GLINTER. CDPred performs better than DNCON2_inter, DeepComplex, DeepHomo, and GLINTER also in terms of Accuracy Rate and AUC score and second best in terms of Accuracy Order. According to almost all the evaluation metrics, CDPred performs better than both CDPred_BFD and CDPred_Uniclust, indicating that averaging the distance predictions made from the two kinds of MSAs can improve the prediction accuracy.

We also compared the methods above on the HomoTest2 homodimer test dataset (Table 2). CDPred performs best in terms of all the evaluation metrics. Combining the predictions of CDPred from two kinds of MSAs improves the prediction accuracy.

Table 2 The precision of top 5, top 10, top L/10, top L/5, and top L contact predictions, accuracy order, accuracy rate, and AUC score on the HomoTest2 test dataset for DeepHomo, GLINTER, and CDPred predictors

Full size table

The impact of MSA depth on the accuracy of inter-chain contact prediction for homodimers

Section 2.1 shows that two different MSAs (BFD and Uniclust) lead to different prediction accuracy for CDPred_BFD and CDPred_Uniclust, and CDPred that averages the two contact maps predicted from the two MSAs yields the best result. Here, we investigate how the depth of MSAs and a direct combination of the two MSAs may affect prediction accuracy. Supplementary Tables 1, 2 reports the number of sequences and the number of effective sequences (Neff)²⁶ for each dimer in HomoTest1 and HomoTest2 as well as the top L/2 contact prediction precision of CDPred_BFD, CDPred_Uniclust, CDPred, and CDPred_ComMSA that uses the simple combination of the BFD MSA and Uniclust MSA as input. Neff weights similar sequences in MSA less in counting the number of sequences and is widely used to measure the depth of MSA.

The Neff and contact prediction precision for CDPred_BFD and CDPred_Uniclust vary from target to target. The Pearson correlation coefficient between the difference of Neff and the difference of the top L/2 precision for CDPred_BFD and CDPred_Uniclust is 0.31 and 0.67 on HomoTest1 and HomoTest2, respectively, indicating that the depth of MSA has some positive impact on the contact prediction precision. CDPred_ComMSA, which combines the two MSAs to generate a deeper MSA as input, performs better than both CDPred_BFD and CDPred_Uniclust on HomoTest1 and better than CDPred_BFD on HomoTest2, suggesting that directly combining two MSAs can be beneficial.

CDPred still performs slightly better than CDPred_ComMSA in terms of top L/2 prediction precision on both datasets (55.19 versus 55.13% on HomoTest1 and 38.14 versus 36.14% on HomoTest2), indicating that averaging the distance maps predicted from the two MSAs is more effective than simply combining the two MSAs as input.

Evaluation of inter-chain contact prediction for heterodimers

We compare CDPred and a state-of-the-art heterodimer contact predictor GLINTER on both HeteroTest1 and HeteroTest2 heterodimer test datasets (see results in Tables 3, 4, respectively). The input tertiary structures of monomers used by both methods are predicted by AlphaFold2. We use two different orders of monomer A and monomer B (AB and BA) in each heterodimer to generate input features for CDPred to make predictions. The average of the outputs of the two orders is used as the final prediction. The process of averaging the two outputs is shown in Supplementary Figure 1. The inter-chain part of the BA prediction map is taken out and transposed to the same shape as its counterpart in the AB prediction map before they are averaged.

Table 3 The evaluation of contact predictions on the HeteroTest1 test dataset for the DeepComplex, GLINTER, and CDPred

Full size table

Table 4 The evaluation of contact predictions on the HeteroTest2 test dataset for the DeepComplex, GLINTER, and CDPred

Full size table

On the HeteroTest1 dataset (Table 3), CDPred achieves much better performance than GLINTER in terms of all the metrics. It is also substantially better than DeepComplex in terms of all the metrics but Accuracy Order. For instance, the top Ls/5 contact prediction precision of CDPred, 47.59% is more than twice 23.24% that of GLINTER, and 40.19% percentage points higher than DeepComplex. On the HeteroTest2 dataset (Table 4), CDPred also substantially outperforms DeepComplex and GLINTER in terms of all the metrics (contact precisions, Accuracy Order, Accurate Rate, and/or AUC).

Supplementary Tables 3, 4 compare the performance of using the two different orders of monomers as input (CDPred(A_B) and CDPred(B_A)) and averaging the outputs of the two different orders (CDPred) on the HeteroTest1 and HeteroTest2 datasets, respectively. The accuracy of CDPred(A_B) and CDPred(B_A) varies from target to target and from dataset to dataset. Sometimes the precision of the two orders can be substantially different (see Supplementary Fig. 2 for a target-by-target comparison of the precision on HeteroTest1 and HeteroTest2). However, a two-sided pairwise t-test shows that there is no significant difference between the two on average. Even though averaging the contact maps predicted in two different orders does not always yield the best accuracy, it makes the performance more stable by reducing the variance and smoothing the prediction. For instance, CDPred often delivers either the best or medium prediction accuracy in comparison with CDPred(A_B) and CDPred(B_A).

Furthermore, we divide the top L/10 contact prediction precisions for the heterodimers in the more challenging HeteroTest2 dataset into four equal intervals and plot the number of heterodimers in each interval (Fig. 1). The precision of the predictions in the four internals is bifurcated, mainly centered on a low precision interval [0–25%] and a high precision interval [75–100%]. Forty heterodimers have low contact prediction precision in the range of 0–25%, indicating there is still a large room for improvement. One reason for the low precision is that most of the 40 heterodimers have shallow MSAs. The Pearson correlation coefficient between the logarithm of the number of effective sequences (Neff) of MSA and the top L/10 complex contact precision is 0.46, indicating a modest correlation between the two.

**Fig. 1: The histogram of the precision of the top L/10 contact predictions for the heterodimers in the HeteroTest2 dataset.**

It is also observed that the inter-chain contact prediction accuracy for heterodimers is lower than for homodimers on average. One reason is that the MSA generation for a homodimer only needs to generate an MSA for a monomer in the homodimer, which is usually much deeper than the MSA generated for a heterodimer that requires the pairing of the related sequences in the MSAs of two different monomers in the heterodimer. Another reason is that homodimers tend to have a larger interaction interface than heterodimers on average, making the prediction easier.

Comparison of the co-evolutionary features generated by the statistical optimization method and deep learning method

To compare the performance of the co-evolutionary feature generated by the statistical optimization tool —CCMPred and the deep learning tool—MSA transformer, we trained two different models on the two different kinds of co-evolutionary features of the same training dataset using the same neural network architecture. One network (CDPred_PLM) is trained on the PLM co-evolutionary features generated by CCMPred. Another one (CDPred_ESM) is trained on the row attention map features generated by the MSA transformer. The precision of the top L/10 contact predictions of the two models on the four different test datasets are plotted in Fig. 2. CDPred_ESM has better performance than CDPred_PLM on all four test datasets, indicating that the co-evolutionary feature extracted automatically by the deep learning method is more informative than by the statistical optimization method of maximizing direct co-evolutionary signals. However, combining the two kinds of co-evolutionary features yields even better results (see the results in Tables 1, 2, 3, and 4). Supplementary Figure 3 plots the top L/10 precision of CDPred_ESM against the top L/10 precision of CDPred_PLM for the homodimers in the two homodimer test datasets and the heterodimers in the two heterodimers test datasets, respectively. For 42 out of 51 homodimers and 55 out of 64 heterodimers, CDPred_ESM has higher precision than CDPred_PLM. Both CDPred_ESM and CDPred_PLM can perform better on some targets, indicate the co-evolutionary features used by the two methods have some complementarity.

**Fig. 2: Comparison between CDPred_PLM (blue) and CDPred_ESM (orange) on four different test datasets.**

The impact of the quality of predicted tertiary structures of monomers on inter-chain distance prediction of dimers

The quaternary structure of a protein complex depends on the tertiary structure of its monomer units. As Alphafold can predict the tertiary structure of monomers very well, we investigated how effectively AlphaFold-predicted tertiary structures can be applied to predict inter-chain distance maps for protein complexes. The TM-scores of the predicted tertiary structure for each monomer unit of each dimer and the contact prediction precision of CDPred on the four datasets (HomoTest1, HomoTest2, HeteroTesst1, and HeteroTest2) are shown in Supplementary Tables 5, 6, 7, 8, respectively. The average TM-scores of the predicted tertiary structures for HomoTest1 and HomoTest2 are 0.95 and 0.90, for Chain A of heterodimers in HeteroTest1 and HeteroTest2 are 0.90 and 0.89, and for Chain B of heterodimers in HeteroTest1 and HeteroTest2 are 0.95 and 0.88, respectively, indicating the AlphaFold-predicted tertiary structures have high quality. The Pearson’s correlation between the TM-score of the predicted tertiary structures and top L/2 contact prediction precision is 0.19. The weak correlation may be partly due to that the quality of predicted tertiary structures is high enough in general for CDPred to leverage most tertiary structure information to predict inter-chain distances.

Moreover, we compared the top L/2 inter-chain contact prediction precision of using AlphaFold-predicted tertiary structures of monomers as input and using true tertiary structures of monomers in the bound state as input on the four datasets (Fig. 3). Using the true tertiary structures yields slightly better performance than using the AlphaFold-predicted structures on three out of four datasets (HomoTest1, HomoTest2, and HeteroTest2), but slightly worse performance on HeteroTest1. The p value of the pairwise t-test of the difference on the four datasets is 0.6802, 0.8892, 0.9083, and 0.9963, respectively, indicating that the difference is not significant. The results show that the AlphaFold-predicted tertiary structures are sufficiently accurate for CDPred to make inter-chain distance prediction, even though using true tertiary structures as input can slightly improve the prediction accuracy overall. This is different from GLINTER whose accuracy of using true tertiary structures as input is substantially higher than using AlphaFold-predicted tertiary structures as input²¹.

Fig. 3: Comparison of using AlphaFold-predicted tertiary structure (blue) and true tertiary structure (yellow) to generate intra-chain distance maps as input for predicting inter-chain distance maps on the four datasets.

High correlation between the precision of inter-chain contact predictions and predicted probability scores

The previous work on the intra-chain distance prediction²⁷ shows that the intra-chain distance prediction accuracy and predicted probability scores have a strong correlation, which can be used to select predicted intra-chain distance maps. Here, we investigate if a similar correlation exists in the inter-chain distance prediction. Figure 4 is a plot of the precision of top L/5 inter-chain contact predictions and the average of their probability scores for each target in the four test datasets. The correlation between the top L/5 inter-chain contact precision and the average predicted probability score is 0.7345. The high correlation suggests that the probability of inter-chain contacts predicted by CDPred can be used to estimate the confidence of the inter-chain prediction.

**Fig. 4: The plot of inter-chain contact prediction precision against average contact probability.**

The comparison between CDPred and AlphaFold2-multimer

AlphaFold2-multimer is currently the state-of-the-art method for predicting quaternary structures of multimers. To investigate if CDPred is complementary with AlphaFold2-multimer, we compare their inter-chain contact prediction accuracy on the four datasets. The comparison is not completely fair because the redundancy between the test datasets and AlphaFold2-multimer’s training dataset is not removed. We ran the latest version (Version 2) of AlphaFold2-multimer without templates to predict the quaternary structures for the dimers in the four test datasets. The inter-chain distance maps are extracted from the predicted quaternary structures. Each distance in the map is inverted to generate a contact probability map to be compared with the inter-chain contact map predicted by CDPred. Supplementary Figure 4 presents a target-by-target comparison of the top L/2 inter-chain contact prediction precision of CDPred and AlphaFold2-multimer for each target in the four test datasets. AlphaFold2-multimer has higher top L/2 precision than CDPred on the majority of the targets. However, for the very hard 44 targets on which the top L/2 precision of AlphaFold2-multimer is less than 10%, CDPred performs better than AlphaFold2-multimer on 15 targets, equally on 25 targets, and worse on 4 targets. On the 19 hard targets that the two methods perform differently, the average precision of CDpred is 14.8%, much higher than 1.79% of AlphaFold2-multimer. The p value of the two-sided pairwise t-test of the difference is 0.0068, indicating it is significant. For instance, for target 7LB6, the top L/2 precision of CDPred is 44.62%, much higher than 0% of AlphaFold2-multimer. The Neff of the MSA of the target is 16.6. The results show that CDPred is complementary with AlphaFold2-multimer and can be particularly useful when the target is very hard and AlphFold2-multimer prediction has very low confidence. One possible application of CDPred is to use its predicted distance map to rank and select diverse quaternary structural models of hard targets predicted by AlphaFold2-multimer.

An interesting inter-chain distance prediction example

Typically, when the MSA is shallow, the precision of inter-chain distance prediction is low due to the lack of information. However, CDPred still can accurately predict inter-chain distance for some targets with shallow MSAs. Figure 5 shows such a CASP13 homodimer target T0991, the distance map is visualized by matplotlib²⁸. Its MSA has only one sequence. The TM-score²⁹ of the tertiary structure of the monomer of T0991 predicted by AlphaFold2 is 0.3104, indicating the predicted tertiary structure fold is not correct. However, the precision of top L/10, top L/5, and top L/2 inter-chain contacts derived from the distance map predicted by CDPred is 72.73, 68.18, and 56.36%, respectively, which is high. Figures 5a, b show the intra-chain distance maps of the AlphaFold-predicted tertiary structure and the true tertiary structure of the monomer, Fig. 5c shows the inter-chain contact map predicted by CDPred, and Fig. 5d the true inter-chain contact map. The predicted inter-chain contact map accurately recalls a large portion of the true inter-chain contacts.

**Fig. 5: The prediction for homodimer T0991 with a shallow MSA.**

Methods

Attention-based neural network architecture

Figure 6 illustrates the overall architecture of CDPred based on the channel-wise and spatial-wise attention mechanisms. CDPred takes the tertiary structures of monomers of a dimer as input and extracts the monomer sequences and intra-chain distance maps. For homodimers, since the sequences of the two monomers of a homodimer are the same, only one monomer tertiary structure is used as input. The monomer sequences are used to search the protein sequence databases to generate MSAs of dimers, which are used to generate residue-residue co-evolutionary scores, row attention maps, and position-specific scoring matrix (PSSM) as input features (see Features Subsection 4.2 for details). The complete input for CDPred is the concatenation of all the input features.

**Fig. 6: Overview of the CDPred architecture.**

The input features stored in 2D tensors of multiple channels are first transformed by a 2D convolutional layer, followed by a Maxout layer³⁰ to reduce the dimensionality. The output of the Maxout layer is used as input for a series of deep residual network blocks empowered by the attention mechanism. The residual network has been widely used in computer vision and protein intra-chain distance and contact prediction^5,7,31. Here, we combine the residual connection with other useful components to construct a residual block, which includes the normalization block (called RCIN) consisting of a row normalization layer (RN), column normalization layer (IN)³², and instance normalization (IN)³³ for normalizing the feature maps, a channel attention squeeze-and-excitation (SE) block³⁴ for capturing the important information of different feature channels, and a spatial attention block³⁵ that captures signals between residues right after the channel attention block. Following the residual blocks, a 2D convolutional layer with the softmax function is used to classify the distance between any two residues from two monomers in a dimer into 42 distance bins (i.e., 40 bins from 2 to 22 Å with a bin size of 0.5 Å, plus a 0–2 Å bin and a >22 Å bin). Two kinds of inter-chain residue-residue distance are predicted at the same time: (1) the distance between the two closest heavy atoms from two residues used by most existing works in the field and (2) the C_b-C_b distance between two residues used by some recent works³⁶, resulting in two kinds of distance maps predicted.

Features

The input features of CDPred contain (1) the tertiary structure information of monomers in the form of an intra-chain distance map, (2) pairwise co-evolutionary features, and (3) sequential amino acid conservation features, which are stored in an L × L × N tensor (L is the length of the sequence of a monomer for a homodimer or the sum of the length of two monomers (L1 + L2) for a heterodimer). N is the number of feature channels for each pair of residues.

Tertiary structure information of monomers

The protein tertiary structure information of a monomer in a dimer is represented as an intra-chain distance map storing the distance between C_b atoms of two residues in the monomer. For a homodimer, an intra-chain distance map (L × L × 1) computed from the tertiary structure of only one monomer is used. For a heterodimer, two intra-chain distance maps (L1 × L1 × 1 and L2 × L2 × 1) of the two monomers in the heterodimer are computed from their tertiary structures and added as the top left submatrix and the bottom right submatrix of the input distance map of the dimer of (L1 + L2) × (L1 + L2) × 1 dimension. The values of the other area of the input distance map of the heterodimer are set to 0. In the training phase, the true tertiary structures of monomers in the dimers are used to compute the intra-chain distance maps above. During the test/prediction phase, the tertiary structures of monomers predicted by AlphaFold are used to generate the intra-chain distance maps as input. Using predicted tertiary structures as input is more challenging but can more objectively evaluate the performance of inter-chain distance prediction because, in most situations, the true tertiary structures of the monomers are not known. A predicted tertiary structure also corresponds to an unbound tertiary structure, a term commonly used in the protein docking field.

Co-evolutionary features

MSAs are generated for homodimers or heterodimers as input for the calculation of their co-evolutionary features. To challenge the deep learning method to effectively predict inter-chain distance maps from noisy inputs, in the training phase, we use less sensitive tools or smaller sequence databases to generate MSAs, but in the test phase, we use state-of-the-art tools and larger databases to generate the requisite MSAs. Specifically, in the training phase, for a homodimer, we use PSI-BLAST³⁷ to search the sequence of a monomer against Uniref90 (2018-04)³⁸ to generate the MSAs, and for a heterodimer, we follow the procedure in FoldDock³⁹ using the HHblits⁴⁰ to search against Uniclust30 (2017-10) to generate the MSA for each of the two monomers and then pair the two MSAs to produce an MSA for the heterodimer according to the organism taxonomy ID of the sequences.

In the test stage, for a homodimer, we use HHblits to search the sequence of a monomer against the Big Fantastic Database (BFD)⁴¹ and Uniclust30 (2017-10), respectively, to generate two MSAs for a single chain of the homodimers, which are used to generate input features separately to make two predictions that are averaged as the final predicted distance map; for a heterodimer, an MSA is generated by the same procedure used in EvComplex2⁴², which applies the jackhammer to search against Uniref90 (2018-04) to generate one MSA for each of the two monomers and then pairs the sequences from the two MSAs to produce an MSA for the heterodimer according to the highest sequence identity with the monomer sequences in each species. The MSA for a homodimer or a heterodimer is used by a statistical optimization tool CCMpred⁴³ to generate a residue-residue co-evolutionary score matrix (L × L × 1) as features and by a deep learning tool MSA transformer²² to generate residue-residue relationship (attention) matrices (L × L × 144) as features. L is the number of columns in MSA.

Sequential features

The sequence profile (i.e., position-specific scoring matrix (PSSM)) of the protein generated by the PSI-BLAST search above contains the residue conservation information. The PSSM of a monomer in a homodimer or the vertical concatenation of two PSSMs of two monomers in a heterodimer in the shape of L × 20 is tiled (i.e., cross-concatenated element by element) to generate sequential features of dimensionality L × L × 40.

Training procedure and hyperparameters

The deep neural network uses the input features above to predict a heavy atom distance map and a C_b distance map of shape L × L × 42. The 42 channels store the probability of a distance between two residues in 42 distance bins. The predicted inter-chain distance maps are compared with their true counterparts to calculate the cross-entropy loss to adjust the weights during training. For a heterodimer (L = L1 + L2), an output distance map of dimension (L1 + L2) × (L1 + L2) × 42 contains both inter-chain distance predictions and intra-chain distance predictions. Only inter-chain distance predictions are used to calculate the cross-entropy loss to train the networks, while the intra-chain distance predictions are ignored. The number of convolutional layers of CDPred is set to 156, and the number of filters of each convolutional layer is set to 64. The batch size in training is set to 1 due to the limitation of GPU memory. We used the Adam optimizer with a 1e-3 learning rate to train the model for the first 30 epochs to achieve fast convergence and used stochastic gradient descent with the 1e-4 starting learning rate and 10-time reduction every 20 epochs for the remaining 50 epochs to further reduce the training loss.

Datasets and evaluation metrics

We use the DeepHomo training dataset¹⁹ to train the homodimer inter-chain distance predictor. The whole dataset includes 4,132 homodimeric proteins with C2 symmetry. And after removing proteins that have > =30% sequence identity with the blind test datasets (HomoTest1 and HomoTest2) consisting of the targets of the CASP/CAPRI experiments using MMseq2⁴⁴, 4129 homodimers are left as training, validation, and internal test data. The same as DeepHomo, we select 300 of them as the validation data and 300 as the internal test data and use the rest as the training data. The test dataset used by DeepHomo contains 28 targets collected from the CASP10-13 experiments is used as one blind homodimer test dataset (HomoTest1). Another test dataset used by GLINTER²¹, which includes 23 homodimer targets collected from the CASP13 and 14, is used as the other blind homodimer test dataset (HomoTest2). The two blind test datasets have six common targets.

For heterodimers, we use the heterodimers in Apoc⁴⁵ to create the training, validation, and internal test datasets. After filtering out similar sequences at the 40% sequence identity threshold and removing the sequences with ≥30% sequence identity with the blind test datasets (HeteroTest1 and HeteroTest2), 3955 heterodimers are left. We randomly select 3576 of them as the training data, 198 as the validation data, and 181 as the internal test data. The test dataset used by GLINTER which contains nine heterodimer targets from the CASP13 and CASP14 experiments in conjunction with the CAPRI experiments is used as a blind test dataset (HeteroTest1). To create a larger blind test dataset, we collect the heterodimer released between 09-2021 and 11-2021 in the PDB. After filtering out similar sequences at a 40% sequence identity threshold and excluding sequences with >1000 residues targets, 55 heterodimers are left to create another blind test dataset (HeteroTest2).

Since the external methods, GLINTER and DeepHomo predict inter-chain contacts instead of inter-chain distances, to fairly compare CDPred with them, we use the precision of contact prediction as the evaluation metric. Specifically, the precision of top 5, 10, L/10 (or Ls/10), L/5 (or Ls/5), L/2 (or Ls/2), and L (or Ls) contact predictions (L: length of a monomer in homodimers, Ls: length of the shorter monomer in heterodimers) is computed and compared. A similar metric is also widely used in evaluating intra-chain contact prediction. Because DeepHomo and GLINTER predict inter-chain contacts at an 8 Å threshold, we use the same threshold to convert the distance maps predicted by CDPred into the binary contact map. A predicted inter-chain contact is correct if the minimal distance between the heavy atoms of the two residues is less than 8 Å. The accuracy order, accuracy rate⁴⁶, and AUC score are also used to evaluate the inter-chain distance prediction of CDPred. The accuracy order is the rank of the first correct contact prediction divided by the total number of residues of a dimer. AccRate is the percentage of dimers for which at least one of the top 10 inter-chain contact predictions is correct.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

The test data generated in this study have been deposited in the Zenodo database under Creative Commons Attribution 4.0 International Public License at https://zenodo.org/record/6647564. The raw protein dimer data used in this study are available under the CC0 1.0 Universal (CC0 1.0) Public Domain Dedication at https://www.rcsb.org/. The source data generated in this study are provided in the Source Data file. The Uniclust30(2017-10) database used in this study is available under the Creative Commons Attribution-ShareAlike 4.0 International License at https://wwwuser.gwdg.de/~compbiol/uniclust/2017_10/. The Uniref90(2018-10) database used in this study is available under the Creative Commons Attribution 4.0 International (CC BY 4.0) License at https://www.uniprot.org/help/uniref. And the Big Fantastic Database (BFD) database used in this study is available at https://bfd.mmseqs.com/. Source data are provided with this paper.

Code availability

The code of CDPred⁴⁷ is available at: https://github.com/BioinfoMachineLearning/CDPred.

References

Utsumi, S. & Matsumura, Y. Structure-function relationships. Food Proteins Appl. 80, 257 (1997).
CAS Google Scholar
Spirin, V. & Mirny, L. A. Protein complexes and functional modules in molecular networks. Proc. Natl Acad. Sci. USA 100, 12123–12128 (2003).
Article ADS PubMed PubMed Central CAS Google Scholar
Eickholt, J. & Cheng, J. Predicting protein residue–residue contacts using deep networks and boosting. Bioinformatics 28, 3066–3072 (2012).
Article PubMed PubMed Central CAS Google Scholar
Adhikari, B., Hou, J. & Cheng, J. DNCON2: improved protein contact prediction using two-level deep convolutional neural networks. Bioinformatics 34, 1466–1472 (2018).
Article PubMed CAS Google Scholar
Wang, S., Sun, S., Li, Z., Zhang, R. & Xu, J. Accurate de novo prediction of protein contact map by ultra-deep learning model. PLoS Comput. Biol. 13, e1005324 (2017).
Article ADS PubMed PubMed Central Google Scholar
Li, Y., Hu, J., Zhang, C., Yu, D.-J. & Zhang, Y. ResPRE: high-accuracy protein contact prediction by coupling precision matrix with deep residual neural networks. Bioinformatics 35, 4647–4655 (2019).
Article PubMed PubMed Central CAS Google Scholar
Wu, T., Guo, Z., Hou, J. & Cheng, J. DeepDist: real-value inter-residue distance prediction with deep residual convolutional network. BMC bioinformatics 22, 1–17 (2021).
Senior, A. W. et al. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710 (2020).
Adhikari, B. & Cheng, J. CONFOLD2: improved contact-driven ab initio protein structure modeling. BMC Bioinforma. 19, 22 (2018).
Article Google Scholar
Rohl, C. A., Strauss, C. E., Misura, K. M. & Baker, D. Methods in Enzymology (Elsevier, 2004).
Kandathil, S. M., Greener, J. G., Lau, A. M. & Jones, D. T. Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterized proteins. Proc. Natl Acad. Sci. USA 119, e2113348119 (2022).
Yang, J. et al. Improved protein structure prediction using predicted interresidue orientations. Proc. Natl Acad. Sci. USA 117, 1496–1503 (2020).
Article ADS PubMed PubMed Central CAS Google Scholar
Xu, J. & Wang, S. Analysis of distance‐based protein structure prediction by deep learning in CASP13. Proteins 87, 1069–1081 (2019).
Article PubMed CAS Google Scholar
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583–589 (2021).
Article ADS PubMed PubMed Central CAS Google Scholar
Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31, 3370–3374 (2003).
Article PubMed PubMed Central CAS Google Scholar
Evans, R. et al. Protein complex prediction with AlphaFold-Multimer. BioRxiv, 2021.2010. 2004.463034 (2022).
Gao, M., Nakajima, An,D., Parks, J. M. & Skolnick, J. AF2Complex predicts direct physical interactions in multimeric proteins with deep learning. Nat. Commun. 13, 1–13 (2022).
Google Scholar
Zeng, H. et al. ComplexContact: a web server for inter-protein contact prediction using deep learning. Nucleic Acids Res. 46, W432–W437 (2018).
Article PubMed PubMed Central CAS Google Scholar
Yan, Y. & Huang, S.-Y. Accurate prediction of inter-protein residue–residue contacts for homo-oligomeric protein complexes. Brief. Bioinforma. 22, bbab038 (2021).
Article Google Scholar
Roy, R. S., Quadir, F., Soltanikazemi, E. & Cheng, J. A deep dilated convolutional residual network for predicting interchain contacts of protein homodimers. Bioinformatics 38, 1904–1910 (2022).
Xie, Z. & Xu, J. Deep graph learning of inter-protein contacts. Bioinformatics 38, 947–953 (2022).
Article CAS Google Scholar
Rao, R. M. et al. In International Conference on Machine Learning. 8844-8856 (PMLR, 2021).
He, K., Zhang, X., Ren, S. & Sun, J. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 770–778 (IEEE, 2016).
Quadir, F., Roy, R. S., Halfmann, R. & Cheng, J. DNCON2_Inter: predicting interchain contacts for homodimeric and homomultimeric protein complexes using multiple sequence alignments of monomers and deep learning. Sci. Rep. 11, 1–10 (2021).
Article Google Scholar
Quadir, F., Roy, R. S., Soltanikazemi, E. & Cheng, J. DeepComplex: a web server of predicting protein complex structures by deep learning inter-chain contact prediction and distance-based modelling. Front. Mol. Biosci. 8, 716973 (2021).
Jones, D. T., Singh, T., Kosciolek, T. & Tetchner, S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31, 999–1006 (2015).
Article PubMed CAS Google Scholar
Guo, Z., Wu, T., Liu, J., Hou, J. & Cheng, J. Improving deep learning-based protein distance prediction in CASP14. Bioinformatics 37, 3190–3196 (2021).
Article PubMed Central CAS Google Scholar
Hunter, J. D. Matplotlib: a 2D graphics environment. Comput. Sci. Eng. 9, 90–95 (2007).
Article Google Scholar
Zhang, Y. & Skolnick, J. Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).
Article PubMed CAS Google Scholar
Goodfellow, I., Warde-Farley, D., Mirza, M., Courville, A. & Bengio, Y. in International conference on machine learning. 1319–1327 (PMLR).
Li, Y., Zhang, C., Bell, E. W., Yu, D. J. & Zhang, Y. Ensembling multiple raw coevolutionary features with deep residual neural networks for contact‐map prediction in CASP13. Proteins 87, 1082–1091 (2019).
Article PubMed PubMed Central CAS Google Scholar
Mao, W., Ding, W., Xing, Y. & Gong, H. AmoebaContact and GDFold as a pipeline for rapid de novo protein structure prediction. Nat. Mach. Intell. 2, 1–9 (2019).
Ulyanov, D., Vedaldi, A. & Lempitsky, V. Instance normalization: the missing ingredient for fast stylization. Preprint at arXiv:1607.08022 (2016).
Hu, J., Shen, L. & Sun, G. Squeeze-and-excitation networks. In Proc. IEEE Conference on Computer Vision and Pattern Recognition. 7132–7141 (IEEE, 2018).
Woo, S., Park, J., Lee, J.-Y. & Kweon, I. S. Cbam: Convolutional block attention module. In Proc. European Conference on Computer Vision (ECCV) 2018. 3–19.
Soltanikazemi, E., Quadir, F., Roy, R. S., Guo, Z. & Cheng, J. Distance‐based reconstruction of protein quaternary structures from inter‐chain contacts. Proteins 90, 720–731 (2022).
Article PubMed CAS Google Scholar
Bhagwat, M. & Aravind, L. Comparative Genomics (Springer, 2007).
Mirdita, M. et al. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 45, D170–D176 (2017).
Article PubMed CAS Google Scholar
Bryant, P., Pozzati, G. & Elofsson, A. Improved prediction of protein-protein interactions using AlphaFold2. Nat. Commun. 13, 1–11 (2022).
Google Scholar
Remmert, M., Biegert, A., Hauser, A. & Söding, J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173 (2012).
Article CAS Google Scholar
Steinegger, M., Mirdita, M. & Söding, J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat. Methods 16, 603–606 (2019).
Article PubMed CAS Google Scholar
Green, A. G. et al. Large-scale discovery of protein interactions at residue resolution using co-evolution calculated from genomic sequences. Nat. Commun. 12, 1–12 (2021).
Article Google Scholar
Seemayer, S., Gruber, M. & Söding, J. CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations. Bioinformatics 30, 3128–3130 (2014).
Article PubMed PubMed Central CAS Google Scholar
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
Article PubMed CAS Google Scholar
Gao, M. & Skolnick, J. APoc: large-scale identification of similar protein pockets. Bioinformatics 29, 597–604 (2013).
Article PubMed PubMed Central CAS Google Scholar
Zhao, Z. & Gong, X. Protein-protein interaction interface residue pair prediction based on deep learning architecture. IEEE/ACM Trans. Comput. Biol. Bioinforma. 16, 1753–1759 (2017).
Article Google Scholar
Guo, Z. & Cheng, J. BioinfoMachineLearning/CDPred: the release accompanying the publication of the Nature Communication paper describing CDPred. Zenodo https://doi.org/10.5281/zenodo.7218709 (2022).

Download references

Acknowledgements

The work was partly supported by the Department of Energy, USA (grant #: DE-AR0001213 (J.C.), DE-SC0020400 (J.C.), and DE-SC0021303 (J.S. and J.C.)), National Science Foundation (grant #: DBI1759934 (J.C.) and IIS1763246 (J.C.)), and National Institutes of Health (grant #: R01GM093123 (J.C.), R01GM146340 (J.C.), and R35GM118039 (J.S.)).

Author information

Authors and Affiliations

Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, MO, 65211, USA
Zhiye Guo, Jian Liu & Jianlin Cheng
School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, 30332-200, USA
Jeffrey Skolnick

Authors

Zhiye Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jian Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey Skolnick
View author publications
You can also search for this author in PubMed Google Scholar
Jianlin Cheng
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.C. conceived the project. Z.G. and J.C. designed the experiment. Z.G. performed the experiment and collected the data. Z.G., J.C., J.L., and J.S. analyzed the data. J.L. and J.S. provided some datasets. Z.G. and J.C. wrote the manuscript. Z.G., J.C., and J.S. edited the manuscript.

Corresponding author

Correspondence to Jianlin Cheng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work. Peer reviewer reports are available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Info File #1

Peer Review File

Reporting Summary

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Guo, Z., Liu, J., Skolnick, J. et al. Prediction of inter-chain distance maps of protein complexes with 2D attention-based deep neural networks. Nat Commun 13, 6963 (2022). https://doi.org/10.1038/s41467-022-34600-2

Download citation

Received: 08 June 2022
Accepted: 24 October 2022
Published: 15 November 2022
DOI: https://doi.org/10.1038/s41467-022-34600-2

This article is cited by

Multi-domain and complex protein structure prediction using inter-domain interactions from deep learning
- Yuhao Xia
- Kailong Zhao
- Guijun Zhang
Communications Biology (2023)
Enhancing alphafold-multimer-based protein complex structure prediction with MULTICOM in CASP15
- Jian Liu
- Zhiye Guo
- Jianlin Cheng
Communications Biology (2023)
Improving AlphaFold2-based protein tertiary structure prediction with MULTICOM in CASP15
- Jian Liu
- Zhiye Guo
- Jianlin Cheng
Communications Chemistry (2023)
Protein–protein contact prediction by geometric triangle-aware protein language models
- Peicong Lin
- Huanyu Tao
- Sheng-You Huang
Nature Machine Intelligence (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Evaluation of inter-chain contact prediction for homodimers

The impact of MSA depth on the accuracy of inter-chain contact prediction for homodimers

Evaluation of inter-chain contact prediction for heterodimers

Comparison of the co-evolutionary features generated by the statistical optimization method and deep learning method

The impact of the quality of predicted tertiary structures of monomers on inter-chain distance prediction of dimers

High correlation between the precision of inter-chain contact predictions and predicted probability scores

The comparison between CDPred and AlphaFold2-multimer

An interesting inter-chain distance prediction example

Methods

Attention-based neural network architecture

Features

Tertiary structure information of monomers

Co-evolutionary features

Sequential features

Training procedure and hyperparameters

Datasets and evaluation metrics

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

Comments

Search

Quick links