Abstract
The ability to predict protein–protein interactions is crucial to our understanding of a wide range of biological activities and functions in the human body, and for guiding drug discovery. Despite considerable efforts to develop suitable computational methods, predicting protein–protein interaction binding affinity changes following mutation (ΔΔG) remains a severe challenge. Algebraic topology, a champion in recent worldwide competitions for protein–ligand binding affinity predictions, is a promising approach to simplifying the complexity of biological structures. Here we introduce element and sitespecific persistent homology (a new branch of algebraic topology) to simplify the structural complexity of protein–protein complexes and embed crucial biological information into topological invariants. We also propose a new deep learning algorithm called NetTree to take advantage of convolutional neural networks and gradientboosting trees. A topologybased network tree is constructed by integrating the topological representation and NetTree for predicting protein–protein interaction ΔΔG. Tests on major benchmark datasets indicate that the proposed topologybased network tree is an important improvement over the current state of the art in predicting ΔΔG.
Main
Protein–protein interactions (PPIs) are crucial to a wide range of biological activities and functions in the human body, including cell metabolism, signal transduction, muscle contraction and immune systems. The antibody–antigen system is one of the most essential among all PPIs and plays a unique role in the study of PPIs. Antibodies are large proteins that serve important roles in the immune system by counteracting antigens—chemicals recognized as alien by the human body. On the tip of an antibody, there is an antigenbinding fragment that contains a paratope for recognizing a unique antigen via its epitope; more specifically, a paratope consists of a set of complementaritydetermining regions that have the highest conformational flexibility among sites on an antibody^{1}. The high selectivity of antibody–antigen recognition mechanism and the flexibility of antibodies as large proteins make antibodies a suitable platform for designing counteractants of target molecules. Antibodies have been widely used as therapeutic agents to treat human diseases. Antibody therapy has several advantages over traditional therapy, including longer serum halflife, higher avidity and selectivity, and the ability to invoke desired immune responses^{2,3,4}. Antibody therapy also brings hope of curing several previously incurable diseases and there are ongoing efforts in the direction of HIV vaccine development^{5} and cancer therapeutic antibodies^{6,7}.
Threedimensional (3D) structural information and thermodynamic measurements are two essential components for understanding the molecular mechanism of PPIs. Many experimental methods have been developed to determine the structure of protein–protein complexes. Among them, Xray crystallography, NMR and cryoelectron microscopy are the main workhorses^{8}. The Protein Data Bank^{9}, one of the largest protein structure databases, includes tens of thousands of protein–protein complex structures and is expanding at an unprecedented rate.
Sitedirected mutation is a key technology for probing the thermodynamic properties of PPIs, including binding affinities of antibody–antigen interactions. Sirin et al.^{10} collected an ABBind database of mutationinduced antibody–antigen complex binding free energy changes. This database contains 1,101 mutation data entries, including 645 singlepoint mutations on 32 different antibody–antigen complexes. SKEMPI is a more general database for protein–protein binding affinity changes following mutation (ΔΔG)^{11}, it contains 3,047 mutation data entries for protein–protein heterodimeric complexes with experimentally determined structures.
The aforementioned databases have been widely used as benchmark tests for evaluating the predictive power of computational methods, which are indispensable for the investigation of PPIs, especially for the systematic screening of mutations^{12,13}. There are many reliable computational methods that can predict mutant structures on the wildtype, such as Rosetta^{14} and Jackal^{15}. Computational methods for generating protein structures from sequences (for example, MODELLER^{16}) and predicting docking poses for protein–protein complexes (for example, BioLuminate^{17}) are also available.
The thermodynamic properties of PPIs are usually interpreted as the binding affinity or binding free energy, ΔG. Given the importance of computational methods, a variety of them have been developed that use structures to predict antibody–antigen binding affinities. DFIRE^{18} relies on an allatom, distancescaled, pairwise potential that is derived using a database of highquality diverse protein structures, whereas STATIUM uses a pairwise statistical potential that scores how well a protein complex can accommodate different pairs of residues in the parent complex geometry. Forcefields for proteins can also be used to compute the binding free energy, representing van der Waals interactions, hydrophobic packing, electrostatics and solvation effects. These approaches include FoldX (FOLDEF)^{19}, Discovery Studio (CHARMMPLR)^{20} and Rosetta^{14}. Typically, physicsbased methods provide mechanistic interpretations but are not designed for handling large and diverse datasets.
Pires et al. optimized their graphbased cutoff scanning matrix (CSM) method for predicting antibody–antigen affinity changes following mutation given in the ABBind database^{21}. This method (named mCSMAB) was shown to outperform the aforementioned physical methods yet only achieve a Pearson’s correlation coefficient (R_{P}) of 0.53 with tenfold crossvalidation on a set of 645 singlepoint mutations. The limited performance of the current methods therefore highlights a pressing need for a new generation of ΔΔG predictors that are constructed with entirely new design principles and/or innovative machine learning algorithms. Although the physicsbased methods assume potential functions of certain forms and the graphbased method only considers pairwise interactions, we seek an approach that makes fewer assumptions and allows a systemic description of PPIs.
Persistent homology^{22,23,24,25}—a new branch of algebraic topology—is able to bridge geometry and topology, leading to a new efficient approach for the simplification of biological structural complexity^{26,27,28,29,30,31}; however, it neglects critical chemical/biological information when it is directly applied to complex biomolecular structures. Elementspecific persistent homology can retain critical biological information during the topological abstraction. Paired with advanced machine learning, such as a convolutional neural network (CNN), this new topological method gives rise to some of the best predictions for protein–ligand binding affinities^{32}, protein folding free energy changes following mutation^{33,34} and drug virtual screening^{35}. This approach has won many contests in the D3R Grand Challenges, a worldwide competition series in computeraided drug design^{36}; however, the techniques designed for protein–ligand binding analysis could not be directly applied to PPIs due to biological differences and the different characteristics of available datasets.
In the present work we introduce sitespecific persistent homology that is tailored for PPI analysis. We explore the utility of sitespecific persistent homology and machine learning algorithm for characterizing PPIs that are associated with sitespecific mutations. We hypothesize that a topological approach that generates intrinsically lowdimensional representations of PPIs could dramatically reduce the dimensionality of antibody–antigen complexes, leading to a reliable highthroughput screening in searching for valuable mutants in protein design. To validate our hypothesis, we integrate topological descriptors with a machine learning algorithm (CNNassisted gradientboosting trees (GBTs)) to predict PPI ΔΔG. The resulting topologybased network tree (TopNetTree) method is found to outperform other methods on two major benchmark datasets, ABBind^{10} and SKEMPI^{11}, by a large margin. Our TopNetTree offers an accurate and reliable tool for studying PPIs.
TopNetTree model for PPI binding energy change following mutation prediction
This section describes the TopNetTree model and its application to PPI ΔΔG prediction. As shown in Fig. 1, the proposed TopNetTree consists of two major modules: topologybased feature generation and a CNNassisted GBT model (Fig. 1). For the feature generation, we mainly used element and sitespecific persistent homology to capture structural characteristics, which was enhanced by chemical–physical descriptors, whereas for the learning model we used a GBT fed with inputs from a CNN as a predictor. We demonstrate the performance of the proposed TopNetTree by three commonly used PPI benchmark datasets.
Topological representation of PPIs
The pairwise interactions between atoms are characterized by the zeroth homology group (H_{0}, also known as the size function^{37}). The higherdimensional homology groups encode higherorder patterns in PPI complexes. The first homology group (H_{1}), which is generated with Euclidean distance (D_{e})based filtration, characterizes loop or tunnellike structures, as shown in Fig. 2, whereas the second homology group (H_{2}) describes cavity structures in PPI complexes. We obtain a comprehensive topological description of PPIs by combining various dimensions.
A topological representation should be able to extract patterns of different biological or chemical aspects (for example, hydrogen bonds between oxygen and nitrogen atoms, hydrophobicity, polarizability and so on) from a PPI system that is represented by a set of atomic coordinates (that is, a point cloud). We construct simplicial complexes using selected subsets of atomic coordinates and modified distance matrices to achieve this goal.
For the construction of an element and sitespecific persistent homology, we classify the atoms in a PPI complex into various subsets:
 (1)
\({{\mathcal{A}}}_{\mathrm{m}}\): atoms of the mutation site.
 (2)
\({{\mathcal{A}}}_{\mathrm{mn}}(r)\): atoms in the neighbourhood of the mutation site within a cutoff distance, r.
 (3)
\({{\mathcal{A}}}_{\mathrm{Ab}}(r)\): antibody atoms within r of the binding site.
 (4)
\({{\mathcal{A}}}_{\mathrm{Ag}}(r)\): antigen atoms within r of the binding site.
 (5)
\({{\mathcal{A}}}_{\mathrm{ele}}({\rm{E}})\): atoms in the system that has atoms of element type, E. When characterizing interactions between atoms a_{i} and a_{j} in set \({\mathcal{A}}\) and/or set \({\mathcal{B}}\), we use a modified distance matrix to exclude the interactions between the atoms from the same set. In the following formula, 𝐷_{mod} is defined as the modified distance and 𝐷_{e} is defined as the Euclidian distance.
Specific designations for sets \({\mathcal{A}}\) and \({\mathcal{B}}\) are given in Supplementary Table 1, which summarizes various topological barcodes.
Vectorization of topological barcodes
Using persistent homology, the original 3D pointcloud data are characterized by topological barcodes that are represented as collections of intervals that capture geometric patterns, topological patterns and PPIs while dramatically simplifying complicated structural representations of a PPIcomplex. The upper bound of the filtration parameter corresponds to the distance cutoff of interactions of interest, which is set to be the same for different samples in the dataset. Instead of having bounding cubes of different sizes around the binding and mutation sites, topological barcodes for different samples are in the same range of filtration values, which improves the scalability in comparison with the direct use of the original 3D data. We construct feature vectors from these sets of intervals for machine learning models.
One method of vectorization is to discretize the range of the filtration parameter into bins and record the behaviour of the barcodes in each bin^{35}. In this work we subdivide a filtration range (for example, [0, 12] Å) into bins of length 0.5 Å; namely, [0, 0.5], (0.5, 1], ⋯ , (11.5, 12] Å. For each bin, we count the numbers of persistence intervals, birth events and death events (see Fig. 3 for an illustration of filtration and persistence). This approach gives us three feature vectors for each topological barcode. Note that this characterization of birth and death might not be stable against different discretizations. As such, only H_{0} barcodes obtained from the Vietoris–Rips filtration are used in our approach.
One advantage of binned barcode vectorization is that it keeps the distance information that reflects the strength of hydrogen bonds, van der Waals interactions and so on. The bin representation of barcode features can be easily incorporated into a CNN, which captures and discriminates local patterns; that is, the impact of mutations.
Another method of vectorization is to summarize barcode statistics, including the sum, maximum, minimum, mean and standard derivation of bar lengths, birth values and death values. We use this method to vectorize H_{1} and H_{2} barcodes obtained from alpha complex filtration as these higherdimensional barcodes are sparser than the zerodimensional ones^{23}.
Machine learning models
A major challenge in the prediction of binding affinity changes following mutation for PPIs is that the data is highly complex due to 3D structures, whereas the datasets are relatively small. We designed a hybrid machine learning algorithm that combines a CNN and GBT to overcome this difficulty. The topologically simplified description of the 3D structures are further converted into concise features by the CNN module; the GBT module then builds robust predictors with effective control of overfitting.
TopGBT model
An ensemble method is a class of machine learning algorithms that builds a powerful model from weak learners. It improves the performance on the weak learners with the assumption that the individual learners are likely to make different mistakes and thus summing the weak learners will reduce the overall error. In this work we use GBTs that add a tree to the ensemble according to the current prediction error on the training data. This method (a toplogybased GBT or TopGBT) performs well when there is a moderate number of features and is relatively robust against hyperparameter tuning and overfitting. The implementation provided by the scikitlearn package (v.0.18.1)^{38} is used.
TopCNN model
CNNs are some of the most successful deep learning architectures, a regular CNN is a special case of a multilayer artificial neural network where only local connections are allowed between convolution layers and the weights are shared across different locations. We use a topologybased CNN (TopCNN) as an intermediate model; specifically, we feed vectorized H_{0} features into CNNs to extract higher level features for the downstream model (detailed parameters and prepossessing process of our model can be found in the Supplementary Information).
TopNetTree model
CNNs can automatically extract highlevel features from H_{0}. These CNNextracted features are combined with features constructed from highdimensional topological barcodes, H_{1} and H_{2}, as the inputs of the GBTs; specifically, we build a supervised CNN model with the PPI ΔΔG as labels. After the model is trained, we feed the flatten layer neural outputs into a GBT model to rank their importance. Based on the importance, a subset of CNN features is combined with other features, such as the statistics of H_{1} and H_{2} barcodes, for the final GBT model as shown in Fig. 1. The GBT is used for its robustness against overfitting, good performance for moderately small data sizes and its model interpretability (further details on TopNetTree are given in the Supplementary Information).
Model performance for PPIs
We consider three datasets: the ABBind dataset^{10}, the SKEMPI dataset^{11} and the SKEMPI 2.0 dataset^{39} to validate the proposed TopNetTree model. Two evaluation metrics (R_{P} and the rootmeansquare error, r.m.s.e.) are used to assess the quality of prediction. Detailed information of evaluation metrics can be found in the Supplementry Information.
The prediction of ABBind free energy changes following mutation
The ABBind dataset includes 1,101 mutational data points with experimentally determined binding affinities^{10}. We follow Pires et al.^{21} by considering only 645 single mutations across 29 antibody–antigen complexes. Among them, 87 mutations are on five complexes with homology structures. This dataset, called the ABBind S645 set, consists of about 20% stabilizing mutations and 80% destabilizing ones; there are 27 nonbinders in the whole dataset, which are variants determined not to bind within the sensitivity of the assay. The binding affinity changes following mutation of these nonbinders were set to –8 kcal mol^{–1}. These nonbinders could be regarded as outliers in the database and have a strongly negative impact on the prediction model accuracy.
Our model achieved an R_{P} of 0.65 on the ABBind S645 dataset, which is significantly better than those of other existing methods as shown in Table 1. In comparison with nonmachine learning methods such as Rosetta and bASA, our method is over 100% more accurate in terms of R_{P}, indicating that our topologybased machine learning methods have a better predictive power for PPI systems. Our method is about 22% more accurate than the bestexisting score of R_{P} = 0.53 (given by mCSMAB), indicating the power of our TopNetTree.
Both GBTs and neural networks are quite sensitive to system errors as the training of a model is based on optimizing the meansquare error of the loss function. The ΔΔG of 27 nonbinders (–8 kcal mol^{–1}) did not follow the distribution of the whole dataset. Pires et al.^{21} found that excluding nonbinders from the dataset would significantly increase the performance of a prediction model. In our case, the R_{P} increased from 0.65 to 0.68 for the same treatment as shown in Fig. 4. We also applied a blind test on homology structures using the rest of the samples as the training set, achieving an R_{P} of 0.55, as shown in Fig. 4.
The performance on the SKEMPI dataset
The SKEMPI dataset^{11} contains 3,047 binding free energy changes following mutation, which are assembled from the scientific literature for protein–protein heterodimeric complexes with experimentally determined structures; it includes singlepoint mutations and multipoint mutations. There are 2,317 single point mutation data entries among the whole database, which are referred to as the SKEMPI S2317 set.
Xiong et al. recently selected a subset of 1,131 nonredundant interface singlepoint mutations (denoted set S1131) from SKEMPI set S2317^{40}. The same authors applied several methods to the SKEMPI S1131 set^{40}, including BindProfX^{40}, Profilescore^{41,42} FoldX^{19} BeAtMuSiC^{43}, SAMMBE^{44} and Dcomplex^{45}.
Table 2 shows the Pearson correlation coefficients on tenfold crossvalidations. It is found that the proposed TopNetTree is about 15% more accurate than the bestexisting method.
The performance on the SKEMPI 2.0 dataset
The SKEMPI 2.0 (ref. ^{39}) database is an updated version of the SKEMPI database, containing new mutations collected after its first version, including data from three other databases: ABBind^{10}, PROXiMATE^{46} and dbMPIKT^{47}. This dataset contains 7,085 entries, including single and multipoint mutations. By selecting only singlepoint mutations and excluding mutation entries without energychange values, 4,947 data points were chosen from SKEMPI 2.0 (denoted set S4947). David et al. recently applied their updated mCSMPPI2 method^{48} to the SKEMPI2 dataset. They filtered only singlepoint mutations and selected 4,169 variants in 319 different complexes (denoted set S4169). Set S8338 was derived from set S4169 by setting the reverse mutation energy changes to the negative values of its original energy changes. We applied our TopNetTree model to sets S4947, S4169 and S8338. We tested set S4947 with the regular tenfold crossvalidation, achieving an average R_{P} of 0.82 and an r.m.s.e. of 1.11 kcal mol^{–1} for the tenfold crossvalidation. We followed the method of tenfold stratified crossvalidation used in mCSMPPI2 paper for sets S4169 and S8338^{48}. For set S4169, we obtained an average R_{P} of 0.79 and r.m.s.e. of 1.13 kcal mol^{–1}, compared with the average R_{P} of 0.76 and r.m.s.e. of 1.19 kcal mol^{–1} achieved by mCSMPPI2. Finally, for set S8338, our method attained an average R_{P} of 0.85 and r.m.s.e. of 1.11 kcal mol^{–1}, whereas mCSMPPI2 reported the average R_{P} 0.82 and r.m.s.e. of 1.18 kcal mol^{–1} (ref. ^{48}).
We further validated our method by the blind prediction of another subset of the ABBind database. As SKEMPI 2.0 contains entries in the ABBind dataset, we chose 24 protein complexes that appear in both ABBind and SKEMPI 2.0 datasets as the test set for 787 mutations (denoted as the S787 set). The S4947 set, excluding the S787 set, was used as the training set. We achieved an average R_{P} of 0.53 and r.m.s.e. of 1.45 kcal mol^{–1} on this blind test (further details are given in the Supplementary Information).
Discussion
The quality of machine learning predictions typically depends on model inputs. In our case, the inputs consist of three crucial components: protein structures, the mutation position and mutation type. In this section we discuss the influence of each component to the prediction quality.
Prediction result analysis for different protein complexes
For the ABBind S645 set, mutations can be separated into 24 different protein–protein complexes (we merged the complex with its homology model as one category). We did intra and interprotein crossvalidations to further analyse the prediction quality across different protein complexes.
Interproteinlevel crossvalidation
To perform interproteinlevel crossvalidation for 24 different protein–protein complexes, the samples in one protein complex are taken as the test set, whereas the rest of the dataset is used as the training set (see Supplementary Table 2 for more details). For this test, our model reached an average R_{P} of 0.508 and a median R_{P} of 0.541. This performance is comparable with the result of blind test on homology models (see Fig. 4); however, the performance of the model varies among different protein families. Models trained on some protein families could extrapolate to other families; for example, the two protein families with the best results, 1KTZ and 2JEL, can reach R_{P} of 0.866 and 0.818, respectively, whereas the two families with the poorest results, 1FFW and 1YY9, have R_{P} values of –0.043 and –0.068, respectively.
Intraproteinlevel leaveoneout crossvalidation
Crossvalidation was carried out within each protein complex. For this test, our model reached average/median \({R}_{\mathrm{P}}^{^{} }\) values of 0.170/0.215, which are significantly lower than the tenfold crossvalidation result over the entire dataset. One possible reason for this behaviour is that the training set for each complex is too small with only an average of 27 samples per complex. This result also implies that our model needs a diversity of training samples to achieve stable and consistent prediction quality (see Supplementary Table 3 for more details).
Prediction result analysis for different mutation regions
The locations of the site mutations could be categorized into five different regions: interior, surface, rim, support and core (a detailed definition can be found in the Methods). In experimental data, mutations at the core or support region have a higher average energy change of around 1.8 kcal mol^{–1} (1.72 kcal mol^{–1} and 1.91 kcal mol^{–1}, respectively), whereas mutations at the rim or interior region have an average energy change of around 0.8 kcal mol^{–1} (0.82 kcal mol^{–1} and 0.83 kcal mol^{–1}, respectively), as shown in Fig. 4. On the other hand, the surface mutations have an average energy change of less than 0.2 kcal mol^{–1}. Similar patterns regarding mutation sites and energy changes were reported in the literature^{49}. A possible reason for these patterns is that different mutation regions vary in their accessibility to water; in general, surface, interior and rim regions have greater access to water than the core and support regions.
Figure 4 shows our predictions concerning different mutation regions. Average \({R}_{\mathrm{P}}^{^{} }\) values of 0.60, 0.66, 0.66, 0.65 and 0.48 were achieved for the core, rim, support, interior and surface regions, respectively. This result shows that the performance is consistent among different mutation regions except for the surface region. We believe that the relative inferior performance for surface mutations is due to its small data size and that the energy disturbance caused by surface mutations is small on average.
Prediction result analysis for different mutation types
The pattern of PPI binding affinity changes over different mutation types is important for protein design. We test how well can the model prediction resemble the distribution in experimental data. Here we investigate the behaviour of our model for 20 different amino acids types in the ABBind S645 set. A reverse mutation from ‘B’ to ‘A’ is considered to be the same mutation type as from ‘A’ to ‘B’, and the associated energy change admits an opposite sign (the mutations count for each mutation type can be found in Supplementary Fig. 1).
Overall, our predicted patterns are remarkably similar to those of experimental data in terms of both average binding energy changes and variance of binding energy changes, as shown in Fig. 5. It is interesting to note that all the mutations to alanine have a positive energy change—a possible reason is that mutations from a large residue to a small one could lead to a stabilizing effect to the whole system. Aside from the size of amino acids, we also categorized them into charged, polar, hydrophobic and specialcase groups. In terms of binding affinity changes, we find that most mutations from polar to hydrophobic residues have a positive free energy change (for example, S to M), which means mutations from polar residues to hydrophobic residues would make the whole PPI system more stable. We also observed that a mutation from charged residues to uncharged polar residues could lead to a negative energy change; for example, lysine to serine (K to S), which means such mutations might have broken some charge–charge interaction pairs.
Although our model shares a similar pattern in the variance of energy changes with experimental data, the variance of the model predictions is generally lower than the experimental data as shown in Fig. 5. It remains a challenging task to come up with predictions with a diversity level the same as that of experimental data.
Conclusion
The importance of PPIs is evident from the intensive efforts to study them from many perspectives, including quantum mechanics, molecular mechanics, biochemistry, biophysics and molecular biology; for example, the R_{P} value between predicted ΔΔG values and experimental data in crossvalidations of a commonly used PPI database, ABBind^{10}, is only 0.53.
Topology has recently been shown to be surprisingly effective in simplifying biomolecular structural complexity^{26,27,29}. It has been devised to win worldwide competitions in computeraided drug design^{36}. It is therefore of enormous importance to exploit topology for understanding PPIs. In this work, we propose TopNetTrees for ΔΔG predictions; specifically, an element and sitespecific persistent homology is introduced to characterize PPIs. Furthermore, we propose machine learning algorithms—CNNassisted GBTs—to pair with the topological method for the prediction of PPI ΔΔG. We demonstrate that the proposed TopNetTree achieves an R_{P} of 0.65, which is about 22% better than the previous best result for the ABBind dataset. For another benchmark PPI dataset, SKEMPI, the present method significantly outperforms the stateoftheart in the literature.
Methods
Simplicial complex and filtration
An abstract simplicial complex is a finite collection of sets of points (that is, atoms) \(K={\{{\sigma }_{i}\}}_{i}\), where the elements in σ_{i} are called vertices and σ_{i} is called a ksimplex if it has k + 1 distinct vertices. If τ ⊆ σ_{i} then τ is called a face of σ_{i}. A simplicial complex, K, is valid if τ ⊆ σ_{i} for σ_{i} ∈ K indicates that τ ∈ K, and that the nonempty intersection of any two simplices σ_{1}, σ_{2} ∈ K is a face of both σ_{1} and σ_{2}.
In practice, it is favourable to characterize point clouds or atomic positions in various spatial scales rather than in a fixed scaled simplicial complex representation. To construct a scalechanging simplicial complex, consider a function \(f:K\to {\mathbb{R}}\) that satisfies f(τ) ≤ f(σ) whenever τ ⊆ σ. Given a real value, x, f induces a subcomplex of K by constructing a sublevel set, K(x) = {σ ∈ K∣f(σ) ≤ x}. As K is finite, the range of f is also finite and the induced subcomplexes, when ordered, form a filtration of K,
There are many constructions of K and one that is widely used for point clouds is the Vietoris–Rips complex. Given K as the collection of all possible simplices from a set of atomic coordinates until a fixed dimension, the filtration function is defined as f_{Rips}(σ) = max{d(v_{i}, v_{j})∣v_{i}, v_{j} ∈ σ} for σ ∈ K, where d is a predefined distance function between the vertices; for example, D_{e}. In practice, an upper bound of the filtration value is set to avoid an excessively large simplicial complex. Another efficient construction called the alpha complex^{23} is often used to characterize geometry, and we denote the filtration function by \({f}_{\alpha }:{\rm{DT}}(X)\to {\mathbb{R}}\), where DT(X) is the simplicial complex that is induced by the Delaunay triangulation of the set of atomic coordinates, X (ref. ^{23}). The filtration function is defined as \({f}_{\alpha }(\sigma )=\max\{\frac{1}{2}{D}_{\mathrm{e}}({v}_{i},{v}_{j}) {v}_{i},{v}_{j}\in \sigma \}\) for σ ∈ DT(X). Back to molecular structures, the filtration of simplicial complexes describes the topological characteristics of interaction hypergraphs under various interaction range assumptions.
Homology and persistence
A homology group (in singular homology) of a simplicial complex topologically depicts holelike structures of different dimensions. Given a simplicial complex, K, a kchain is a finite formal sum of ksimplices in K; that is, \(\sum _{i}{a}_{i}{\sigma }_{i}\). There are many choices for coefficients, \(a_i\), and we choose \({a}_{i}\in {{\mathbb{Z}}}_{2}\) for simplicity. The kth chain group (denoted C_{k}(K)) comprises all of the kchains under the addition that is induced by the addition of coefficients. A boundary operator ∂_{k}: C_{k}(K) → C_{k−1}(K) connects chain groups of different dimensions by mapping a chain to the alternating sum of codimension1 faces. It suffices to define the boundary operator on simplices,
where \({\textbf{v}}_{i}\) indicates the absence of vertex v_{i}. The kth cycle group (denoted Z_{k}(K)) is defined to be the kernel of ∂_{k}, whose members are called kcycles. The kth boundary group is the image of ∂_{k+1} and is denoted B_{k}(K). It follows that B_{k}(K) is a subgroup of Z_{k}(K) based on the property of boundary maps, ∂_{k} ∘ ∂_{k+1} = 0. The kth homology group, H_{k}(K), is defined to be the quotient group Z_{k}(K)∕B_{k}(K). The equivalent classes in H_{k}(K) correspond to kdimensional holes in K that cannot be deformed to eachother by adding or subtracting the boundary of a subcomplex.
Given a filtration as in equation (2), in addition to characterizing the homology group at each frame H_{k}(K(x_{i})), we also want to track how topological features persist along the sequence. Viewing H_{k}(K(x_{i})) as vector spaces together with inclusion map induced linear transformations gives a persistence module,
An interval module with respect to [b, d) denoted \({{\mathbb{I}}}_{[b,d)}\) is defined as a collection of vector spaces {V_{i}} that are connected by linear maps, f_{i}: V_{i} → V_{i+1}, where \({V}_{i}={{\mathbb{Z}}}_{2}\) for i ∈ [b, d) and V_{i} = 0 elsewhere and f_{i} is identity map when possible and zero otherwise. The persistence module in equation (4) can be decomposed as a direct sum of interval modules \({\oplus }_{[b,d)\in B}{{\mathbb{I}}}_{[b,d)}\). Each \({{\mathbb{I}}}_{[b,d)}\) corresponds to a homology class that appears at filtration value b and disappears at filtration value d (the values b and d are usually called the birth and death values). The collection of these pairs, B, encodes the evolution of kdimensional holes when varying the filtration parameter and thus records the topological configuration of the input point cloud under different interactions ranges if a distance based filtration is used. Figure 3 illustrates filtration and persistence.
Mutation regions
Mutant residue locations were classified into interface and noninterface regions. Interface residues were further classified as the rim, support or core, and noninterface residues were also further classified as surface or interior, based on the classification approach by Levy^{50}.
Residue classification is mainly based on the change of relative residue accessible surface area (rASA) between protein–protein complex (rASA_{c}) and individual protein components of complex (rASA_{m}), as shown in Table 3. The accessible surface area was calculated with AREAIMOL from the CCP4 suite^{51} and relative solvent accessibility was obtained by normalizing the absolute value with that of the same amino acid in a G–X–G peptide^{52}.
Data availability
All the data are available through the original papers cited or through our Code Ocean capsule (https://doi.org/10.24433/CO.0537487.v1).
Code availability
All source codes and models are publicly available through a Code Ocean compute capsule (https://doi.org/10.24433/CO.0537487.v1).
References
 1.
Chothia, C. et al. Conformations of immunoglobulin hypervariable regions. Nature 342, 877–883 (1989).
 2.
Carter, P. J. Potent antibody therapeutics by design. Nat. Rev. Immunol. 6, 343–357 (2006).
 3.
Demarest, S. J. & Glaser, S. M. Antibody therapeutics, antibody engineering, and the merits of protein stability. Curr. Opin. Drug Discov. Dev. 11, 675–687 (2008).
 4.
Shire, S. J., Shahrokh, Z. & Liu, J. Challenges in the development of high protein concentration formulations. J. Pharm. Sci. 93, 1390–1402 (2004).
 5.
Barouch, D. H. et al. Therapeutic efficacy of potent neutralizing HIV1specific monoclonal antibodies in SHIVinfected rhesus monkeys. Nature 503, 224–228 (2013).
 6.
Glennie, M. J. & van de Winkel, J. G. Renaissance of cancer therapeutic antibodies. Drug Discov. Today 8, 503–510 (2003).
 7.
BenKasus, T., Schechter, B., Sela, M. & Yarden, Y. Cancer therapeutic antibodies come of age: targeting minimal residual disease. Molecular Oncology 1, 42–54 (2007).
 8.
Geng, C., Xue, L. C., RoelTouris, J. & Bonvin, A. M. Finding the ΔΔG spot: are predictors of binding affinity changes upon mutations in protein–protein interactions ready for it? WIREs Comput. Mol. Sci. 9, e1410 (2019).
 9.
Berman, H. M. et al. The protein data bank. Nucl. Acids Res. 28, 235–242 (2000).
 10.
Sirin, S., Apgar, J. R., Bennett, E. M. & Keating, A. E. ABbind: antibody binding mutational database for computational affinity predictions. Protein Sci. 25, 393–409 (2016).
 11.
Moal, I. H. & FernándezRecio, J. SKEMPI: a structural kinetic and energetic database of mutant protein interactions and its use in empirical models. Bioinformatics 28, 2600–2607 (2012).
 12.
Patil, S. P., Ballester, P. J. & Kerezsi, C. R. Prospective virtual screening for novel p53–MDM2 inhibitors using ultrafast shape recognition. J. Comput. Aided Mol. Des. 28, 89–97 (2014).
 13.
Demerdash, O. N. A., Daily, M. D. & Mitchell, J. C. Structurebased predictive models for allosteric hot spots. PLOS Comput. Biol. 5, e1000531 (2009).
 14.
Kortemme, T., Morozov, A. V. & Baker, D. An orientationdependent hydrogen bonding potential improves prediction of specificity and structure for proteins and protein–protein complexes. J. Mol. Biol. 326, 1239–1259 (2003).
 15.
Xiang, J. Z. & Honig, B. Jackal: A Protein Structure Modeling Package. (Columbia University and Howard Hughes Medical Institute: 2002.
 16.
Webb, B. & Sali, A. Comparative protein structure modeling using modeller. Curr. Protoc. Bioinformatics 47, 5–6 (2014).
 17.
Zhu, K. et al. Antibody structure determination using a combination of homology modeling, energybased refinement, and loop prediction. Proteins Struct. Funct. Bioinformatics 82, 1646–1655 (2014).
 18.
Zhang, C., Liu, S. & Zhou, Y. Accurate and efficient loop selections by the DFIREbased allatom statistical potential. Protein Science 13, 391–399 (2004).
 19.
Schymkowitz, J. et al. The foldx web server: an online force field. Nucleic Acids Res. 33, W382–W388 (2005).
 20.
Discovery Studio Modeling Environment (Biovia, 2017).
 21.
Pires, D. E. & Ascher, D. B. mCSMAB: a web server for predicting antibody–antigen affinity changes upon mutation with graphbased signatures. Nucleic Acids Res. 44, W469–W473 (2016).
 22.
Frosini, P. & Landi, C. Size theory as a topological tool for computer vision. Pattern Recognition Image Anal. 9, 596–603 (1999).
 23.
Edelsbrunner, H., Letscher, D. & Zomorodian, A. Topological persistence and simplification. Discrete Comput. Geom. 28, 511–533 (2002).
 24.
Zomorodian, A. & Carlsson, G. Computing persistent homology. Discrete Comput. Geom. 33, 249–274 (2005).
 25.
Zomorodian, A. & Carlsson, G. Localized homology. Comput. Geom. 41, 126–148 (2008).
 26.
Xia, K. L. & Wei, G. W. Persistent homology analysis of protein structure, flexibility and folding. Int. J. Numer. Methods Biomed. Eng. 30, 814–844 (2014).
 27.
Gameiro, M. et al. Topological measurement of protein compressibility via persistence diagrams. Japan J. Industr. Appl. Math. 32, 1–17 (2014).
 28.
Xia, K. L. & Wei, G. W. Persistent topology for cryoEM data analysis. Int. J. Numer. Methods Biomed. Eng. 31, e02719 (2015).
 29.
Cang, Z. X. et al. A topological approach to protein classification. Mol. Based Math. Biol. 3, 140–162 (2015).
 30.
Yao, Y. et al. Topological methods for exploring lowdensity states in biomolecular folding pathways. J. Chem. Phys. 130, 04B614 (2009).
 31.
KovacevNikolic, V., Bubenik, P., Nikolić, D. & Heo, G. Using persistent homology and dynamical distances to analyze protein binding. Stat. Appl. Genet. Mol. Biol. 15, 19–38 (2016).
 32.
Cang, Z. & Wei, G.W. Integration of element specific persistent homology and machine learning for protein–ligand binding affinity prediction. Int. J. Numerical Methods Biomed. Eng. 34, e2914 (2018).
 33.
Cang, Z. X. & Wei, G. W. Analysis and prediction of protein folding energy changes upon mutation by element specific persistent homology. Bioinformatics 33, 3549–3557 (2017).
 34.
Cang, Z. & Wei, G.W. Topologynet: topology based deep convolutional and multitask neural networks for biomolecular property predictions. PLoS Comput. Biol. 13, e1005690 (2017).
 35.
Cang, Z., Mu, L. & Wei, G.W. Representability of algebraic topology for biomolecules in machine learning based scoring and virtual screening. PLoS Comput. Biol. 14, e1005929 (2018).
 36.
Nguyen, D. D. et al. Mathematical deep learning for pose and binding affinity prediction and ranking in D3R grand challenges. J. Compurt. Aided Mol. Design https://doi.org/10.1007/s1082201801466 (2018).
 37.
Frosini, P. A distance for similarity classes of submanifolds of a euclidean space. Bull. Australian Math. Soc. 42, 407–415 (1990).
 38.
Pedregosa, F. et al. Scikitlearn: machine learning in python. J. Machine Learning Res. 12, 2825–2830 (2011).
 39.
Jankauskaitė, J., JiménezGarcía, B., Dapkūnas, J., FernándezRecio, J. & Moal, I. H. SKEMPI 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation. Bioinformatics 35, 462–469 (2018).
 40.
Xiong, P., Zhang, C., Zheng, W. & Zhang, Y. Bindprofx: assessing mutationinduced binding affinity change by protein interface profiles with pseudocounts. J. Mol. Biol. 429, 426–434 (2017).
 41.
Lensink, M. F. & Wodak, S. J. Docking, scoring, and affinity prediction in CAPRI. Proteins Struct. Funct. Bioinformatics 81, 2082–2095 (2013).
 42.
Szilagyi, A. & Zhang, Y. Templatebased structure modeling of protein–protein interactions. Curr. Opin. Struct. Biol. 24, 10–23 (2014).
 43.
Dehouck, Y., Kwasigroch, J. M., Rooman, M. & Gilis, D. Beatmusic: prediction of changes in protein–protein binding affinity on mutations. Nucleic Acids Research 41, W333–W339 (2013).
 44.
Petukh, M., Dai, L. & Alexov, E. SAAMBE: webserver to predict the charge of binding free energy caused by amino acids mutations. Int. J. Mol. Sci. 17, 547 (2016).
 45.
Liu, S., Zhang, C., Zhou, H. & Zhou, Y. A physical reference state unifies the structurederived potential of mean force for protein folding and binding. Proteins Struct. Funct. Bioinformatics 56, 93–101 (2004).
 46.
Jemimah, S., Yugandhar, K. & Michael Gromiha, M. Proximate: a database of mutant protein–protein complex thermodynamics and kinetics. Bioinformatics 33, 2787–2788 (2017).
 47.
Liu, Q., Chen, P., Wang, B., Zhang, J. & Li, J. dbMPIKT: A database of kinetic and thermodynamic mutant protein interactions. BMC Bioinformatics 19, 455 (2018).
 48.
Rodrigues, C. H. M., Myung, Y., Pires, D. E. V. & Ascher, D. B. mCSMPPI2: Predicting the effects of mutations on protein–protein interactions. Nucleic Acids Res. 47, W338– W344 (2019).
 49.
Petukh, M., Li, M. & Alexov, E. Predicting binding free energy change caused by point mutations with knowledgemodified MM/PBSA method. PLoS Comput. Biol. 11, e1004276 (2015).
 50.
Levy, E. D. A simple definition of structural regions in proteins and its use in analyzing interface evolution. J. Mol. Biol. 403, 660–670 (2010).
 51.
Collaborative, C. P. et al. The CCP4 suite: programs for protein crystallography. Acta Crystallogr. D 50, 760 (1994).
 52.
Miller, S., Janin, J., Lesk, A. M. & Chothia, C. Interior and surface of monomeric proteins. J. Mol. Biol. 196, 641–656 (1987).
Acknowledgements
This work was supported in part by NSF grants DMS1721024, DMS1761320 and IIS1900473, NIH grant R01GM126189, Pfizer and Bristol–Myers Squibb.
Author information
Affiliations
Contributions
G.W.W. was responsible for conceptualization, supervision and funding acquisition. M.W. and Z.C. designed the project. M.W. curated the data. All authors carried out the investigations and wrote the manuscript.
Corresponding author
Correspondence to GuoWei Wei.
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
About this article
Cite this article
Wang, M., Cang, Z. & Wei, G. A topologybased network tree for the prediction of protein–protein binding affinity changes following mutation. Nat Mach Intell 2, 116–123 (2020). https://doi.org/10.1038/s4225602001496
Received:
Accepted:
Published:
Issue Date:
Further reading

A Review of Deep Learning Methods for Antibodies
Antibodies (2020)

SAAMBE3D: Predicting Effect of Mutations on Protein–Protein Interactions
International Journal of Molecular Sciences (2020)