Main

Interactions between proteins are conceptually described as lock-and-key complexes1, reflected in multiple successful protein–protein interactions (PPI) algorithms, such as PRISM, PSIVER and MaSIF2,3,4. These and other computational packages predict protein complex formation and interaction sites by assessing the pairwise similarity of a potential ‘key’ with many other ‘keys’. A similar concept can be applied to nanoparticle (NP)–protein interactions, but its realization requires a massive library of X-ray diffraction data for NP–protein pairs comparable to the Protein Data Bank (PDB), which is currently unavailable. Other PPI algorithms, such as SPPIDER and Pre-PPI, combine the geometrical description of docking molecules with structural relations at the organism level, exemplified by protein networks from evolutionary homology and genomics5,6. Importantly, these PPI software packages7,8,9,10,11 also assume that the interacting molecules are linear polymers from amino acids (AAs). Such descriptors are natural for proteins but make it impossible to extend these algorithms to bioinspired inorganic NPs, even though they may carry some AAs as surface ligands11,12,13. The simplified molecular-input line-entry system can annotate the structure of nonpeptide biomolecules14 but is, again, inapplicable to biomimetic NPs, even those based on carbon atoms, while many NPs exhibiting strong specific biological activity are entirely inorganic15,16. Unifying structural description of proteins and NPs is possible at the atomistic molecular dynamics (MD) level that represents the state of the art in predictions of NP–protein interactions17,18,19. However, the interaction time probed by typical atomistic MD methods is mainly limited to hundreds of nanoseconds17,18,19,20,21. Even with the dedicated Anton2 supercomputer, the interaction time can only reach up to 2 μs (ref. 22), while the time required for the formation of protein–protein and NP–protein complexes may exceed minutes or sometimes hours23,24. While being significant for complexes between macromolecules, the weak multicentre interactions exemplified by dipole–dipole forces and collective hydrogen bonds are difficult to implement without drastic time restrictions. The complexity of the energy landscape for nanoscale interactions may also lead to entrapment of MD simulations in metastable states before the formation of a fully equilibrated complex.

Here, we analyse the role of different structural features contributing to the formation of protein–protein complexes with the goal of identifying structural descriptors that could be uniformly applicable to complexes between proteins and NPs. Identifying such descriptors would enable one to extend the knowledge gained from the vast PPI datasets and existing algorithms to NP–protein pairs encountered in diverse biomedical contexts, from drug delivery to the environmental effects of NPs.

Results

Distance matrices of protein complexes

A protein complex (Fig. 1a) can be represented as a distance matrix DAB (di,k) where 1 < i < NA and 1 < k < NB with a set of matrix elements di,k representing the distance in angstroms between pairs of α-carbon (Cα) in AA residues from proteins A and B (Fig. 1b)25. The darkest areas of the matrix (yellow boxes) indicate the AAs in macromolecules A and B that are the closest to each other. The level of proximity of Ai and Bk in DAB (di,k) will be used to distinguish interacting and noninteracting residue pairs in machine-learning (ML) algorithms (Supplementary Fig. 1). The proteins are less likely to form a lock-and-key complex when the predicted probabilities of interacting AA residue pairs within 7 Å from each other are low (<0.5). If this mathematical approach is successful for protein complex, it can, perhaps, be extended to nanoscale assemblies from abiological nanostructures because it relies on structural coding based on the three-dimensional (3D) geometry of the macromolecules.

Fig. 1: The concept of the distance matrix of a protein complex and the introduction of descriptors.
figure 1

a, An example of two interacting proteins, chain A and B of PDB ID 1MA9 (vitamin D binding protein and α-actin). b, The distance matrix (in Å) of a protein complex (PDB ID 1MA9), where the yellow box represents the interaction fingerprints between two different proteins. The darkest areas of the matrix (yellow boxes) indicate the AAs in macromolecules A and B that are the closest to each other. c, Feature list of chemical (CH) descriptors. df, Example feature visualization of electrostatic charge of the carbon atom (d), hydrophobicity (e) and molecular weight (f). g, Feature list of geometrical (GE) descriptors. hj, Example feature visualization of minimum inaccessible radius (Rinacc) (h), pocketness (Pocket) (i) and Osipov–Pickup–Dunmur (OPD) chirality index (j). k, Feature list of graph-theoretical (GT) descriptors. ln, Example feature visualization of Ollivier–Ricci curvature (ORC) (l), Gaussian network models (GNM) modes (m) and multifractal dimension (MFD) (n).

Source data

Contributing descriptors

The chemical (CH), geometrical (GE) and graph-theoretical (GT) descriptors are computed and embedded into each of Ai and Bk to form characteristic feature matrices that comprehensively characterize the interacting macromolecules from different physicochemical perspectives. The CH descriptors include the electrostatic charge (C-charges)26, hydrophobicity (Hp), molecular weight (MW), polarity and atomic compositions (C-count) of the biomolecules (Fig. 1c–f and Supplementary Figs. 810).

The GE descriptors include Cartesian (local distances and shapes), topological (global organization) and asymmetry (chirality) characteristics of the interacting subunits at the nanoscale. The GE descriptors also include the minimum inaccessible radius (Rinacc), the accessible shell volume (Shell) and the pocketness (Pocket)27 (Fig. 1g–i and Supplementary Fig. 11). Chirality is calculated for the vicinity of each AA residue as the Osipov–Pickup–Dunmur (OPD, Fig. 1j) indices28. These GE measures assesses the compatibility of the protein geometries to each other at nanometre and subnanometre scales. We found that the areas of positive OPD values in proteins are distinctly associated with α-helices (Supplementary Fig. 12).

The proteins and their assemblies can also be represented as a graph, G(n, e), constructed by taking individual AA residues as nodes (n) while the edges (e) between the nodes are assigned depending on the distance matrix for a single-folded protein, DA (di,j). GT descriptors enable structural encoding of protein complexes without reliance on the AA sequence in the macromolecule. Furthermore, GT descriptors add classifiers that depict the shape complexity, chemical connectedness and molecular deformability of these structures (Fig. 1k–n). GT descriptors utilize well-developed applied-mathematics methods that enable acceleration of the computations while reducing the computational resources required29,30. For GT descriptors, three parameters were calculated: (1) Gaussian network models (GNMs) representing macromolecules as elastic networks to describe their flexibility31,32 (Fig. 1m and Supplementary Fig. 15), (2) the Ollivier–Ricci curvature (ORC) and Forman–Ricci curvature (FRC) describing the macromolecules in terms of Riemannian geometry to identify the segments subject to conformational changes (Fig. 1l and Supplementary Fig. 14) and (3) the node-based multifractal dimension (MFD) describing the molecules as fractals to account for the hierarchical organization of macromolecules essential for their interactions (Fig. 1n and Supplementary Fig. 13).

Descriptor correlations in protein–protein complexes

We analysed cross-correlations between the CH, GE and GT descriptors to understand their (1) independent inputs into the formation of protein complexes and (2) enumeration of suitability of non-proteinaceous macromolecules for descriptors of similar complexes with NPs. Apart from some a priori expected correlations between FRC and ORC, Shell versus Rinacc and MW and C-count, the correlation between different components of the descriptors is small (Fig. 2a,b). Notably, the correlations between the descriptors for the molecules overall (Fig. 2a) and interfaces are quite different (Fig. 2b). This fact highlights (1) the mutual structural adaptation of the macromolecule at the interface and (2) the significance of descriptors such as Rinacc, Shell, Pocket, OPD, ORC and MFD characterizing the geometry and dynamics of the interfaces. In view of the lock-and-key concept, one can also ask whether a specific value of descriptors in one protein requires a particular value of the same feature on the counterpart forming the complex, which will be informative in predictions of preferred interaction sites33,34. The contour plots in Fig. 2c–e present the distribution of Rinacc, ORC and OPD for AA pairs at different distances from each other. The plots for distances of <7 Å and 7–10 Å represent AAs located in close proximity of PPI. The distinct maxima on these plots, especially for AAs located at distances of 7–10 Å, vividly indicate that all these structural descriptors indeed require specific values when the macromolecules try to fit each other. As such, the local chirality in the neighbourhood of AAs located directly across the interface (<7 Å) tends to be small, indicating that highly mirror-asymmetric ‘holes’ have greater difficulty in finding a fitting ‘key’, which shifts to mutually positive OPD values of about 0.3 × 107 for AAs separated by 7–10 Å. The three maxima observed in the contour plots for Rinacc for distances of 10–20 Å clearly indicate the long-range correlation between GE features required for complex formation (see Supplementary Figs. 16 and 17 for additional descriptors).

Fig. 2: Analysis of descriptors.
figure 2

a, Descriptor correlations for all AA residues in a protein complex (PDB ID 1MA9). b, Descriptor correlation for interface AA residues (contact distance <7 Å) in a protein complex (PDB ID 1MA9) shown as correlation matrices where the size of the square depicts the correlation strength and its colour indicates its sign. Pocketness (Pocket) features are expected to correlate with accessible shell volume (Shell) and minimum inaccessible radius (Rinacc) because geometries with bumps and protrusions are more accessible than others. There are also positive correlations between ORC, FRC, GNM modes and MFD with Rinacc and Shell, because these GE and GT descriptors tend to have higher values in the convex part of the molecular structures. ce, The correlation distribution of Rinacc (c), ORC (d) and OPD indices (e) values from each protein forming a complex depending on the distance between AA residues. In each contour plot, the x and y axes indicate the descriptor values of protein A and B, respectively. Four distance classes describe the physical distance between AA residues in the protein–protein complex. The 7 Å and 10 Å classes describe the immediate vicinity of protein–protein interfaces. The values for the OPD indices in e are scaled by division fraction of 1 × 107.

Source data

ML algorithms for protein–protein complexes

CH, GE and GT descriptors calculated for each AA residue of the constituents in the protein complexes served as inputs for ML algorithms, while the distance matrix DAB (di,k) was the output (Figs. 1 and 3a and Supplementary Sections 1.11.4). We trained different ML algorithms, namely logistic regression, Gaussian naïve Bayes, support vector machine (SVM), random forest (RF), XGBoost (XGB) and deep neural network (DNN), using seven independent datasets: all descriptors, CH only, GE only, GT only, CH + GE, CH + GT and GE + GT. Among the sets with a single descriptor type, the GT-only set performed best versus the CH-only or GE-only sets, with an area under the receiver operating characteristic curve (ROC-AUC) of 87.7 ± 0.7%, accuracy as high as 80.8 ± 1% and an F1 score of 80.0 ± 1.2% with the tenfold cross-validated DNN model. The same characteristics were 83.8 ± 1.3%, 77.0 ± 1.2% and 75.6 ± 1.2% for the GE descriptors and 59.0 ± 1.1%, 57.4 ± 0.8% and 54.8 ± 2.8% for the CH descriptors, respectively. It was unexpected that, when adding the CH descriptors to the GE descriptors for training the high-performance DNN model, the ROC-AUC and accuracy increased by only 1.9% and 2.1%, respectively. With the addition of the CH descriptors to the GT descriptors, the ROC-AUC and accuracy scores decreased by 0.1% and 0.2%, respectively (Fig. 3b,c and Table 1). The other ML algorithms, such as RF and XGB (Table 1 and Supplementary Fig. 18), performed similarly to the DNN and maintained the same trends in terms of ROC-AUC, accuracy and F1 score (Supplementary Section 3.2.1).

Fig. 3: Construction of distance-based feature matrices for prediction of protein complexes.
figure 3

a, Schematic explanation of feature matrix extraction from protein complexes. The descriptor vectors are embedded per each AA residue, and the pairwise descriptor sets form the final feature matrix to train the ML algorithms. b,c, Comparison of the ROC curve (b) and model performance metrics (c) depending on the descriptor subsets when using the tenfold cross-validated DNN.

Source data

Table 1 Comparison of ML algorithms depending on the training of different descriptors subsets

The feature ablation study suggests that the GT descriptors make the most significant contribution to the predictive power of ML algorithms. Despite the absence of a strong direct correlation between the CH and other descriptors, it is apparent that GT and GE descriptors contain adequate information to predict the formation of protein complexes. This finding is quite surprising because electrostatic, van der Waals and hydrogen-bonding interactions are expected to be most relevant to PPI, being dependent on C-charges, MW, C-count and Hp. However, CH descriptors are strongly correlated amongst themselves (Fig. 2a), which indicates that training of ML algorithms using all of them as classifiers increases the laboriousness of the process but not the accuracy of the predictions. The unexpectedly low impact on the ML algorithm performance of adding the CH to the GT or GE descriptors is observed because GE and topological parameters of the macromolecules emerge from the multiplicity of weak and strong chemical interactions, which creates a path for partial embedding of chemical information into GE and GT features. Another important factor is the scale of GE and GT descriptors, which matches the dimensions of the protein–protein interface covering the area of several square nanometres, while the scale of CH features is commensurate with the size of single AA residues.

ML predictions for protein–protein complexes

The predictive capability of the ML algorithm based on different sets of structural descriptors was compared for several proteins that were not included in the training set. Furthermore, they were nonhomologous to those present in the database to test the true ‘learning’ rather than ‘memorization’ capabilities of the algorithms applied. The tested protein complexes included chains A and B of (1) severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) nucleocapsid (PDB ID 6WZO)35, (2) fluorescent protein Dronpa (PDB ID 6NQN)36 and (3) bacterial tryptophan synthase (PDB ID 1C29)37 (Fig. 4, Supplementary Figs. 1923 and Supplementary Tables 46). The formation of a complex was based on the AA residues from each macromolecules being within a distance of 7 Å. These residues defined interfaces between the interacting molecules. For a fair comparison, the ground-truth data were calculated with the same assumptions as the ML algorithms (Fig. 4a,c,e and Supplementary Section 3.3.1). Comparing the outcomes of ML using different descriptor sets, we confirmed that the GE + GT descriptors predicted the interaction sites of each protein complex (Fig. 4b,d,f) with ~80% accuracy with only a few false negatives (Supplementary Tables 46). The false positives were predominantly located in the vicinity of the true interface sites (Supplementary Table 3), which reflects the connectivity of the protein globules, perhaps overestimated by GT parameters. When comparing the models trained on the GE + GT versus all the descriptors, both the precision and recall tended to increase for the former, that is, smaller, set of descriptors. Such an effect is related to the ‘curse of dimensionality’ when selection of the most important uncorrelated features from a larger pool of descriptors increases the accuracy of ML algorithms.

Fig. 4: Prediction of protein–protein complexes with the DNN model trained on the GE + GT descriptors.
figure 4

af, The ground-truth interface (a,c,e) and the complex interface predicted by the DNN model trained on GE + GT (b,d,f) of the SARS-CoV-2 dimer protein (PDB ID 6WZO, AB) (a,b), fluorescent protein Dronpa with the β-barrel structures (PDB ID 6NQN, AB) (c,d) and bacterial tryptophan synthase having triose-phosphate isomerase (TIM) barrel structure (PDB ID 1C29, AB) (e,f), with A and B highlighted in red and blue, respectively. The dashed ellipses indicate that the main interaction interfaces of protein chain A and B for both ground truths and predicted ones.

Source data

ML predictions for complexes of proteins with NPs

The exclusion of CH components enables the direct application of the ML algorithms trained on PPI to biomimetic NPs whose structure can be enumerated by the same parameters. The atomic structure of NPs is coarse-grained to match the scale of AA residues in proteins. Considering the number of atoms per AA (10 in glycine, 28 in tryptophan) and the relative abundance of different residues, 13 atoms are grouped into a single ‘residue’ in the coarse-grained representation of carbon nanostructures to obtain DNP (di,j) (Supplementary Section 3.3.5). The G(n, e) of NPs are constructed by taking coarse-grained groups of atoms as nodes and connecting an edge when di,j is less than 7 Å. Utilizing the GE and GT descriptors, we tested the performance of the ML algorithms to predict protein–NP complexes. We focused primarily on carbon-based nanomaterials, namely graphene quantum dots (GQDs), spherical carbon NPs and single-walled carbon nanotubes (SWNTs) (Supplementary Section 3.3.5) but other types of NP can be coarse-grained in the similar way. Our focus on nanocarbons was also based on the rapid development of this diverse class of biocompatible nanomaterials for biomedical applications, such as drug carriers38, antibacterials39, antivirals40 and high-sensitivity bioanalysis41,42. However, research progress in this area is slowed down by difficulties in predicting the enzymatic degradation of nanocarbons43, the protein coronas44,45,46 around them and other types of biochemical processes in the complex milieu of biomolecules.

The ML predictions were compared against experimental and supporting MD simulations from literature19,47,48. For the simplest example, we tested the docking of carboxylated GQDs to phenol-soluble modulin-α (PSM-α) peptide (PDB ID 5KHB). The prediction results match well with the MD simulation by displaying GQD docking near the N-terminus of the peptides19 (Fig. 5a). Also, we found that predicted interaction sites in the complex between hydroxylated GQDs and a monomer of human islet amyloid polypeptide (hIAPP, PDB idID 2L86) (Fig. 5b) match the experimental observations established independently from (1) comprehensive liquid chromatography with tandem mass spectrometry (LC–MS/MS) evaluation by Faridi et al.47 and (2) quenching of the nanostructure autofluorescence in a dose-dependent manner by Wang et al.48. However, the interaction sites change drastically when hIAPP fibril is formed (PDB IS 6ZRF) (Fig. 5c). Then, GQDs are bound onto the amyloid’s surface, which was again accurately identified in the DNN models using only GT + GE descriptors (Fig. 5c). ML predictions for the complex between short-cut carbon nanotube (CNT, 17 Å length) and human myeloperoxidase (hMPO, PDB ID 1CXP) pointed to two interaction sites that were nearly identical to those established by Kagan et al.43 using MD simulations. Specifically, the tyrosine residues (nos. 293 and 313) and arginine residues (nos. 307 and 294) correctly emerged (Fig. 5d,e) as specific AAs in hMPO closely interacting with CNTs.

Fig. 5: The prediction of complexes between protein and nanocarbons.
figure 5

a, Carboxylated GQD and PSM-α assembly. b, Hydroxylated GQD and hIAPP assembly. c, Hydroxylated GQD and hIAPP fibril assembly. d,e, The carboxylated CNT and hMPO assembly (d) and a view rotated by 180° (e), showing the second binding sites. The blue and green highlights in ae indicate the predicted interaction sites in NPs and proteins, respectively. Yellow and gray surfaces are non-interacting sites. f, Probability of protein binding sites on the different carbon macromolecules, spherical carbon crystal and three different SWNTs.

Source data

The contributions from specific groups and interactions that are stronger than others for particular pairs of NPs and proteins can also be found. These interactions are determined by both the chemical composition and the nanoscale geometry of the interacting species, which is captured well by the combination of GE and GT descriptors that account for attractive interactions while minimizing the frustration from molecular reconfiguration. Taking a specific example of the GQD and hIAPP complex, the edges and surface of GQD are hydroxylated, which results in local curvature that can provide a specific fit to the geometry of the protein, capable of forming hydrogen bonds with –OH groups. The GE + GT structural descriptors point to the sites on the GQD that form hydrogen bonds with hIAPP. A similar mechanism can also be traced in the interactions between CNT and hMPO.

While localization of potential interaction sites is essential for NP adjuvants, enzyme mimics, amyloid fibrillation inhibitors and antibiotic and antiviral agents15,19,39,40, the assessment of the relative propensity of several proteins to interact with NPs of different shapes is also important for nanoscale contrast agents and drug delivery agents. Thus, we also tested the ability of DNN to predict the relative protein abundance in the protein corona around SWNTs as studied by Pinals et al.44 and spherical carbon NPs as studied by Monopoli et al.45 and Visalakshan et al.46. We found that albumin displays a lower tendency to adsorb on SWNTs than apolipoprotein, histidine-rich glycoprotein or galectin-3-binding protein (Supplementary Figs. 24 and 25). Also, the spherical carbon NPs showed a lower probability of forming a nanoscale complex with all four proteins than three different types of SWNTs. Both findings match recent experimental results for similar nanocarbons established using multiple centrifugation cycles followed by mass spectroscopy, small-angle X-ray scattering and isothermal titration calorimetry (Fig. 5f)44,45,46.

These findings becomes particularly useful for the analysis of protein interactions with entirely or predominantly inorganic NPs made from gold49, ZnO16 and silica46 that can acquire a variety of shapes46. Thus, we tested ML with pyramidal ZnO NPs (3 nm in the base, 3 nm height) that are known to form a reversible one-on-one complex with β-galactosidase (Supplementary Fig. 27)16. The coarse-graining protocol for these NPs was based on the crystal surface atoms. Taking into account the mean size of AAs (3.5–4 Å) and the ionic bond length of ZnO (1.89 Å), two atoms were grouped to produce distance matrices and G(n, e) (Supplementary Section 3.3.5). As a result, we found that the interaction sites responsible for the formation of the complex between ZnO NPs and β-galactosidase are located at the apex and edge of the nanopyramids (Supplementary Fig. 28), which coincides perfectly with experimental data16.

Discussion

Despite the complexity of intermolecular interactions between nanoscale structures, GE + GT descriptors adequately predict the formation of complexes and interaction sites for proteins. The same descriptors can be applied directly to NPs. The fact that ML algorithms trained on protein–protein complexes accurately predict the structure of protein–NP complexes provides direct and incontrovertible evidence of the biomimetic nature of water-soluble inorganic NPs known to display a variety of biological functions16,19,43,44,45,46,48.

The chemistry of nanocarbons and other inorganic NPs can be very different from that of proteins. While the dynamics of their complexes with proteins tends to be challenging to model, the developed ML algorithms can streamline their molecular design for specific biomedical or biomanufacturing applications. The nanoscale species’ rigidity level can be described by the GNM parameters (Supplementary Fig. 26), which can be calculated rapidly and accurately. Analysis of GNM modes can be instrumental for (1) engineering of molecular rigidity across the spectrum of different nanoscale species and (2) predicting interaction sites at different temperatures. Both tasks can be accomplished by adding physics-based descriptions of thermal motion for various chemical bonds using Boltzmann distributions, which will provide complementary descriptors for biological and abiological nanoscale species.

From a fundamental perspective, these findings extend the boundaries of understanding of the structural requirements for forming lock-and-key interfaces between nanoscale entities and integrate the concepts of topology, Riemannian geometry and multifractality to establish commonalities between them. From a practical perspective, these findings offer a toolbox for the rapid design of abiological nanostructures with specific shapes and surface chemistries for biomedical and other applications.

While the traditional CH descriptors provide limited input to the accuracy of the prediction of protein–protein and protein–NP complexes, we expect that subsequent development of unified CH descriptors inclusive of nonadditivity and collective effects between proteins50,51 and NPs52 using, for instance, MD or density functional theory calculations calculated locally will also improve the accuracy of such predictions of interaction sites and affinity constants.

Methods

The atomic coordinates of proteins were acquired from the RCSB protein data bank (https://www.rcsb.org/), and the coordinates for the NPs were modelled by using BIOVIA Materials Studio.

Training database formation

The curated database formed the input of the training dataset, while distance matrices formed the output (Supplementary Fig. 1). The final PPI training set comprised 464 uniquely interacting protein pairs (Supplementary Fig. 4) and 27,859,297 pairs of AA residues in total.

Computation of descriptors

The following descriptors were computed and embedded into each Ai and Bk to form characteristic feature matrices (Supplementary Section 2).

CH descriptors

The probability of AA residues in proteins interacting can be related to their electrostatic charge, hydrophobicity, molecular weight, polarity and atomic composition (Supplementary Figs. 810). These chemical parameters determine the repulsive/attractive forces between the macromolecules and are used in nearly all current PPI algorithms4,53. The atomic contribution of continuum electrostatic charges per residue is computed using the Chemistry at Harvard Macromolecular Mechanics (CHARMM) force field26. In addition to the previously used coarse description of residue charge as −1, 0 or 1, we used a more accurate representation of electrostatic interactions wherein the charge contribution per atom in the residue was considered. Hydrophobicity indices of residue are measured by the Kyte–Doolittle scale54. We note that CH descriptors do not account for nonadditivity and interdependence of electrostatic, hydrophobic or van der Waals interactions on the surface of biomolecules50,51.

GE descriptors

The Rinacc descriptor measures the shallowness of the protein surface by calculating the minimum inaccessible radius of the circle at AA residue point Ai and Bk, in Cα coordinates. The Shell descriptor characterizes the depth of a specific AA in the folded protein chain, obtained by quantifying an accessible volume of a particular residue point, Ai and Bk. The Pocket descriptor enumerates the depth and size of a concavity on the surface of the protein globule. This value is inversely proportional to Shell but directly proportional to the pocket radius (Supplementary Section 2.2.1). OPD chirality indices28 were calculated per each Ai and Bk, considering different distances from AA residue points. Left/right-handed geometries correspond to negative/positive values of OPD. Being conscious of the computational problems emerging for chiral objects with high dimensionality, we restrict OPD calculation to a limited group of N-neighbour residues around a particular residue (Supplementary Section 2.2.2).

GT descriptors

To produce G(n, e), AA residues were connected with an edge when di,j was less than 7 Å (refs. 55,56). This cutoff value was chosen because it is larger than the average distance between the AA residues in the single protein (3.8 Å)57 and corresponds to the segments interacting by supramolecular interactions58,59 (Supplementary Section 2.3 and Supplementary Table 1). In GNM, the Gaussian modes are decomposed by eigenvalue and eigenvector from the Kirchhoff matrix calculated based on G(n, e) (refs. 31,32). To obtain the dominant modes, only the square sum of the first tenth of the modes is considered2 (Supplementary Section 2.3.3 and Supplementary Fig. 15). The ORC evaluates the transport characteristics of the network and, therefore, the stress transfer and reconfigurability in relation to the centre of the molecule, while FRC describes the same characteristics of the periphery of the graph60,61. Also, we defined the node-based scalar Ricci curvature as the sum of all edge curvature values on that node60 (Supplementary Section 2.3.2 and Supplementary Fig. 14). The MFD values are estimated by investigating the power-law behaviour between the partition function, associated with the qth powers of the node-based probability measure of covering the graph with boxes of a specific radius, and the box sizes employed to cover the graph62,63 (Supplementary Section 2.3.1 and Supplementary Fig. 13).

ML algorithms

The pairwise AA residue descriptor data are used to independently train ML algorithms for seven subsets of {CH, GE, GT} to compare the contributions to the prediction scores. The DNN model consists of three fully connected layers with 512 neurons, with the rectified linear unit (ReLU) function used for activation. After these layers, a dropout layer is added with a rate of 0.5 to prevent overfitting. Lastly, the SoftMax layer is implemented to compute the probability of each class and the model is trained by optimization of the categorical cross-entropy loss function. While training, the best model having the minimum loss and highest accuracy is saved by the callback function. The performance of each ML algorithm is evaluated using tenfold cross-validation. The logistic regression, Gaussian naïve Bayes, SVM and RF approaches are implemented using the scikit-learn Python package64, while the XGB approach uses the XGBoost Python package65.

When we test unknown protein–protein complexes with any ML algorithms, the probability of forming a close contact (<7 Å) is computed for every molecule’s AA residue pair. We take residue pairs in the top 0.1% by probability as interface residues (Supplementary Section 3 and Supplementary Fig. 19). An identical criterion was applied for protein–NP complexes when assessing DA (di,j) and to obtain DNP (di,j).