Machine Learning and Network Analysis of Molecular Dynamics Trajectories Reveal Two Chains of Red/Ox-specific Residue Interactions in Human Protein Disulfide Isomerase

The human protein disulfide isomerase (hPDI), is an essential four-domain multifunctional enzyme. As a result of disulfide shuffling in its terminal domains, hPDI exists in two oxidation states with different conformational preferences which are important for substrate binding and functional activities. Here, we address the redox-dependent conformational dynamics of hPDI through molecular dynamics (MD) simulations. Collective domain motions are identified by the principal component analysis of MD trajectories and redox-dependent opening-closing structure variations are highlighted on projected free energy landscapes. Then, important structural features that exhibit considerable differences in dynamics of redox states are extracted by statistical machine learning methods. Mapping the structural variations to time series of residue interaction networks also provides a holistic representation of the dynamical redox differences. With emphasizing on persistent long-lasting interactions, an approach is proposed that compiled these time series networks to a single dynamic residue interaction network (DRIN). Differential comparison of DRIN in oxidized and reduced states reveals chains of residue interactions that represent potential allosteric paths between catalytic and ligand binding sites of hPDI.


SUPPLEMENTARY INFORMATION
1

Molecular Dynamic Simulations
X-ray crystal structures of ox-hPDI (PDB ID: 4EL1) and red-hPDI (PDB ID: 4EKZ) were used as initial conformations for MD simulations 1 with CHARMM 27 force field in the NAMD 2.10 package 2 . Short missing regions including residues 250 to 254, 321 to 323 and single residue 479 were recovered using Modeller 9.13 3 . Missing hydrogens were added by Reduce 3.23 4 . Each system was then solvated in a rectangular box of around 16491 TIP3P water molecules with at least 10 Å padding around protein and neutralized by addition of 23 Sodium ions. Both systems were minimized for 20000 steps of conjugate gradient method while harmonic restraints with a force constant of 10 kcal mol -1 Å -2 were applied to protein heavy atoms. Gradual heat up from 10 to 300 K were then performed via 500 ps NPT dynamics with a time step of 1 fs while harmonic restraints were gradually relaxed. Nosé-Hoover Langevin piston was used to keep pressure at 1 atm and Langevin thermostat was used to control temperature 5,6 . Periodic boundary conditions were applied and non-bonded interactions were truncated at 12 Å using a smooth switching started at 10 Å. Long-range electrostatic interactions were calculated using the particle mesh Ewald (PME) method with a grid spacing of 1 Å 7 . After heat up and removal of restraints, the equilibration phase was continued at 300 K up to 2 ns. In subsequent MD simulations, the time step was increased to 2 fs and all bonds with hydrogen partners were kept rigid using the SHAKE algorithm. MD trajectories were recorded every 4 ps. To accelerate the conformational sampling, both systems were simulated for 10 ns and recorded structures were clustered based on their backbone RMSD via the quality threshold (QT) algorithm implemented in the VMD 1.9 2 2 package 8,9 . Six representative structures from most populated clusters of each of the ox-hPDI and red-hPDI systems were then used as starting points of 55 ns NPT dynamics and all recorded snapshots excluding those of the first 3 ns were used for subsequent analysis. This consists of 156000 structures from a total of 600 ns NPT simulation.

Cross Correlation and Principal Component Analysis
Backbone RMSD for each domain and whole protein was calculated along each trajectory with respect to the first production frame. Secondary structure character of all residues were calculated over trajectories using the STRIDE program 10 . To identify groups of residues with correlated motions, cross correlation matrix of atomic displacements was calculated according to equation (1): where ! "# is the cross correlation or normalized covariance between displacements of atoms i and j with position vectors i r ! and j r ! , respectively 11 where (PC1, PC2) is the probability distribution function obtained from MD data and max r is its maximum value which is subtracted to put zero of free energy on the most probable 4 5 4 conformation. The kernel density estimation with a Gaussian kernel was used for construction of probability distribution functions. Projected FELs were also obtained for PC1-PC3 and PC2-PC3 subspaces.
To check the performance of adopted multi-trajectory approach in sampling of conformational space of hPDI, the inner products between PCA eigenvectors obtained from all data and those of different halves of the data were compared in Figure S10. Comparison of diagonal and off diagonal elements in panel b of Figure S10 shows that the directions of important collective motions would be the same if one considers only the half-length of all trajectories while this is not true for full-length of all trajectories. Accordingly, starting from multiple configurations is crucial for efficient sampling and for avoiding from trapping regions of landscape. ranges, we used a custom linear discriminator to find an optimum value for sep V from statistical distributions of V values in ox and red states which maximizes the accuracy of ox vs. red discrimination based on the value V of a given snapshot. We also used the SVM with radial-basis kernel to improve ox vs. red discrimination accuracy over the linear models. Although this method could not provide a single threshold sep V , it significantly improved the discrimination accuracy in several cases, particularly when the distribution of V was bimodal or multi-modal in some state. In all cases the classifier was trained with 50% random selection of all available frames (in both ox and red states) and tested with the rest of the frames. (Code available upon request)

Dynamic Residue Interaction Network (DRIN)
The standalone version of RING 2.0 program was used for calculation of residue interaction network (RIN). Five types of non-covalent interactions were considered including hydrogen bond, van der Waals, salt bridge, π-π and cation-π interactions. The RING was used in a mode that reports multiple interactions per residue pair but only one interaction per interaction type. Accordingly, each of the 156000 structures was converted to a graph with residues as its vertices and the pairwise interactions of the residues as its edges. There could 7 6 be different edges between a pair of residues due to different types of non-covalent interactions.
Let ! be the number of residues and ) (t A ij be the !×! adjacency matrix of a RIN graph for each type of non-covalent interaction. The ! "# (%) elements are equal to zero or one for the absence or presence of considered type of interaction between residues i and j in time ! of some MD trajectory. From time series of ! "# (%) , the maximum interaction life time, Γ "# , was then extracted to build the matrix Γ which will be denoted here as the DRIN matrix for considered type of non-covalent interaction. Here we define Γ "# as the maximum length of a continuous period of time in which a persistent interaction type is observed in the RIN graphs between the residues ! and ! : By this definition, for each type of interaction, 156000 RIN graphs were compiled to two single DRIN graphs for each of the ox-and red-hPDI systems. The simpler choices, such as summing ! "# $ for each pair of residues during the whole period of time, results in noisy and transient on-and-off interactions often make a big total which could often dominate the effect of persistent interactions. Accordingly, a DRIN matrix can be considered as a complete !×! weighted graph with edge weights equal to Γ #$ . The DRIN matrices Γ "# $% and Γ "# $%& were calculated for each of the ox-and red-hPDI systems, respectively, and for all five types of non-covalent interactions considered. Values of Γ #$ elements were averaged over six separate trajectories of each system.
To highlight differences between oxidized and reduced states, we computed the differential DRIN graph Δ with the same nodes and edges as DRIN graph, but adjusted weights that represent the fold change of ij G values between ox-and red-hPDI systems. More specifically, the elements of the differential DRIN matrix D were computed for each type of interaction as: In the above equation, e is a small positive value with negligible effect on the outcome, which is considered to prevent the division by zero. In our study, we set ! equal to the time between two consecutive MD snapshots that means we have elongated the maximum duration of each interaction by 1 more frame. This is negligible with respect to the values of ij G for persistent interactions that were in the order of hundreds or thousands of frames.
We used the differential DRIN matrix in a number of ways. By considering a cutoff on the absolute values of ij D we could identify the pairs of residues that had a significant alternated pattern of interactions between ox and red states. We could also consider Δ as the weights of a graph, where positively and negatively weighted edges depict the interactions that are more persistent in ox and red states, respectively. We visualized such a network using the Cytoscape version 3.3.0, by assigning the colors and the thickness of the edges according to the interaction weight in differential DRIN matrix. For visualization and analysis purposes differential DRIN graphs obtained for different types of interactions were merged together in a single graph with different edge styles. Figure S1 backbone RMSD values with respect to the initial structure for single domains and whole protein in oxidized (a and b, respectively) and reduced (c and d, respectively) hPDI conformations.

Figure S2
The cumulative contribution of principal components in total variance.       Table S1 Structural features of representative conformers from FEL. R: inter domain distance; θ: angle; Φ: Dihedral torsion.