Drug binding dynamics of the dimeric SARS-CoV-2 main protease, determined by molecular dynamics simulation

We performed molecular dynamics simulation of the dimeric SARS-CoV-2 (severe acute respiratory syndrome corona virus 2) main protease (Mpro) to examine the binding dynamics of small molecular ligands. Seven HIV inhibitors, darunavir, indinavir, lopinavir, nelfinavir, ritonavir, saquinavir, and tipranavir, were used as the potential lead drugs to investigate access to the drug binding sites in Mpro. The frequently accessed sites on Mpro were classified based on contacts between the ligands and the protein, and the differences in site distributions of the encounter complex were observed among the ligands. All seven ligands showed binding to the active site at least twice in 28 simulations of 200 ns each. We further investigated the variations in the complex structure of the active site with the ligands, using microsecond order simulations. Results revealed a wide variation in the shapes of the binding sites and binding poses of the ligands. Additionally, the C-terminal region of the other chain often interacted with the ligands and the active site. Collectively, these findings indicate the importance of dynamic sampling of protein–ligand complexes and suggest the possibilities of further drug optimisations.


Supplementary information A. Seven HIV inhibitors used in this study.
Supplementary Figure A1. Seven HIV inhibitors used in this study. Numerals indicate molecular weight.

Supplementary information B. Initial locations of ligands.
[TOP] Supplementary Figure B1. Schematics for initial locations of ligands. We employed 4 variations of unbounded locations, A (blue), B (red), C (grey), and D (purple), arbitrary chosen at least 1.5 nm apart from the active sites of M pro . For each initial location, seven variations of randomised initial velocities were employed for the production runs. In total 28 variations of production runs were performed for each ligand.
Supplementary information C. Clustering analysis of the contact data matrix. [TOP] Hierarchical clustering analysis of the contact data matrix: Minimum distances between amino acid residues of M pro dimer and each drug were computed by gmx pairdist command implemented in GROMACS [1]. The contact strength between a residue and the drug was defined as one if the minimum distance of the either residue in the dimer is less than the threshold length d = 0.35 nm and zero otherwise. Here, the distances were measured among all the atoms of the residue and those of the drugs, including hydrogen atoms. Since M pro is a homodimer, we did not discriminate the contact from the either monomer. Then, the contact vector size m becomes 306, the number of amino acid residues of the M pro monomer. To find the common contact patterns among seven drugs, we concatenated overall data points of seven drugs. The total data points n is (7 drugs) × (4 initial drug positions) × (7 initial randomised velocities) × (200 ns / 200 ps) = 196,000. The hierarchical clustering analysis of the contact data matrix with Ward's method [2] was performed with fastcluster package [3] for Python, which only requires the memory usage on the order of n × m. The dendrogram of the clustering analysis is shown in Fig. C1, where cluster identifiers (clsid) at level 9 (at which just top 9 clusters are classified) are shown on the green line.
Supplementary Figure C1. Dendrogram for hierarchical clustering analysis. Top 9 clusters are shown with their cluster identifiers (clsid).

K-means clustering analysis based on the hierarchical clustering result:
Based on the result of the hierarchical clustering analysis described above, we have further refined the classification of the binding sites by using k-means clustering analysis with Hartigan-Wong algorithm [4] implemented in R [5]. As seen in Supplementary information D, the hierarchical clustering analysis at level=9 gives sufficiently robust classification of the binding sites. Based on this observation, the contact data matrix was reanalysed by using k-means clustering analysis of k=9 with the initial cluster centres defined by the hierarchical clustering analysis at level=9. As to the robustness of the k-means clustering result, we have confirmed that the same classification results obtained by using the initial set of centres defined by the hierarchical clustering analysis for more sparsely sampled (every 2ns) total 19600 data. Comparing the hierarchical clustering analysis and the k-means clustering analysis, the classification unchanged for 90% of data points as summarised in Table C1. We have visually inspected major difference between the result obtained by the hierarchical clustering and that by kmeans clustering and observed that k-means clustering gives reasonable classification (especially trajectories of indinavir A3, D6, D7). Table C2 shows the list of the residues belong to each cluster, where residues of mean contact ratio greater than 0.2 are shown with bold. See below for the definition of the mean contact ratio.

Identify cluster as a nearest cluster for a given contact vector:
Above procedures (k-means clustering after Ward hierarchical clustering) give us clsid for each data point (snapshot). This clsid can be also obtained as a cluster id of minimum distance, because k-means clustering analysis is utilised for the final clustering.
For a given snapshot, contact vector ci (i=1,⋯,306) is calculated as described in the above. Then for each contact vector, the distance Ds to the cluster of clsid=s is defined as where Gi(s) is cluster centre of the cluster of clsid=s. The cluster centre is the mean contact ratio which is defined as the averaged contact data over all member of the cluster of clsid=s, where j is an index of contact vector data, Ms is the number of data points classified as the cluster of clsid=s ( = ∑ 1 ∈Ω ), and Ω is the group of j (index of data points) classified as the cluster clsid=s. The cluster id clsid to which a given data point belongs can be determined by finding the cluster giving minimum distance among distances to the 9 clusters. Table C2 shows the list of cluster centres, where residues which have mean contact ratio less than 0.1 (less than 0.05 for clsid=9) were omitted from the list, but this list of centres is sufficient to reproduce original classification for 99.8% of present data points. The standard errors were estimated by Jackknife method [6]

Supplementary information D. Robustness of the clustering analysis.
[TOP] In order to see the results of the hierarchical clustering are sufficiently robust and consistent, the clusters detected at level 9 with four different conditions are shown in the Figs. D1-2, where, for each cluster, residues having mean contact ratio above 0.8(red), 0.6(orange), 0.4(yellow), 0.2(white) are drawn with the surface plot view, except the clusters shown at the bottom row in Fig. D2 which show residues with ratio above 0.08 (red) (because maximum values of the mean contact ratio were below 0.2, i.e., these clusters correspond to "not yet classified" contacts). Here, the mean contact ratio for each cluster was defined as the averaged contact vector data over the data points classified to the cluster (see also Supplementary information C). We examined two variations for the threshold lengths d=0.35nm and d=0.3nm to judge contact between a ligand and a residue.
We also examined two cases of the trajectory lengths: in one case, whole frames  in each trajectory were used, and in another case, only later-half frames (501-1000) in each trajectory used for the clustering analysis (total 98,000 frames were analysed in the latter case). The figures clearly show that these four clustering analyses gave similar results, although the cluster identifier and the classified order were somewhat different.
In this paper, we adopted the classification with d=0.35nm using whole frames (schematic  Table E1). Figure E8 shows the time series of binding free energies averaged over 28 trajectories. The standard deviations were only depicted for ritonavir, but they are at similar levels for all ligands. The deviations were large because they were averaged over different states. Here, Δ bind gas indicates the gas-phase interaction energy between the receptor and the ligand, which is the difference between the gas-phase potential energy of the receptorligand system and the sum of the gas-phase potential energies for the receptor only system and for the ligand only system. The term Δ bind solv is the solvation energy difference upon protein-ligand association and is computed as the sum of polar and nonpolar terms (Δ bind solv,polar + Δ bind solv,nonpolar ). The term Δ bind solv,polar can be calculated numerically by solving the generalized Born model developed by Onufriev et al [8]. The term Δ bind solv,nonpolar can be calculated using the following equation.
Δ bind solv,nonpolar = + Here, SASA represents the solvent-accessible surface area, calculated using the linear combinations of pairwise overlaps method [9], and the values for the γ and β were set to 0.005 kcal/mol·Å 2 and 0 kcal/mol. In this calculation, the entropic term Δ bind gas was not taken into consideration.
Supplementary   Fig. F1 and Fig. 3 in the main text. In order to estimate the standard error of X, let introduce virtual sample removing j-th trajectory ( = 1, ⋯ , ) as Then the standard error of X is estimated [6] as

Supplementary information G. Negative control simulation infrequent binding to the active site. [TOP]
In the binding simulation of 7 HIV drugs shown in the main text, 7 drugs bound to the active site at least twice in 28 simulation runs. One may wonder how different frequency of binding can be observed for any other drugs. We here report additional simulation using lamivudine triphosphate (Fig. G1(a)), which showed infrequent access to the active site in contrast to frequent binding of HIV drugs to the active site. We think this observation can be a negative control simulation and also suggests importance of the binding simulation reported in the main text.
In the same manner with HIV simulations, we prepared four initial locations of the ligand (Fig. G1(b)) and 14-fold simulation runs with differently randomized initial velocity (total 56 simulation runs) were performed. Here twice runs were performed for better statistics. Figure G2 shows contact map for lamivudine triphosphate binding simulation, shown in similar way of Fig. 2 in the main text. Figure G3 shows the time course of occupation ratio in the active site, which clearly shows infrequent access to the active site. Figure G4

Supplementary information H. MM-GB/SA free energy for 1 µs trajectories. [TOP]
To observe the dynamics of ligands in the active site, 23 simulations have been extended for 1 µs. Table H1 summarises  continuous data points. Figure H1 shows the time courses of MM-GB/SA binding free energies with the same averaging. The lines in faint colours indicate shallow binding states. Figure H2 shows the histograms of MM-GB/SA binding free energies for the trajectories of 200-1,000 ns. The filled boxes and solid lines correspond to the histograms for deep binding states and all states, respectively. The deep binding states had larger binding free energy differences; however, we could not separate deep and shallow bindings clearly because of unimodal distributions. Figure H3

Supplementary information J. Principal component analysis for characterising the active site conformations.
[TOP] To investigate the conformational variability of the active site upon ligand binding,  Supplementary Table C2 clsid=2 (shown with bold for mean contact ratio greater than 0.2). PCA was calculated using CPPTRAJ [10] module of AmberTools 18.
From the result of PCA, the contribution ratios of first two principal components (PC1 and PC2) were 36.1% and 12.9%, respectively. The first two eigenvectors were shown in Fig. J2. In the projection of the first two principal components, the representative conformations of active site in MD trajectory of apo-M pro system were shown in Fig. J3.
In order to confirm the robustness of the above PCA result, we performed additional PCA by employing the ligand-unbound active site of the holo-M pro system in addition to the ligand-bound active site of the holo-M pro system and the both active sites of the apo-

Supplementary information K. Representative binding poses for MD simulations of respective ligand-bound M pro systems.
[TOP] Here we investigate how each ligand interacted with the active site residues in the MD simulations. In order to detect the effective interaction between the ligand and each residue, the interaction fingerprint was analysed by using the protein-ligand interaction fingerprint (PLIF) tool of Molecular Operating Environment (MOE) [11]. The fingerprint is defined as a set of bits that identify the existence of the ionic or hydrogen bond between the residue atom and the ligand atom with the threshold energy value of 0.5 kcal/mol. The fingerprints were assigned for all snapshots in 20 trajectories of the 1 μs MD simulations.
For each ligand, the appearance rates of the bits were given by averages of the fingerprints for the snapshots taken every 2 ns over 200-1000 ns. In order to extract key residues in the protein-ligand interaction, a maximal value of the appearance rates was selected among all bits related for each residue. We here call the value of the maximum appearance rate as the representative appearance rate of the interaction fingerprint (RAIF) that are summarised in Table K1 (see also Fig. K8). The value is expected to reflect an interaction strength between each residue to a specific ligand. Further, in order to pick the representative binding poses, we have performed clustering analysis on the fingerprints using the Jarvis-Patrick clustering algorithm [12] with Tanimoto Table K1 for explanations.

Chain B C-terminal region
Active site region (1) Active site region (2) Active site region (3) Active site region (4) Active site region (5) Chain B Supplementary

Supplementary information L. Molecular docking using the X-ray crystal and MD simulation structures.
[TOP] To figure the feature of the active site conformations sampled by the MD simulations, we performed the conventional molecular docking using the four protein structures (one derived from X-ray crystallographic and three from representative MD simulations shown in Figure 4) and the drug-like compound library. All molecular dockings were performed using Glide module [13]- [15], implemented in the Schrödinger Release 2020-1 [16]. The conditions for molecular docking were as follows. For generating receptor grids for the docking, a side length of cubic grids was set to 10 Å and no constraints were applied. The grid center was set at an arbitrary point near two catalytic residues, His41 and Cys145.
To soften the potential of nonpolar parts of ligands, the scaling factor for the ligand van der Waals radii was set in 0.80. The OPLS3 force field was used [17]. All compounds were docked into the active site of M pro using Glide standard precision (SP) mode. For the compound library, about 14,000 drug-like compounds contained in Maybridge HitCreator V2 were used [18]. The compound molecules were prepared using LigPrep module, implemented in the Schrödinger Release 2020-1 [19]. Hydrogen atoms were added and different protonation states and ionization states for each ligand were generated for a pH of 7. All possible stereoisomers and tautomeric states were also generated. The OPLS3 force field [17] was used for energy minimization to generate low energy three dimensional conformers of the ligands.

Supplementary information Z. Computation of Coulomb forces for MDGRAPE-4A.
[TOP] For MD simulations by MDGRAPE-4A, Coulomb forces were computed by tensorstructured multilevel Ewald summation method (TME) (Y. M. Koyama et al., in preparation). TME can be considered as a combination of the smooth particle mesh Ewald method (SPME) [20] and the B-spline multilevel summation method (B-spline MSM) [25] with approximations by Gaussian functions. In the current study, the Coulomb potential with the scaling factor s = 2 and the Ewald splitting parameter α = 2.116203 nm -1 (corresponding to ewald-rtol = 10 -4 in GROMACS [1]). Since the short-range part S ( ) is identical to one in SPME, we computed them by the direct summation with cutoff distance 1.3 nm. The middle and long-range part is evaluated by the combination of long-range part of SPME and B-spline MSM with 32 grids for each axis and 6 order B-spline interpolation. To approximate the middle-range part M ( ) with Gaussian functions, we used 4-points Gauss-Legendre quadrature. The approximation by Gaussian functions enables the computation of the 3D convolution to convert grid charges into the grid potentials by separable convolutions [26].
The long-range part can be computed similarly to the one of SPME with the long-range potential L ( ), which uses the fast Fourier transformation for the computation of grid potentials from grid charges. Due to the scaling factor s = 2, the long-range part was evaluated with 16 grids for each axis. Conversions from 32 × 32 × 32 to 16 × 16 × 16 grid charges (called restriction) and from 16 × 16 × 16 to 32 × 32 × 32 grid potentials (called prolongation) can also be computed by separable convolutions [25].