Artificial intelligence guided conformational mining of intrinsically disordered proteins

Artificial intelligence recently achieved the breakthrough of predicting the three-dimensional structures of proteins. The next frontier is presented by intrinsically disordered proteins (IDPs), which, representing 30% to 50% of proteomes, readily access vast conformational space. Molecular dynamics (MD) simulations are promising in sampling IDP conformations, but only at extremely high computational cost. Here, we developed generative autoencoders that learn from short MD simulations and generate full conformational ensembles. An encoder represents IDP conformations as vectors in a reduced-dimensional latent space. The mean vector and covariance matrix of the training dataset are calculated to define a multivariate Gaussian distribution, from which vectors are sampled and fed to a decoder to generate new conformations. The ensembles of generated conformations cover those sampled by long MD simulations and are validated by small-angle X-ray scattering profile and NMR chemical shifts. This work illustrates the vast potential of artificial intelligence in conformational mining of IDPs.


Supplementary Note 1
Instead of Cartesian coordinates, rotationally invariant properties such as dihedral angles or distance matrices could be used as the input to autoencoders. We tested such models. Training using 30% of the MD run1 data for Aβ40, with backbone dihedral angles (three per residue) as input and a latent-space dimension of 200, we obtained a reconstruction RMSD (Ca only) of 11.78 Å (calculated on the 100-fold diluted test set of 980 conformations). In comparison, the reconstruction Ca RMSD using Cartesian coordinates of the same training set and the same latent-space dimension was only 3.28 Å. The reconstruction results (RMSDs calculated on heavy atoms) for the latter Cartesian model are reported in Autoencoders using a distance matrix as input had a similar problem. For example, using Ca-Ca distances of residue i to residues i + 1, i + 2, i + 3, i + 4 as input, the reconstruction Ca RMSD for Aβ40 was 11.13 Å. An i to i + 3 Ca-Ca distance corresponds to a virtual dihedral angle defined by the Ca atoms of residues i, i + 1, i + 2, and i + 3. Therefore the high reconstruction RMSD of the distance-matrix model can in particular be attributed to errors in the i to i + 3 Ca-Ca distance. The reconstruction RMSE of the distancematrix model was 1.1 Å for the i to i + 3 Ca-Ca distance, and the corresponding mean largest error was 3.0 Å.
Here are some details of the models and data analysis when dihedral angles or distance matrices were used as input. The autoencoder architecture was the same as described for the case where Cartesian coordinates were the input, with a dimension of 200 for the latent space. Again the loss function was the binary cross-entropy. For dihedral angles, the number of input and output neurons was 3Nres -3 (117 for Aβ40); for distance matrices, the number of input and output neurons was 4Nres -10 (150 for Aβ40). The conversion from Cartesian coordinates (backbone heavy atoms only) to dihedral angles was done using the BAT (bond-angle-torsion) module 1 in the MDAnalysis package 2 . The back conversion was also done using the BAT module, with each bond or bond angle fixed at the average value of the training set. For distance matrices, the back conversion to Cartesian coordinates (Ca atoms only) was done in a Fortran code, with Ca Cartesian coordinates of residues i, i + 1, i + 2, and i + 3 calculated from the i to i + 1, i +1 S3 to i + 2, i +2 to i + 3, i to i + 2, i +1 to i + 3, and i to i + 3 distances. This procedure generated two possible positions for the i + 3 Ca (mirror images of each other); the i -1 to i + 3 distance was used to select one of the two image positions. That is, we selected the i + 3 Ca position with an i -1 to i + 3 distance closer to the supplied i -1 to i + 3 distance. Figure S4 and Table S1 show that the training data in the latent space from MD run1 of ChiZ is not represented well by a single multivariate Gaussian. We explored the idea of representing the training data by a mixture of multivariate Gaussians. To that end, we first clustered the latent-space positions of the training set. The clustering method was k-means cluster and the distances between latent-space positions were the Euclidian distances. The number of clusters was set to 2, 4, or 8. We then used the mean vector and covariance matrix calculated on the points within each cluster to define a cluster-specific multivariate Gaussian. Lastly each cluster-specific Gaussian was assigned a weight proportional to the number of points in that cluster, and new points were sampled from all the cluster-specific Gaussians, with a particular Gaussian selected each time based its assigned weight. The new points were fed to the decoder to produce conformations in Cartesian coordinates.

Supplementary Note 2
We compared a generated set of conformations (at size 1´) against the 10-fold diluted training set and test set. The best-match RMSDs are listed below. Without clustering (i.e., a single Gaussians), the best-match RMSD with the training set was 6.06 Å; this RMSD decreases as the number of Gaussians increases, reaching 4.65 Å with 8 Gaussians. Thus a mixture of multivariate Gaussians indeed improved the representation of the training data in the latent space. However, the degree of match with the test set went in the opposite direction. The best-match RMSD increased steadily from 7.95 Å for a single Gaussian to 8.50 Å with 8 Gaussians. The mixture of multivariate Gaussians was apparently overfitting the training data, thereby compromising the ability to capture generic features shared by the test data.

# of Gaussians Best-match RMSD with training set (Å)
Best-match RMSD with test set (Å)

S5
We selected 0.75Nres as the latent-space dimension (=13 for Q15, 30 for Ab40, and 48 for ChiZ), but also assessed the effect of the latent-space dimension on the prediction accuracy. The results are listed below.
In short, while reducing the latent-space dimensions from the selected values by 10 resulted in deterioration in accuracy, increasing the dimensions by 10, 20, and 30 had little effect on the prediction accuracy. For Q15, a very large value, 200, for the latent-space dimension actually led to a slight increase in best-match RMSD (see also Fig. S6).

Supplementary Note 4
The main results that we report for ChiZ were based on MD simulations using the AMBER14SB/TIP4PD force-field combination. To further demonstrate the robustness of conformational generation by autoencoders, we applied this approach to ChiZ conformations sampled from MD simulations using four other protein/water force-field combinations. For AMBER03ws/TIP4P2005, the amount of simulations was the same as for AMBER14SB/TIP4PD, i.e., 12 trajectories of 3 µs each. The autoencoder results for AMBER03ws/TIP4P2005 were very similar to those for AMBER14SB/TIP4PD. The reconstruction RMSDs for AMBER03ws/TIP4P2005 at a 30% training size were 7.1 ± 1.6 Å (mean ± standard deviation among 12 MD runs), comparable to the counterparts, 6.4 ± 1.0 Å, for AMBER14SB/TIP4PD. For generating new conformations, the best-match RMSDs at size 1´ were 7.9 ± 1.3 Å for AMBER03ws/TIP4P2005; the corresponding result for AMBER14SB/TIP4PD run1 was 7.95 Å, right in the middle of the range of best-match RMSDs obtained from the 12 AMBER03ws/TIP4P2005 runs.

S6
For each of the other three force-field combinations, we ran four trajectories of 0.5 µs each. Given the shorter trajectories, we used 70% of the conformations (25, 000 total) for training and 30% for testing.
Again, these ranges of RMSDs for conformational reconstruction and generation are comparable to those obtained using AMBER14SB/TIP4PD, thereby demonstrating the robustness of the autoencoder approach.

Supplementary Note 5
We tested autoencoders where the loss function was the mean square error (MSE) instead of the binary cross-entropy (BCE). The results, listed below, showed no significant differences between the two loss functions. Likewise, the particular conformation used for structural alignment before shifting and scaling the Cartesian coordinates for autoencoder input had no effect on the accuracy of generated conformations. a The data for training and testing were from MD run1 of each protein.
b The frame number used for structural alignment before shifting and scaling the coordinates.
c The loss function was binary cross-entropy (BCE).
d The loss function was mean square error (MSE).
S7 Table S1. Kullback-Leibler divergence between histograms in two-dimensional spaces of the latent space  The combined training set is diluted 10-fold before use, whereas the combined test set is diluted 1000fold. Average reconstruction RMSDs at different sizes of the training sets sampled from replicate MD runs. (a) Q15 at 5%, 10%, and 20% training sizes from two runs. (b) Aβ40 at 10%, 20%, and 30% training sizes from four runs. (c) ChiZ at 10%, 20%, and 30% training sizes from 12 runs. S11 Figure S3. Histograms of Q15 in the latent space, calculated from training data, test data, and multivariate Gaussian. Histograms for pairs of encoder nonzero outputs from run1 are shown as heat maps, with yellow representing pixels with the highest counts and dark blue representing pixels with 0 count. S12 Figure S4. Histograms of ChiZ in the latent space, calculated from training data, test data, and multivariate Gaussian. Histograms for pairs of encoder nonzero outputs from run1 are shown as heat maps, with yellow representing pixels with the highest counts and dark blue representing pixels with 0 count. S13 Figure S5. Results for autoencoder-generated conformations of ChiZ. The average best-match RMSDs of the 100-fold diluted test set of run1 are calculated against generated sets at different sizes. (a) Generated sets from run1, at sizes measured in multiples of the test size (= 101,500). (b) Generated sets pooled from run1 to runi, where i goes from 1 to 12. From each MD run, the generated set is at size 1´. S14 Figure S6. Results for autoencoder-generated conformations of Q15, with a 200-dimensioan latent space. The average best-match RMSDs of the 100-fold diluted test set are calculated against generated sets at different sizes. The latter sizes are measured in multiples of the test size (= 85,500). Run1 results are shown at sizes of the generated set ranging from the training size (9,500 or 0.11´) to 4´. For run2, the result is shown at 1´.