Unraveling low-resolution structural data of large biomolecules by constructing atomic models with experiment-targeted parallel cascade selection simulations

Various low-resolution experimental techniques have gained more and more popularity in obtaining structural information of large biomolecules. In order to interpret the low-resolution structural data properly, one may need to construct an atomic model of the biomolecule by fitting the data using computer simulations. Here we develop, to our knowledge, a new computational tool for such integrative modeling by taking the advantage of an efficient sampling technique called parallel cascade selection (PaCS) simulation. For given low-resolution structural data, this PaCS-Fit method converts it into a scoring function. After an initial simulation starting from a known structure of the biomolecule, the scoring function is used to pick conformations for next cycle of multiple independent simulations. By this iterative screening-after-sampling strategy, the biomolecule may be driven towards a conformation that fits well with the low-resolution data. Our method has been validated using three proteins with small-angle X-ray scattering data and two proteins with electron microscopy data. In all benchmark tests, high-quality atomic models, with generally 1–3 Å from the target structures, are obtained. Since our tool does not need to add any biasing potential in the simulations to deform the structure, any type of low-resolution data can be implemented conveniently.


Criteria to pick the final structural model
Several possible criteria, including the scoring function, RMSD ini and R g , are investigated. In the close-to-open fitting of AKeco targeted by the SAXS data, we plotted all these values at each cycle, along with the RMSD tar values (Fig. S1). If we select the final structural model according to the saturation of the scoring function χ (Fig. S1a, circles), the conformation at the 23 rd cycle would be picked that has a RMSD tar of 2.5 Å. However, the saturation point of RMSD ini is at the 33 rd cycle (Fig. S1b, down-triangles), and the corresponding conformation has a RMSD tar of 3.5 Å. After the 24 th cycle, R g becomes saturated at about 21.2 Å (Fig. S1c, squares) that is consistent to the value (21.2±0.1 Å) estimated from the target SAXS data by Guinier analysis, and the conformation at this point has a RMSD tar of 2.4 Å. It seems that either χ or R g could serve as a criterion here to pick the final structural model. Considering that R g might not work well when integrating other types of low-resolution data in PaCS-Fit, we finally use the saturation of the scoring function as the criterion since it looks like the most straightforward way to pick the final model.

Estimation of model precision
In a real application of PaCS-Fit, the target structure of the protein is unknown, so one cannot be certain of the accuracy of the structural model. However, model precision and accuracy may be estimated based on variability of multiple built models 1 . We have addressed this issue on HEWL and the triple-BRCT-domain of ECT2, respectively.
For HEWL, we ran ten independent PaCS-Fit targeted by the SAXS data, from the same initial conformation. All the χ values are saturated at about 0.4. R g of the ten structural models are from 15.0 to 15.1 Å, which are close to the value (15.3±0.2 Å) estimated from the SAXS data.
Pretending no knowledge of the target crystal structure, we have calculated the pairwise RMSD values among these models, which are from 1.1 to 2.2 Å with the mean value of 1.5 Å. The above data suggest that the structural models of HEWL built by PaCS-Fit are reliable with high precision, and the variability of them may provide a lower bound of the model accuracy 1 .
For the triple-BRCT-domain of ECT2, ten independent SAXS-targeted PaCS-Fit were also carried out, starting from its crystal structure. Although their χ values are all fairly small from 0.4 to 0.6, and R g (from 26.6 to 27.7 Å) are all similar to the value (27.5±0.4 Å) estimated from the experimental SAXS data, the ten structural models demonstrate some different conformations with pairwise RMSD values ranging from 2.1 to 8.3 Å. The results may indicate that the SAXS data alone is insufficient to determine a single model of this protein or there exist multiple conformations 1 . Therefore, it seems unlikely to estimate the model precision here unless more experimental data would be available.

Cross validation of the PaCS-Fit models
One type of analysis to indicate the model accuracy is cross validation, that is, the structural model is built by one set of experimental data and then validated against other data sets not included in the integrative modeling 1 . For AKeco, both the simulated SAXS and EM data are available, which enable us to do cross validation (Table S1). In the close-to-open fitting of the protein, the atomic model constructed by SAXS-targeted PaCS-Fit has a χ value of 1.0, and its CC=0.87. The model obtained by EM-targeted PaCS-Fit has a CC value of 0.94, and its χ equals to 5.3 that is fairly small (Fig. 1a, circles). In the open-to-close fitting of AKeco, the χ value of the structural model built by SAXS-targeted PaCS-Fit is 3.8, and this model has a CC=0.84. The model from EM-targeted PaCS-Fit has a CC value of 0.92, and χ is as small as 2.7. The above cross validation may suggest that these PaCS-Fit models of AKeco are reliable, which are supported by their RMSD tar values (Table S1). Generally the CC values are consistent to the RMSD tar values, that is, the larger CC the smaller RMSD tar . However, this may not be the case between χ and RMSD tar because the SAXS data has a lower resolution than the EM data that may lead to an overfitting problem.  Table S1. Cross validation of the AKeco models generated by PaCS-Fit. These numbers in bold were scores used in PaCS-Fit to pick the final structural model, and those in italic were back-calculated from the selected model.

PaCS-Fit targeted by the SAXS and EM data simultaneously
We

Selection of M and N in PaCS-Fit
In all of the PaCS-Fit, we ran N=10 independent MD simulations at each cycle after the preliminary simulation, as in the original PaCS-MD method 2,3 . For any of these proteins, ten independent MD simulations have already achieved efficient sampling to explore its conformational space. If more computational resources are available, one could try a larger N, with which fewer cycles may be needed to fit the low-resolution structural data. However, we can still set N=10, and allocate more cores to run individual MD simulation. In the latter case, although more cycles may be necessary, each PaCS-Fit cycle would be done faster than that in the former case.
Besides N, PaCS-Fit has an additional parameter M, that is, at each cycle, M conformations that best fit the low-resolution structural data are selected from the trajectory, and N out of M conformations that have the largest RMSD ini are used to start the next cycle. In the original PaCS-MD method 2 , the 'scoring function' is RMSD tar , whereas in PaCS-Fit, the scoring function is from the low-resolution data that may provide a weaker restraint towards the target than RMSD tar . Therefore, RMSD ini is used to encourage the protein to escape from its initial state. M should be larger than N. When M=N, we did observe the protein conformation was sometimes 'trapped' and could not transit to the target. On the other hand, M cannot be too larger than N, otherwise the major driving force in PaCS-Fit would be RMSD ini instead of the low-resolution structural data. We have found that M=20 seems appropriate with N=10.
When using the PaCS-Fit method, we suggest users to try M=20 and N=10 as well, for their own problems, which should work well in most cases.

MD. Standard MD simulations were used in both the SAXS-and EM-targeted PaCS-Fit of
AKeco, which were carried out using the GROMACS-4.5.5 package 4 and the AMBER03 force field 5 . The setup procedure was as follows. The periodic boundary condition (PBC) with a dodecahedron box type was used, with the minimum distance between the solute and the box boundary of 1.2 nm. The box was filled with TIP3P water molecules 6 . The system with the protein and waters was energy-minimized by the steepest descent method, until the maximum force was smaller than 1000 kJ mol -1 nm -1 . 4 Na + were added to compensate the net negative charges on the protein by replacing the same number of water molecules with the most favorable electrostatic potential. The final system was energy-minimized again using the steepest descent and then the conjugate gradient method, until the maximum force was smaller than 100 kJ mol -1 nm -1 . The simulation was conducted by using the leap-frog algorithm 7 with a 2 fs time step.
Before the production run, a 100 ps equilibration simulation with positional restraint was carried out, using a force constant of 1000 kJ mol -1 nm -2 . The initial atomic velocities were generated according to a Maxwell distribution at 300 K. The simulation was performed under the constant NPT condition. The three groups (protein, solvent, and ions) were coupled separately to a temperature bath of 300 K by using an velocity rescaling thermostat 8 , with a relaxation time of 0.1 ps. The pressure was kept at 1 bar with a relaxation time of 0.5 ps and the compressibility of 4.5×10 -5 bar -1 . Covalent bonds in the protein were constrained using the P-LINCS algorithm 9 .
Twin-range cutoff distances for the van der Waals interactions were chosen to be 0.9 and 1.4 nm, respectively, and the neighbor list was updated every 20 fs. The long-range electrostatic interactions were treated by the PME algorithm 10 where Δr ij is the fluctuation of the COM distance between residues i and j, and k ij is the spring constant, for SAXS experiments. The initial velocities of the system were generated at 310 K, and the temperature bath was set to 310 K as well.

GroEL monomer
MD. For each EM-targeted PaCS-Fit of the GroEL monomer, standard MD simulations were used to run these cycles after the preliminary simulation. The simulation parameters were the same as that of AKeco, except for the following. When setting up MD staring from the closed structure, the minimum distance between the solute and the box boundary was 2.0 nm to assure that the box may have enough space to allow the close conformation to transit to the open state during the PaCS-Fit. 19 Na + were added in order to compensate the net negative charges on the protein.
ACM. The enhanced sampling method was used to run the preliminary simulation in each EM-targeted PaCS-Fit of the GroEL monomer. The parameters of ACM were largely the same as those for HEWL, except for the following. The four slowest modes were used to define an essential subspace. At each time step, the component of velocity in the essential subspace was coupled to a high temperature of 900 K while the remaining was coupled to 300 K. During the ACM simulation, collective modes were updated every 500 time steps.

Computational cost of PaCS-Fit
In this work, we used either 80 Intel cores (2.6 GHz) or 160 AMD cores (2.3 GHz) to run PaCS-Fit (Table S2). The computational cost mainly consists of a preliminary MD or ACM simulation, ten independent MD simulations at each cycle, and calculations of scoring functions for all the simulated conformations. It should be noted that, for the same protein, computation of CC is more expensive than that of χ. GroEL monomer-open 98849 160 2+60 5382 Table S2. Computational cost of PaCS-Fit. The total time scale of each PaCS-Fit includes a preliminary simulation (0.1-ns MD or 2-ns ACM) followed by 30 or 60 cycles of ten independent 0.1-ns MD simulations.