Biomolecular simulation based machine learning models accurately predict sites of tolerability to the unnatural amino acid acridonylalanine

The incorporation of unnatural amino acids (Uaas) has provided an avenue for novel chemistries to be explored in biological systems. However, the successful application of Uaas is often hampered by site-specific impacts on protein yield and solubility. Although previous efforts to identify features which accurately capture these site-specific effects have been unsuccessful, we have developed a set of novel Rosetta Custom Score Functions and alternative Empirical Score Functions that accurately predict the effects of acridon-2-yl-alanine (Acd) incorporation on protein yield and solubility. Acd-containing mutants were simulated in PyRosetta, and machine learning (ML) was performed using either the decomposed values of the Rosetta energy function, or changes in residue contacts and bioinformatics. Using these feature sets, which represent Rosetta score function specific and bioinformatics-derived terms, ML models were trained to predict highly abstract experimental parameters such as mutant protein yield and solubility and displayed robust performance on well-balanced holdouts. Model feature importance analyses demonstrated that terms corresponding to hydrophobic interactions, desolvation, and amino acid angle preferences played a pivotal role in predicting tolerance of mutation to Acd. Overall, this work provides evidence that the application of ML to features extracted from simulated structural models allow for the accurate prediction of diverse and abstract biological phenomena, beyond the predictivity of traditional modeling and simulation approaches.

. Experimental data of the holdout dataset. The classification cutoffs were as follows: Soluble Yield (Sol Yield) set at 520 nM, balancing actives and inactives at 53% and 47%, respectively; Total Yield (Tot Yield) set at 1600 nM, balancing actives and inactives at 51% and 49%, respectively; Soluble Fraction (Sol Frac) set at 39%, balancing actives and inactives at 47% and 53%, respectively.

Definitions
The following table represents the abbreviations used within the rest of the document. the entire first monomeric unit as well as the aligned mirrored full monomeric unit. These two files were saved as a joint object producing the complete homodimer.

Supplementary
The RecA structure (PDB: 3CMW 4 ) is originally a polymer which is bound with DNA and other small molecules. For simplicity in modeling, the original RecA structure was trimmed to a single monomeric unit without any DNA present. Both the LexA and RecA structures were relaxed after initial pre-processing to be used in mutational modeling. The structures were relaxed with a Cartesian fast-relax in PyRosetta with the following options: a MoveMap set for both backbone and sidechain minimization, the beta_nov_16_cart score function, lbfgs_armijo_nonmonotone minimizer, and the minimize_bond_angles flag set to True.

Acridon-2-ylalanine (Acd) Modeling
In order to mutate sites of interest to Acd, we used the same params and rotamer library files generated in our previous work. 2 These requisite files were generated according to the protocol in Renfrew et al. 5 To mutate a residue to Acd, PyRosetta must be initialized with the following extra options: RMSD and any of the three dependent variables (soluble yield, total yield, soluble fraction). We also investigated the difference in Rosetta total score between the Acd and WT simulations. We observed yet again that there is no correlation between Rosetta REU and our data.

Feature Analysis
The two feature sets used in this study are the score terms from the Rosetta beta_nov16 score function and the terms found in the following table.
Supplementary t_c_m The temperature at which half of the a-helix is unfolded 9 delta_t_d_m Change in melting temperature values between each peptide analogue and that of the glycine analogue 9 blosum62_trp Score of mutation from an amino acid to tryptophan 10 blosum62_tyr Score of mutation from an amino acid to tyrosine 10 blosum62_phe Score of mutation from an amino acid to phenylalanine 10 blosum62_his Score of mutation from an amino acid to histidine 10 ASA Accessible surface area from DSSP 11 RSA Relative accessible surface area from DSSP 11 Locally relaxed complexes following mutation to Acd were scored with the beta_nov16 score function where PyRosetta was initialized with the pairwise energy term distance changed from 6 Å to 9 Å. Additionally, the score function was passed to the decompose bb_hb_into_pair_energies energy option and values were recorded on a per residue basis. Scores corresponding to the five local relax trials were averaged for both the Acd and WT structures. Score differences (deltas) were computed by subtracting the averaged Acd scores by the averaged WT scores. Features which S12 contain the "_Site" suffix are derived from the residue where Acd was incorporated. Features with the "_8A" suffix are derived from the average of all other residues that were designated to move based on the MoveMap in the local relax. Supplementary Table 3 shows the minimum, maximum, mean, and standard deviation of the score deltas used for ML.

Dimensionality Reduction
Descriptive analysis of the experimental data using RCSF features proved to be effective.
However, we were interested in the predictive capability of these features and, due to the size of the experimental dataset, the total number of features must be reduced to avoid overfitting the dataset. Features for machine learning were selected using the SelectKBest module from sklearn.feature_selection utilizing the mutual_info_classif score function in univariate analysis.
Given the size of the dataset we set an upper limit of 10 features to reduce any overfitting.
Supplementary Tables 5 demonstrates the number and identity of the features that were selected for each model based on the highest accuracy in forward feature selection. S19 Supplementary

Tuning Parameters
Although we limited the number of features for machine learning to avoid overfitting, we sought to demonstrate both the efficacy and practicality of our models through accurate prediction of a holdout dataset which faithfully represented both LexA and RecA, secondary structure, and tolerated and untolerated positions. Prediction of never-before-seen data in the holdout set eliminates the concern of overfitting due to hyperparameter optimization alone. Supplementary Column headings: Nu (regularization parameter), kernel (kernel type), gamma (kernel coefficient), class_weight (parameter which dictates magnitude of loss based on class distribution), tol (tolerance of optimization), and alpha(in BNB alpha is the smoothing parameter, in KRR it is the regularization parameter).
All models following feature selection were tuned using exhaustive grid searching in sklearn. Grid searching was performed with stratified 5-fold cross validation. Supplementary

Feature Importance
In addition to investigating the feature importance of the soluble fraction models, we also computed feature importance values for the soluble yield and total yield models. Interestingly, the soluble yield and soluble fraction models are very similar in number and type of features, while the total yield model is five features, heavily biased towards Lennard Jones attraction. It is not surprising that feature importance is different in the total yield model as it is a completely different experimental observable. Supplementary Tables 7 and 8 display the normalized feature importance values for the soluble and total yield RCSFs.