Aqueous pKa prediction for tautomerizable compounds using equilibrium bond lengths

The accurate prediction of aqueous pKa values for tautomerizable compounds is a formidable task, even for the most established in silico tools. Empirical approaches often fall short due to a lack of pre-existing knowledge of dominant tautomeric forms. In a rigorous first-principles approach, calculations for low-energy tautomers must be performed in protonated and deprotonated forms, often both in gas and solvent phases, thus representing a significant computational task. Here we report an alternative approach, predicting pKa values for herbicide/therapeutic derivatives of 1,3-cyclohexanedione and 1,3-cyclopentanedione to within just 0.24 units. A model, using a single ab initio bond length from one protonation state, is as accurate as other more complex regression approaches using more input features, and outperforms the program Marvin. Our approach can be used for other tautomerizable species, to predict trends across congeneric series and to correct experimental pKa values.


Regression Approaches Gaussian Process Regression (GPR)
We built our GPR model using the Python library called George. A gaussian process is a non-parametric model that defines a distribution over functions, which is updated according to the training data. The process is fully defined by two priors: the mean function (set to zero) and a covariance function , ′ (also called the kernel, which in our case is a radial basis function or SE-ARD), The hyperparameters ( ) for this kernel were found by maximising the log-likelihood function using the gradient descent BFGS algorithm (implemented by scipy), on the negative gradient of the log-likelihood function (therefore finding the maximum of the function). As there can be many local maxima, the optimiser was restarted with random weights 100 times in an attempt to find the global maximum.

Support Vector Regression (SVR)
We built our SVR model using the Python library scikit-learn. For a linear function, given a set of data { , , … , , the goal of SVR is to find a function , i.e.,

〈 , 〉
where 〈 , 〉 corresponds to the dot product, and that has at most deviation from all training set values, and is as flat as possible. In order to obtain a function that is as flat as possible the norm of is minimised, which is a convex optimisation problem, i.e., the problem becomes, such that 〈 , 〉 〈 , 〉 * The loss function is the default in sklearn, which corresponds to the "epsilon-insensitive loss" (the L1 loss).
In eq. 3 we have also introduced and * as slack variables, which measure the deviation of training samples outside the -insensitive zone. These slack variables are used in -SVR as it can be a difficult task to find a function that approximates all , with less than deviation for all points. is the box constraint, which affects the trade-off between the model flatness and the extent that deviations larger than epsilon are allowed in optimisation. and were found via a GridSearch in scikit-learn using 7-fold cross validation RMSEE, along with ε.

Partial Least Squares (PLS)
We built our PLS models using the Python library scikit-learn. PLS was chosen over Multiple Linear Regression due to the possibility of multicollinearity between bond length features. PLS is a way of building a regression model using latent variable (LV) decomposition. The two matrices and , containing the descriptors and the target values of the training set can be decomposed into a sum of latent variables, where and are score matrices, and are loading matrices, and and are residual matrices, all for and . For a latent variable, (8) where is the coefficient for the latent variable . To identify an optimal number of LVs for each feature combination, models were constructed using 1 to ndim LVs (where ndim is the number of dimensions/features), and the model that returned the best RMSEE of the training set was selected.

Random Forest Regression (RFR)
We built our RFR model using the Python library called scikit-learn. Random Forest Regression is an ensemble method to regression. This means that multiple decision trees are used to arrive at a final output value, which is the average of the individual tree predictions. This is achieved by bagging, i.e. randomly sampling from the full set (by replacement) and a decision tree being trained on each training data sample, , where , ( = 1, …, , is a vector of descriptors and is the target value (pK a ). For each tree the best split for each node is picked from a randomly selected subset of descriptors. The tree is then grown until further splits are not possible, and not pruned back. This procedure is repeated until enough trees have been grown. The number of estimators (n est ) and maximum depth were found in each case by applying a grid search (GridSearchCV in scikit-learn). The final hyperparameter values were chosen to minimize a 7-fold cross validation RMSEE.

Validation Metrics.
The r 2 score that we calculate to assess model predictability is produced via k-fold cross validation, where k=7. Hence, a 7 th of the dataset is removed, and the remaining 6/7 th of the input features and observables are used to form the predictive equation. Predictions are then made for the 7 th that was removed. The second 7 th is then removed, and the first 7 th joins the remaining 5/7 th to make up a new 6/7 th training set.
When all 7 cycles are complete and all compounds have been predicted once, the following equation is used to obtain the r 2 value: where , and , correspond to the observed and predicted values for each of the training set compounds, and is the mean value of the observed values for the training set. The RMSEE values that we quote are derived from the following equation: where , and , are defined as above and is the number of compounds of the training set.The mean absolute error is defined as: where , , is the residual error, and where is now the test set compound and denotes the number of compounds in the external test set. These 7-fold CV metrics were calculated using cross_validate External validation was also performed by calculation of the Mean Absolute Error (MAE) and by employing Roy's MAE evaluation criteria 11 . According to Roy, the two criteria that must be met by a "good" model are that: 1) the MAE must be less than 10% of the training set range, and 2) the MAE+3 must be less than 20% of the training set range.
Here  denotes the standard deviation of the absolute errors. If the model does not fit the above criteria then it can be deemed "moderate", that is, if for the second criterion 25% is used in place of 20%, or "poor" if it does not obey either criterion. The Root-Mean-Squared-Error of Prediction (RMSEP) calculated for the test set is also used to evaluate model prediction accuracy. *See Supplementary Methods for methodological overview of experimental pK a determination in this work.

Supplementary Note 1
Many of the sources for the herbicides o1-o8 are not primary sources because they cannot be located (it is likely they were measured internally by the companies that developed them). However, many compounds feature in the "Pesticide Manual" produced by the British Crop Council. Details of experimental measurement of pK a values for all compounds marked * in Supplementary Table 1 (i.e. "Syngenta" or "this work").
Experimental pK a measurements were collected using a SiriusT3 instrument (Sirius Analytical Instruments, East Sussex, Great Britain), an automatic titration system incorporating in situ UV spectroscopy. The Sirius T3 is equipped with an Ag/AgCl double-junction reference electrode to monitor pH, a dip probe attached to a UV spectrophotometer, a stirrer, and all with automated volumetric titration capability. The Sirius T3 UVmetric pK a measurement protocol measures the change in multi-wavelength absorbance in the UV region of the absorbance spectrum while the pH is titrated. Measurements were carried out at 25°C and constant ionic strength (0.01 M KCl) and UV absorbance data are collected from 160-760 nm while the 250-450 nm region is typically used for pK a determinations.
Because of the low water solubility of the tested compounds, the titrations were carried out in a co-solvent (methanol): the method involves compound titration with three different methanol concentrations and the calculation of the pK a by extrapolation using the Yasuda-Shedlovsky 12,13 equation. Two Sirius T3 computer programs, that is, Sirius T3 Control v1.1.3.0 and Sirius T3 Refine v1.1.3.0, were used to execute measurement protocols and analyse pH-dependent multi-wavelength spectra, respectively.

Supplementary Note 2 The origin of bond length variation with pK a for 2-NO 2 (tkn) and 2-Cl (tkc) substituted triketones.
Initial analysis of the internal validation statistics for the full set of 10 tkn and tkc compounds in total, reveals, in each state, no r 2 values above 0.90 for any plot of bond length i to viii vs pK a . However, correlations increase for each bond length when the set is split into two subsets. Intuitively, the full set is split according to the 2-substituent type, into one set consisting of compounds tkn1 -tkn4, and another set of compounds tkc1 -tkc6. Whilst Cl atoms are electron-withdrawing through σ-bonds due to their electronegativity, the electron-withdrawing effect of an NO 2 group occurs not only due to the electronegativity of the nitrogen atom (i.e. σ-effects) but also due to a π-withdrawal effect. Furthermore, However, there are some inconsistencies between the two sets when we now consider the remaining bonds. Firstly, considering the 2-NO 2 set (still as tautomer b i.e., the anti keto-enol form), the more acidic species are found to have shorter C-C iv and vi bonds, and longer C=O v bonds, suggesting that the delocalisation occurs across the whole triketone system. Evidence for this assertion lies in the co-planar orientation of the keto-enol moiety with the exo carbonyl group: tkn1 -tkn4 have an average C 1 =C 2 -C 3 =O 4 dihedral angle T1 of 15° ( Figure 3B of the main text). However, the plot of bond vii vs pK a does not provide corroboration of the above assertion because, rather than showing the expected negative correlation, the two variables are completely uncorrelated (r 2 = 0.06). For these same four compounds, the exo carbonyl group is almost orthogonal to the phenyl ring: the C 6 =C 5 -C 3 =O 4 torsional angle T2 has an average value of 75°. This orthogonality is indicative of negligible conjugation of the 2-Ac-1,3-CHD group with the aromatic ring. It should also be noted that the magnitude of T1 and T2 dihedral angles also correlate with pK a , with r 2 values 0.84 and 0.92, respectively. That is to say, the more co-planar the two C=O and single C-O moieties of the triketone fragment are, the lower the pK a .
Now considering the tkc subset, the average T1 and T2 angles are found to be ~50°, and ~38°, respectively. This preferred geometry is indicative of an increase in the degree of conjugation with the aromatic ring. This assertion is corroborated by the fact that T1 and T2 values also correlate to aqueous pK a values, with r 2 values of 0.95 (with a positive gradient) and 0.89 (with a negative gradient), respectively i.e., as the exo carbonyl group of the tkc series becomes more in-plane with benzene, these compounds become more acidic. As is illustrated in Figure 3B of the main text, the C-C iv and C=O v bonds of the 2-Cl subset are also found to show opposing trends with pK a when compared to the same bonds of the 2-NO 2 analogues. A longer C-C bond and shorter C=O bond is again indicative of less conjugation with the keto group. Finally, whereas the slope of the line-of-best-fit for both subsets is negative for bond distance viii, i.e., the C-C bond linking to the phenyl group, those of the 2-Cl series are consistently shorter those of the 2-NO 2 set. From this observation, it may again be inferred that there is a greater degree of conjugation into the phenyl ring through the exo carbonyl for C1 to C6, as the C-C viii bond gains more double bond character.
In order to further explain the above observations, an IQA analysis was performed on the B3LYP/6-311G(d,p) calculated wavefunctions of all 10 derivatives. The images shown in Fig.S1 (a) Table 7. Bond lengths (i-v) and pK a values for the most stable conformation identified at B3LYP/6-311G(d,p) in CPCM for all compounds studied in this work, whilst keeping the keto-enol anti state of the 1,3-CHD or 1,3-CPD group.   Supplementary Table 12. 7-fold CV RMSEE, MAE, and R 2 score for SV regression with an RBF kernel using each combination of the 5 bond lengths of the keto-enol fragment. The optimal model based on RMSEE is marked in red. Table 13. 7-fold CV RMSEE, MAE, and R 2 score for GP regression with an RBF kernel using each combination of the 5 bond lengths of the keto-enol fragment. The optimal models based on RMSEE are marked in red.  Figure 2. For the series o1-o8: Marvin predictions (no consideration of tautomers/resonance), literature experimental values ("Lit Exp"), our newly measured experimental values ("Our Exp") and AIBL predictions made using the C-O bond model, i.e. pK a = 98.38*r(C-O)-127.71. An asterisk (*) denotes predictions for compounds that feature in the test set, i.e. are external predictions. Those without asterisks feature in the training set used to formulate the model itself.

Supplementary
Supplementary Figure 3. For the series tet1-tet6: Marvin predictions (no consideration of tautomers/resonance), literature experimental values ("Exp") and AIBL predictions made using the C-O bond model, i.e. pK a = 98.38*r(C-O)-127.71. An asterisk (*) denotes predictions for compounds that feature in the test set, i.e. are external predictions. Those without asterisks feature in the training set used to formulate the model itself.