Aqueous pKa prediction for tautomerizable compounds using equilibrium bond lengths

Caine, Beth A.; Bronzato, Maddalena; Fraser, Torquil; Kidley, Nathan; Dardonville, Christophe; Popelier, Paul L. A.

doi:10.1038/s42004-020-0264-7

Download PDF

Article
Open access
Published: 12 February 2020

Aqueous pK_a prediction for tautomerizable compounds using equilibrium bond lengths

Communications Chemistry volume 3, Article number: 21 (2020) Cite this article

3803 Accesses
4 Citations
10 Altmetric
Metrics details

Subjects

Abstract

The accurate prediction of aqueous pK_a values for tautomerizable compounds is a formidable task, even for the most established in silico tools. Empirical approaches often fall short due to a lack of pre-existing knowledge of dominant tautomeric forms. In a rigorous first-principles approach, calculations for low-energy tautomers must be performed in protonated and deprotonated forms, often both in gas and solvent phases, thus representing a significant computational task. Here we report an alternative approach, predicting pK_a values for herbicide/therapeutic derivatives of 1,3-cyclohexanedione and 1,3-cyclopentanedione to within just 0.24 units. A model, using a single ab initio bond length from one protonation state, is as accurate as other more complex regression approaches using more input features, and outperforms the program Marvin. Our approach can be used for other tautomerizable species, to predict trends across congeneric series and to correct experimental pK_a values.

Accurate determination of solvation free energies of neutral organic compounds from first principles

Article Open access 20 January 2022

Non-bonded force field model with advanced restrained electrostatic potential charges (RESP2)

Article Open access 03 April 2020

A highly accurate metadynamics-based Dissociation Free Energy method to calculate protein–protein and protein–ligand binding potencies

Article Open access 07 February 2022

Introduction

Approximately 21% of the compounds that make up pharmaceutical databases are said to exist in two or more tautomeric forms¹. Tautomerism is a form of structural isomerism that is characterized by a species having two or more structural representations, between which interconversion can be achieved by “proton hopping” from one atom to another. Issues surrounding pK_a prediction for species exhibiting this feature have been noted a number of times in the literature. Recently², Connolly suggested that a lack of experimental information on both relative tautomer stability and the properties of distinct tautomeric forms are the likely causes of such issues. Tautomeric species present a challenge, not just to empirical-based approaches, but also to those that attempt to solve the pK_a prediction problem using first-principles^1,2,3,4,5. For tools implementing the latter approach (e.g. Jaguar, Schrödinger^4,6,7), the most rigorous protocol includes quantum chemical calculations for conformations of each, or a select few low lying tautomer(s), in both gas- and solvent phase, and in both protonated and deprotonated forms. Therefore, without some element of empiricism, first-principles approaches often incur significant computational expense.

For methods of pK_a estimation that generate descriptors starting from 2D fingerprints, each tautomeric form of a species will correspond to a unique representation. Therefore, the user must either (i) possess prior knowledge of tautomeric stability in order to maximize prediction accuracy, or (ii) tautomer enumeration must be performed by the program based on an arbitrary user input, followed by selection of the optimal tautomer for calculation of chemical descriptors^8,9,10. A comparative study¹¹ of 5 empirical pK_a prediction tools (ACD/pK_a DB (http://www.acdlabs.com/home), Epik (http://www.schrodinger.com), VCC (http://vcclab.org), Marvin (http://www.chemaxon.com) and Pallas (www.compudrug.com)) on 248 compounds of the Gold Standard Dataset compiled by Avdeef¹², demonstrated a tendancy for prediction errors to be higher for compounds with a larger number of possible tautomeric states. For the tool they tested, the guanidine group of the drug Amiloride and the enolic hydroxyl groups of herbicides Sethoxydim and Tralkoxydim were also identified as common outliers.

Compounds containing a 1,3-diketo group exhibit tautomerism (shown in Fig. 1a(i), (ii)). For cyclic 1,3-diketones, the diketo state (Fig. 1a(i)) can be transformed into two keto-enol forms (Fig. 1a(ii)). Tautomeric states of the same molecule may be non-degenerate, with the ratio being influenced by the solvent environment and temperature¹³. The compounds 1,3-cyclohexanedione (1,3-CHD) and 1,3-cyclopentanedione (1,3-CPD) are known to possess significant keto-enol character in solution, a phenomenon attributed to the formation of hydrogen bonded solute dimers, and additional stabilization from solute-solvent interactions¹⁴.

**Fig. 1: Structures of 1,3-diketone derivatives and schematic of our workflow.**

1,3-CHD is a fragment prevalent to both agrochemically and pharmaceutically active compounds in use today. Alloxydim (Fig. 1b(i)) is currently used as a selective systemic herbicide for post-emergence control of grass weeds in sugar beet, vegetables and broad-leaved crops. Adding a derivatized benzoyl group at the 2-position in place of Alloxydim’s 2-oxime forms what is known as triketone herbicide (e.g. Mesotrione, Fig. 1b(ii)). Pharmaceutically relevant compounds containing the 1,3-CHD group include the antibiotic Tetracycline and its analogues.

Previous work^{15,16,17,18,19,20,21,22} from our group, as well as the earlier work of others, has highlighted the utility of bond lengths^23,24,25 and other quantum chemically derived descriptors in the context of Quantitative Structure Property Relationship studies²⁶. Most recently, our approach to pK_a prediction, which uses only internuclear distances as descriptors, called AIBL-pK_a (ab initio bond lengths), showed remarkably accurate prediction of acidity variation across congeneric series of guanidine-containing species¹⁹ and sulfonamides²⁰. The current work brings attention to the issue of pK_a prediction for tautomerizable compounds and delivers an intuitive solution to this problem for 1,3-CHD and 1,3-CPD derivatives, which remain important scaffolds in pharmaceutical and agrochemical research.

Results and discussion

Scheme for model construction

Our proposed method of predicting pK_a values (Fig. 1c and Methods) makes use of equilibrium bond lengths from density functional theory calculations (B3LYP/6-311G(d,p) with the conductor-like polarizable continuum model or CPCM) as input features for regression models.

The full dataset of 71 compounds used in this work represent a wide variety of substituent types and patterns (generic structures and examples of dataset compounds are shown in Fig. 2a). After an initial analysis of the linear fit of each individual bond length, we investigate whether the use of multiple bond lengths as input features could provide an advantage in prediction accuracy and model applicability radius. For this task, we considered all subset combinations of the bonding distances of the fragment common to each species. We also compared a number of machine learning methods for their regression onto pK_a values, namely, random forest regression (RFR), support vector regression (SVR), Gaussian process regression (GPR) as well as partial least squares (PLS). PLS²⁷ and SVR^28,29,30 have been implemented in the context of pK_a prediction many times, using many different types of descriptors. A brief overview of the theory and method used for these approaches can be found in the Supplementary Methods section of the Supplementary Information (SI). Further details and formalism for the validation metrics used in this work (r², RMSEE, MAE) can also be found in Supplementary Methods.

**Fig. 2: Exemplar cyclic diketone compounds studied in this work and the performance of Marvin versus Experiment.**

Through our analysis, we demonstrate that a powerful model may be constructed from simple linear regression of a single ab initio bond length, thereby potentially negating the need for the more complex approaches.

Current approaches

To exemplify the issues surrounding prediction for cyclic 1,3-diketones using existing empirical approaches, the commercial program by ChemAxon known as Marvin was used to estimate values for a series of 1,3-CHD and 1,3-CPD derivatives (o1-o8, tk1-tk15 and dk1-dk12 shown in Supplementary Table 1 of the SI). The Marvin program uses Gasteiger partial charges³⁰, polarizabilities and structure specific increments to predict pK_a values using ionizable group specific regression equations¹¹. The results are shown in Fig. 2b, where the orange diamonds denote experimental values, blue squares represent Marvin predictions without the option to “consider tautomers/resonance”, while the magenta triangles are predictions made with this option. For the compounds in Fig. 2b where the blue and red points overlap, the program predicts the keto-enol state to be dominant, and delivers predictions that lie 0.8 units away from experimental values on average. However, for 60% of the compounds, the program predicts the diketo state to be dominant. For the series o1-o8, Marvin gives values of ~16 log units for 5 out of 8 species. For the remaining three compounds, o1, o3 and o7, the program identifies the acidic proton (pK_a ~ 17) at the 4 or 6 position on the 1,3-CHD ring.

The above results suggest that if accurate predictions are to be made (i.e. residual errors <1 pK_a unit), then the user must have prior knowledge of the dominant keto-enol tautomeric form (blue squares in Fig. 2b). In the following sections we show that our method, which uses quantum chemically derived geometric descriptors, avoids such problems intrinsically. Despite the increased computation time compared with empirical approaches, AIBL avoids the need to compute pK_a values for both protonation states. Moreover, descriptor calculations may be carried out only in the solvent phase using an implicit approach (CPCM).

Identifying AIBL-pK_a relationships for triketones

The relationship between the structure and herbicidal activity of triketones (Fig. 3a) was first reported³¹ by Lee and co-workers. One of the primary conclusions of that early work was that the ortho-substituent on the phenyl ring is a requirement for the compound’s herbicidal activity. The authors also noted that compounds with more electron-withdrawing para-substituents required a lower dose to obtain a 50% weed-control rating across 7 variants of broad-leaf plants (the metric known as lethal dose 50, or LD₅₀). It was thereby deduced that a linear relationship exists between Hammett constants of para-substituents, log(LD₅₀) and pK_a. Therefore, a more electron-deficient benzene is associated with enhanced acidity and herbicidal activity³¹. As there is already evidence of a structure–property/activity relationship for these species, we took the set of 10 compounds from the work of Lee et al. as a starting point to assess the prevalence of AIBL-pK_a relationships across available tautomeric states.

**Fig. 3: Tautomeric forms of cyclic triketones and the trends in bond length variation with pK_a for compounds labelled tkn and tkc in this work.**

The identities, pK_a values, equilibrium bond lengths and log(LD₅₀) values of the compounds studied by Lee et al. are shown in Supplementary Tables 2–5, labelled as tkn1-tkn4 and tkc1-tkc6. All tkn species possess one 2-NO₂ group whereas each tkc species has a 2-Cl substituent (Fig. 3b). Across each subset the para- substituent varies. We find that the order of stability of each compound in their four lowest energy tautomer/conformations (Fig. 3a) is c > d > b > a. The triketo form a is ~9 kJ mol⁻¹ less stable than the (endo) keto-enol anti form b, which in turn is ~29 kJ mol⁻¹ less stable than the (exo) keto-enol syn form d. Although both d and c possess a stabilising intramolecular hydrogen bond, the most stable form is c by around 7 kJ mol⁻¹.

Experimental pK_a values were regressed onto bond lengths i–viii (Fig. 3b) of the triketo or keto-enol fragment of tautomers a–d and the fit was assessed using r². For all tautomers a–d, there is a significant improvement in r² when the set is split into two subsets (r² generally 0.9 or above), with one group containing tkn derivatives and the other containing tkc substituted compounds. The slope for the tkn series is consistently 22% larger (i.e. steeper) than that of the tkc derivatives. We can interpret this steeper gradient as the resonance electron-withdrawing effect of the 2-NO₂ substituent heightening the para-substituent’s electronic effect on dissociation propensity. The heightened acidity of the tkn compounds is also likely to be linked to the marked difference in geometry between the two subsets. For the tkc series, the exo-carbonyl group is almost co-planar with the phenyl ring, whereas for the tkn series, the exo carbonyl is co-planar with the keto-enol moiety. In the latter orientation (of the tkn series), the orbital overlap allowing hydroxyl oxygen lone pair delocalisation across the keto-enol and exo-keto group is possible. It may be asserted that this increased conjugative effect would result in less delocalization between O and H atoms, a longer, weaker O–H bond and greater propensity for dissociation.

The bond lengths of the enol anti-conformer b exhibit the most strongly correlated relationships with pK_a values (see Supplementary Tables 2–5). With the exception of O–H(i) and the exocyclic C=O(vii) bond lengths, all pairs of subsets tkn and tkc exhibit r² values above 0.90 (q² > 0.9 and RMSEE ~ 0.2). This is an interesting result, considering that b is not the most stable tautomer according to the ranking at B3LYP/6-311 G(d,p)/CPCM. It may be asserted that the emergence of stronger relationships between geometric features (bond lengths) and pK_a using the anti keto-enol tautomer is indicative of its prevalence in solution. A thorough analysis using explicit solvation to explore this hypothesis is beyond the scope of this work. However, preference for this conformation could be linked to its increased propensity for dimerization and H-bonding to solvent molecules.

For both subsets, the trend in the bond variation of O–H (i), C–O (ii) and C=C (iii) with pK_o is such that more acidic compounds have longer O–H and C=C bonds but shorter C–O distances. These observations therefore fit with the intuition that a longer, weaker O–H bond should exhibit an increased propensity for cleavage. Conversely, bonds C–C (iv) and C=O (v) are found to show opposing trends between each series (Fig. 3b).

The aim of this work is to derive a generally applicable model for compounds containing the diketone fragment. Therefore, we deemed it important to understand this disparity in C–C (iv) and C=O (v) bond length variation. To this end, we performed an interacting quantum atoms (IQA) analysis to partition the interaction energy between pairwise atoms A and B into V_xc(A,B) (exchange-correlation) and V_cl(A,B) (electrostatics). For further methodological and theoretical details of this approach see the Methods section.

By taking V_xc(A,B) as our dependent variable in place of bond distances, we can look at how the extent of delocalization of electrons between two topological atoms A and B changes with pK_a. In doing so, we find analogous relationships between V_xc(A,B) of bonds i–v and pK_a values. Longer bonds exhibit less negative V_xc(A,B) values (i.e. there is less delocalization), and vice versa (Fig. 3b). The trend in V_xc(A,B) for bonds i–v across the keto-enol fragment of the tkn series is consistent with hydroxyl oxygen lone pair delocalization across the whole keto-enol fragment, akin to the resonance forms shown in Fig. 1a(ii). Conversely, for the tkc series this delocalization effect is not reflected in the distance variation of iv and v. Further discussion pertaining to the origin of the difference in bond and V_xc variation with pK_a between subsets can be found in Supplementary Methods. Overall, the discrepancy in AIBL-pK_a trends with substituent type (Supplementary Note 2) suggests that, in the search for a bond that has a relationship with pK_a over a wide variety of substituent patterns/types, it is logical to look to the enolic hydroxyl group, i.e. O–H (i), C–O (ii) and C=C (iii).

Due to the prevalence of well-correlated relationships between bonding distances and pK_a for the keto-enol anti-conformation for tkn1-tkn4 and tkc1-tkc6, this tautomeric form was used for all subsequent analysis on the remaining dataset. The bonds that are under investigation are those of the keto-enol fragment (i–v in Fig. 3b), which are common to all 1,3-CHD and 1,3-CPD compounds of the dataset. Selection of these specific bond lengths therefore allows us to construct one generally applicable model, rather than assembling many models for more specific sub-regions of chemical space.

Single bond length models

Our dataset of 71 compounds (Supplementary Table 1) consists of 46 triketones and diketones from Syngenta, plus an additional 9 diketones and 2 triketones measured for the purpose of this work (experimental details can be found in Supplementary Methods in the SI). A further 8 pK_a values for Alloxydim analogues were also obtained from the literature (Supplementary Table 1). Due to a discrepancy between predicted and literature values, samples were procured and pK_a values were re-measured for 7 of these 8 compounds. Literature values for 6 Tetracycline derivatives were also included. The full set was split into 70% training and 30% test set, i.e. 49:22 training to test set.

Table 1 lists internal, cross-validation and external validation statistics of each single bond length regression model (i.e. the typical AIBL approach). The values listed in Table 1 are found using a reduced training set, due to the removal of two outliers, dk29 and tk3. The reason for the removal of these compounds will be discussed in the next section. The most active bond, i.e. the model exhibiting the highest r² and lowest RMSEE is the C–O (ii) bond (0.72 and 0.57, respectively). We note that these values are somewhat less impressive than the threshold values used to mark the presence of an active bond in our other case studies (~0.90 for r² and ~0.3 for RMSEE). This decrease in goodness of fit can be attributed to the higher structural diversity of the set: the model covers 5- and 6-membered rings, compounds with substitution at the 2, 4 and 6 position of the 1,3-CHD fragment and compounds containing more than one ionizable group.

Table 1 Summary of the results for the typical AIBL ordinary least squares approach.

Full size table

Nonetheless, the error metrics for the C–O model (pK_a = 93.381*r(CO) −127.71) used on the external test set indicate a high level of prediction accuracy and consistency across a diverse array of analogues; the MAE and standard deviation of absolute errors for the test set are both 0.24. No C–O model errors exceed 1 pK_a unit and only 2 out of 22 exceed 0.5 log units (tk1 = +0.92, dk8 = −0.77). The nature of bond length variation across the 47 training compounds matches that of the tkn/tkc series for O–H (i), C–O (ii) and C=C (iii).

Outliers

Two species were found to have residual errors exceeding 1.5 log units for 4 out of 5 bonds. One outlier is dk29, a 1,3-CPD derivative with a CH₂-2-pyridyl group at the 4-position. The pK_a value of 5.78 listed for this species was identified as the pK_a for dissociation of the 2-pyridyl group, rather than the keto-enol fragment (pyridine itself has a pK_a of 5.23). The other incongruous data point corresponds to tk3, which has a fourth keto group at the 5-position of the 1,3-CHD ring, a feature that is also present in compounds tk1 and tk4. The C–O bond distances of these three compounds sit below the trend line for the rest of the set, with an r² value of 1 for a linear fit, i.e., compounds with the 5-C=O structural motif in common form their own high-correlation subset. More accurate predictions for compounds such as tk1 (error = +0.92) could therefore be made using the equation of this line as a new model, rather than the original C–O model. Both compounds were removed from subsequent analysis.

Other regression approaches

Table 2 shows the 7-fold CV and external validation statistics for optimal models. These were derived using PLS (4 bonds), RFR (3 bonds), SVR [linear] (2 bonds), SVR [RBF] (3 bonds) and GPR [RBF] (3 bonds) using feature selection based on minimization of the 7-fold RMSEE of the training set. The 7-fold RMSEE for each of the 31 combinations/subsets are compared in Fig. 4a (the Model ID list is shown in Supplementary Table 8, the full list of statistics for each model is shown in Supplementary Tables 9–13 and predictions are shown in Table 3). The optimal model for each method was then used to predict test set pK_a values.

Table 2 Summary of the results for optimal feature choice using PLS, RFR, SVR with linear and RBF kernels, and GPR with the RBF kernel.

Full size table

**Fig. 4: Performance of each regression method tested in this work using bond lengths as input features compared with results obtained using Marvin.**

Table 3 Experimental pK_a values and predictions for each method tested in this work, for the test set compounds.

Full size table

Overall, all optimal models for each method include C–O as an input feature. The lowest 7-fold CV MAE and RMSEE correspond to the GPR model using a radial basis function kernel, which uses C–O, C–C and C=O as input features (MAE = 0.30, RMSEE = 0.39). However, this same GPR model also delivers the least accurate predictions for the 22 compounds of the external test set with an RMSEP of 0.59 and a MAE of 0.43, possibly indicative of overfitting to the training set data. Overall, SVR[RBF] using C–O, C–C and C=O returns the lowest MAE and RMSEP for the test set (0.29 and 0.36, respectively) and is consistent in its accuracy (s.d. = 0.22). However, PLS using C–O, C=C, C–C and C=O also performs similarly well (MAE = 0.31, RMSEP = 0.36) and exhibits the lowest standard deviation of absolute errors (s.d. = 0.19). There is one consistently large error across every model, corresponding to the predicted value for tk1. This compound shows an average error across all models of −1.21, with the lowest error exhibited by the PLS model (−0.72) and the largest for GPR[RBF] (−1.60). This compound was previously identified as belonging to a new subset of 5-C=O containing compounds, along with tk3 and tk4 for the C–O model, and may therefore be considered to be on the edge region of the domain of applicability for the model.

The comparable accuracy of the single bond length C–O model for the test set, with respect to more complex regression methods using more input features is a remarkable result, given the simplicity of the approach. This result also validates our previous work, in which models using multiple input features were deemed unnecessary given the strength of the correlation for individual bond distances.

Marvin

A comparison between error metrics for all models shows significant improvement compared with Marvin (Figs. 4b, c), either with or without consideration of tautomer/resonance. Furthermore, AIBL provides predicted values that correctly suggest the dominant microstate at pH 7 is the enolate, i.e. the ionized form. After tautomer enumeration and selection, Marvin’s pK_a values predict that 15 out of 22 compounds would be >50% unionized at this pH. However, this result is reduced to only two incorrectly assigned microstates when the keto-enol form is used explicitly. All experimental and predicted values can be found in Table 3.

Correction of experimental value for Profoxydim

Experimental pK_a data were initially procured from literature sources for the series of “dim” herbicides used in this work (Supplementary Note 1). Upon performing the fits for the single bond length models, the residual error for Profoxydim (Fig. 4d) using the literature pK_a value of 5.91 was found to be anomalously high, at +1.30 units. Marvin predicts the pK_a of the enolic hydroxy group to be 5.44, i.e. very close to this experimental value.

Due to the excellent accuracy observed for species o1-o7 (residuals < 0.50), we decided to re-measure all pK_a values. Seven of the eight compounds (all except Clethodim) were procured and re-measured using the UV-metric method (see Supplementary Methods for details). Excellent agreement was found between old and new values for all compounds but Profoxydim, for which a value of 4.82 was found. This new value lies only 0.22 units from our original prediction (4.61), yet it lies ~1.10 log units from the literature value. Therefore, we demonstrate the power of the AIBL approach to check internal consistency of pK_a values for a given congeneric series. Structures and predictions for all dim herbicides can be found in Supplementary Fig. 2 of the SI.

Tetracyclines

Aside from tautomerism, one of the more complex issues in the field of pK_a prediction is the estimation of values for multiprotic compounds. Two of the species of our dataset contain a secondary ionizable group (dk26 and dk29, 2-pyridyl, pK_a = ~5). In recent work we have demonstrated that prediction for a specific ionizable group may be performed by using the relevant microstate to the dissociation of interest. Therefore, in the case of dk26 and dk29, we performed all calculations on the cationic form of the 2-pyridyl group. To showcase the applicability of the AIBL model derived here in the context of larger multiprotic compounds, 6 tetracycline derivatives were included. For the correct microstate (the neutral state) of each species the most stable form is analogous to the keto-enol syn c conformation. The anti-conformation was constructed by manual rotation of the C²–C¹–O⁹–H¹⁰ (Fig. 3b) torsional angle from this form. For tet1, tet3, tet5 and tet6 of the training set, residual errors from the C–O model are below 0.1 log unit in all cases. For the test set compounds, predictions for tet2 and tet4 also lie within 0.1 log units. Use of Marvin with consideration of tautomers on this occasion identifies the keto-enol state as the relevant tautomeric form, delivering predictions of 2.83, 2.63, 2.55, 2.92, 2.84 and 2.51, for tet1–tet6, respectively, whereas experimental values are 3.35, 3.48, 3.25, 3.50, 3.53 and 3.30, respectively. Therefore, despite making the prediction using the correct tautomer, there is a distinct prediction bias towards higher acidity for the enolic hydroxy group for these compounds. Structures and predictions for tetracyclines can be found in Supplementary Fig. 3 of the SI.

Future application of AIBL

The poorer performance of Marvin, as illustrated by Figs. 4b, c, can most likely be partly attributed to a lack of coverage of this type of compound (cyclic 1,3-diketones) in their training dataset. The predicted preference of the diketo state of many test compounds can also likely be attributed to the lack of knowledge on relative tautomeric stability, as previously pointed out by Connolly. The results in Fig. 4 illustrate the excellent performance of the C–O AIBL-pK_a model in predicting the pK_a variation across the series. Furthermore, we show that the accuracy is such that we can correct experimental values. We assert that a powerful future application of the AIBL approach is a method of fleshing out areas of chemical space that are sparse in the experimental pK_a databases of empirical predictors, such as Marvin. Once a model has been set up with existing experimental data, hypothetical compounds with a variety of substituents can be assembled and their pK_a values predicted and added to the training set. Therefore, the empirical approach is calibrated using the highly accurate AIBL approach, whilst still maintaining user-friendly computational speed.

We have shown bonding distances to be an intuitive and powerful descriptor of ionization propensity for much of 1,3-CHD and 1,3-CPD space. Due to the use of quantum chemically derived descriptors, the dominant tautomeric state is easily identified as the keto-enol form, from which chemically meaningful relationships are derived; a longer O–H and a shorter C–O bond are generally indicative of a species with heightened acidity compared with the parent compound. A simple but accurate AIBL-pK_a method is proposed and validated; good results are derived using only simple linear regression of pK_a onto C–O bond distances, which is shown to be applicable to a diverse array of analogues. For the test set, this simple model is found to outperform regression using various approaches and multiple bond lengths relevant to the dissociation at the keto-enol ionizable group. Furthermore, the method is applicable to multiprotic compounds, which along with tautomerizable species, represent one of the most challenging areas of pK_a prediction. All of the models developed showed superior accuracy compared with the industry standard, represented by the program Marvin, for which the user must have prior knowledge of the dominant tautomeric form. At present, there is still a time/cost barrier to feasible use of quantum chemical QSPR methods in large scale screening studies. However, this work suggests that the inclusion of some description of electrons and their distribution (via a highly populated geometric representation of molecules), provides a significant advantage in terms of prediction accuracy over an approach (Marvin) that does not describe a compound quantum mechanically. Thanks to AIBL predictions, we also amend the literature experimental value for Profoxydim, which is corrected from a previous value 5.91 to a new value of 4.82. Based on the work shown here, and on previous results, we propose that AIBL-pK_a is applicable to any tautomerizable congener series, given that pK_a data exist for model calibration.

Methods

Data

Structures and pK_a values with references are given in Supplementary Table 1 for all compounds studied in this work. Equilibrium bond lengths for the most stable geometries identified are listed in Supplementary Table 7.

The pK_a data for the compounds investigated in this work have been procured from various sources. Sixteen triketones, labelled tk-1 to tk-15, tk18 and tk19 were procured from the Syngenta and are analogues of the herbicide Mesotrione. A further 20 diketone compounds were procured from Syngenta, which are labelled as dk-1 to dk-12 and dk22 to dk29. These values were obtained using the UV-vis metric approach with a Sirius T3 instrument at standard conditions (see Supplementary Methods in the SI for more details). A set of 10 compounds of triketone (tk) type labelled in as tkn1-tkn4 and tkc1-tkc6 were taken from the work³² of Lee et al. Samples of 11 diketones (dk), labelled dk-13 to dk-21, tk16 and tk17 have been procured and measured for the purpose of this work, using the potentiometric method with a Sirius T3 instrument at standard conditions. Finally, literature values were procured for 8 “dim” herbicides Alloxydim, Cycloxydim, Butroxydim, Clethodim, Sethoxydim, Tepraloxydim, Tralkoxydim and Profoxydim were procured, samples were purchased for all except Clethodim (due to unavailability) and pK_a measurements were taken using the same apparatus and experimental procedure as described above and in Supplementary Methods. Literature values for 6 tetracycline derivatives (tet1–tet6) were obtained from literature sources.

Quantum chemical calculations

An ensemble of 15 conformers were generated for each tautomeric form of each compound tkn1-tkn4 and tkc1-tkc6 using the conformer generator plug-in within the Marvin program. Geometry optimization and frequency calculations were then performed using B3LYP/6-311G(d,p) with CPCM implicit solvation for each conformer of every ensemble using GAUSSIAN09³³. Conformers were ranked according to internal energy and the most stable species was taken as the global minimum. For the anti and syn conformers of the keto-enol state, an input geometry for the higher energy anti-conformation was manually generated by rotating the orientation of the O–H bond of the syn conformer by 180°. This process of generating the keto-enol anti state^{15,16,18,19,20,21,22} was repeated for the remaining 61 species.

IQA calculations

The extent of electronic delocalization between two atoms can be calculated within the context of a topological energy decomposition framework called interacting quantum atoms (IQA). Originating from the quantum theory of atoms in molecules³³ (QTAIM), IQA has been used to analyze a variety of chemical phenomena^{34,35,36,37,38}. By decomposing the total energy of a system into intra- and interatomic terms, we derive the exchange-correlation potential energy V_xc, which is the sum of the exchange energy V_x, and the correlation energy V_c. The former term usually dominates and denotes the Fock-Dirac exchange, which describes the ever-reducing probability of finding two electrons of the same spin close to one another (i.e. the Fermi hole). The latter term is associated with the Coulomb hole and the electrostatic repulsion between electrons. The absolute value of V_xc evaluated between two atoms can be taken as the extent delocalization of electrons between them and so can be interpreted as a measure of covalency. These values were obtained by the AIMAll program³⁹ (version 14), using DFT-compatible IQA partitioning, and using default parameters on wavefunctions obtained at the B3LYP/6-311G(d,p) level using CPCM.

Models

For more details of regression methods implemented in this work see Supplementary Methods in the SI. Model training and error evaluation were performed using scikit-learn⁴⁰. Initially, ordinary least squares (OLS) regression of single bond distances and pK_a, and validation was performed using r² and 7-fold CV RMSEE and MAE to assess the linear relationships between bond lengths and pK_a. A random 70:30 split of training set to external test set was then performed (i.e. training set = 49, test set = 22). We compared the results of using more than one bond length of the keto-enol fragment using support vector regression (SVR) with a linear and radial basis function (RBF) kernel, random forest regression (RFR), partial least squares (PLS) and Gaussian process regression (GPR) with an RBF kernel. We also compared our test set prediction errors results with those obtained using the program Marvin. Each model was evaluated using error-based metrics, mean absolute error (MAE), standard deviation of absolute errors (s.d.), root-mean-squared error (RMSEP) and the r² of observed vs predicted values. An overview of the AIBL workflow used in the context of cyclic β-diketones is shown in Fig. 1c.

The optimal hyperparameters for the SVR models, C, ε (and γ for the RBF kernel) and RFR (number of estimators n_est, maximum depth) were found in each case by applying a grid search (GridSearchCV in scikit-learn). The final hyperparameter values were chosen to minimize a 7-fold cross-validation RMSEE.

The GPR model was implemented in python using the GPR package called George. The squared exponential (SE) kernel, or RBF, was used to set up the GPR models with a unique length scale (hyperparameter) for each dimension, also known as the automatic relevance determination kernel of the SE-ARD,

$${\mathrm{SE}}-{\mathrm{ARD}}\left( {x,x^{\prime}} \right) = \exp \left( { - \frac{1}{2}\mathop {\sum}\limits_{d = 1}^N {\frac{{|x - x^{\prime}|^2}}{{\ell ^2}}} } \right)$$

(1)

The hyperparameters for this kernel were found by maximizing the log-likelihood function using the training set. The implementation for this used the gradient descent BFGS algorithm (implemented by scipy) on the negative gradient of the log-likelihood function (therefore finding the maximum of the function). As there can be many local maxima, the optimizer was restarted with random weights 100 times in an attempt to find the global maximum.

Data availability

All data analysed during this study are included in this published article (and its Supplementary Information).

Code availability

The exact code is not provided given that it was written using methods from sci-kit learn (v.0.20.1) and George (v.0.3.1) libraries, which are freely available. Optimal hyperparameters for each method have been provided in the Supplementary Information and are otherwise set to default.

References

Connolly Martin, Y. Experimental and pK_a prediction aspects of tautomerism of drug-like molecules. Drug Discov. Today.: Technol. 27, 59–64 (2018).
Article Google Scholar
Brown, T. N. & Mora-Diez, N. Computational determination of aqueous pK_a values of protonated benzimidazoles (Part 1). J. Phys. Chem. B 110, 9270–9279 (2006).
Article CAS Google Scholar
Connolly Martin, Y. Let’s not forget tautomers. J. Comput. Aided Mol. Des. 23, 693–704 (2009).
Article Google Scholar
Philipp, D. M., Watson, M. A., Yu, H. S., Steinbrecher, T. B. & Bochevarov, A. D. Quantum chemical pK_a prediction for complex organic molecules. Int. J. Quant. Chem. 118, 1–8 (2017).
Google Scholar
Fujiki, R. et al. A computational scheme of pK_a values based on the three-dimensional reference interaction site model self-consistent field theory coupled with the linear fitting correction scheme. PhysChemChemPhys 20, 27272–27279 (2018).
CAS Google Scholar
Bochevarov, A. D., Watson, M. A. & Greenwood, J. R. Multiconformation, density functional theory‐based pK_a prediction in application to large, flexible organic molecules with diverse functional groups. J. Chem. Theor. Comput. 12, 6001–6019 (2016).
Article CAS Google Scholar
Yu, H. S., Watson, M. A. & Bochevarov, A. D. Weighted averaging scheme and local atomic descriptor for pK_a prediction based on density functional theory. J. Chem. Inf. Model. 58, 271–286 (2018).
Article CAS Google Scholar
Haranczyk, M. & Gutowski, M. Combinatorial–computational–chemoinformatics (C3) approach to finding and analyzing low-energy tautomers. J. Comput. Aided Mol. Des. 24, 627–638 (2010).
Article CAS Google Scholar
Greenwood, J. R., Calkins, D., Sullivan, A. P. & Shelley, J. C. Towards the comprehensive, rapid, and accurate prediction of the favorable tautomeric states of drug-like molecules in aqueous solution. J. Comput. Aided Mol. Des. 24, 591–604 (2010).
Article CAS Google Scholar
Watson, M. A., Yu, H. S. & Bochevarov, A. D. Generation of tautomers using micro-pK_as. J. Chem. Inf. Model. 59, 2672–2689 (2019).
Article CAS Google Scholar
Balogh, G. T., Gyarmati, B., Nagy, B., Molnar, L. & Keseru, G. M. Comparative evaluation of in silico pK_a prediction tools on the gold standard dataset. QSAR Comb. Sc. 28, 1148–1155 (2009).
Article CAS Google Scholar
Avdeef, A. Absorption and Drug Development: Solubility, Permeability and Charge State. (Wiley-Interscience, New Jersey, USA, 2003).
Book Google Scholar
Cyr, N. & Reeves, L. W. A study of tautomerism in cyclic β-diketones by proton magnetic resonance. Can. J. Chem. 43, 3057–3062 (1965).
Article CAS Google Scholar
Junior, V. L., Constantino, M. G., da Silva, G. V. J., Neto, Al. C. & Tormena, C. F. NMR and theoretical investigation of the keto-enol tautomerism in cyclohexane-1,3-diones. J. Mol. Struct. 828, 54–58 (2007).
Article Google Scholar
Alkorta, I., Griffiths, M. Z. & Popelier, P. L. A. Relationship between experimental pK_a values in aqueous solution and a gas phase bond length in bicyclo[2.2.2]octane and cubane carboxylic acids. J. Phys. Org. Chem. 26, 791–796 (2013).
Article CAS Google Scholar
Alkorta, I. & Popelier, P. L. A. Linear free energy relationships between a single gas-phase ab initio equilibrium bond length and experimental pK_a values in aqueous solution. ChemPhysChem 16, 465–469 (2015).
Article CAS Google Scholar
Anstöter, C., Caine, B. A. & Popelier, P. L. A. The AIBLHiCoS method: predicting aqueous pK_a values from gas-phase equilibrium bond lengths. J. Chem. Inf. Model. 56, 471–483 (2016).
Article Google Scholar
Dardonville, C. et al. Substituent effects on the basicity (pK_a) of aryl guanidines and 2-(arylimino)imidazolidines: correlations of pH-metric and UV-metric values with predictions from gas-phase ab initio bond lengths. N. J. Chem. 41, 11016–11028 (2017).
Article CAS Google Scholar
Caine, B. A., Dardonville, C. & Popelier, P. L. A. Prediction of aqueous pK_a values for guanidine-containing compounds using ab initio gas-phase equilibrium bond lengths. ACS Omega 3, 3835–3850 (2018).
Article CAS Google Scholar
Caine, B. A., Bronzato, M. & Popelier, P. L. A. Experiment stands corrected: accurate prediction of the aqueous pK_a values of sulfonamide drugs using equilibrium bond lengths. Chem. Sci. 10, 6368–6381 (2019).
Article CAS Google Scholar
Harding, A. P. & Popelier, P. L. A. pK_a Prediction from an ab initio bond length: Part 2—phenols. Phys. Chem. Chem. Phys. 13, 11264–11282 (2011).
Article CAS Google Scholar
Harding, A. P. & Popelier, P. L. A. pK_a prediction from an ab initio bond length: Part 3—benzoic acids and anilines. Phys. Chem. Chem. Phys. 13, 11283–11293 (2011).
Article CAS Google Scholar
Kirby, A. J. Crystallographic approaches to transition state structures. Adv. Phys. Org. Chem. 29, 87–183 (1994).
CAS Google Scholar
Green, A. J., Giordano, J. & White, J. M. Gauging the donor ability of the C–Si bond. Results from low-temperature structural studies of gauche and antiperiplanar β-trimethylsilylcyclohexyl esters and ethers by use of the variable oxygen probe. Aust. J. Chem. 53, 285–292 (2000).
Article CAS Google Scholar
Davies, J. E., Doltsinis, N. L., Kirby, A. J., Roussev, C. D. & Sprik, M. Estimating pK_a values for pentaoxyphosphoranes. J. Am. Chem. Soc. 124, 6594–6599 (2002).
Article CAS Google Scholar
Sorianoa, E., Cerdan, S. & Ballesteros, P. Computational determination of pK_a values. A comparison of different theoretical approaches and a novel procedure. J. Mol. Struct. 684, 121–128 (2004).
Article Google Scholar
Xing, L., Glen, R. C. & Clark, R. D. Predicting pK_a by molecular tree structure fingerprints and PLS. J. Chem. Inf. Comput. Sci. 43, 870–879 (2003).
Article CAS Google Scholar
Goodarzi, M., Freitas, M. P., Wu, C. H. & Duchowicz, P. R. pK_a modeling and prediction of a series of pH indicators through genetic algorithm-least square support vector regression. Chemometrics Intell. Lab. Syst. 101, 102–109 (2010).
Article CAS Google Scholar
Harding, A. P., Wedge, D. C. & Popelier, P. L. A. pK_a prediction from “quantum chemical topology” descriptors. J. Chem. Inf. Mod. 49, 1914–1924 (2009).
Article CAS Google Scholar
Gasteiger, J. & Marsili, M. A new model for calculating atomic charges in molecules. Tetrahedron Lett. 19, 3181–3184 (1978).
Article Google Scholar
Lee, D. L. et al. The structure–activity relationships of the triketone class of HPPD herbicides. Pestic. Sci. 54, 377–384 (1998).
Article CAS Google Scholar
GAUSSIAN09, Revision B.01 et al. GAUSSIAN09 (Gaussian, Inc., Wallingford, CT, 2009).
Bader, R. F. W. Atoms in Molecules. A Quantum Theory. (Oxford Univ. Press, Oxford, 1990).
Google Scholar
Maxwell, P., Martín Pendás, A. & Popelier, P. L. A. Extension of the interacting quantum atoms (IQA) approach to B3LYP level density functional theory. PhysChemChemPhys 18, 20986–21000 (2016).
CAS Google Scholar
Thacker, J. C. R. & Popelier, P. L. A. The ANANKE relative energy gradient (REG) method to automate IQA analysis over configurational change. Theor. Chem. Acc. 136, 86 (2017).
Article Google Scholar
Thacker, J. C. R. & Popelier, P. L. A. Fluorine gauche effect explained by electrostatic polarization instead of hyperconjugation: an interacting quantum atoms (IQA) and relative energy gradient (REG) study. J. Phys. Chem. A 122, 1439–1450 (2018).
Article CAS Google Scholar
Thacker, J. C. R., Vincent, M. A. & Popelier, P. L. A. Using the relative energy gradient method with interacting quantum atoms to determine the reaction mechanism and catalytic effects in the peptide hydrolysis in HIV-1 protease. Chem. Eur. J. 14, 11200–11210 (2018).
Article Google Scholar
Wilson, A. L. & Popelier, P. L. A. Exponential relationships capturing atomistic short-range repulsion from the interacting quantum atoms (IQA) method. J. Phys. Chem. A 120, 9647–9659 (2016).
Article CAS Google Scholar
AIMAll. Todd A. Keith (TK Gristmill Software, Overland Park, KS, USA, 2014) (aim.tkgristmill.com).
Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
Google Scholar

Download references

Acknowledgements

P.L.A.P. thanks the EPSRC for Fellowship funding (EP/K005472), while P.L.A.P. and B.A.C. thank the BBSRC for funding her PhD studentship under the “iCASE” award BB/L016788/1 (with a contribution from Syngenta Ltd) and for funding a subsequent postdoc with Impact Acceleration funding (IAA_105) (with a contribution of Lhasa Ltd). C.D. thanks the Ministerio de Ciencia, Innovación y Universidades (MCIU/AEI/FEDER, UE; grant RTI2018-093940-B-I00).

Author information

Authors and Affiliations

Department of Chemistry, University of Manchester, Manchester, UK
Beth A. Caine & Paul L. A. Popelier
Manchester Institute of Biotechnology (MIB), 131 Princess Street, Manchester, UK
Beth A. Caine & Paul L. A. Popelier
Syngenta AG, Jealott’s Hill, Warfield, Bracknell, RG42 6E7, UK
Maddalena Bronzato, Torquil Fraser & Nathan Kidley
Instituto de Química Médica, IQM–CSIC, C/Juan de la Cierva 3, Madrid, 28006, Spain
Christophe Dardonville

Authors

Beth A. Caine
View author publications
You can also search for this author in PubMed Google Scholar
Maddalena Bronzato
View author publications
You can also search for this author in PubMed Google Scholar
Torquil Fraser
View author publications
You can also search for this author in PubMed Google Scholar
Nathan Kidley
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Dardonville
View author publications
You can also search for this author in PubMed Google Scholar
Paul L. A. Popelier
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

B.C. performed all computational work and analysis aided by P.L.A.P. who oversaw the entire study. N.K. and T.F. procured experimental pK_a data for tk1-tk15, tk18 and tk19, dk1-dk12, dk22-dk29. M.B. performed experimental pK_a measurements for compounds o1-o8 and C.D. performed experimental pK_a measurements for dk13-dk21, tk16 and tk17. B.C. prepared the paper and SI, which were reviewed by all authors.

Corresponding author

Correspondence to Paul L. A. Popelier.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Caine, B.A., Bronzato, M., Fraser, T. et al. Aqueous pK_a prediction for tautomerizable compounds using equilibrium bond lengths. Commun Chem 3, 21 (2020). https://doi.org/10.1038/s42004-020-0264-7

Download citation

Received: 04 November 2019
Accepted: 16 January 2020
Published: 12 February 2020
DOI: https://doi.org/10.1038/s42004-020-0264-7

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Subjects

Abstract

Similar content being viewed by others

Accurate determination of solvation free energies of neutral organic compounds from first principles

Non-bonded force field model with advanced restrained electrostatic potential charges (RESP2)

A highly accurate metadynamics-based Dissociation Free Energy method to calculate protein–protein and protein–ligand binding potencies

Introduction

Results and discussion

Scheme for model construction

Current approaches

Identifying AIBL-pKa relationships for triketones

Single bond length models

Outliers

Other regression approaches

Marvin

Correction of experimental value for Profoxydim

Tetracyclines

Future application of AIBL

Methods

Data

Quantum chemical calculations

IQA calculations

Models

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Information

Peer Review File

Rights and permissions

About this article

Cite this article

Share this article

Comments

Search

Quick links

Identifying AIBL-pK_a relationships for triketones