Solubility prediction remains a critical challenge in drug development, synthetic route and chemical process design, extraction and crystallisation. Here we report a successful approach to solubility prediction in organic solvents and water using a combination of machine learning (ANN, SVM, RF, ExtraTrees, Bagging and GP) and computational chemistry. Rational interpretation of dissolution process into a numerical problem led to a small set of selected descriptors and subsequent predictions which are independent of the applied machine learning method. These models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Finally, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.
Solubility is a critical physical property of organic compounds in drug development, e.g., availability, distribution, metabolism, excretion and toxicity (ADMET)1,2, protein engineering3,4,5, chemical process design6, synthetic route prediction7,8, extraction and crystallisation9,10. Due to its importance in environmental predictions, biochemistry, and agrochemical and drug design, aqueous solubility prediction has been the subject of intensive research11. Previous solubility prediction approaches include fragment-based semi-empirical methods, e.g. general solubility equation12, UNIFAC13, thermodynamic cycle14, thermodynamic parameters, e.g., UNIQUAC15,16, Hansen solubility parameters and Hildebrandt solubility parameters17,18, different molecular dynamics methods19,20,21, and first principle ab initio calculations (COSMO-RS)22,23. More recent developments focused on quantitative structure-activity/property relationship (QSAR/QSPR)24,25, through statistical analysis and machine learning techniques26,27,28. Despite these advances, accurate prediction of solubility is still a major scientific challenge, as exemplified by the two solubility challenges issued to the research community in 2008 and 201929,30. This is due to the complex nature of dissolution process, which involves lattice/sublimation energy, solvation energy, ionisation of solute and solution phase interactions. Each of these is a challenging property to predict and can be quite computationally expensive31. Statistical and machine learning approaches often employ a large number of descriptors (>100)32, which has led to difficulties in interpreting and rationally improving prediction models33. Finally, prediction is hindered by the poor quality of experimental solubility data34, which are affected by measurement techniques, and purity of solute and solvents.
In this paper, we report our new approach to general solubility prediction in organic solvents, which has been understudied, and water using machine learning. In contrast to previous studies, a small number of descriptors (14 in contrast to the usual >100 descriptors employed in QSPR models) were rationally selected based on their relevance to the physicochemical aspects of dissolution process (Fig. 1a). Consequently, interpretable solubility prediction models, which reproduce physicochemical relationships between solubility and molecular properties in different solvents, with excellent accuracy were developed. Furthermore, these models were successfully validated against industrial targets and those of the solubility challenges29,30. Finally, our results were benchmarked against the AquaSol model26, EPI SuiteTM (the official tool of the EPA)35, and COSMOtherm as the standard ab initio tool for solubility prediction36.
Results and discussion
Solubility data were collected from Open Notebook Science Challenge aqueous solubility dataset and the Reaxys database. For this study, only solubility data of neutral solutes in single component solvents were collected. While aqueous solubility data are numerous, our search of the Reaxys database resulted in a limited amount of data in organic solvents (Fig. 1b). Thus, ethanol, benzene and acetone were chosen as the solvents in this study to maximise the amount of training data and to cover the entire range of solvent polarity. Although benzene is not a commonly employed solvent in modern chemistry, it represents non-polar solvents with sufficient data availability.
Analysis of LogS values for the collected solubility data showed that while the range for LogS (measured in M) in water is −12 to 2, those in organic solvents are typically in the range of −4 to 1. To provide a consistent comparison, a second aqueous solubility dataset (Water_set_narrow, LogS = −4 to 1) was created from the first dataset (Water_set_wide). Although an even distribution of LogS values across the range in each dataset is preferable for model training (Fig. 1b), given the limitation on data availability no trimming based on LogS was carried out for the organic solvent datasets (Ethanol_set, Benzene_set, and Acetone_set).
Molecular weight (MW) was found to be normally distributed for all datasets, centred on MW = 200 with few above MW = 500 (Supplementary Fig. 4). For this study, compounds with MW > 504 were excluded to keep computational costs reasonable while maintaining their relevance to synthetic intermediates in drug discovery/development37. Interestingly, the distributions of organic functional groups are similar between the datasets with the exception of a higher percentage of halides in Water_set_wide and Water_set_narrow and a higher percentage of sulphur containing compounds in Benzene_set (Supplementary Fig. 5). A wide range of functional groups were found including halogen, 3- and 4-membered rings, although B and Si containing compounds, which may be valuable synthetic intermediates, were absent.
Thus, five open-access and curated solubility datasets were created for this study. Three are unique solubility datasets in organic solvents. Each of these was split into a training set and a validation set by LogS binning (Supplementary Note 4.1) and a randomly even distribution of data to ensure the representative nature of the validation set.
Descriptor development: In order to develop interpretable predictive models for solubility in different solvents, a small set of molecular descriptors, which represent solute-solute and solute-solvent interactions, was selected. This small set of descriptors will also benefit the statistical robustness of the models given the relatively small size of the datasets. All 22 descriptors are summarised in Table 1, covering sum of thermal and electronic energies of the solute molecule, solvation energy, orbital interaction between solute and solvent, dipole moment and charge distribution in the solute molecule, molecular volume, Solvent Accessible Surface Area, molecular weight and the number of atoms of the solute. Finally, the experimental melting point was included as a reflection of the sublimation energy of the solid form of the solute. Melting point prediction is still highly inaccurate (RMSE ≈ 38 °C), rendering experimental values a necessity38.
Correlation between the calculated descriptors were analysed and summarised in Fig. 2a. The only observed significant correlations were those expected between E0_gas, E0_solv, DeltaE0_sol, G_gas, G_sol and DeltaG_sol, between gas_dip and solv_dip, and between HOMO, LUMO, LsoluHsolv, and LsolvHsolu. Similarly, the scree plots indicated >10 principal components were needed to capture most of the variations in the descriptors (Fig. 2b). Using an acceptable threshold of correlation R2 ≤ 0.9, the descriptors N_atoms, E0_gas, E0_solv, DeltaE0_sol, G_gas, gas_dip, HOMO and LUMO (Table 1, in bold font) were removed. Consequently, the trimmed down set of 14 descriptors (white background) was taken forward for solubility prediction models.
Metrics for predictive models: In order to build and improve predictive models, reliable metrics to evaluate their accuracy and reliability are essential. Common practice in machine learning relies on R2 and RMSE to evaluate models. Both these values are highly dependent on the LogS range the model is applied to. Furthermore, despite consistency within in-house small datasets29, a typical experimental error of ± 0.5–0.7 for LogS in literature aqueous solubility measurements has been established by Mitchell and Palmer34. These are due to variations in pH, temperature and purity of solvents. Such errors in the training set render R2 and RMSE less reliable in evaluating solubility predictive models. Consequently, two new metrics were created for our evaluation: % of predictions within LogS ± 0.7 and within LogS ± 1.0 of experimental values (%LogS ± 0.7 and %LogS ± 1.0). The former reflects the maximum accuracy of the model based on the available data and the latter the limits of the usefulness of the model as a guiding tool for process/product development.
Evaluation and interpretation of models: Eight machine learning methods, i.e. MLR (Multiple Linear Regression), PLS (Partial Least Square), ANN (Artificial Neural Network), SVM (Support Vector Machine), GP (Gaussian Process), RF (Random Forest), ET (Extra Trees) and Bag (Bagging), were applied to all 5 datasets. Deep Neural Networks were not considered due to the small size of the datasets. Parameters for each model were optimised to maximise accuracy and avoid overfitting (Supplementary Note 4). The optimisation and cross-validation results are included in the Supplementary Notes. The split-test model metrics are summarised in Table 2.
All four metrics (R2, RMSE, %LogS ± 0.7, %LogS ± 1.0) clearly showed that linear regression models (MLR and PLS) perform poorly in solubility prediction compared to non-linear models. Importantly, across the five datasets, the performances of the five non-linear models are quite comparable and consistently good. In most cases, the standard deviations between their metrics are very small. The only exceptions are SVM, which gave notably better %LogS ± 0.7 with Water_set_wide and Acetone_set, and GP with Water_set_narrow. These suggested that the overall accuracy of these predictions is less dependent on the machine learning model and is more dependent on the descriptors and data quality. This is further supported by good agreement (R2 > 0.9) between individual predictions from each of the six non-linear methods (Supplementary Figs. 45–49). Consistent with this hypothesis, the R2 and RMSE metrics for the models for Ethanol_set and Acetone_set are much poorer compared to those of Water_set_wide, Water_set_narrow and Benzene_set, despite little decrease in %LogS ± 0.7 and %LogS ± 1.0. These reflect the quality of experimental solubilities in ethanol and acetone and the poor reliability of R2 and RMSE. Both solvents are often contaminated with water and acetone is a volatile solvent (b.p. 56 °C), leading to larger experimental errors in solubility measurements.
The non-linear models coped well with these datasets, with %LogS ± 0.7 = 60-80 and %LogS ± 1.0 = 74-90 for (LogS = −4–1), maintaining their effectiveness as predictive models for novel compounds. The best models were obtained with Benzene_set, with the highest %LogS ± 0.7 = 79.8 and %LogS ± 1.0 = 90.4. When the predicted errors for each solubility by GP was included, very high values of %LogS ± 0.7 > 91.6 and %LogS ± 1.0 > 93.5 were obtained, further supporting our hypothesis that the accuracy of the predictions was limited by the descriptors and training data themselves. The distributions of predicted errors for each prediction using GP are shown in Fig. 2c–g, confirming the inherent errors LogS ± 0.7 in the training data. Finally, there was an expected deterioration of the R2 metric, although the other three metrics were improved, moving from Water_set_wide (LogS = −12–2, with R2 value comparable to those achieved by other methods)28,32,39,40 to Water_set_narrow (LogS = −4–1). When the predictions for Water_set_wide were narrowed to LogS = −4 –1, the obtained metrics are very similar to those of Water_set_narrow (Table 2, entries 5-8).
Analysis of the outliers in each model using ET algorithm (Supplementary Note 4.9), chosen for its consistent performance with all datasets, showed that they often include acidic and basic functional groups, extended conjugate/aromatic system, azo group, long and flexible carbon chains, or high density of polar functional groups. These are less well presented in the training data. The distribution of LogS of outliers and the BertzCT complexity descriptor for the inliers and outliers (Supplementary Fig. 56)41 also indicated that the outliers are on average more complex than the inliers and their LogS values are more likely at the limits of the LogS range, as expected from the uneven distribution LogS values in the datasets.
The interpretability of the models is one of the key aspects of their validation. As the six non-linear methods produced comparable results, the analysis was again carried out for the ET models. The effect of leaving one descriptor out on the model metrics were evaluated for all five datasets (Fig. 2i and Supplementary Figs. 59–61). Similar trends were observed for all 4 metrics: (i) minor changes for Water_set_wide and Water_set_narrow, and (ii) significant decrease in accuracy with Ethanol_set, Benzene_set, and Acetone_set when melting points are excluded. This decrease is more pronounced with benzene than with the two polar solvents, ethanol and acetone.
Furthermore, feature importance plots of the 5 ET models showed very high dependence of the models for Ethanol_set, Benzene_set, and Acetone_set on melting point (Fig. 2j and Supplementary Fig. 62). The models for Water_set_wide and Water_set_narrow showed a more even distribution of importance across all the descriptors. In solvents other than benzene, MW, molar volume, SASA, charges on heteroatoms, which are linked to solvent-solute interactions, were also given high importance (Supplementary Figs. 59–62). These analyses showed crucial insights into the factors controlling solubility in the four solvents in this study. Aqueous solubility is dominated by solvation energy and solvent-solute interactions, due to the high polarity of water and its capability for hydrogen bonding42. Thus, the importance of melting point as a descriptor is low. In contrast, solubility in organic solvents is dominated by solute-solute interactions in the solid form, i.e. sublimation energy. Consequently, the models showed strong dependence on melting point, which is the only descriptor included to explicitly describe the solid state of the solute. As solvent-solute interaction is weaker in benzene, with only Van der Waal forces being available, the impact of removing melting point from the descriptor is more pronounced. Thus, the prediction models showed strong agreement with our understanding of the physical aspects of the dissolution process.
Finally, the 14 descriptors were recalculated using the semi-empirical method PM6 in order to evaluate the impact of the lower computational cost to the accuracy of these models. The %LogS ± 0.7 and %LogS ± 1.0 metrics for PM6 models are similar to those of the DFT models with a few exceptions for datasets with LogS = –1 to 4 (Supplementary Table 23), with the exception of the models for Water_set_wide. With the highest quality dataset Benzene_set, all metrics for PM6 and DFT models are nearly identical. The total CPU time for PM6 calculation of descriptors of 394 compounds is 219 hours, compared to 5458 hours for DFT descriptors.
Improvement of the models: While Fig. 2h indicated that the accuracy of our predictions is close to that of the training data, the values for %LogS ± 0.7 can still be improved. Based on our hypothesis that the predictions are more dependent on the descriptors than on the machine learning method, those which have the highest impacts were considered for improvement. SASA depends on the size of the probe and the conformer being measured, but the variation is small. MW, molar volume, and m.p. are fixed for each molecule, leaving charge descriptors and solvation energy (for Water_set_wide and Water_set_narrow). Thus, four methods were evaluated to rationally improve the models: (i) by inclusion of conformers; (ii) by inclusion of descriptors for the molecular charge surface; (iii) by using more accurate calculation of the solvation energy (in water only) and (iv) by consensus of predictions.
Inclusion of conformers (PM6, descriptors averaged by population) did not result in any significant improvement to the model metrics (Supplementary Table 25). Boltzmann distribution based on the free energy of conformers indicated that most molecules have one stable conformer which accounts for more than 90% of the population, negating the potential benefit this approach. Descriptors for the charge isosurface (95% of the electron density, Supplementary Note 2.2.3) were included with the original 14. The only strong correlations within this new set of 27 descriptors were between Area2, Area3 and SASA as expected. While some improvements were observed with the metrics of the models for Acetone_set (Supplementary Table 17), the new models generally gave similar results with much larger computational cost.
Jensen and co-workers recently demonstrated that HF/SMD (Solvation Model Density) give more accurate aqueous solvation energy than other methods, e.g. IEFPCM and COSMO43. Thus, we recalculated G_solv and DeltaG_sol using the HF/SMD method and used these new descriptors to rebuild prediction models for Water_set_wide and Water_set_narrow (Table 3). Significant improvements to %LogS ± 0.7 and %LogS ± 1.0 were observed with Water_set_wide for all six machine learning methods. Notably, %LogS ± 0.7 increased 9.5% with ANN and 7.4% with Bag. For most models, approximately 70% of the predictions are within LogS ± 0.7, as accurate as the training data. The improvements obtained with Water_set_narrow were less significant, but the metrics are consistently better than those obtained with DFT/PCM method.
Finally, the similarity between predictions from different models (Supplementary Figs. 45–49) suggests that the few wrong predictions can be compensated through a wisdom-of-crowd approach44. Consequently, the consensus predictions were carried out for each compound in the validation set as the average and median of the predictions using ANN, SVM, GP and ET. The results are summarised in Supplementary Table 20. The predictions for all narrow datasets (LogS = -4 to 1) showed improved metrics compared to those of ET models. The consensus mean predictions are slightly better than the consensus median predictions, consistent with our assessment that the predictions from all four methods are very similar, with few outliers. Furthermore, the wrong predictions are not too different from the experimental LogS values, negating the benefit of median over mean. The best performance was observed with Benzene_set, with 82.0% of the predicted solubilities inside LogS ± 0.7 and 90.4% inside LogS ± 1.0 (Supplementary Figs. 66 and 67).
Benchmarking and external datasets: Our models were compared with standard prediction tools used in academia and industry, employing the same evaluation datasets. For aqueous solubility, AquaSol, which was developed based on undirected graph recursive neural networks26, gave less accurate predictions than our ET model, particularly at lower LogS values. EPI Suite, a fragment-based tool35, performed even more poorly as expected. Similarly poor results were obtained with COSMOtherm by COSMOlogic45,46. For solubilities in ethanol, benzene and acetone, COSMOtherm predictions were compared with our ET models. In all three cases, COSMOtherm produced significantly larger errors in its predictions, with multiple outliers. The results are summarised in Fig. 3.
While our models performed better than the established tools, a more rigorous test should be an application of the models to new unrelated test sets. For this purpose, the solubilities of three sets of compounds from AstraZeneca (in water, ethanol and acetone, without m.p. for a fair comparison against COSMOtherm) were evaluated against their measured values (Fig. 3n–p). As benchmarks, COSMOtherm was employed to predict solubilities for the same compounds and the results are shown in (Fig. 3q–s). The accuracy of water solubility predictions using our ET model decreased compared to those of the validation set, consistent with the increased in complexity and higher frequency of functional groups in these compounds (Fig. 3m). However, predictions made by COSMOtherm are much less accurate than ours in all three solvents. Importantly, all predictions made by ET models in ethanol and acetone (when m.p. is included, see Supplementary Table 32) were within LogS ± 1.0, albeit with small test sets.
In conclusion, we report the development, evaluation and improvement of interpretable solubility prediction models in organic solvents and water based on judicious interpretation of the dissolution phenomenon into numerical representations through physicochemical relationships. This approach, which we named Causal Structure-Property Relationship (CSPR), enabled the use of a small set of carefully selected descriptors and smaller training datasets compared to models which employ deep neural networks. Our models gave significantly more accurate predictions compared to benchmarked open-access and commercial tools, achieving accuracy close to the expected level of noise in training data (LogS ± 0.7). Importantly, they reproduced physicochemical relationship between solubility and molecular properties in different solvents, which led to rational approaches to improve the accuracy of each models.
Solubility data in water and ethanol were taken from Open Notebook Science Challenge aqueous solubility dataset. Further solubility data in ethanol and other solvents were mined from the Reaxys database. Solubilities measured at temperature specified outside the 14-30 °C range were discarded. Each compound was identified by its InChIKey and analysed using SMILES code. Where multiple measurements were acquired for a molecule, obvious outliers (LogS ± 1.0 from 2 or more measurements) and polymorphs were excluded and the median value of the remaining measurements was taken. For this study, only solubility data of neutral solutes in single component solvents were collected. Melting points were collected from Reaxys and ChemSpider databases. Initial 3D coordinates were generated with CIRpy47. Molecules were optimised in gas phase with B3LYP/6-31 + G(d) method using Gaussian 0948. The solution phase optimisation was carried out with an implicit polarisable continuum solvent model (IEFPCM) or solvation model based on electron density (SMD), pre-parametrised for each solvent.
Initial 3D structures were generated with Corina software and then optimised at BP-TZVPD-FINE DFT level in COSMOConf45 to create the requisite input files for COSMOtherm. COSMOtherm was used to calculate the solubility, where the sublimation energy was estimated via the inbuilt QSPR protocol instead of reference solubility data.
For machine learning, data was pre-processed by scaling descriptors to between 0 and 1, using the Python/scikit-learn standard scaler protocol. MLR, PLS, ANN, SVM, RF, ET, and BG were performed using scikit-learn. GP models were built using GPy platform with error bars obtained to 1 standard deviation by finding the upper and lower limits for the predictions which encompassed 68% of the prediction distribution. In all cases, radial basis function (rbf) was the best kernel. For correlation between descriptors, Pearson’s R2 was calculated pairwise for each descriptor combination using scipy python module. These were plotted in 2×2 matrices as heat maps.
The datasets from open literature, including calculated descriptors, in this manuscript can be downloaded from this link: https://doi.org/10.5281/zenodo.3686212 Citations should refer directly to this manuscript.
Relevant Python code are included in Supplementary Note 7: Python code examples.
Bergström, C. A. S. & Larsson, P. Computational prediction of drug solubility in water-based systems: qualitative and quantitative approaches used in the current drug discovery and development setting. Int. J. Pharm. 540, 185–193 (2018).
Bergström, C. A. S., Charman, W. N. & Porter, C. J. H. Computational prediction of formulation strategies for beyond-rule-of-5 compounds. Adv. Drug Deliv. Rev. 101, 6–21 (2016).
Khurana, S. et al. DeepSol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics 34, 2605–2613 (2018).
Sormanni, P., Aprile, F. A. & Vendruscolo, M. The CamSol method of rational design of protein mutants with enhanced solubility. J. Mol. Biol. 427, 478–490 (2015).
Hebditch, M., Carballo-Amador, M. A., Charonis, S., Curtis, R. & Warwicker, J. Protein–Sol: a web tool for predicting protein solubility from sequence. Bioinformatics 33, 3098–3100 (2017).
Diorazio, L. J., Hose, D. R. J. & Adlington, N. K. Toward a more holistic framework for solvent selection. Org. Process Res. Dev. 20, 760–773 (2016).
Carter, H. L. et al. Rapid route design of AZD7594. React. Chem. Eng. 4, 1658–1673 (2019).
Baumann, M. & Baxendale, R. I. An overview of the synthetic routes to the best selling drugs containing 6-membered heterocycles. Beilstein J. Org. Chem. 9, 2265–2319 (2013).
Loschen, C. & Klamt, A. Solubility prediction, solvate and cocrystal screening as tools for rational crystal engineering. J. Pharm. Pharmacol. 67, 803–811 (2015).
Sheikholeslamzadeh, E. & Rohani, S. Solubility prediction of pharmaceutical and chemical compounds in pure and mixed solvents using predictive models. Ind. Eng. Chem. Res. 51, 464–473 (2012).
Delaney, J. S. Predicting aqueous solubility from structure. Drug Discov. Today 10, 289–295 (2005).
Ran, Y. & Yalkowsky, S. H. Prediction of drug solubility by the general solubility equation (GSE). J. Chem. Inf. Comput. Sci. 41, 354–357 (2001).
Fredenslund, A., Jones, R. L. & Prausnitz, J. M. Group-contribution estimation of activity coefficients in nonideal liquid mixtures. AIChE J. 21, 1086–1099 (1975).
Palmer, D. S., McDonagh, J. L., Mitchell, J. B. O., van Mourik, T. & Fedorov, M. V. First-principles calculation of the intrinsic aqueous solubility of crystalline druglike molecules. J. Chem. Theory Comput. 8, 3322–3337 (2012).
Abrams, D. S. & Prausnitz, J. M. Statistical thermodynamics of liquid mixtures: a new expression for the excess Gibbs energy of partly or completely miscible systems. AIChE J. 21, 116–128 (1975).
Maurer, G. & Prausnitz, J. M. On the derivation and extension of the uniquac equation. Fluid Phase Equilib. 2, 91–99 (1978).
Hansen, C. M. Hansen Solubility Parameters: A User’s Handbook, Second Edition. (CRC Press, 2007).
Hildebrand, J. H. Solubility of non-electrolytes. Nature 138, 742 (1936).
Li, L., Totton, T. & Frenkel, D. Computational methodology for solubility prediction: application to the sparingly soluble solutes. J. Chem. Phys. 146, 214110 (2017).
Boothroyd, S. & Anwar, J. Solubility prediction for a soluble organic molecule via chemical potentials from density of states. J. Chem. Phys. 151, 184113 (2019).
Luder, K., Lindfors, L., Westergren, J., Nordholm, S. & Kjellander, R. In silico prediction of drug solubility. 3. Free energy of solvation in pure amorphous matter. J. Chem. Phys. B. 111, 7303 (2007).
Klamt, A. Conductor-like screening model for real solvents: a new approach to the quantitative calculation of solvation phenomena. J. Phys. Chem. 99, 2224–2235 (1995).
Klamt, A., Eckert, F., Hornig, M., Beck, M. E. & Bürger, T. Prediction of aqueous solubility of drugs and pesticides with COSMO-RS. J. Comput. Chem. 23, 275–281 (2002).
Yu, X., Wang, X., Wang, H., Li, X. & Gao, J. Prediction of solubility parameters for polymers by a QSPR model. QSAR Comb. Sci. 25, 156–161 (2006).
Duchowicz, P. R. & Castro, E. A. QSPR studies on aqueous solubilities of drug-like compounds. Int. J. Mol. Sci. 10, 2558–2577 (2009).
Lusci, A., Pollastri, G. & Baldi, P. Deep architectures and deep learning in chemoinformatics: the prediction of aqueous solubility for drug-like molecules. J. Chem. Inf. Model. 53, 1563–1575 (2013).
Huuskonen, J., Salo, M. & Taskinen, J. Aqueous solubility prediction of drugs based on molecular topology and neural network modeling. J. Chem. Inf. Comput. Sci. 38, 450–456 (1998).
Deng, T. & Jia, G. Prediction of aqueous solubility of compounds based on neural network. Mol. Phys. 118:2, https://doi.org/10.1080/00268976.2019.1600754 (2019).
Llinàs, A., Glen, R. C. & Goodman, J. M. Solubility challenge: can you predict solubilities of 32 molecules using a database of 100 reliable measurements? J. Chem. Inf. Model. 48, 1289–1303 (2008).
Llinas, A. & Avdeef, A. Solubility challenge revisited after ten years, with multilab shake-flask data, using tight (SD < 0.17 log) and loose (SD < 0.62 log) test sets. J. Chem. Inf. Model. 59, 3036–3040 (2019).
Skyner, R. E., McDonagh, J. L., Groom, C. R., van Mourik, T. & Mitchell, J. B. O. A review of methods for the calculation of solution free energies and the modelling of systems in solution. Phys. Chem. Chem. Phys. 17, 6174–6191 (2015).
Palmer, D. S., O’Boyle, N. M., Glen, R. C. & Mitchell, J. B. O. Random forest models to predict aqueous solubility. J. Chem. Inf. Model. 47, 150–158 (2007).
Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nat. Mach. Intell. 1, 206–215 (2019).
Palmer, D. S. & Mitchell, J. B. O. Is experimental data quality the limiting factor in predicting the aqueous solubility of druglike molecules? Mol. Pharm. 11, 2962–2972 (2014).
Seung Lim, J. EPI Suite: a fascinate predictive tool for estimating the fates of organic contaminants. J. Bioremediat. Biodegrad. 7, e171 (2016).
Klamt, A. & Schüürmann, G. COSMO: a new approach to dielectric screening in solvents with explicit expressions for the screening energy and its gradient. J. Chem. Soc. Perkin Trans. 2, 799–805 (1993).
Lipinski, C. A. Lead- and drug-like compounds: the rule-of-five revolution. Drug Discov. Today Technol. 1, 337–341 (2004).
Tetko, I. V. et al. How accurately can we predict the melting points of drug-like compounds? J. Chem. Inf. Model. 54, 3320–3329 (2014).
Huuskonen, J. Estimation of aqueous solubility for a diverse set of organic compounds based on molecular topology. J. Chem. Inf. Comput. Sci. 40, 773–777 (2000).
Yan, A. & Gasteiger, J. Prediction of aqueous solubility of organic compounds based on a 3D structure representation. J. Chem. Inf. Comput. Sci. 43, 429–434 (2003).
Bertz, S. H. The first general index of molecular complexity. J. Am. Chem. Soc. 103, 3599–3601 (1981).
Thompson, J. D., Cramer, C. J. & Truhlar, D. G. Predicting aqueous solubilities from aqueous free energies of solvation and experimental or calculated vapor pressures of pure substances. J. Chem. Phys. 119, 1661–1670 (2003).
Kromann, J. C., Steinmann, C. & Jensen, J. H. Improving solvation energy predictions using the SMD solvation method and semiempirical electronic structure methods. J. Chem. Phys. 149, 104102 (2018).
Boobier, S., Osbourn, A. & Mitchell, J. B. O. Can human experts predict solubility better than computers? J. Cheminform. 9, 63 (2017).
COSMOtherm, Release 19; COSMOlogic GmbH & Co. KG, http://www.cosmologic.de.
Eckert, F. & Klamt, A. Fast solvent screening via quantum chemistry: COSMO‐RS approach. AIChE J. 48, 369–385 (2002).
CIRpy, Python interface for the Chemical Identifier Resolver (CIR). Available at: http://cactus.nci.nih.gov/chemical/structure. (Accessed: 1st January 2019)
Gaussian 09, Revision D.03, M. J. Frisch et al., Gaussian, Inc., Wallingford CT, 2016 (full citation in Supplementary).
Pedregosa, F. et al. Scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011).
GPy: A. Gaussian process framework in python. Available at http://github. com/SheffieldML/GPy (Accessed: 1st January 2019).
This work was undertaken on ARC2, ARC3 and ARC4, part of the High Performance Computing facilities at the University of Leeds, UK. We also thank Simon Yates and Holly Carter for help obtaining experimental solubility measurements from AstraZeneca compounds, and Dr David Buttar and Dr Simone Tomasi for help with the computational work at AstraZeneca. S.B. thanks AstraZeneca and the EPSRC for his iCASE studentship.
The authors declare no competing interests.
Peer review information Nature Communications thanks John Mitchell and Raghunathan Rengaswamy for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Boobier, S., Hose, D.R.J., Blacker, A.J. et al. Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat Commun 11, 5753 (2020). https://doi.org/10.1038/s41467-020-19594-z