Predicting the Enthalpy and Gibbs Energy of Sublimation by QSPR Modeling

The enthalpy and Gibbs energy of sublimation are predicted using quantitative structure property relationship (QSPR) models. In this study, we compare several approaches previously reported in the literature for predicting the enthalpy of sublimation. These models, which were reproduced successfully, exhibit high correlation coefficients, in the range 0.82 to 0.97. There are significantly fewer examples of QSPR models currently described in the literature that predict the Gibbs energy of sublimation; here we describe several models that build upon the previous models for predicting the enthalpy of sublimation. The most robust and predictive model constructed using multiple linear regression, with the fewest number of descriptors for estimating this property, was obtained with an R2 of the training set of 0.71, an R2 of the test set of 0.62, and a standard deviation of 9.1 kJ mol−1. This model could be improved by training using a neural network, yielding an R2 of the training and test sets of 0.80 and 0.63, respectively, and a standard deviation of 8.9 kJ mol−1.

where Δ°H f (s) , Δ°H f (l) and Δ°H f (g) are the standard enthalpy of formation of the solid, liquid and gas, respectively, and Δ°H sub and Δ°H vap are the enthalpies of sublimation and vaporization 1 , respectively. Thus, while the gas-phase enthalpies of formation can be predicted to high accuracy with quantum mechanical methods, the prediction of enthalpies and Gibbs energies associated with phase changes is generally the realm of empirical approaches, particularly quantitative structure property relationship (QSPR) methods. There are several examples of QSPR model development for predicting enthalpies and Gibbs energies of sublimation and vaporization reported in the literature.
Modelling of the enthalpy of sublimation is an exemplar of the success of QSPR methods. The earliest attempts focused their attention on very small training sets, such as the CoMFA analysis of 30 polycyclic aromatic hydrocarbons by Welsh et al. 2 , and the study by Politzer et al. 3 of 34 organic compounds. The squared correlation coefficient (R 2 ) of these models was 0.82 and 0.95, respectively. The Politzer model used only two descriptors, the molecular surface area and a charge 'balance parameter' based on the surface electrostatic potential. Gharagheizi developed a model with an R 2 of 0.97 using a larger and more diverse training set of 1079 compounds with five descriptors 4 ; this model, however, has been criticized for not being generalizable and for using highly correlated descriptors 5  topological polar surface area (TPSA) and the number of hydroxyl groups (nROH). More recently, Salahinejad et al. developed a model using a large heterogeneous data set of 1302 compounds with four descriptors including the fractional charged partial surface area (FPSA 3 ), the polar surface area (PSA), the molecular volume (V) and a parameter describing the hydrophilicity (W1), resulting in a model with an R 2 of 0.95 6 .
Here we review the performance of these recently reported methods for the prediction of the enthalpy of sublimation. We use a single training set to re-derive each model and compare these new models with those obtained previously using different training sets. The purpose of this review is to establish whether there is any strong dependence of each of the models on the contents of the original training dataset.
There have been significantly fewer attempts reported in the literature of QSPR models for predicting the Gibbs energy of sublimation. Perlovich and Raevsky developed a model with three descriptors, the molecular polarizability, and hydrogen bond donor and hydrogen bond acceptor factors 7 ; the latter two descriptors are available within the HYBOT software package. Models for both the enthalpy and Gibbs energy of sublimation were generated using the same set of descriptors; the training sets consisted of 1316 and 686 compounds, respectively, yielding models with an R 2 of 0.66 and 0.60, respectively.
In this study, we have applied QSPR techniques for the prediction of the Gibbs energy of sublimation. If the models for predicting the enthalpy of sublimation reliably encode information that depicts this property, it should be possible to extend these models with information describing the entropy of sublimation to estimate the Gibbs energy of sublimation; here we explore how these models perform when extended to predict the Gibbs energy of sublimation. All QSPR models were developed using the BioPPSy package 8 .

Materials and Methods
A single set of 260 compounds with experimental values of enthalpy of sublimation was used to generate QSPR models of the enthalpy of sublimation 9 ; values of Δ sub H at triple point conditions had been compiled from the DIPPR 801 database and range from 30.6 to 224 kJ mol −1 10 . It is worth noting that only 25 of the compounds in this dataset appear in the more recent compendium by Acree and Chickos 11 . For the Gibbs energies of sublimation the compilation by Perlovich and Raevsky was used 7 ; this is a carefully curated dataset compiled from data obtained using different methods and at different temperatures. Notably, this dataset includes a considerable number of compounds that are normally liquids (or even gases) at 298 K and have been included in the dataset by special accounting for temperature dependencies. This set of 278 compounds was randomly divided into a training set of 244 (with Gibbs energy of sublimation ranging from 5.67 to 120.2 kJ mol −1 ) and a test set of 34 (0.92-72.2 kJ mol −1 ).
The structures of all compounds were first optimized using the MS-DOCK program 12 to identify the global minimum conformation; this method uses the DOCK conformational search algorithm 13 with a scoring function based on the AMBER molecular mechanics force field for estimating the energy. The structures from this search were further minimized at the B3LYP/6-31G(d) level using the GAUSSIAN-09 program 14 . The gas-phase translational and rotational entropies, S trans,gas and S rot,gas , were obtained using GAUSSIAN-09, determined using standard statistical mechanics methods 15 .
The BioPPSy program was used to generate all QSPR models; all descriptors used in the analysis presented here are available as part of the BioPPSy package and conform to the specification in the compendium of descriptors by Todeschini and Consonni 16 . The descriptors used here include the hydrophilicity (Hy), molecular volume (V, Å 3 ), first Zagreb index (ZM1), solvation connectivity index (X1sol, 1 χ s ), number of hydroxyl groups (nROH), topological polar surface area (TPSA, Å 2 ) 17 , Randic-type eigenvector-based index from the van der Waals weighted distance matrix (VRv1) 18 , reciprocal distance sum Randic-like index (RDCHI), surface area (SA, Å 2 ), polar surface area (PSA, Å 2 ) and the fractional charged partial surface area (FPSA 3 ). The Politzer electrostatic variance parameters, σ − 2 and σ + 2 , 3 were calculated from the molecular electrostatic potential calculated at the B3LYP/6-31G(d) level calculated on the 0.001 a.u. electron density contour surface. From these parameters the total variance, Artificial Neural Networks and Support Vector Regression. In addition to multilinear regression (MLR) we also considered Artificial Neural Networks (ANNs) and support vector regression (SVR) approaches; these have received much attention in the literature [19][20][21][22][23][24][25][26] and are typically found to give a superior performance to MLR.
We implemented both ANNs and SVRs in BioPPSy by incorporating the machine learning package weka 26 . Our initial attempts, not shown, used a simple acceptance of the default parameter values given by weka. The resulting models often gave substandard fits to the training data and were unstable to testing data. However, it was reasonably straightforward to optimize these models, providing models of comparable performance compared with MLR when validating against the test data.
Optimization of the ANNs resulting in a lowering of the learning rate to 0.003 and momentum to 0.002 from the default values of 0.3 and 0.2, respectively, provided by weka. This effectively slows the learning rate for the ANN, which therefore required a corresponding amount of extra training time, measured in epochs; the number of epochs was increased from the default 500 to 500,000. The slower-learnt ANNs gave good fit and stable performance against non-training data.
The ANNs currently implemented in BioPPSy are all multilayer perceptrons with a single middle layer with half as many nodes as the input layer (the weka default). We have also followed the common practice of including an extra constant input "bias" node.
The ANNs were trained with a standard back propagation algorithm (available in weka), however, more stable networks exist, with neurons based on the radial basis function (RBF) 24,27 , which are more stable since they are guaranteed to reach the global minimum error surface 27 , or Bayesian neural networks 28,29 ; the inclusion of such networks in BioPPSy remains part of the future development of the software.
The SVR models used in this paper use the RBF kernel that is commonly used for regression problems 22,24,25 . Although not presented here, we also investigated SVR models with the polynomial kernel, but found their performance to be consistently slightly inferior to that of MLR, and with minimal sensitivity to parameter changes. With the RBF kernel, however, SVR was capable of good fits to the training data with stable performance under validation with a testing set, but only with suitable adjustment of the gamma parameter from 0.01 to 0.1.

Data availability.
The BioPPSy program and the sublimation datasets (training and test sets) are available from https://sourceforge.net/projects/bioppsy/.

Results and Discussion
Enthalpy of sublimation. One of the earliest QSPR models to predict the enthalpy of sublimation was described by Politzer et al. using a data set composed of 34 compounds 3 . This model contains two descriptors, the molecular surface area (SA) and the product of the total variances (σ tot 2 ) and the balance parameter (υ)equation 5.  The method for calculating the descriptor VRv1 in BioPPSy and that used by Gharagheizi differ 18 . Thus, while the coefficient for these two descriptors differ, the descriptors themselves present the same information regarding the enthalpy of sublimation.
The third approach considered was that by Bagheri et al. 5 . In this model three simple parameters, RDCHI, nROH and TPSA, were used -equation 7.
sub 2 The R 2 of 0.96 and standard deviation of 5.1 kJ mol −1 calculated here compares favorably with the R 2 and RMSE reported with Bagheri's model of 0.93 and 9.8 kJ mol −1 , respectively. This equation matches closely the original model described by Bagheri, equation 7a.
Salahinejad et al. 6 showed that the enthalpy of sublimation could be adequately reproduced by a simple equation involving a single descriptor that describes the molecular volume which is accessible to and interacts with water molecules (W1) -equation 8 sub From this analysis, we understand Hy is not a suitable substitute for W1. Using a Bayesian feature selection approach, Salahinejad et al. identified three additional descriptors, PSA the polar surface area, V the water-excluded volume, and FPSA 3 the fractional polar surface area, whose inclusion led to a significant improvement in their original enthalpy of sublimation model.
Using these 3 additional descriptors and replacing W1 with Hy, our MLR refinement produced the following model -equation 10. where the R 2 and standard deviation are identical as those obtained using equation 10. We conclude that hydrophilicity does not play a significant role in describing the enthalpy of sublimation. Finally, Mathieu generated a model using 35 group contributions yielding an R 2 of 0.99 and an RMSE of 4.1 kJ mol −1 from a training set containing 814 compounds 30 . In the dataset of 260 compounds we used to create these models only 19 of the 35 group fragments were present. Using the 19 remaining groups we obtained a model with an acceptable R 2 of 0.70, but a large standard deviation of 17.1 kJ mol −1 . It is not unusual for models based on group contributions to have limited application beyond the molecule types included in the training set.
A comparison of the 5 models used in the prediction of the enthalpy of sublimation is presented in Table 1; the predicted heats of sublimation for all compounds for each model is provided in supporting information Table S1. The R 2 calculated here using a training set common to the development of each model is in close agreement with the value originally obtained using 5 different datasets. The model originally developed by Bagheri et al. has significant appeal since the R 2 calculated using the common set of 260 compounds matches the R 2 calculated using their own dataset of 1269 compounds, the standard deviation is the smallest of all the models studied, and the model uses only 3 descriptors.
The largest deviations from experiment for each model, both positive and negative, are presented in Table 1. The Politzer and Mathieu models performed particularly poorly in the prediction for bis-2-hydroxyethyl-terephthalate (127.7 and 132.1 kJ mol −1 ) and di-n-butyl-sulfide (−163.9 and −92.9 kJ mol −1 ), respectively. The enthalpy of sublimation of 2,6-di-tert-butyl-4-methylphenol was poorly predicted by all methods, with errors of 35.   There have been two other attempts to develop models for the prediction of the enthalpy of sublimation 2,31 . These studies focused their attention on specific classes of compounds (polyaromatic hydrocarbons and explosives) and are unlikely to be extensible beyond those classes.
Gibbs energy of sublimation. None of the models used to predict the enthalpy of sublimation could be used to train a model suitable to predict the Gibbs energy of sublimation. Each of the models described above were trained against the dataset of 278 Gibbs sublimation energies from Perlovich and Raevsky; the R 2 and standard deviation from each model is presented in Table 2. Using the descriptors from the Politzer et al. enthalpy of sublimation model we produced a very poor model for the prediction of Gibbs energy of sublimation, R 2 of 0.23. Using the group parameters in Mathieu's enthalpy model, the R 2 was 0.25. Using the descriptors from the Gharagheizi, Bagheri et al., and Salahinejad et al. enthalpy of sublimation models created models for predicting the Gibbs energy that were also unsatisfactory, with an R 2 all less than 0.60. Thus, without descriptors that capture information regarding the entropy of sublimation, the models that adequately describe the enthalpy of sublimation cannot be repurposed to describe the Gibbs energy of sublimation without appending terms that encode the entropy.
Considering the Gibbs-Helmholtz equation for the Gibbs energy of sublimation sub s ub sub it should be possible to predict the Gibbs energy of sublimation from knowledge of the enthalpy and entropy of sublimation. The entropy of sublimation depends on the molecular interactions between the molecules, and their influence on the order in the solid. Thus, it should be possible to model the entropy of sublimation with molecular descriptors that reflect the different types of non-covalent interactions in solids, namely ionic, hydrogen bonding and van der Waals. Applying different QSPR models for calculating the enthalpy of sublimation, we generated QSPR models for the prediction of the Gibbs energy of sublimation.
Initially we applied four descriptors from the Salahinejad et al. model for predicting the enthalpy of sublimation model (Hy, V, PSA, FPSA 3 ) -where Hy was used as a substitute for the W1 descriptor -and the gas-phase entropies for translation and rotation, S trans,gas and S rot,gas , to build a QSPR model. The R 2 value of the model based on the training set for this model was 0.60. Outliers in this model were identified to contain two characteristic features, conjugated systems and zwitterionic compounds. Thus, two descriptors, the number of fused rings in the molecule (R fused ) and the zwitterionic nature of the molecule (Zwit) 32  We found the gas-phase entropy descriptors, S trans,gas and S rot,gas , could be discarded to produce a new model using just six descriptors, resulting in a robust model with an R 2 for the training set of 0.71, an R 2 of the test set of 0.66, and a standard deviation of 10.5 kJ mol −1 (equation 13). Notably, inclusion of the gas-phase entropy descriptors, S trans,gas and S rot,gas , did not significantly improve any model developed here. A plot of predicted values of Gibbs energy of sublimation versus experimental for the training and test sets is presented in Fig. 1. The hydrophobicity descriptor, Hy, could be removed from equation 13 with little effect on R 2 or the standard deviation (0.71 and 9.1 kJ mol −1 , respectively), consistent with its lack of influence in the corresponding model for the enthalpy of sublimation (equation 10a). However, without Hy the R 2 and standard deviation for the test set was 0.62 and 12.6 kJ mol −1 , respectively, significantly worse than the model with Hy included (0.66). The improved performance of the model including the Hy descriptor suggests inclusion of a descriptor that encodes hydrophobicity is necessary to avoid overfitting the model, and therefore warrants its inclusion in the final model.
In the second approach, we applied the Politzer et al. enthalpy of sublimation QSPR model as the underlying set of descriptors ((SA) 2 and νσ .
( ) tot 2 0 5 ) supplemented with the two entropy-related descriptors, R fused and Zwit. Using these four descriptors, the following equation (equation 14) was obtained, where the R 2 for the training set was 0.51, the R 2 for the test set was 0.29, and the standard deviation was 12.6 kJ mol −1 . The small value of R 2 indicates this model is neither particularly robust nor predictive.
In the third approach, the multivariate model using seven parameters, RDCHI, nROH, TPSA, R fused , and Zwit was obtained (equation 15) with an R 2 for the training set of 0.66, an R 2 for the test set of 0.54, and a standard deviation of 10.1 kJ mol −1 . In the final approach considered here, the underlying model was the enthalpy of sublimation model of Bagheri et al. 5 Table 3; the predicted Gibbs energy of sublimation for all compounds for each model is provided in supporting information Table S2. A y-randomization test of these four models (equations 13-16) yielded R 2 values of 0.01-0.06 and standard deviations of 19.1-19.5 kJ mol −1 , indicating minimal effect of any chance correlation in the refined models. The model described by equation 13 is a good compromise between performance and number of descriptors.
The initial dataset of 278 compounds was partitioned into ten different training and test sets to explore the dependency of the performance of the model on the separation scheme. For the model described by equation 13, ten different partitioning attempts yielded R 2 for the training set between 0.71 and 0.73, between 0.53 and 0.72 for the test set, and standard deviations in the range 9.9 to 10.4 kJ mol −1 . Similar variation was observed for the other models.
The predicted Gibbs energy of sublimation differed significantly from the experimental value for several compounds in all 4 models; 1-amino-2-methyl-9, 10     Thus, it appears these compounds represent systems that are challenging for QSPR models to describe accurately. The range of values of the descriptors in each of the models described in Table 3 are listed in Table 4. These limits define the domain of the applicability of each method 33 . The model described by equation 13 fulfills the criteria of a useful model, an R 2 of the test set greater than 0.6 and low RMSE (or standard deviation) of the test set predictions 34 .
More recently McDonagh et al. developed models for predicting the enthalpy, entropy and Gibbs energy of sublimation 35 . Experimental data for the enthalpy, entropy and Gibbs energy were available for all 158 compounds used in the training set. Using only 2D descriptors, the partial least squares (PLS) method yielded an R 2 of 0.65 and 0.76 for the enthalpy and Gibbs energy, respectively. Presented in Table 5 is presented the performance of two non-linear regression algorithms, ANN and SVR, using the same descriptors used in equations 13-16; the predicted Gibbs energy of sublimation for all compounds for each model is provided in supporting information Tables S3 and S4. Multivariate regression with ANN using   the descriptors included in equation 13 produced a model with a significantly improved R 2 , 0.80, compared with the MLR R 2 of 0.71. The improvement in the R 2 of the test set, however, was significantly more modest, 0.63 using ANN over 0.62 from MLR, indicating the predictability of the ANN model is not significantly better than the MLR model. The small R 2 for the training set, and a negative R 2 for the test set, using ANN with the descriptors in equation 14 indicates a poor model that is not predictive. While the use of SVR produces a slightly more predictive model (R 2 of the test set of 0.26), the model retains very little value. Application of either ANN or SVR with the descriptors in equations 15 or 16 improves slightly the quality of the models over MLR -improvements in R 2 for both training and test sets are roughly 0.05. Again, the model described by equation 13 fulfills the criteria of a useful model 34 .

Conclusion
In this study, we have reproduced several QSPR models reported previously for the prediction of the enthalpy of sublimation. We have trained each model using a single consistent training set. From this comparison, we observe that all QSPR models based on molecular descriptors perform well. In contrast, the one model we examined using a fragment-based approach, did not perform well. We also developed several QSPR models for estimating the values of the Gibbs energy of sublimation with simple descriptors in the BioPPSy package. Models that performed well in predicting the enthalpy of sublimation could not be trained to predict the Gibbs energy of sublimation with any confidence. Inclusion of two descriptors that describe intermolecular interactions, the number of fused rings and the potential to form a zwitterion, could be used to improve these models. The preferred model based on MLR refinement has six descriptors, hydrophilicity, molecular volume, polar surface area, fractional charged partial surface area, the number of fused rings and the potential to form a zwitterionic species, with a squared correlation coefficient of 0.71 and standard deviation of 10.6 kJ mol −1 . ANN refinement using these same descriptors produced a model with significantly improved R 2 and standard deviation, however, the predictability, as gauged by the calculated R 2 for the test set, was not significantly improved.