Diagnostics of Data-Driven Models: Uncertainty Quantification of PM7 Semi-Empirical Quantum Chemical Method

We report an evaluation of a semi-empirical quantum chemical method PM7 from the perspective of uncertainty quantification. Specifically, we apply Bound-to-Bound Data Collaboration, an uncertainty quantification framework, to characterize (a) variability of PM7 model parameter values consistent with the uncertainty in the training data and (b) uncertainty propagation from the training data to the model predictions. Experimental heats of formation of a homologous series of linear alkanes are used as the property of interest. The training data are chemically accurate, i.e., they have very low uncertainty by the standards of computational chemistry. The analysis does not find evidence of PM7 consistency with the entire data set considered as no single set of parameter values is found that captures the experimental uncertainties of all training data. A set of parameter values for PM7 was able to capture the training data within ±1 kcal/mol, but not to the smaller level of uncertainty in the reported data. Nevertheless, PM7 was found to be consistent for subsets of the training data. In such cases, uncertainty propagation from the chemically accurate training data to the predicted values preserves error within bounds of chemical accuracy if predictions are made for the molecules of comparable size. Otherwise, the error grows linearly with the relative size of the molecules.

e e e e = ∈ ≤ ≤ = … . F H Naturally, one is often interested in the set of parameters which satisfies multiple observations: where I ⊂ {1, 2, …, N} is the index set of the QOIs. For example,  1:3,5 denotes the feasible set associated with QOIs, e = 1, 2, 3, 5. If the set  1:N is nonempty, the models and data are said to be consistent. Consistency is a form of model validation, ensuring that a model parameter vector exists that evaluates within the experimental uncertainty bounds for all N QOIs. When such a parameter vector does not exist the set  1:N is empty, the models and data are said to be inconsistent. If the models can be accurately represented by polynomial surrogates, this notion of consistency can be efficiently and provably quantified through constrained optimization 35 . A polynomial surrogate is a model which mimics the behavior of an underlying simulation model as close as possible, while being computationally cheaper to evaluate 37 . In cases where the models are not well characterized by simple surrogates but are computationally inexpensive to evaluate, direct sampling can provide a means of assessing consistency. To prove consistency, we must find at least one parameter vector x in  1:N . However, this becomes very challenging when the prior parameter space is large and there are many QOI models whose experimental bounds must be satisfied. In B2BDC, the prediction for a particular model, say M p (x), amounts to establishing bounds on the range of a model prediction, subject to model parameters being within a feasible set,  I : x p x p The prediction model need not be a member of the collection of models {M e (x)} e = 1, …, N . In certain cases, Eq. 3 can be efficiently solved via constrained optimization 30 . Note, inner approximations to the prediction interval can be found by evaluating the prediction model for a set of parameter values in ℱ I . The inner approximation can be interpreted as saying the prediction model can be no more predictive than the computed bounds and a wider prediction interval indicates greater uncertainty in the prediction.
As an illustrative example of the B2BDC methodology, pertinent to the present work, we select a case investigated recently 31 . The problem is stated as estimating the ionization potential (IP) of water hexamer and its uncertainty interval, given ionization potentials and their respective uncertainties of water dimer, trimer, tetramer and pentamer. In this problem, there are four QOI models for the training-set of water clusters, consisting of dimer, trimer, tetramer and pentamer. A theoretical model of the ionization potential was constructed based on the double-hybrid form of the Density Functional Theory (DFT). Two model parameters, referred here as x 1 and x 2 , encapsulated the model variability over their allowable prior ranges (each 0 to 1). Initial sampling over the parameter space and building surrogate models produced individual feasible regions (Eq. 1), for each of the four training-set water clusters labeled  1 through  4 . The top-left panel of Fig. 1 shows the individual feasible sets of dimer and trimer water clusters,  1 and  2 respectively and their intersection  1:2 shown in dark red. Every point of this red region guarantees predictions of the dimer and trimer IPs within their respective uncertainty bounds. The top-right panel adds the feasible set for tetramer water clusters,  3 and the bottom-left panel adds further the feasible set for pentamer water clusters,  4 . The intersection of all the individual feasible regions produces a feasible region,  1:4 which is the feasible set of the B2BDC framework described in Eq. 2; shown in dark red in the bottom-left panel of Fig. 1 for the property we seek to predict, i.e., the ionization potential of the water hexamer, which is consistent with the ionization potentials of the training data. Collectively, all points in  1:4 result in an interval of predicted hexamer IP values, shown in the bottom-right panel of Fig. 1. The predicted uncertainty interval encompasses all the uncertainties of the input data and turned out to be significantly smaller, by about a factor of 2, compared to the high-theory calculations.
Selection of QOIs and experimental data for PM7. There are a multitude of approaches available for validating a computational model to experimental data. The two most important factors when validating a model is the choice of validation data and the specific experimental feature to validate the model by.
In this study, we will be considering a single homologous series, namely linear alkanes. We chose this homologous series for its simplicity and accumulated knowledge. There are 27 adjustable model parameters in PM7 which are used to parameterize the interactions of carbon and hydrogen atoms. Given the problem choice, we should expect the "best case scenario" from the results.
For a given parameterization, PM7 as a model provides many output responses, e.g., vibrational frequencies, ionization potential, HOMO-LUMO energies, heat of formation, etc. Due to the sheer number of model responses, the availability of accurate experimental data will be used to guide the choice of a quantity of interest. We also keep in mind that the primary output value of PM7 is standard heat of formation. Ruscic et al. 38 , employing a thermochemical network approach, reported accurate thermochemical values for a set of molecules, including members of linear alkanes, methane to octane (Table 1). Uncertainties listed in Table 1 represents the reported 95% confidence intervals, in conformity to the accepted standard in thermochemistry 39 . Despite the reported values not being purely experimental, the terms "experimental uncertainty" and "experimental bounds" will be used to describe the data in Table 1. In the present study, heats of formation of linear alkanes will be used as our QOIs. The experimental bounds used for the training data are the the reported heat of formation ± the uncertainty. Table 1 also shows the QOI models used in the study, namely the heat of formation of methane, ethane to octane. For sake of simplicity, each QOI model is indexed by the number of carbon atoms. Figure 1. Illustration of the B2BDC methodology, specifically the intersection of feasible sets, for the problem described in Edwards et al. 31 . The shaded regions shown in Panels a, b and c are individual feasible set,  e , which are a set of model parameters that can predict the ionization potential of a particular water cluster within its respective uncertainty. Panel a: individual feasible sets for dimer and trimer configurations of water,  1 and  2 respectively. The intersection of these two regions is the feasible set  1:2 , shown in dark red. Panel b: the addition of the feasible set for the tetramer configuration of water,  3 , shown in teal. The intersection is shown in dark red,  1:3 . Panel c: the feasible set of pentamer configuration of water, shown in dark grey, primarily overlaps with  3 . The intersection of all individual feasible sets forms  1:4 , shown in dark red. Panel d: prediction of the ionization potential of a hexamer water cluster on feasible set  1:4 . Prediction establishes a range of a prediction model as seen in Eq. 3. The resulting prediction interval for the ionization potential is shown on the right-hand side in red.
Scientific REpORtS | (2018) 8:13248 | DOI:10.1038/s41598-018-31677-y Motivation for direct sampling. B2BDC typically employs surrogate models to represent an underlying simulation. The use of surrogates makes tasks like optimization and sensitivity analysis much more efficient due to inexpensive evaluations. When the surrogates are polynomials, consistency and prediction can be addressed efficiently and provably. However, we have found constructing surrogates for PM7 particularly challenging for two reasons, as we will detail below with the H (C H ) f 298 4 10  ∆ serving as our example. First, we do not have a well-defined prior domain, , to confine our search for feasible parameters as no bounds are specified for these parameters. To address this, we constructed an initial prior domain  by centering the domain around the PM7 nominal parameter vector 32 , x nom , which is listed in Supplementary Table S1. The bounds of the prior domain were constructed by selecting perturbations ±δ i to the i-th parameter with all other parameters remained fixed such that the interval defined by x nom,i + δ i produced a 10 kcal/mol change in  ∆ H f 298 . We denote this initial region by , as shown in Supplementary Table S1. Second, our experience has shown that surrogate models are only accurate (to within the experimental uncertainty) on small domains. To illustrate this, we compare the fitting error of a quadratic polynomial and Gaussian process surrogate model on shrinking domains, 1], as shown in Fig. 2. The surrogate model fitting error is the absolute difference between an evaluation of PM7 and an evaluation of the surrogate model over a domain  k . For each k, 7500 Latin Hypercube samples were generated in  k to construct both surrogate models. 10-fold cross validation was used to estimate the surrogate model fitting error over each domain. The Gaussian process was implemented by using MATLAB's fitrgp function 40 . As seen in Fig. 2, the Gaussian process deviated by approximately 0.8 kcal/mol from PM7 on the domain  1 , which is significant given the accuracy of the training data. In order to attain a surrogate model fitting error below our target, the experimental uncertainty, the initial region needed to be reduced by the factor of k = 0. 4 Table 1. List of the QOI models M e , the associated index e (equivalent to the number of carbon atoms in the alkane), the reported heat of formation and uncertainty 38 .  Supplementary  Table S1. As the domain shrinks, by reducing k, the surrogate model fitting error, that is the difference between PM7 and the surrogate model, decreases to a point where the error falls below the experimental uncertainty (shown in grey) of 0.06 kcal/mol. Comparison between a quadratic surrogate (black dots) and a Gaussian process model (red crosses) with a constant mean and a squared exponential covariance function is shown. 10-fold cross-validation was used to estimate the fitting error.
Scientific REpORtS | (2018) 8:13248 | DOI:10.1038/s41598-018-31677-y 0.4 is 5.5 × 10 10 times smaller by volume as compared to the original domain  1 . Based on this result, we concluded that it would require an intractable number of surrogate models to mimic the behavior of PM7 over the prior region  within the desired accuracy. Therefore, PM7 will be evaluated directly and forgo any surrogate modeling.
Direct Sampling for Consistency. In light of the challenges discussed in the preceding section, we approached consistency and prediction by directly sampling . Still, despite PM7's fast evaluation time, particularly for fixed geometries, evaluating consistency for multiple QOIs remains a computational challenge. We break this challenge into two tasks: (1) allocating computational resources to sample ; and (2) determining a prior region  to confine our search for feasible parameters.
Computational burden of direct sampling PM7. One of the limitations encountered in the present study was the number of samples we could evaluate for PM7 with the computational resources available. Evaluating PM7 for millions of parameter combinations is an embarrassingly parallel task, but computational efficiency can be improved by reducing an input/output (I/O) bottleneck occurring in the evaluation process.
Throughout the study, the input molecular geometries were obtained via optimization with the PM7 nominal parameter vector, which remained frozen afterwards, during the subsequent direct sampling. The freezing of the molecules geometries was done to avoid any unphysical reorganization or distortion. The consequences of freezing the molecular geometries with the nominal PM7 parameter vector are two-fold. First, using geometry optimization for a given parameter vector can (and usually does) result in different heats of formation as compared to the fixed geometry. Second, parameter vectors which are feasible with respect to the heats of formation, do not guarantee reaching a physically meaningful geometry. All feasible parameter vectors, which would produce unphysical geometries, should ultimately be removed from the feasible set.
The computational chemistry program MOPAC 41 was used for evaluation of PM7 parameter sets. One of the practical aspects to be solved in working with MOPAC is handling efficiently multiple reading and writing of text files for each execution, which creates a bottleneck in I/O when parallelizing the direct sampling approach. To reduce I/O time, a memory based file system was used, that enabled all files to be read and written into system memory (RAM) instead of a hard disk drive (HDD). For example, using a Seagate 7200 RPM SATA HDD, read and write speeds were 204.0 and 194.9 MB/s, respectively. Using a memory based file system we could read and write at 6020.7 and 5840.6 MB/s, respectively. To relate this to sampling performance, the time to evaluate 10 4 different parameter values for C 4 H 10 on a workstation equipped with an Intel Core i7-6700 K processor took 305.0 seconds using the Seagate hard drive and 215.7 using the memory based file system, a speed up of 41%. This speedup in reading and writing MOPAC text files directly translates into more evaluations of PM7 per second.
The MATLAB Parallel Computing Toolbox 40 was used to write input files, execute PM7 and read output files in parallel. Two workstations, each with an Intel Core-i7 6700 K processor were used to directly sample PM7 and evaluate the heats of formation of a series of alkanes, methane to nonane.
Determination of a prior search region. The choice of the prior parameter domain, , determines all subsequent analysis and hence it needs to be well motivated. Our goal is to assess the predictivity of PM7 for large alkanes while being consistent with experimental data available for small alkanes. Due to limited data, we set aside heptane and octane, the two largest alkanes for which we have experimental bounds, to assess predictivity. Conversely, we also set aside methane and ethane so as not to bias the result. Although the physical properties of methane and ethane may be representative of the homologous series, the chemical properties, i.e., the energetics, are very different as seen through group additivity 42 . Propane and butane were also not included in the initial region search, since a priori it was not known if a parameter vector exist that could simultaneously evaluate the heats of formation of propane, butane, heptane and hexane within the uncertainty in the training data. Based on these considerations, we prioritized our determination of  on feasible samples for pentane and hexane,  5:6 .
A non-linear programming solver, fmincon 40 , was used to minimize the following objective function, M x y ( ) 5  . A volume was taken around x opt in order to collect samples of  5:6 . For each sampled point, the heats of formation of pentane and hexane were computed by PM7. 2 × 10 5 samples were generated by employing a Latin Hypercube design from the volume [x opt ± x opt × 1 × 10 −3 ] and 65 samples were found to be within  5:6 .
Principal component analysis (PCA) was conducted to identify a rotated coordinate system around the samples of  5:6 , where all principal components were preserved. Considering the arbitrary choice of the volume used in the sampling of  5:6 , each PCA direction was extended ten times of that of the  5:6 samples. This allowed for an even larger sample region to be considered. A Latin Hypercube design was used to generate uniform samples in the rotated and extended volume.
In total, 5.76 million samples were generated uniformly within the PCA-rotated volume and were used to evaluate PM7 for the heats of formation of nine alkanes, CH 4 to C 9 H 20 . Shown in Table 2

Results
Feasible Set Search. For each of the eight alkanes, a feasible parameter vector could be found that would predict its heat of formation within the experimental bounds. Feasible parameters were also found for all pairwise combinations of alkanes, i.e., a parameter vector could be found that would predict the heat of formation for any two alkanes within their respective experimental bounds. The percentage of feasible samples found for each pair of alkanes in the training data set is reported in Supplementary Table S3. Methane had the smallest percentage of feasible samples for a single alkane and the largest was for octane. This difference in percentages of feasible samples could be due to the uncertainty bounds associated to octane being nearly an order of magnitude larger than that of methane and the difference in molecular structure of methane compared to all others in the homologous series.
From the generated samples, there were 4020 samples of  5:6 , the alkanes which we had focused our search on. However, from the generated samples, there was no single parameter vector that was able to simultaneously satisfy the uncertainty bounds of all the training data, i.e.,  N 1: = ∅. Thus the PM7 model and the training data are found to be mutually inconsistent.
The inability to find a parameter vector feasible for all alkanes does not prove the non-existence of such a point. In fact, parameter values that are feasible for all alkanes could lead to potential biases in the prediction because methane and ethane are sufficiently different from the rest of the homologous series with respect to their atomic bonding and hence the energetics. Using parameter values feasible for methane and/or ethane in the prediction of larger alkane properties can result in a prediction interval with smaller uncertainty but in disagreement with experimental data, a problem similar to the bias-variance tradeoff 43 .
Direct sampling the search region led to a fraction of samples being consistent with the experimental data for three alkanes (Supplementary Table S4), 4250 samples were found from  6:8 . Prediction of all alkanes by  6:8 is shown in Supplementary Fig. S2. Feasible sets formed by consecutive alkanes, i.e., adjacent members of the homologous series, where only a few or no samples were found are shown in Supplementary Table S5. The feasible set with the largest number of consecutive alkanes was  3:8 . Prediction of all alkanes by  3:8 is shown in Supplementary Fig. S3. No samples were found from  2:8 from direct sampling. Using a genetic algorithm via MATLAB's ga function (and 750 + cpu hours), we were able to obtain a single parameter vector from  2:8 (Supplementary Table S2).
For PM7 to predict the heats of formation of eight alkanes to within the uncertainty in the training data turned out to be a challenging task. There exists a set of parameter vectors, specifically 28,696 such sets, each of them simultaneously predicting the heats of formation of all eight alkanes with 95%-confidence chemical accuracy, i.e., within ±1 kcal/mol. Since there is no justification for expanding the uncertainty of the training data, the feasible sets used throughout this study are defined by the reported experimental uncertainties, those reported in Table 1.
To better understand the shape of a feasible set over the search region, consider two feasible points for C 6 H 14 . The heat of formation of C 6 H 14 was calculated along the line segment connecting the two feasible points. Figure 3(a) shows that points along the line segment do not remain feasible, thus proving the feasible set is non-convex. Interestingly, if the line segment is extended further, beyond the initial sample region, another intersection with the experimental bounds occurs at t = −0.65, shown in Fig. 3(b). This proves that there exist more feasible points for C 6 H 14 which are outside of the sampled search region.

Uncertainty of Predictions.
In this section, we examine the uncertainty of predictions for larger alkanes using samples consistent with experimental data for smaller alkanes. Prediction of the heat of formation of heptane, octane and nonane were investigated. In the case of heptane and octane, experimental data are available for comparison, but for nonane there are not. In each case, samples consistent with the experimental data for smaller alkanes were used for prediction of the larger alkane forming the histograms shown in Fig. 4. The width of the histograms constitute the prediction uncertainty. The predictions are only inner approximations as the samples were only generated from a portion of the parameter space. Thus, the predicted uncertainty can be no smaller than that which is presented below. Figure 4(a) depicts the prediction of the heat of formation of heptane. In blue are samples from  5:6 , i.e, parameter vectors that were feasible for both C 5 H 12 and C 6 H 14 . The width of the histogram is 0.5 kcal/mol, which is larger than that of the experimental uncertainty of 0.2 kcal/mol (shown in grey), but still within the chemical accuracy (1 kcal/mol in terms of a 95% confidence interval). To further reduce the prediction uncertainty, we can impose additional constraints by considering samples that are also feasible with smaller alkanes. Samples of  4:6 and  3:6 , shown in red and yellow, respectively, are indeed capable of reducing the predicted uncertainty and in the case of  3:6 producing a prediction 52.5% smaller than that of  5:6 .
Following a similar analysis, we examined the prediction for octane, the largest alkane that we had experimental data for comparison. To predict the heat of formation of C 8 H 18 , we considered feasible samples from the preceding two, three and four smaller alkanes, i.e., samples from  6:7 ,  5:7 and  4:7 . The results are shown in Fig. 4(b). The heat of formation of octane predicted by  6:7 , shown in blue, yielded uncertainty of 0.62 kcal/mol, again within the chemical accuracy. Using samples which were feasible with the preceding three alkanes,  5:7 , resulted in a prediction uncertainty of 0.42 kcal/mol, a reduction of 31% compared to the prediction by  6:7 .
The computational results for the alkane without any experimental data for comparison, nonane, are shown in Fig. 4(c). Using samples of  7:8 , the prediction uncertainty for nonane is 0.85 kcal/mol. We were able to reduce the prediction uncertainty for nonane by 44%, to 0.48 kcal/mol, by considering feasible samples from  5:8 . Prediction of the heat of formation of nonane using  3:8 , the feasible set with the largest number of consecutive alkanes found, is given in Supplementary Fig. S1.
The predictions for alkanes beyond nonane are shown in Fig. 5. The displayed uncertainty was predicted by using samples of  3:8 . Two interesting observations can be made from these results. First, using samples from  3:8 produces a predicted uncertainty of nearly 5 kcal/mol for methane. This observation is consistent with the fact that methane is sufficiently different from the larger members of the homologous series. Parameter vectors feasible for larger alkanes are not capable of predicting the heat of formation for methane with chemical accuracy. Second, for larger alkanes prediction uncertainty grows linearly with the alkane size. The linearly increasing prediction uncertainty can also be seen in Fig. 6 where C 2 H 6 through n-C 20 H 42 are compared against the available experimental data  Table 1). Panel a: line segment between two feasible points is not entirely feasible, thus the feasible set of C 6 H 14 is non-convex. Panel b: extending the line segment beyond x 1 , to the left, a new intersection with the experimental bounds is found at t = −0.65.   Fig. 3, the feasible set of PM7 is non-convex. For a convex feasible set it suffices to map out its boundary in order to determine if any combination of parameters is feasible or not. In the case of PM7, explicitly mapping of the boundary of the feasible set is computationally expensive due to the dimensionality of the problem and the feasible set being non-convex; a viable alternative is to use the data collected via direct sampling and train a binary classifier that distinguishes a "feasible" class of parameter values from the "infeasible" one. As there was no parameter vector found feasible for all alkanes, it is unclear what parameter vectors should be assigned to the "feasible" class. One approach is to only consider parameter values belonging to a specific feasible set, such as  4:7 , as the "feasible" class. An alternative approach would be to consider the "feasible" class as all parameter vectors that are feasible for at least one alkane. Given the unbalanced nature of the feasible and infeasible classes, i.e., over 5 × 10 6 samples which are infeasible and only a few thousand samples belonging to any specific feasible set, we choose the latter option and constructed the "feasible" class from all points that are feasible for at least one alkane in the training set.

Classification of Feasible Points. As was established in
Binary classification was performed using a Random Forest classifier, as implemented in Scikit-learn library 44 , with equal misclassification costs. A Random Forest classifier is capable of emulating the decision boundary for non-convex and even non-contiguous feasible domains. Both classes were down-sampled to 10 5 to handle the class imbalance. Down-sampling was performed by using a randomly drawn 10 5 samples from each class. Performance of the classifier is evaluated in a 5-fold cross-validation via construction of the Receiver Operating Curve (ROC) and computation of the Area Under Curve (AUC) (see Fig. 7). The ROC allows one to gauge the performance of a binary classifier by comparing the true positive and false positive rate across all possible decision thresholds. The true positive rate is the number of feasible samples correctly identified as the "feasible" class over the total number of feasible samples. The false positive rate is the number of samples incorrectly identified as the "feasible" class over the total number of infeasible samples. A perfect classifier would have an ROC curve at the top-left corner of Fig. 7, with a true positive rate of 1 and a false positive rate of 0. For this to occur, all feasible samples would be correctly identified as part of the "feasible" class without misclassifying any infeasible samples as the "feasible" class.
One figure of merit for binary classification is the area under the ROC curve (AUC). The AUC varies between 0 and 1 where a perfect classifier has an AUC of 1 and randomly guessing (luck) would have an AUC of 0.5. With an AUC value of 0.68, we found that there is indeed a non-random structure in the feasible set recovered via direct sampling. Therefore, it is possible to train a classifier that fulfills the same function as the boundary to a feasible set. Of course, for this assessment to be practical, a better performance metric of the classifier should be achieved. One route toward this goal is to collect a sample in the parameter space that conveys a better representation of the spatial extents of the non-convex feasible set found via sampling on a Latin hyper-cube. The inset of Fig. 3(b) depicts a region where a line segment passes through the experimental bounds. This region (and all other intersections) is a 27 dimensional space. Samples in these regions, near the experimental bounds, can help improve the characterization of the feasible set for classification.

Conclusion
Uncertainty quantification of semi-empirical quantum chemical model PM7 was performed using the B2BDC framework. The obtained results did not show evidence of model consistency even in the "best case" scenario, when we considered heats of formation of a small family of linear alkanes. This finding, however, requires a cautious interpretation.
Lack of evidence of the model consistency with experimental data may imply that either the experimental data has some bias or the model has a deficiency. Given the quality of the experimental data, we tend to think the model inconsistency resides with the parameterization of PM7. To ultimately resolve this question, would require a significantly more engaged investigation of PM7 parameter space whose dimensionality exceeds 27 for organic molecules that include hetero-atoms. The practical lesson of the study is that we could not find a parameter vector feasible for all the training data having searched a finite volume and consumed a certain amount of the computational resources and time. The cost of establishing model consistency is, in itself, a diagnostic that can be used to compare various data-centric models and make an informed selection among them.
Another practical lesson is the source of the encountered lack of consistency. The "problematic" QOIs found were the shortest alkanes, methane and ethane, whose chemical properties are known to be different from the rest of the homologous series. The value of PM7 comes from the low computational cost to evaluate larger molecules. In our study, the heats of formation of larger molecules were found to be consistency with the experimental data. The reported results demonstrate B2BDC's capability for a deeper analysis into PM7 by evaluating the limits of model consistency in a data-centric setting.
One clear challenge in this study was the inability to accurately represent a significant volume of the parameter space with a surrogate model. The encountered difficulties motivated the development of a tool-set for uncertainty quantification of models with high-dimensional parameter spaces that can have non-convex and even non-contiguous feasible sets. Incorporation of machine learning techniques with the B2BDC framework was shown to be a viable strategy to efficiently tackle cases of this type. Our basic attempt to learn the geometric structure of the identified feasible sets showed encouraging results. Possibility of the accurate binary classification of feasible/infeasible classes suggests a route to sampling strategies that are more efficient than brute-force direct sampling and could rely on a semi-supervised model to search for feasible points.
An equally important aspect of our study concerns uncertainty propagation from the training data that have sub-chemical accuracy to PM7 predictions. We observed that predictions remained within bounds of chemical accuracy for the molecules whose size is close to the size of the molecules in the training set. As expected, improvement of the quality of the feasible sets, i.e., attaining consistency with more alkanes, led to the reduction of the prediction uncertainty. Growth of the prediction error with the difference in the size between the test and training molecules agrees with the size-extensive nature of the selected QOI. This result reinforces intuition that there is a price to switching from interpolation to extrapolation regime. It motivates a large-scale uncertainty quantification study of semi-empirical quantum chemical models where both the size and chemical diversity of the dataset are significantly expanded. In order to facilitate such an effort we are including the parameter vectors and current feasibility labels in the Supplementary Information.   Figure 7. ROC curves and their corresponding AUC for binary classification using a Random Forest model. Classification was between the classes, "feasible" and "infeasible" with the samples obtained via direct sampling. "Feasible" class includes samples that are feasible for at least one alkane. "Infeasible" class includes samples that are not feasible for any alkane. Both classes were down-sampled to 10 5 samples. Classification performance was assessed by 5-fold cross-validation using 5 × 10 3 ensemble of random trees.