Identifying models of dielectric breakdown strength from high-throughput data via genetic programming

The identification of models capable of rapidly predicting material properties enables rapid screening of large numbers of materials and facilitates the design of new materials. One of the leading challenges for computational researchers is determining the best ways to analyze large material data sets to identify models that can rapidly predict a given property. In this paper, we demonstrate the use of genetic programming to generate simple models of dielectric breakdown based on 82 representative dielectric materials. We identified the band gap Eg and phonon cut-off frequency ωmax as the two most relevant features, and new classes of models featuring functions of Eg and ωmax were uncovered. The genetic programming approach was found to outperform other approaches for generating models, and we discuss some of the advantages of this approach.

been widely applied in multidisciplinary fields, including but not limited to financial market analysis 20 , biological science 21,22 , software development 23 , and identifying interatomic potential models from calculated energies [24][25][26] . In materials science and engineering, researchers have used genetic programming to develop predictive models for properties of concrete and cement [27][28][29] , asphalt 30 , shape memory alloys 31 , and heterogeneous catalysts 32 . They have also been used to determine the effects of processing parameters on metal alloys [33][34][35][36] , predict the impact roughness of cold formed materials 37 , optimize productivity for the steel industry 38 , and develop models for a variety of problems in structural engineering 39 . Recently genetic programming has been identified as a useful tool for extracting important descriptors of material properties from computational data 19 . Test and training data. Here we evaluate the effectiveness of genetic programming in predicting dielectric breakdown strength. Dielectric breakdown strength is defined as the maximum external electric field strength that the materials can withstand before turning into a conductor. Materials with high dielectric breakdown strength are used as insulators for applications including high voltage power transmission and capacitors [40][41][42] . Fundamentally, dielectric breakdown strength is a complex phenomenon involving physical interaction between materials and an electric field. Here, we focus on intrinsic dielectric breakdown strength, which is defined for a defect-free crystal and theoretically is only influenced by materials chemistry. The calculation of intrinsic dielectric breakdown strength is based on von Hippel 43 and Fröhlich [44][45][46] theory implemented in a DFT framework 47 . Due to the time-consuming nature of such calculations, only 82 representative crystals were calculated. The details of these calculations and the underlying theories can be found in the work by Sun et al. 47 and Kim et al. 48 .
Despite its importance in both academia and industry, dielectric breakdown strength lacks a good predictive model that can be used to rapidly screen new candidate materials. Recently Kim et al. 48 provided a case study of searching for relevant models via three supervised machine-learning algorithms: Kernel Ridge Regression (KRR) [49][50][51] , Random Forest Regression (RFR) 49 and Least Absolute Shrinkage and Selection Operator (LASSO) 49,52 . Of these, they determined that the LASSO method was effective for the identification of analytical models, and based on this method they developed several phenomenological models for crystalline dielectric materials. They highlight the following model as being particularly good in terms of simplicity and accuracy: ω .

or, equivalently
where F b is the dielectric breakdown strength, E g is the electronic band gap, and ω max is the maximum phonon frequency.
The genetic programming method used in this paper is in some ways similar to the LASSO method. The LASSO method used by Kim et al. 48 assesses linear combinations of elementary terms, where each term is a function of some subset of material properties. The disadvantage to this approach is that a list of possible terms must be provided to the algorithm. (Kim et al. 48 combinatorially generated a total of 187,944 terms, each containing functions of up to three properties.). In contrast, genetic programming is capable of dynamically generating such terms by combining properties in non-linear ways, so only the list of known material properties must be provided to the algorithm.
To enable comparison with LASSO and other machine learning algorithms, we have applied the genetic programming approach to the same input dataset generated by Kim et al. 48 . This data set is composed of eight feature properties, listed in Table 1, for 82 representative crystalline dielectric materials. The detailed information for 82 crystalline insulators can be found in the supplementary information of Kim et al. 's paper 48 . Simulation Details. We performed genetic programming calculations using the Eureqa software package 53 .
In genetic programming, an explicit definition of elementary operators is required to define the hypothesis space, and in this study we chose a collection of four algebraic operators (i.e., plus, minus, division, multiplication) and three function operators (square root, exponential and logarithm functions). For each of these operators we used the default complexity value in Eureqa (see Supplementary Table S1). We used three different but representative metrics to measure the fitness of candidate models: the mean absolute error (MAE), the root mean square error (RMSE) and the Pearson correlation coefficient (PCC) (as defined in the SI). The PCC assesses the degree to which two sets of data are linearly related, regardless of how close they are to each other in value. Thus when the PCC is used as the objective function, the breakdown strengths predicted by the models output by Eureqa need to undergo a linear transformation to enable direct comparisons with the DFT-calculated breakdown strengths (see Supplementary Fig. S1). On the other hand the use of MAE or RMSE as the objective function produces output that is directly comparable to the training data. Although this obviates the need for a linear transformation of the To estimate the predictive performance of the evaluated models, we set the parameters in Eureqa so that in each run the 82 materials in the data set were randomly partitioned into two groups of the same size (i.e., each containing 41 materials): the training and the validation dataset. Model optimization is done using the training set, and the validation set is used to construct a Pareto frontier of models, defined as the set of models for which no other models were both simpler and more accurate. We take the output of a single Eureqa run to be the set of models on the Pareto frontier after total number of evaluated models reached 10 12 .

Results and Discussion
Evaluation of individual properties and products of properties. To estimate the degree to which each of the eight feature properties is related to the dielectric breakdown strength, we first simply count the number of times each appears in a model found by Eureqa. Because genetic programming is a stochastic method, with randomness in both the evolutionary algorithm and the way in which the validation set is selected, we performed our analysis over eight independent runs for each objective function. We define a parameter f i by: where N is the total number of models in the Pareto frontier and N i is the total number of models in the Pareto frontier that are functions of the the i th material property. Higher f i values for a particular property suggest that the property is more useful as a descriptor of dielectric breakdown strength.
The calculated values of f i for the eight material properties are shown in Fig. 1. The properties E g (the band gap) and ω max (the maximum phonon frequency) have the highest values of f i . For all three objective functions, more than 60% of the models on the Pareto frontier contain these values. The importance of E g and ω max in predicting dielectric breakdown strength is in agreement with the results obtained by the machine learning models evaluated by Kim et al. 48 , as well as a simple correlation analysis ( Supplementary Fig. S3). These results indicate that genetic programming is an effective tool for rapidly identifying the most relevant properties, consistent with prior results 19 . Our results also revealed that bm (the bulk modulus) and ω mean (the average phonon frequency) have the next-highest values of f i , which is not surprising given the degree to which these properties are correlated with ω max (see Supplementary Fig. S3). These three properties (ω max , ω mean , and bm) are associated with the stiffness of the material 54 , which is consistent with the proposed physical picture described by Kim et al. 48 .
After scrutinizing the raw results from Eureqa, it was evident that the multiplication of two features is the most frequent way for features to be combined. Based on this observation, we computed the number of appearances for all possible products of two features (Fig. 2). For all three objective functions the E g *ω max term appeared most frequently, providing further evidence of the importance of these two properties.
Performance of the genetic programming models. We first consider a direct comparison between the results of genetic programming and the results obtained by Kim et al. 48 . To make this comparison, we have used the logarithm of the dielectric breakdown strength as the output value, to be consistent with their approach 48 . To accelerate the search, we restricted the properties considered to the two that consistently appeared most frequently in models on the Pareto frontier: E g and ω max , based on Figs 1 and 2. All results in this paper are under this premise unless stated otherwise. Because the two-feature Eureqa runs are faster than the eight-feature runs, for each objective function we gathered statistics for 16 two-feature Eureqa runs. In each of these runs it took approximately 14 hours to evaluate 10 12 models on a single core of 3 GHz Intel Core i5-2320 CPU.
Averaged over all 16 Eureqa runs as a function of complexity, the performance of the generated models for all three objective functions (MAE, PCC, RMSE) on the total set of input data (i.e., training plus validation sets) increases with increasing complexity and levels off when the complexity reaches around 10 (Fig. 3). The LASSO solution (equation (2)) has a complexity of 11, determined via the same complexity measure used for the genetic programming runs. At this level of complexity, the average model found by genetic programming has slightly better performance than the LASSO solution on the training and validation data.
To assess the predictive ability of the generated models, i.e. how well the models on the Pareto frontier perform when exposed to data Eureqa has never seen, we evaluated the models' performance (measured by RMSE and  PCC) on an out-of-range test set (Fig. 4). The test set consisted of four cubic crystals 48 , Li 2 S, Na 2 S, SrCl 2 and ZrO 2 , as well as six perovskite crystals 55 , BaSnO 3 , CaGeO 3 , CaSiO 3 , BSiO 2 F, BaBO 2 F and SrBO 2 F. Detailed information about these ten materials is provided in the works by Kim et al. 48,55 and summarized in Supplementary Table S2. The test data was not used by either the genetic programming algorithm or the LASSO algorithm when generating models.
For the models discovered by genetic programming, the average quality of the predictions on the test data improves until about a complexity of about 10, at which point the errors start to increase (Fig. 4). In contrast, the performance of models on the validation data increases with increasing complexity and levels off when the complexity reaches about 10 (Fig. 4). These results indicate that beyond this complexity, the genetic programming algorithm is overfitting the data. Although the LASSO solution has performance similar to the genetic programming models on the validation data, its performance on test data is significantly worse, manifested by higher RMSE and lower PCC values. Its performance on the test data is comparable to the genetic programming models at levels of complexity that overfit the training data, suggesting that the LASSO solution may have also overfit the data. We note that the genetic programming algorithm found the LASSO solution in two of the 48 Eureqa runs, and in both of these runs PCC was the objective function. The PCC runs appear to be better than the RMSE or MAE runs at identifying models with good predictive ability at low levels of complexity, which is understandable given the additional requirements imposed on the generated models when MAE or RMSE is used as an objective function.
Our results suggest that genetic programming is effective at finding models with good predictive ability, provided that the appropriate level of complexity is determined. Models that are too simple are not able to adequately account for the factors that influence dielectric breakdown strength, and models that are too complex overfit the data and have relatively poor predictive ability. The challenge is in determining the appropriate level of complexity to minimize prediction errors. In addition, at a given level of complexity there may be many different models by different genetic programming runs, and it is also necessary to select from these models. One approach to identifying models that are expected to have good predictive accuracy is to simply withhold a set of test data, as we have done here. Approaches in which a set of test data is withheld have the added benefit of allowing for the amount of uncertainty in the predictions for each model to be estimated by evaluating the prediction errors on the withheld data 11 . Similarly, cross-validation could be used to try to identify the optimal level of complexity. Some amount of cross-validation is already included in the genetic programming algorithm, as the data is randomly partitioned into training and validation sets for each Eureqa run. Here we explore two alternative strategies for identifying the best models.
The first strategy we explored is to simply count the number of times a model appears in the different stochastic runs, under the hypothesis that models that appear on the Pareto frontiers more frequently are less likely to have fit the training data well by chance. In each of the 48 Eureqa runs, a different, randomly-selected partition of training and validation data was used. We counted the number of times each model appeared on the 48 Pareto frontiers, considering only the functional relationships between the two feature properties and ignoring differences in the constants (e.g. coefficients), as we found that for the same model the constants identified by Eureqa could differ slightly from run to run. We chose a single set of parameters to plot by using a gradient descent algorithm to identify the locally optimal parameters. (Details are provided in the supporting information.) A plot of the models with complexity less than 18 on the Pareto frontiers is shown in Fig. 5, and detailed values for each entry in this plot are listed in Supplementary Table S3. We visualize each model's performance on the test data in Fig , appears on all 48 Pareto frontiers, but has relatively poor predictive performance on the test set. However, the second-most common model on the Pareto frontiers, ω = .
, is one of the best-performing models. It has a lower root-mean-square error on the 82 training and validation materials than the model discovered by LASSO, and it has roughly half the root-mean-square error on the test set.
The relatively weak performance of the simpler model, ω = . + . + . , shows up nearly as frequently and at a complexity level at which the hypothesis space of possible models is significantly larger. This suggests that one way to search for the models with the best predictive power would be to find the models that show up unusually frequently given their complexity. However this method does not resolve the issue of how to select a single model that is likely to have good predictive ability. In addition, if we repeat this exact same exercise using F b , rather than F ln( ) b as the output variable (see Supplementary Fig. S4 and Table S4), we find that the model that performs best on the test set shows up on Pareto frontiers less frequently than some more complex models. It would have been difficult to identify this model as being particularly promising using this first strategy.
The second strategy we used is to create a "universal" Pareto frontier by combining the best models from 48 Pareto frontiers of all Eureqa runs, as evaluated against the validation data (Fig. 6a). The data used to generate this plot is provided in Table S5. When the models in the universal Pareto frontier are benchmarked against test data, we find that at some complexity values (e.g., 9 and 11), the models on the universal Pareto frontier simultaneously have the lowest (or near-lowest) RMSE for both training and test data. However at most complexity values, the best model on the training data is not the best model for the test data. The RMSE on the test data for models on the universal Pareto frontier is similar to the average RMSE over all sixteen Pareto frontiers generated using RMSE as the fitness metric, suggesting that simply appearing on the universal Pareto frontier is not an indicator of low prediction error.
There does appear to be an advantage to using the universal Pareto frontier. There is a large change in slope at a complexity value of 9, which is roughly the optimal complexity value. Beyond this point, the models get only slightly better even as they get significantly more complex. The entries on the universal Pareto frontier at a complexity of 9 and 11 are also two of the best-performing models on the test data (Fig. 6b). A parity plot of the performance of the model with complexity 9 (i.e., ω = . the LASSO solution for test, training, and validation data is provided in Fig. 7. A similar result was found when F b , rather than F ln( ) b , was used as the objective function (see Supplementary Fig. S5 and Table S6). There is a large change in slope on the universal Pareto frontier at a complexity of 10, and the model at this point on the frontier performs very well on the test data (see Supplementary Fig. S5).
It may be fortuitous that the models at the point where the slope changes on the Pareto frontier also happen to perform very well on the test data, as models at other complexities do not perform as well (Fig. 6b). In addition, it may not always be clear what constitutes a "large change in slope. " However this change in slope may be an indication of an optimal (or near-optimal) level of complexity. The relatively rapid decrease in the error with increasing complexity up to this point may be an indication that the error is decreasing primarily because of improving model skill. Similarly, the relatively slow decrease in error at high levels of complexity may be an indication that the error is decreasingly primarily by chance; i.e. the increasing size of the hypothesis space makes it easier to find a model that happens to do well on the training and validation data but generalizes poorly.
Predicting dielectric breakdown strength. Using the analysis in the previous section, we can now identify models that are likely to be useful as predictive models of dielectric breakdown strength. Here we report results on predicting the dielectric breakdown strength itself, rather than the natural logarithm of the breakdown strength. This choice effectively places greater importance on making accurate predictions for materials that have a high dielectric breakdown strength, and these are the materials that are often of the most technological interest. We have included results for the natural logarithm of the dielectric breakdown strength in the supporting information ( Supplementary Fig. S8 and Supplementary Table S7).
To identify the models that are likely to make accurate predictions, we have used the universal Pareto frontier approach, this time including all data (i.e., training + validation + test data). For comparison, we have also included the LASSO solution, although we note that the LASSO solution was selected without considering the test set and was fit to the natural logarithm of the dielectric breakdown strength, so a direct comparison is not as straightforward as the comparisons in previous section. All models on this Pareto frontier, as well as their performance, are summarized in Table 2. The universal Pareto frontier (Fig. 8) gives a large slope change around complexity 8. Based on the analysis in the previous section, this suggests that models at about this level of complexity may have the greatest predictive power.
We have selected from the universal Pareto frontier three models with complexity around 8, labeled S1, S2, and S3 in Table 2, for further analysis. We have also included the LASSO solution for comparison, and label it S4. Parity plots of the values predicted by the models vs. the values predicted by DFT for these four models are provided in Supplementary Figs S9-S11. We emphasize that because the models S1, S2, and S3 were optimized for F b and the LASSO solution was optimized for F ln( ) b , these plots should not be used to compare the performance of genetic programming vs. LASSO (that comparison was discussed in the previous section). To better understand how each of these four model predicts how F b will change as a function of E g and ω max , we have created plots showing the predicted vs. DFT-calculated values as a function of E g and ω max (Fig. 9). These plots make it clear that although there is a dependence on E g and ω max , these variables are not sufficient for a complete description, as there are several pairs of materials in which both materials have similar values for E g and ω max but very different breakdown strengths.    Of particular interest are models S2 and S3, as these are arguably the best models on the entire set of data in terms of combined simplicity and accuracy. Models S2 and S3 include terms with . − E 14 6 g and E 15 g − in their denominators, respectively. These terms will have singularities for materials with band gaps of 14.6 or 15 eV and will become negative for materials with larger band gaps. Fortunately, apart from condensed noble gasses, there is no known material with a band gap larger than that of LiF 56,57 , which has a calculated band gap of 13.6 eV. Thus the upper bound on the band gaps in these models corresponds well to the known upper limit in nature for band gaps of materials of the type considered here. Models with similar form showed up on the universal Pareto frontier generated using just training and validation data (Supplementary Table S6), and their persistence when test data was included suggests they have true predictive power. If we generate a universal Pareto frontier using only models discovered using the training and validation data, but with error (the y-axis) evaluated against all data, the models with forms similar to S2 and S3 have the best performance ( Supplementary Fig. S12 and Supplementary Table S8).
Models S2 and S3, and similar models, are also of interest because they exist outside of the hypothesis space that was searched by the LASSO algorithm 48 . This highlights a problem with methods that attempt to enumerate all possible solutions in the hypothesis space: the combinatorial space of even relatively simple functions is very large and difficult to comprehensively enumerate. The advantage of an approach such as genetic programming is that it can effectively search this hypothesis space without the need to explore the entire space; it naturally focuses on the regions of the space with the most promising (i.e. "fit") solutions. As discussed in this paper, care must be taken to avoid overfitting the training data, but that will be a problem with any algorithm that searches a similarly-sized hypothesis space, including LASSO.

Conclusion
In this paper, we demonstrated that genetic programming is an effective way to search a large hypothesis space of simple functions of known material properties. For the specific property of the dielectric breakdown strength of materials, we identified a new family of models based on ~E ( 15eV ) g max 1 ω − − that performed well on the training and validation data and then again on the test data. Our results indicate that there is a substantial risk to overfitting the training and validation data, both with genetic programming and with the LASSO approach. We explored different techniques to mitigate this risk and facilitate the use of genetic programming to discover models with good predictive power. The more effective of these appears to be finding the model(s) at or near the point at which the Pareto frontier starts to level off. It can be helpful to consider the number of times a model shows up in repeated genetic programming runs, but this approach appears to be less reliable in identifying models with good predictive power. We believe further exploration of these and related approaches will make genetic programming a more practically useful tool for researchers. There are a number of additional potential areas for improvement, including how to best define "complexity" and how to best partition the known data. Despite the room for further improvement, the relative success of the genetic programming approach in identifying simple models of dielectric breakdown strength provides additional evidence that it is a valuable tool for descriptor identification in materials science and engineering.