Abstract
Materials scientists increasingly employ machine or statistical learning (SL) techniques to accelerate materials discovery and design. Such pursuits benefit from pooling training data across, and thus being able to generalize predictions over, knary compounds of diverse chemistries and structures. This work presents a SL framework that addresses challenges in materials science applications, where datasets are diverse but of modest size, and extreme values are often of interest. Our advances include the application of power or Hölder means to construct descriptors that generalize over chemistry and crystal structure, and the incorporation of multivariate local regression within a gradient boosting framework. The approach is demonstrated by developing SL models to predict bulk and shear moduli (K and G, respectively) for polycrystalline inorganic compounds, using 1,940 compounds from a growing database of calculated elastic moduli for metals, semiconductors and insulators. The usefulness of the models is illustrated by screening for superhard materials.
Introduction
In recent years, firstprinciples methods for calculating properties of inorganic compounds have advanced to the point that it is now possible, for a wide range of chemistries, to predict many properties of a material before it is synthesized in the lab^{1}. This achievement has spurred the use of highthroughput computing techniques^{2,3,4,5} as an engine for the rapid development of extensive databases of calculated material properties^{6,7,8,9,10,11,12}. Such databases create new opportunities for computationallyassisted materials discovery and design, providing for a diverse range of engineering applications with custom tailored solutions. But even with current and nearterm computing resources, highthroughput techniques can only analyze a fraction of all possible compositions and crystal structures. Thus, statistical learning (SL), or machine learning, offers an express lane to further accelerate materials discovery and inverse design^{2,5,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27}. As statistical learning techniques advance, increasingly general models will allow us to quickly screen materials over broader design spaces and intelligently prioritize the highthroughput analysis of the most promising material candidates.
One encounters several challenges when applying SL to materials science problems. Although many elemental properties are available, we typically do not know how to construct optimal descriptors for each property, over a variable number of constituent elements. For instance, if one believes that some average of atomic radii is an important descriptor, there are many different averages, let alone possible weighting schemes, that one might investigate. This challenge may be reduced by placing restrictions on the number of constituent elements or types of chemistries or structures considered, but such restrictions reduce the generalizability of the learned predictor. Materials science datasets are often also smaller than those available in domains where SL has an established history. This requires that SL be applied with significant care in order to prevent overfitting the model. Overfitting leads to predictions that are less generalizable to new data than anticipated^{28}, such that predictions are less accurate than expected. At the same time, smaller datasets challenge us to use the available data as wisely as possible. This may include leveraging observations related to the smoothness of the underlying physical phenomenon, and the use of an appropriate risk criterion, rather than partitioning the available data into distinct training and test sets. For SL to have the greatest impact on materials discovery and design, we must pursue techniques that make maximal use of the available data. This requires approaches that are capable of systematically pooling training data across, and are thus capable of generalizing predictions over, knary compounds of diverse chemistries and structures.
The successful application of SL requires the selection of an appropriate set of descriptor candidates. In materials science problems, the candidates must be capable of both “uniquely characterizing”^{22} a diverse range of compounds, and sufficiently explaining the diversity of the phenomenon being learned. Thus, the selection of descriptor candidates is a crucial and active field of investigation within materials science (e.g. refs 14, 22 and 23), as the field endeavors to develop general models with high predictive accuracy. Previous work in materials science has included both categorical descriptors (e.g. refs 23 and 24) and continuous descriptors (e.g. refs 22 and 23). Although both types of descriptors may be legitimately used in SL, special care should be taken when using categorical descriptors, as each such descriptor essentially (i.e., unless there is sufficient smoothing across cells) partitions the space of compounds into disjoint cells, which quickly increases the degrees of freedom and thus the risk of overfitting the model.
SL applications should always include descriptor candidates suggested by known, scientifically relevant relationships^{22,23}. But in order to construct models that accurately generalize across diverse datasets, such candidates will typically need to be augmented with additional descriptor candidates, capable of bridging across the simplifying assumptions that divide less generalizable models. Without these additional candidates, attempts to learn more general models will be stifled, as it will be impossible to discover new, unexpected relationships. Here we introduce the use of Hölder means, also known as generalized or power means, as an ordered approach to explicitly constructing descriptor candidates from variable length numeric lists. Hölder means describe a family of means that range from the minimum to maximum functions, and include the harmonic, geometric, arithmetic, and quadratic means^{29}. This paper advances previous work by constructing descriptor candidates as Hölder means, which, to the best of our knowledge, has not previously been done in the field of materials science.
Having discussed the construction of descriptor candidates, we now introduce gradient boosting machine local polynomial regression (GBMLocfit), which is a SL technique that we developed to leverage the available data as wisely as possible. Energy minimization problems often enforce smoothness in the functions mapping useful descriptors to outcomes. Statistical learning techniques may exploit such smoothness, when present, in order to produce models that are as accurate as possible for a fixed amount of training data; such considerations are more important when working with smaller training datasets than with larger datasets. GBMLocfit utilizes multivariate local polynomial regression, as implemented in Locfit^{30}, within a gradient boosting machine (GBM) framework^{31}. Local polynomial regression performs a series of weighted regressions within a moving window, with a weight function that gives greatest weight to observations near the center of the window, producing a smooth curve that runs through the middle of the observations^{32,33}. GBM uses a gradient descent algorithm to iteratively assemble a predictor while minimizing an appropriate, typically squared error, loss function^{31}. Our approach enforces more smoothness in the functions mapping descriptors to outcomes than traditional treebased GBM methods, which we suggest is appropriate for this and many other materials science problems. Additionally, the enforced smoothness helps minimize boundary bias (i.e., when the solution is flat over some peripheral region of the space of descriptors), which can be problematic with treebased techniques when the data has sparsely populated tails. We believe GBMLocfit will be advantageous for many materials science problems where datasets are of modest size, the underlying physical phenomenon is reasonably smooth and sparse regions have been carefully studied and are of particular interest.
To illustrate both our GBMLocfit approach and the use of descriptor candidates constructed as Hölder means, we predict the elastic bulk and shear moduli (K and G, respectively) of knary inorganic polycrystalline compounds. These moduli govern the stressstrain relations of isotropic materials within the linearelastic range and are central to governing the mechanical behavior of materials in diverse contexts spanning geophysics to structural engineering. In addition, elastic constants are known to correlate with a wide range of other materials properties, including ductility and hardness^{34,35,36,37} and thermal conductivity^{38,39,40}. Further, the singlecrystal elastic constants are a direct measure of the bonding strength and directionality in a material, and are thus widely employed in the development of theoretical models for interatomic forces. Due to the importance of these properties, extensive efforts have been devoted to developing theoretical models of elastic moduli, relating their magnitude to structural and electronic properties such as atomic density, coordination, cohesive energy and Fermi energy^{27,41,42,43,44,45,46}. But all of these models consider specific subsets of chemistries or structures, limiting their use for predicting the elastic properties of a diverse range of materials. A recent investigation employing nonparametric regression^{24} and categorical descriptors considered elastic constants for a diverse range of materials, but the results fail to generalize to new data (see Supplementary Figures S6 and S7). In this paper, we demonstrate the application of the SL framework described above to develop broadly applicable models for K and G, expressed in terms of a few descriptors that are either currently tabulated or easily computed. We demonstrate how such models can be used to enable materials discovery, by screening the hardness of over 30,000 compounds to identify superhard inorganic compounds that are thermodynamically stable or weakly metastable at a temperature of 0 K. Our training dataset consists of 1,940 inorganic compounds from the Materials Project’s growing database of elastic constants constructed using firstprinciples, quantum mechanical calculations based on Density Functional Theory (DFT)^{12}.
The outline of this paper is as follows. In the methods section, we detail our SL framework, which includes safeguards against overfitting our models. Then the predictive models for the elastic moduli are described in the results section, including an overview of the descriptors and the prediction accuracy. In addition, we present a screening process for superhard materials and present a DFT validation. In the discussion section, we examine known issues with the accuracy of the predictions and conclude with a summary of the main advances presented in this work.
Methods
We begin our methods section with some background on local polynomial regression, which was introduced to the statistics literature by Stone^{32} and Cleveland^{33}. Loader^{30} provides a general, yet thorough discussion of local regression, as well as implementation details of the Locfit software. Simply put, multivariate local regression produces a smooth surface that runs nominally through the middle of the observations. Thus, given response variables, y, and predictor variables, x, Locfit estimates a regression function, η, which relates these quantities, where is noise (assumed to be independent and identically distributed with zero mean and finite variance):
Globally, no strong assumptions are made concerning the form of η, the underlying function being estimated. Locally, at each fitting point, the underlying function is assumed to be smooth, such that Taylor’s theorem allows the behavior to be described with a low order polynomial. Specifically, a oncedifferentiable function can be approximated locally by a linear function, and more generally, a ktimes differentiable function can be approximated locally by a kthorder polynomial. In order to make local estimates of η, one must select a bandwidth, or smoothing window, and an appropriate smoothing kernel or weight function. Appropriate weight functions, such as Locfit’s default tricubic weight function, give greatest weight to observations near the center of the smoothing window and zero weight to observations outside the window. The local estimate of η, at each fitting point, is the intercept of the local regression centered at the fitting point, and these local estimates combine to produce a smooth estimate of the underlying function. We are interested in estimating smooth functions, because energy minimization problems often enforce smoothness in the functions mapping useful descriptors to outcomes.
In this paper, we distinguish between composition and structural descriptors. Composition descriptors, such as average atomic radius and weight, are calculated from elemental properties and only require knowledge of a compound’s composition. Structural descriptors, such as cohesive energy and volume per atom, require knowledge of a compound’s specific structure (in addition to composition), and may be determined experimentally or calculated using DFT. We seek composition descriptors that generalize over knary compounds, but do not have a priori knowledge of how to combine the various elemental properties to construct descriptors that are optimal for our specific, yet very general problem. Thus we construct composition descriptors as a series of weighted Hölder means, rely upon Locfit to capture any necessary nonlinearities, and rely upon model selection techniques and our GBM framework to select which descriptors are most useful at each iteration, and for each specific problem. Because GBM implements a version of the least absolute shrinkage and selection operator (LASSO)^{47}, our approach has similarities to the statistical learning approach advocated by Ghiringhelli et al.^{22}, but is less reliant upon sufficient, a priori, scientific insight and may thus be applied to more general problems.
In equation (2), μ_{p}(x) represents the Hölder mean μ, to the power p, of the property x, taken over i values, with associated weights w_{i}^{29}. Equation (3) gives the Hölder mean when p equals zero. Hölder means describe a family of generalized means that range from the minimum function (when p = −∞) to the maximum function (p = ∞). An example would be calculating the cubic mean (p = 3) of atomic radii over all constituent elements in a particular composition, where the weights would be the molar quantities of each element.
In this work, we only consider the Hölder means with integer power values between negative and positive four, which include the well known harmonic mean (p = −1), geometric mean (p = 0), arithmetic mean (p = 1), and quadratic or Euclidean mean (p = 2). We construct these Hölder based composition descriptors for each of eight elemental properties listed in Table 1 (upper), using elemental properties from pymatgen^{48}. We also consider the structural descriptors listed in Table 1 (lower); most of these descriptors are obtained directly or postprocessed from a single density functional theory (DFT) calculation per compound. The cohesive energy per atom, E_{c}, is estimated from DFT by subtracting the atominabox energies of the constituent elements, from the formation energy of each compound. Following a Voronoi tessellation^{49} of each unit cell, atomic coordinations, nearestneighbor bond lengths, and bond angles between adjacent neighbors are calculated for each site. Additional structural descriptors are then constructed as Hölder means of these quantities over all sites; please see Supplementary Table SI for a full list of all investigated descriptors.
Our GBMLocfit implementation uses established model selection techniques, including 10fold crossvalidation and a conservative risk criterion, to determine which descriptors are the most useful for predicting K and G, without overfitting the training data. The GBM framework iteratively assembles a predictor, P, as a sum of Locfit smoothed weak learners, η_{i}. At each iteration, GBM selects the smoothed weak learner candidate that leads to the greatest reduction in the size of the residual, , but attenuates each weak learner by the learning rate, λ, as shown in equation (4). Both of our final models use a learning rate of 5% and limit the level of interaction to 3 descriptors, D_{j}, meaning that Locfit is only run with three descriptors at a time, to create each smoothed weak learner candidate.
Although it is possible that the number of iterations required to achieve a given prediction error could be reduced by tuning Locfit’s smoothing parameters, we have opted to use Locfit’s default smoothing parameters and rely on the GBM method to provide an appropriate amount of flexibility to the fitted model. The one exception to this, is that Locfit’s degree of local polynomial is set to linear for all models, rather than using the default setting of quadratic.
Our GBMLocfit implementation utilizes all of the available data for training and relies on a conservative risk criterion to limit the number of iterations, rather than an explicitly partitioned test dataset, to avoid overfitting the model to the data. As summarized graphically in Fig. 1, we perform 10fold crossvalidation (CV), using 90% of the data to select the weak learner that minimizes the squared error loss function and the remaining 10% of the data to estimate the mean and standard deviation of the outofsample squared errors, for each iteration and fold. After this process is completed for each fold and for a large number of iterations, the prediction errors are calculated as the mean (over folds) outofsample squared error for each iteration, and the standard errors of the prediction errors are estimated from the standard deviations of the outofsample squared errors for each iteration. The risk criterion determines the iteration threshold as the first iteration for which the prediction mean squared error (MSE) is less than the sum of the minimum prediction MSE (over all iterations) and the standard error of the prediction MSE at that minimum^{50}, which has been shown to be conservative^{51}. Please see Supplementary Fig. S1 for example performance curves. A more commonly used, but unconservative risk criterion is to simply establish the iteration threshold as the minimum prediction MSE (without adding one standard error), but this seems overly optimistic, particularly when the sample size is small (relative to the number of descriptor candidates) or the prediction MSE curve lacks a distinct minimum. After the iteration threshold is determined, a final GBMLocfit model is fit using 100% of the data, but limiting the number of iterations to the previously established iteration threshold, to avoid overfitting the model to the data. By limiting the number of GBM iterations, we inherently limit the number of weak learners, since each iteration adds one weak learner term to the predictor, as in equation (4).
Results
We demonstrate GBMLocfit by learning the VoigtReussHill (VRH) averages^{52} of the elastic bulk and shear moduli (K and G, respectively) which characterize polycrystalline compounds. More specifically, we learn log(K) and log(G) to avoid having the squared error loss function severely overweight the higher moduli materials. We present our predictions graphically for K and G in Fig. 2, by comparing the VRH moduli from our DFT training set^{12} with those learned by our GBMLocfit method.
Our best four descriptor models for log(K) and are summarized in Table 2. None of our models with more than four descriptors have significantly better predictive accuracy than these four descriptor models, based on comparisons of prediction mean squared error and their associated standard errors. And all of our models with less than four descriptors have significantly less predictive accuracy.
For both K and G, the structural descriptors log(V), log of volume per atom, and E_{c}, cohesive energy per atom, are very important, with a combined relative influence of 66.0% for K and 72.9% for G. Notably, these two descriptors are more useful for predicting K and G than any of the Voronoi based structural descriptors, which were constructed to capture individual attributes of the local environments. Yet the usefulness of these two informationrich descriptors is not surprising, since log(V) and E_{c} incorporate information regarding the local environments, including coordination, bond angles, and bond lengths.
For modulus X, relative error is defined as:
Over half of our predictions have a relative error of less than 10% for K, and less than 20% for G, as shown in Table 3.
Figures 3 and 4 show marginal predictor, or partial dependence, plots for K and G, which provide a one dimensional summary of the effect of each descriptor on the overall prediction. Marginal predictor summaries account for the effects of the other descriptors^{28}, so any correlation between descriptors reduces the predictive influence of one or more of the correlated descriptors. The marginal predictor plots indicate an inverse, nearly linear relationship between log(V) and log(K), and log(V) and log(G). This agrees with previous findings for K^{53,54,55,56}, but our results support that this relationship generalizes beyond the specific material classes previously studied, and also applies approximately to G. Additionally, the marginal predictor plots indicate an inverse, gently nonlinear relationship between E_{c} and log(K), and E_{c} and log(G). Thus, our models indicate compounds with high K and G are generally densely packed (low log(V)) and strongly bonded (high E_{c}), which agree with both previous findings^{53,54,55,56} and physical intuition. Furthermore, the strong influence of log(V) and E_{c} on K and G, combined with their similar marginal predictor plots, underscore the strong correlation between the two moduli.
Although E_{c} is an important predictor of K, the composition descriptor μ_{1}(R_{n}), arithmetic mean of elemental row number, ranks as the second most influential descriptor of K. The marginal predictor plot for μ_{1}(R_{n}) indicates a roughly quadratic relationship with log(K), indicating that compounds with a higher arithmetic average of row number generally have a higher K. The final descriptor for K, with a comparatively small relative influence in our model, is μ_{−4}(X), the quarticharmonic mean of elemental electronegativity, whose marginal predictor plot indicates that compounds with low average electronegativity generally have a lower K. Although this electronegativity descriptor has a small relative influence and fairly weak partial dependence, these are both after accounting for the influence of the other descriptors. The Spearman’s rank correlation between μ_{−4}(X) and K is approximately 0.50, which is a moderately strong correlation, as evident in Fig. 3(d). Hölder means of R_{n} and X also complete the set of top four descriptors for G, although for G the most useful Hölder means are μ_{−3}(R_{n}), the cubicharmonic mean of elemental row number, and μ_{4}(X), the quartic mean of elemental electronegativity.
The influence of log(V) suggests that it may be possible to develop higher moduli materials from existing compounds by filling interstitial sites (to decrease average volume per atom). But the influence of mean elemental row number will at least partially offset the possible improvement, since elements that could be added to interstitial sites (with minimal disruption of the structure) will generally be smaller than the neighboring elements.
Screening for superhard materials
As an example illustration, we use the SL model to screen for superhard materials. More details and analyses resulting from this application will be presented in a forthcoming article. Here we focus on the main results to illustrate the utility of such a SL model.
The SL predictors developed in this work allow the rapid estimation of K and G for thousands of compounds, for which the required descriptors may be easily calculated. Additionally, with appropriate caution, these predictors may be plugged into other relationships, to estimate additional material properties that can be expressed as functions of K and G. As an example, we estimate Vickers hardness, and then screen for superhard materials, defined as those having a hardness exceeding 40 GPa^{57}. Vickers hardness is estimated using a recently published model, that has shown good agreement with experimental measurements for both cubic and noncubic materials^{35,58,59}:
In general, one should exercise caution when plugging results from one or more statistical models into other equations, without an understanding of the error associated with each model and the effect of such errors upon the new outcome. In our case, because the residuals of our SL models for log(K) and log(G) are positively correlated, the plugin approach should be reasonably accurate when used to generate a relative ranking of hardness. Additionally, we are only using this approach for screening, to identify compounds for more thorough investigation via our DFT workflow.
The screening process for superhard materials starts by considering approximately 30,000 compounds that form a subset of the 66,000 compounds currently in the Materials Project. This subset contains only elements and compounds for which DFT calculations based on the generalized gradient approximation (GGA) are expected to be accurate. In particular, materials containing felectrons or delectrons that require beyondDFT treatments (such as DFT + U) are not considered in the screening process. This represents a realistic materialsdiscovery scenario in which screening is carried out on a large number of compounds for which the desired physical property is unknown, either from experiments or calculations.
The SL model is used to estimate the hardness for the 30,000 compounds by employing equation (6). The resulting distribution of hardness values is shown in Supplementary Fig. S8. To further refine the SL predictions of hardness, DFT calculations are performed on the most promising candidates, as identified by the SL model. Consistent with experiments, our SL model identifies diamond as being the hardest compound (among the 30,000 materials considered in the screening), followed by (cubic) boron nitride, a wellknown superhard compound^{57}. Both of these findings were confirmed by subsequent DFT calculations. Some other compounds that are predicted to be superhard or nearsuperhard according to the SL model and subsequent DFT calculations are Be_{2}C and the family of borides of the form XB_{2}, where X = Ti, Hf, Zr, Sc, Re, V, together with B_{4}C. These compounds are all known (near) superhard materials^{60,61}. Some compounds have been identified from the SL model (and confirmed by DFT) in this study as (near) superhard, but are not listed (to the best of our knowledge) in this context in the literature. Such compounds include: Mg(B_{6}C)_{2}, Sc_{2}CrB_{6} and Mg_{2}B_{24}C, all of which are known compounds that have been synthesized as part of previous investigations^{62,63,64}. Such compounds might provide an interesting starting point for future experimental investigations of superhardness. More details on our screening approach, the DFT calculations and the results will be presented in a forthcoming article.
The hardness screening illustrates the power of using SL models to quickly identify potentially interesting novel materials with target properties. However, the potential use of such SL models reaches far beyond the screening of compounds. In particular, inverse design can be performed in which materials that meet a desired requirement are designed computationally by combining a SL model with an optimization routine such as a genetic algorithm. Such methods may be applied not only to search for superhard materials, but also novel thermoelectrics, auxetic materials, photovoltaics or materials with high elastic stiffness, for example. With regards to identifying compounds with high elastic moduli, we note that in the screening process undertaken in this work, several classes of compounds with high K, G and K/G are identified using SL and confirmed by DFT. The ratio K/G is known as Pugh’s ratio and correlates with intrinsic ductility^{34}. As with the hardness, DFT calculations are performed on the compounds with the highest value for the property of interest as predicted by SL. The systems that are subsequently investigated by DFT because of promisingly high elastic moduli or K/G ratio are shown in Supplementary Table SIII. The topperforming candidates in terms of K, G and K/G are shown in Supplementary Tables SIV, SV and SVI, respectively. The elastic moduli predicted by both the SL model and the subsequent DFT calculations are tabulated. The calculated K and G for the systems in Supplementary Table SIII are shown graphically in Supplementary Figures S9 and S10, respectively.
Discussion
Discrepancies between the DFT and GBM moduli may be caused by (i) shortcomings in our K and G predictors or (ii) DFT methodsrelated errors and approximations, which add noise to the underlying physical phenomenon that we are trying to learn. More specifically, predictor shortcomings may include having insufficient training data in some important regions of the space of descriptors, having overlooked relevant descriptor candidates, and any difficulties our GBMLocfit regressions may have fully capturing the underlying physical behavior. The inorganic polycrystalline compounds in our training set include metallic, ionic, and covalent bonds, but 81% of the compounds qualify as metallic, based on having a DFTcalculated band gap of 0.2 eV or less. Although DFTcalculated band gaps are not entirely reliable, it seems unlikely that a more detailed analysis would result in a meaningfully different characterization of our training set. So while our goal is to predict K and G for a wide variety of inorganic polycrystalline compounds, regardless of bond details, we acknowledge that our sample is skewed towards metallic compounds. Additionally, we have excluded compounds with fblock elements from our training set, since less than 60 such compounds with elastic moduli were available from the Materials Project, and these were insufficient to capture the additional complexities associated with the bonding in such compounds. So although our learned models may be used to predict K and G for any knary compound, the reported accuracies may not generalize to compounds with fblock elements.
Our GBM prediction errors are larger for G than for K, particularly for compounds with elastically anisotropic crystalline structures. For these compounds, the Voigt and Reuss bounds are further apart which leads to more uncertainty in the VRH average, so a descriptor of crystal anisotropy would likely improve the model’s predictive accuracy. Other cases where the prediction error is often larger include systems with local magnetic moments, e.g., transitionmetal oxides and intermetallic compounds containing Cr, Fe, Co, Mn and Ni. Calculating K and G for such compounds using DFT is a challenge due to the degrees of freedom associated with magnetic ordering^{65}. On the other hand, the GBMLocfit models lack (explicit) descriptors to include magnetism and might therefore not be able to capture these features accurately. DFT is known to sometimes yield inaccurate cohesive energies, volume per atom and bulk moduli for late transition metals such as Ag and Au but also various metals such as Hg, Cd, Ga, Tl, Pb and Bi^{65}. This has been attributed to problems with the description of delectron correlation^{66,67,68}, dispersion^{69}, relativistic effects^{70} and spinorbit coupling in DFT. This is especially a problem in high throughput DFTcalculations, where it is impractical to tune all parameters to yield optimum results for each compound.
Summary and Conclusions
We have demonstrated our novel GBMLocfit SL technique and descriptor candidates constructed as Hölder means by predicting the elastic bulk and shear moduli (K and G, respectively) of knary inorganic polycrystalline compounds. Our SL framework combines GBMLocfit (multivariate local regression within a gradient boosting framework), 10fold crossvalidation with a conservative risk criterion, and a diverse set of composition and structural descriptors, which generalize over knary compounds. Thus, following a single DFT run to determine log(V) and E_{c}, predictions for K and G may be made for any inorganic polycrystalline knary compound. In fact, the K and G predictors described here, are already available on the Materials Project website, for all materials for which the full elastic tensors have not yet been calculated via the DFT elastic constants workflow^{12}. Additionally, the Materials Project continues to run the DFT workflow on more compounds, so as the available training set grows, the predictive accuracy of subsequent SL models should improve and the reliable identification of additional predictive descriptors may be possible.
More generally, our SL framework dovetails with the extensive materials properties databases made possible by highthroughput computing techniques. Such databases provide training data for new SL models, which facilitate efficient screening of broader design spaces, which then help focus subsequent highthroughput runs on the most promising candidates.
We believe that our GBMLocfit approach will be advantageous for many materials science problems, when functions mapping descriptors to outcomes are smooth, as is common in problems governed by energy minimization. Although we have introduced descriptors constructed as Hölder means and GBMLocfit together, they are independent advancements; descriptors constructed as Hölder means may be used with any SL technique. Hölder means provide an ordered approach to constructing a set of descriptor candidates from variable length numeric lists, and should prove useful for a variety of SL problems.
Additional Information
How to cite this article: De Jong, M. et al. A Statistical Learning Framework for Materials Science: Application to Elastic Moduli of knary Inorganic Polycrystalline Compounds. Sci. Rep. 6, 34256; doi: 10.1038/srep34256 (2016).
References
 1
Jain, A., Shin, Y. & Persson, K. A. Computational predictions of energy materials using density functional theory. Nature Reviews Materials 1, 15004 (2016).
 2
Morgan, D., Ceder, G. & Curtarolo, S. Highthroughput and data mining with ab initio methods. Measurement Science and Technology 16, 296 (2004).
 3
Curtarolo, S. et al. Aflow: an automatic framework for highthroughput materials discovery. Computational Materials Science 58, 218–226 (2012).
 4
Curtarolo, S. et al. The highthroughput highway to computational materials design. Nature Materials 12, 191–201 (2013).
 5
Carrete, J., Li, W., Mingo, N., Wang, S. & Curtarolo, S. Finding unprecedentedly lowthermalconductivity halfheusler semiconductors via highthroughput materials modeling. Physical Review X 4, 011019 (2014).
 6
Curtarolo, S. et al. Aflowlib. org: A distributed materials properties repository from highthroughput ab initio calculations. Computational Materials Science 58, 227–235 (2012).
 7
Jain, A. et al. Commentary: The materials project: A materials genome approach to accelerating materials innovation. Apl Materials 1, 011002 (2013).
 8
Ramakrishnan, R., Dral, P. O., Rupp, M. & von Lilienfeld, O. A. Quantum chemistry structures and properties of 134 kilo molecules. Scientific Data 1 (2014).
 9
Saal, J. E., Kirklin, S., Aykol, M., Meredig, B. & Wolverton, C. Materials design and discovery with highthroughput density functional theory: the open quantum materials database (oqmd). Jom 65, 1501–1509 (2013).
 10
de Jong, M., Chen, W., Geerlings, H., Asta, M. & Persson, K. A. A database to enable discovery and design of piezoelectric materials. Scientific Data 2 (2015).
 11
Pizzi, G., Cepellotti, A., Sabatini, R., Marzari, N. & Kozinsky, B. Aiida: automated interactive infrastructure and database for computational science. Computational Materials Science 111, 218–230 (2016).
 12
de Jong, M. et al. Charting the complete elastic properties of inorganic crystalline compounds. Scientific Data 2 (2015).
 13
Curtarolo, S., Morgan, D., Persson, K., Rodgers, J. & Ceder, G. Predicting crystal structures with data mining of quantum calculations. Physical Review Letters 91, 135503 (2003).
 14
Mueller, T., Kusne, A. & Ramprasad, R. Machine learning in materials science: Recent progress and emerging applications. Rev. Comput. Chem (2015).
 15
Fischer, C. C., Tibbetts, K. J., Morgan, D. & Ceder, G. Predicting crystal structure by merging data mining with quantum mechanics. Nature Materials 5, 641–646 (2006).
 16
Rupp, M., Tkatchenko, A., Müller, K.R. & von Lilienfeld, O. A. Fast and accurate modeling of molecular atomization energies with machine learning. Physical Review Letters 108, 058301 (2012).
 17
Snyder, J. C., Rupp, M., Hansen, K., Müller, K.R. & Burke, K. Finding density functionals with machine learning. Physical Review Letters 108, 253002 (2012).
 18
Hansen, K. et al. Assessment and validation of machine learning methods for predicting molecular atomization energies. Journal of Chemical Theory and Computation 9, 3404–3419 (2013).
 19
Meredig, B. et al. Combinatorial screening for new materials in unconstrained composition space with machine learning. Physical Review B 89, 094104 (2014).
 20
Meredig, B. & Wolverton, C. Dissolving the periodic table in cubic zirconia: Data mining to discover chemical trends. Chemistry of Materials 26, 1985–1991 (2014).
 21
Faber, F., Lindmaa, A., von Lilienfeld, O. A. & Armiento, R. Crystal structure representations for machine learning models of formation energies. International Journal of Quantum Chemistry (2015).
 22
Ghiringhelli, L. M., Vybiral, J., Levchenko, S. V., Draxl, C. & Scheffler, M. Big data of materials science: Critical role of the descriptor. Physical Review Letters 114, 105503 (2015).
 23
Seko, A. et al. Prediction of lowthermalconductivity compounds with firstprinciples anharmonic latticedynamics calculations and Bayesian optimization. Physical Review Letters 115, 205901 (2015).
 24
Calfa, B. A. & Kitchin, J. R. Property prediction of crystalline solids from composition and crystal structure. AIChE Journal (2016).
 25
Isayev, O. et al. Materials cartography: Representing and mining materials space using structural and electronic fingerprints. Chemistry of Materials 27, 735–743 (2015).
 26
Li, Z., Kermode, J. R. & De Vita, A. Molecular dynamics with onthefly machine learning of quantummechanical forces. Physical Review Letters 114, 096405 (2015).
 27
Balachandran, P. V., Xue, D., Theiler, J., Hogden, J. & Lookman, T. Adaptive strategies for materials design using uncertainties. Scientific Reports 6, 19660 (2016).
 28
Hastie, T., Tibshirani, R. & Friedman, J. The elements of statistical learning: data mining, inference, and prediction 364–370 (Springer, 2011), second edn.
 29
Ku, H.T., Ku, M.C. & Zhang, X.M. Generalized power means and interpolating inequalities. Proceedings of the American Mathematical Society 127, 145–154 (1999).
 30
Loader, C. Local regression and likelihood, vol. 47 (Springer, New York, 1999).
 31
Friedman, J. H. Greedy function approximation: A gradient boosting machine. Annals of Statistics 29, 1189–1232 (2001).
 32
Stone, C. J. Consistent nonparametric regression. The Annals of Statistics 595–620 (1977).
 33
Cleveland, W. S. Robust locally weighted regression and smoothing scatterplots. Journal of the American Statistical Association 74, 829–836 (1979).
 34
Pugh, S. XcII. Relations between the elastic moduli and the plastic properties of polycrystalline pure metals. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 45, 823–843 (1954).
 35
Niu, H. et al. Extraelectron induced covalent strengthening and generalization of intrinsic ductiletobrittle criterion. Scientific Reports 2 (2012).
 36
Gschneidner, K. et al. A family of ductile intermetallic compounds. Nature Materials 2, 587–591 (2003).
 37
Greaves, G. N., Greer, A., Lakes, R. & Rouxel, T. Poisson’s ratio and modern materials. Nature Materials 10, 823–837 (2011).
 38
Snyder, G. J. & Toberer, E. S. Complex thermoelectric materials. Nature Materials 7, 105–114 (2008).
 39
Cahill, D. G., Watson, S. K. & Pohl, R. O. Lower limit to the thermal conductivity of disordered crystals. Physical Review B 46, 6131 (1992).
 40
Clarke, D. R. Materials selection guidelines for low thermal conductivity thermal barrier coatings. Surface and Coatings Technology 163, 67–74 (2003).
 41
Cohen, M. L. Calculation of bulk moduli of diamond and zincblende solids. Physical Review B 32, 7988 (1985).
 42
Cohen, M. L. Theory of bulk moduli of hard solids. Materials Science and Engineering: A 105, 11–18 (1988).
 43
Lam, P. K., Cohen, M. L. & Martinez, G. Analytic relation between bulk moduli and lattice constants. Physical Review B 35, 9190 (1987).
 44
Harrison, W. A. Elementary electronic structure (World Scientific Singapore, 2004).
 45
Suh, C. & Rajan, K. Virtual screening and qsar formulations for crystal chemistry. QSAR & Combinatorial Science 24, 114–119 (2005).
 46
Xu, B., Wang, Q. & Tian, Y. Bulk modulus for polar covalent crystals. Scientific Reports 3 (2013).
 47
Efron, B. et al. Least angle regression. The Annals of Statistics 32, 407–499 (2004).
 48
Ong, S. P. et al. Python materials genomics (pymatgen): A robust, opensource python library for materials analysis. Computational Materials Science 68, 314–319 (2013).
 49
O’Keeffe, M. A proposed rigorous definition of coordination number. Acta Crystallographica Section A: Crystal Physics, Diffraction, Theoretical and General Crystallography 35, 772–775 (1979).
 50
Breiman, L., Friedman, J. H., Olshen, R. A. & Stone, C. J. Classification and regression trees 78–80 (CRC press, 1993).
 51
Giacobino, C., Sardy, S., Rodriguez, J. D. & Hengartner, N. Quantile universal threshold for model selection. arXiv preprint arXiv:1511.05433 (2015).
 52
Hill, R. The elastic behaviour of a crystalline aggregate. Proceedings of the Physical Society. Section A 65, 349 (1952).
 53
Anderson, O. L. & Nafe, J. E. The bulk modulusvolume relationship for oxide compounds and related geophysical problems. Journal of Geophysical Research 70, 3951–3963 (1965).
 54
Anderson, D. L. & Anderson, O. L. The bulk modulusvolume relationship for oxides. Journal of Geophysical Research 75, 3494–3500 (1970).
 55
Jayaraman, A., Batlogg, B., Maines, R. & Bach, H. Effective ionic charge and bulk modulus scaling in rocksaltstructured rareearth compounds. Physical Review B 26, 3347 (1982).
 56
Cohen, M. L. Calculation of bulk moduli of diamond and zincblende solids. Physical Review B 32, 7988 (1985).
 57
Vepřek, S. The search for novel, superhard materials. Journal of Vacuum Science & Technology A 17, 2401–2420 (1999).
 58
Chen, X.Q., Niu, H., Li, D. & Li, Y. Modeling hardness of polycrystalline materials and bulk metallic glasses. Intermetallics 19, 1275–1281 (2011).
 59
Chen, X.Q., Niu, H., Franchini, C., Li, D. & Li, Y. Hardness of tcarbon: Density functional theory calculations. Physical Review B 84, 121405 (2011).
 60
Prikhna, T. Innovative superhard materials and sustainable coatings for advanced manufacturing. edited by jay lee, nikolay novikov. NATO Science Series. II. Mathematics, Physics and Chemistry 200, 81 (2005).
 61
Gupta, K. Engineering materials: research, applications and advances (CRC Press, 2014).
 62
Adasch, V., Hess, K.U., Ludwig, T., Vojteer, N. & Hillebrecht, H. Synthesis, crystal structure, and properties of two modifications of mgb12c2. Chemistry–A European Journal 13, 3450–3458 (2007).
 63
Mykhalenko, S., Babizhetskyj, V. & Kuzma, Y. New compound in the system sc–cr–b. Journal of Solid State Chemistry 177, 439–443 (2004).
 64
Adasch, V., Hess, K.U., Ludwig, T., Vojteer, N. & Hillebrecht, H. Synthesis and crystal structure of mg 2 b 24 c, a new boronrich boride related to tetragonal boron i. Journal of Solid State Chemistry 179, 2150–2157 (2006).
 65
Lejaeghere, K., Van Speybroeck, V., Van Oost, G. & Cottenier, S. Error estimates for solidstate densityfunctional theory predictions: an overview by means of the groundstate elemental crystals. Critical Reviews in Solid State and Materials Sciences 39, 1–24 (2014).
 66
Gaston, N., Andrae, D., Paulus, B., Wedig, U. & Jansen, M. Understanding the hcp anisotropy in cd and zn: the role of electron correlation in determining the potential energy surface. Physical Chemistry Chemical Physics 12, 681–687 (2010).
 67
Gaston, N., Paulus, B., Rosciszewski, K., Schwerdtfeger, P. & Stoll, H. Lattice structure of mercury: Influence of electronic correlation. Physical Review B 74, 094102 (2006).
 68
Wedig, U., Jansen, M., Paulus, B., Rosciszewski, K. & Sony, P. Structural and electronic properties of mg, zn, and cd from hartreefock and density functional calculations including hybrid functionals. Physical Review B 75, 205123 (2007).
 69
Richardson, D. & Mahanty, J. Van der waals contribution to the binding energy of noble metals. Journal of Physics C: Solid State Physics 10, 3971 (1977).
 70
Philipsen, P. & Baerends, E. Relativistic calculations to assess the ability of the generalized gradient approximation to reproduce trends in cohesive properties of solids. Physical Review B 61, 1773 (2000).
Acknowledgements
This work was intellectually led by the Department of Energy Basic Energy Sciences program  the Materials Project  under Grant No. EDCBEE. Work at Lawrence Berkeley was supported by the Office of Science of the U.S. Department of Energy under Contract No. DEAC0205CH11231. The authors thank Shyue Ping Ong for assistance with pymatgen and his extensive contributions to both pymatgen and the Materials Project.
Author information
Affiliations
Contributions
M.d.J. proposed this statistical learning project, helped produce the original DFT elastic constant data, performed the screening work, and produced some of the tables and figures. W.C. helped produce the original DFT elastic constant data and produced some of the tables and figures. R.N. implemented the GBMLocfit approach, ran the models, and produced some of the tables and figures. K.P. was involved in planning and supervising the work and its integration into the Materials Project effort. G.C. proposed some descriptor candidates and was involved in planning and supervising the work and its integration into the Materials Project effort. A.J. proposed several descriptor candidates, provided detailed critiques of several drafts of the manuscript, and was involved in planning and supervising the work and its integration into the Materials Project effort. M.A. proposed some descriptor candidates, provided detailed critiques of several drafts of the manuscript, and was involved in planning and supervising the work and its integration into the Materials Project effort. A.G. proposed the GBMLocfit approach and constructing descriptors as Hölder means, and oversaw the statistical learning work. All authors contributed to the text of the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing financial interests.
Supplementary information
Rights and permissions
This work is licensed under a Creative Commons Attribution 4.0 International License. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in the credit line; if the material is not included under the Creative Commons license, users will need to obtain permission from the license holder to reproduce the material. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
About this article
Cite this article
de Jong, M., Chen, W., Notestine, R. et al. A Statistical Learning Framework for Materials Science: Application to Elastic Moduli of knary Inorganic Polycrystalline Compounds. Sci Rep 6, 34256 (2016). https://doi.org/10.1038/srep34256
Received:
Accepted:
Published:
Further reading

QSAR without borders
Chemical Society Reviews (2020)

Machine learning approaches for the prediction of materials properties
APL Materials (2020)

DataBased Methods for Materials Design and Discovery: Basic Ideas and General Methods
Synthesis Lectures on Materials and Optics (2020)

Automated extraction of the shortrange part of the interaction in noncontact atomic force microscopy
Applied Physics Letters (2020)

A machine learning model for failure of perforated plates under impact
Mechanics Based Design of Structures and Machines (2020)
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.