Rapidly predicting Kohn–Sham total energy using data-centric AI

Predicting material properties by solving the Kohn-Sham (KS) equation, which is the basis of modern computational approaches to electronic structures, has provided significant improvements in materials sciences. Despite its contributions, both DFT and DFTB calculations are limited by the number of electrons and atoms that translate into increasingly longer run-times. In this work we introduce a novel, data-centric machine learning framework that is used to rapidly and accurately predicate the KS total energy of anatase \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathrm{TiO}}_2$$\end{document}TiO2 nanoparticles (NPs) at different temperatures using only a small amount of theoretical data. The proposed framework that we call co-modeling eliminates the need for experimental data and is general enough to be used over any NPs to determine electronic structure and, consequently, more efficiently study physical and chemical properties. We include a web service to demonstrate the effectiveness of our approach.

. (Left) The DFT algorithm. (Right) The cooperative model framework for TiO 2 as an example. The first pass 1 uses DFT/DFTB to produce the minimal viable data (the smallest data size that will produce effective co-modeling). Once the co-model is constucted, we can directly submit the atomic geometry data 2 for TiO 2 to predict Kohn-Sham Total Energy without having to use DFT/DFTB. Throughout the paper we may abbreviate Kohn-Sham total energy to total energy. We use typical ML formalism describing inputs as a vector x , output as label ℓ , and classifier as f : x → ℓ. www.nature.com/scientificreports/ The remainder of the paper is as follows: "Background and related work" section we provide background on KS and an overview of ML and brief description of the members of M . In "Methodology and experimental results" section, we give a detailed description of co-modelling and show its application to the TiO 2 NPs. "Summary and conclusion" section is the summary can conclusion. Both code and data are publicly accessible. [https:// github. com/ hasan kurban/ Kohn-Sham-Total-Energy-Predi ction. git].

Background and related work
The Kohn-Sham equation. The DFTB formalism contains two major contributions [see Eq. (1)], which are matrix elements (the Hamilton and overlap) and the repulsive potentials 52 . The fundamental idea of the DFTB method is to implement the second order expansion of the Kohn-Sham (KS) DFT where the electronic density (ρ 0 ) is calculated from a reference function, where ρ 0 is the sum of neutral atomic densities. The first term of Eq. (1) represents a Kohn-Sham effective Hamiltonian; and the second term is related to the energy due to charge fluctuations where Ŝ µν = �η µ | η ν � is the overlap matrix elements. In the DFTB formalism, the matrix elements of atomic orbitals are pre-calculated and stored. Besides, the second order self-consistent charge (SCC) extension is used in the DFTB method due to the dependence of the DFTB Hamiltonian on the atomic charge. The third term is the repulsive potential which is approximated as a sum of two-center repulsions, with pair potentials based on the respective atom types and the interatomic distance Typically this problem is solved by a self-consistent approach. Schematic representation of the self-consistency cycle in Kohn-Sham equations is given in Fig. 1(left).
The method of calculations. Temperature dependent structures of anatase phase TiO 2 NPs have been obtained using molecular dynamics (MD) methods implemented in DFTB+ code 53 with the hyb-0-2 54,55 set of Slater Koster parameters. Thermal equilibrium is controlled by the NVT ensemble during whole simulations. The time step of MD was chosen as 1 fs. The temperature of the NPs was increased by 50 K up to 1000 K.
Machine learning. Related work. Ellis et al. 56 introduces an ML based framework, where the feed-forward neural network is used, to speed up DFT calculations over aluminum at ambient density (2.699 g/cc) and temperatures up to the melting point (933 K) . The authors show that their model can accurately predict solid and liquid properties of aluminum. Li et al. 57 shows the relations between density and energy can be iteratively and accurately learned with very small data and ML models, which have better generalizability, by including KS equations into the training data. KS equations are solved while learning exchange-correlation functional with neural networks which improve generalization. Chandraskeran et al. 58 demonstrates ML models can predict the electronic charge density and DOS from atomic configuration information. Instead of learning of a specific representation of the local electronic properties, the authors propose a grid-based learning and predictions scheme which is more memory intensive, but can routinely handle large systems and improve run-times. Brockherde et al. 59 perform the first MD simulation with a machine-learned density functional on malonaldehyde constructing more accurate density functionals for realistic molecular systems. Schleder et al. 60 gives an overview of ML techniques used in modern computational materials science to design novel materials and explain the present challenges and research problems.

The cooperative model framework (co-model)
We first give an informal description of co-modelling to prepare for the detailed description for Algo. 1. www.nature.com/scientificreports/ An overview of co-modelling. To aid discussion we write t ∈ N 0 for temperature Kelvin and O}} for a collection of nanoparticle position and atom type. Using DFT/DFTB we computationally determine total Kohn-Sham energy E ij given temperature t i and nanoparticles NP j (since i = j we use only one subscript). The data set has 21 values: where (t, NP) is called the feature set and E the label, building a "best" function f that, given feature values, gives energy using ML parlance. In this work we are interested in investigating how a linear ensemble of disparate ML models performs. We settled on a linear model, since it is among the simplest, best understood, and most widely used. The types of ML models are quite divers e.g., random forests, support vector machines, neural networks, k-nearest neighbor. Detailed discussion of each model is not feasible here, but links to the implementations used in this work are given in the next section. We train every model M k from a set of candidates using default parameters. Models that either individually perform poorly or are correlated are removed. An linear ensemble of models is further refined yielding f where α, β ∈ R are coefficients and constant. Using an additional tuning-parameter grid each candidate model is either optimized for its parameters or removed via cross-validation. The simplicity of Eq. (6) belies its power-any value t ∈ [0 K, 1000 K] and acceptable NP for TiO 2 can be determined directly bypassing the traditional DFT/ DFTB technique saving time while preserving fidelity discussed in "Methodology and experimental results" section ( Fig. 3).

The cooperative model framework (co-model). Presented here is a new approach to predicting
Kohn-Sham total energy of nanoparticles by combining a set of single runs of DFT/DFTB with a linear ensemble of disparate ML models we call the cooperative modelling framework (co-modelling)-the results are rapid and accurate. A model's individual performance criterion considers either (1) Mean Absolute Error (MAE), (2) Root Mean Square Error (RMSE), or (3) R Squared ( R 2 ). Co-modelling consists of five steps: 1. Compute KS energy for a small number of temperature, nanoparticles pairs. We call this data shown in Eq. (5) and Algo. lines 4-8. 2. Split the data into training data (used to build individual models) and test data (used to assess models quality). We call these data Train , Test , respectively in Algo. lines 9-10. 3. Build a set of candidate models from a large corpus of models M . In this work we are utilizing a possible size of over 230 models. A model is ignored if it is either performing poorly or takes longer than an arbitrary, fixed amount of time to complete ( > 1 hr ) in Algo. lines 10-16. 4. Of pairs of correlated models remaining, remove the poorer performing in Algo. lines 17-18. 5. Construct a linear ensemble, possibly refining the set of models further, by tuning parameters or pruning returning the function shown in Eq. (6) and Algo. lines 19-21. www.nature.com/scientificreports/ The initial model class |M| = 230 considers all regression models in caret 61 which is a popular ML library in R programming language. All models are initially run with default parameter values and become candidates if they complete computation in less than 1 hr . To construct the linear ensemble we leverage a popular R package extant in caret called caretEnsemble 62 . This package relies on caretStack that, from a collection of models, either prunes or optimally tunes parameters. Discussed next are descriptions and some characterizations of the gamut of models used. M consists of: lm: Linear regression; glm: Generalized Linear Model; lmStepAIC: Generalized Linear Model with Stepwise Feature Selection, gcvEarth: Multivariate Adaptive Regression Splines, ppr: Projection Pursuit Regression, rf: Random Forest, xgbDart,XgbTree, xgbLinear: Extreme Gradient Boosting; monmlp:Monotone Multi-Layer Perceptron Neural Network; svmLinear, svmRadial: Support Vector Machines (SVMs) with Linear and Radial Kernel Functions; knn: K-Nearest Neighbor; rpart: Decision Tree (cart); gausspr-Linear: Gaussian Process; icr: Independent Component Regression. We categorize these ML models as (1) ensemble: RF xgbDart,XgbTree, xgbLinear and (2) and non-ensemble models: lm, glm, icr, lmStepAIC, ppr, gaussprLinear, gvcEarth, svmLinear, knn, rpart, monmlp, svmRadial. Unlike non-ensemble models where only a single model is being built, ensemble models 63 consist of collections of the same models constructed randomly. More detailed description is provided next.
Co-model ensemble models. In RF the ensemble is a collection of DTs constructed with bagging (i.e., independently of each other), whereas XGB builds weak learners (with any type model) in an additive manner-one weak learner at a time. In ensembles, the final decision is given by combining the predictions of all models ("voting"). XGB lets weak learners vote along the way while RF lets all DTs vote at the end after building all trees. Each model is built using a different sample of the training data and, thus, the sampling method is among the www.nature.com/scientificreports/ factors which determine the final models. Voting strategy is another factor that can change the final prediction, e.g., weighted or unweighted voting where unweighted voting gives an equal weight to the each DT model. RF is quite popular due to its robustness, e.q., remote sensing 64 , land-cover classification 65 , network intrusion detection system 66 , sleep stage identification 67 . Error correlation among the trees and the strength of the DTs are estimated over the out of bag data which is the data remaining from bootstrap sampling. The trade-off between the margin which shows how well a single DT separates a correct class from an incorrect class and the correlations between the trees determines how well RF will perform. Breiman 68 is the first RF paper and 69 gives a review of RF algorithm. Differing approaches to improve RF; weighted voting and dynamic data reduction 70 , through sampling 71 , improving data 72 , with clustering 73 .
The stochastic gradient algorithm 74 improves 75 , the first gradient boosting algorithms for big data. Extreme Gradient Boosting (XGBoost/XGB) 76 is a scalable gradient tree boosting algorithm proven to work well in many areas, e.g., finance 77 , bioinformatics 78 , energy 79 , music 80 . Unlike RF, the cost function given in Eq. (7) is solved in an additive manner since it is not possible to optimize it in Euclidean space using the traditional optimization methods; it can be solved using second-approximation 81 .
where ŷ i represents the prediction for the i th data point and represents the trees' space. K; additive functions' number, x i ; i th data point, n; data size, m; input variables' number, t; iteration number. q is the structure of the trees and f k is an output of an independent tree structure q with a leaf weight. h denotes a differentiable convex loss function and measures the difference between the true model y and the predicted model ŷ . ω is the penalization parameter which is used to tune the complexity of the model and avoid the overfitting problem.
Co-model non-ensemble models. In this section we briefly explain non-ensemble models under two categories: (1) linear: lm 82 89 . Since the models under the same category learn the data according to different strategies, the final models they create are different from each other. For example, although lm can only learn linear relationships in the data, gvcEartch can learn non-linear relationships as well. icr decomposes training data into linear combinations of components which are independent of each other as much as possible. gvcEarth partitions input data into piece-wise linear segments with differing gradients. Linear segments are connected to obtain basis functions which are used to model linear and non-linear behaviors. svmLinear treats the "best" line problem as the maximal margin problem and solves the 2-norm cost functions (least squares error). svmRadial uses radial basis function as kernel function.
lm is Ordinary Least Squares (OLS) under the assumption that residuals distribution is Gaussian. glm searches for the best line by assuming that the distribution of residuals comes from an exponential family klike a Poisson. lmStepAIC, a type of glm, selects the input variables using an automated algorithm. In ppr in an iterative manner the regression surface is modelled as a sum of general linear combinations of smooth functions. This makes ppr more general than stepwise regression procedures. icr decomposes training data into linear combinations of components which are independent of each other as much as possible. gvcEarth partitions input data into piecewise linear segments with differing gradients. Linear segments are connected to obtain basis functions which are used to model linear and non-linear behaviors. gaussprLinear is a probabilistic regression model and a probability distribution over linear functions that fits training data. rpart is a single tree model trained by selecting an optimal attribute (feature) then partitioning data based on the class data for each value in the active domain. This is akin to Quinlan's C4.5. The optimal attribute is chosen using a variety of metrics such as Gini or information gain where each metric can lead to produce a different DT 91,94-96 . Subsequently the training data points are sorted to the leaf nodes. The algorithm runs until the training data points are classified to an acceptable threshold; otherwise it iterates over new leaf nodes. After building DTs are pruned to prevent overfitting (performing well on the training data, but poorly over test data). Pruning greatly impacts performance 95 . A neural network is a weighted, directed graph where nodes input/output and edges weights. monmlp is a recurrent network with a monotone constraint which is used to monotonically increase the behavior of model output with respect to covariates. monmlp handles overfitting using bootstrap aggregation with early stopping. In knn, the data itself is used to make a prediction (lazy learning). knn first searches for k-similar objects in the training data with some distance metric and then uses voting. The choice of k can drastically change the model's prediction.

Methodology and experimental results
In this section we first explain how we generate the minimal viable data and then present our novel machine learning framework over the TiO 2 NPs and share our findings.
Data set: TiO 2 nanoparticles. In Fig. 4 illustrates the theoretical supercell data (3-D, 2-D raw data) of TiO 2 NPs and some statistical properties of the portion of training data. The process begins by carving TiO 2 NPs from a bulk 60 × 60 × 60 supercell. Next, structures at different temperatures to attain various TiO 2 NPs www.nature.com/scientificreports/ are computed. Figure 4a illustrates three different NPs generated at 300 K , 600 K , and 900 K . Structural data is taken Kurban et al. 97 with a detailed description about the initial geometry of the TiO 2 NP model. The structural, electronic and optical properties of twenty-one TiO 2 NP models were obtained from the density functional tight-binding (DFTB) calculation 98 Fig. 4b shows statistical properties of TiO 2 NPs used in this study, e.g., distribution, correlation. The input variables are atom, three-dimensional geometric locations of Ti and O atoms (x, y, z) and temperature. The Energy variable is the output variable. We observe that the variables are linearly non-correlated. www.nature.com/scientificreports/ Using co-modelling to predict energetic properties. Let AC, TE stand for an initial atomic configuration and Kohn-Sham total energy. Treating DFTB as a function, total energy is a function solely of atomic configuration shown as computation (8): Taking temperature into account, we compute the triple (9): where AC i,t is the new atomic configuration using an initial configuration AC i paired with total energy at temperature t. The co-model is built from 21 different values described in computation (9). Our initial examination of the co-model itself might lead one to believe that temperature is the principle driver of total energy. This is not the case. The co-model is learning the relationship of the triple not simply temperature as a predictor of total energy. Our results show (including an active web service) that the triple is so well characterized that, given AC i , AC i,0 , AC i,50 , . . . , AC i,1000 and temperature t ∈ [0 K, 1000 K] , it can accurately predict TE t . The mechanics of the computation are not easy to describe. Underlying most of AI is the recurring conundrum: the model performs well, but how it works is often impossible to explain. Indeed, an active area of AI is explainability [101][102][103] . Interestingly, work is now taking place that allows AI to completely describe the phenomenon 104 . At this point, the results show an effective technique to drastically reduce run-time. It remains for future work how, if even possible, we can make some human-interpretable sense of the computation. Figure 3(top middle) shows the general stops to constructing a co-model that can efficiently, quickly, and effectively predict structural, electronic and optical properties of NPs. The framework starts with generating a minimal viable theoretical data set using DFT/DFTB (steps 1-3). The data is randomly partitioned (step 4) into training and test boosting ( |� Training | > |� Test |) . Cross-validation is a standard technique to assess the quality of models. Pearson's correlation is measured among the model cohort allowing elimination of models that are either highly correlated with a better model or perform badly. The final co-model is built and quality measured (step 5). A more detail description of these steps are provided next. In this work, total energy is eV/atom.
The application of co-modelling to anatase TiO 2 nanoparticles. Referring to Algo. 1 experimental detail is presented. The construction begins with 21 TiO 2 NPs at temperatures ranging over [0 K, 1000 K] from DFT/DFTB. Training data, Training , represents 75% of original data and test data, Test , the remaining 25%. The ratio Ti/O are the same in both Training and Test . In DFTB, the symmetries of the studied NP models were broken under heat treatment; thus, the co-model ensemble and other ML models were trained over non-symmetric NPs 105,106 . The goodness of the models use three metrics that produce values in [0, 1]: (1) Mean Absolute Error (MAE), (2) Root Mean Square Error (RMSE), (2) R Squared (R 2 ). Metrics (1) and (2) are error metrics; thus lower values are better. MAE and RMSE are distance metrics that, as the model improves, describe the deviation as a magnitude from the actual data (DFT/DFTB) and not the actual accuracy of the model. R 2 , in a regression model, represents the proportion of variance in the input variables (dependent) and that can be explained by the output variable (independent). The higher R 2 , the better model is. tenfold cross-validation is used to assess the quality of the models and the summary is given in Fig. 5.
The best performing traditional ML model is ppr in the training step. The experimental results indicate that the performances of XGBoost algorithms (xgbDart, xgbLinear, xgbTree) and rf are similar to ppr. Although some single ML models such as ppr, xqbDart, xqbLinear, perform well on training and test data, the our framework makes it possible to create more accurate models than any of these single ML models. Any new data may not be well-modelled by a single ML model; the co-model will always perform well. Figure 6 highlights pairwise training model differences: Fig. 6a shows R 2 ; Fig. 6b shows linear correlations; Fig. 6c shows a Pearson heat map. Figures for MAE and RMSE are provided as the supplementary material for space reasons. After finding the best models among various regression models, correlations among the models, created during the training step, were measured. The results show that lm-lmStepAIC, lm-gaussprLinear, glm-lm, glm-lmStepAIC,glm-gaussprLinear pairs are highly correlated. We only use gaussprLinear, since its performance is better than others over the training data. The co-model is constructed using the remainder of the cohort. Note that there was none that performed poorly over Training . The performance comparison of traditional ML models and cooperative model over the training data is presented in Fig. 7. The red-dashed line represents the cooperative model and we observe that our cooperative model performs better than the rest of the ML models. For example, over the training data, the RMSE was 1.5 × 10 −10 where ppr, the best traditional model, had 1.6 × 10 −10 RMSE. Figure 6c demonstrates the relative importance of individual models in the cooperative model. We observe that svmLinear, xgbTree, xgbLinear and xgbDART are the most important models, respectively, in the cooperative model, and svmLinear is the most dominant one. Finally, we present the performance of the cooperative model and other models over the testing data in Table 1. Based on all the metrics the results demonstrate that co-modelling performed superior than all other traditional single models. Our best model is embedded to the web and can be easily tested (https:// hasan. shiny apps. io/ app_1/).

Summary and conclusion
We have demonstrated that by pairing DFT/DFTB with a novel, data-driven ML framework-and surprisingly a modicum of data of 21 NPs-we can bypass traditional run-time bottlenecks scientists face when using DFT/ DFTB alone. Co-modelling builds a linear ensemble of models that accurately predicts the structural, electronic, and optical properties of nanoparticles (NPs). The collective time to build the ML portion is relatively minuscule and can even be done ab initio. Our solution is open-source and only requires a standard hardware-a laptop. www.nature.com/scientificreports/ what the best and minimal data can be and ensemble, whether instances of this model differ for materials or types of properties predicted, generating data for general use. Additional future work includes elucidating how the co-model is understanding the relationship possibly leading to a different, perhaps simpler, description of atomic configuration, temperature, and total energy. Another intriguing path is to reverse the computation-can we determine the most likely atomic configuration or class of configurations that would give rise to the total energy. Examining other metal nanoparticles must be investigated to ensure this result is not simply peculiar to TiO 2 . We are also interested in studying extreme temperatures where experimentation is difficult at best. Finally, code and data are publicly accessible and the cooperative model is available on the web.   www.nature.com/scientificreports/