General machine-learning surrogate models for materials prediction

Surrogate machine-learning models are transforming computational materials science by predicting properties of materials with the accuracy of ab initio methods at a fraction of the computational cost. We demonstrate surrogate models that simultaneously interpolate energies of different materials on a dataset of 10 binary alloys (AgCu, AlFe, AlMg, AlNi, AlTi, CoNi, CuFe, CuNi, FeV, NbNi) with 10 different species and all possible fcc, bcc and hcp structures up to 8 atoms in the unit cell, 15 950 structures in total. We find that prediction errors remain unaffected when increasing the number of simultaneously modeled alloys. Several state-of-the-art materials representations and learning algorithms were found to qualitatively agree, with prediction errors as low as 1 meV/atom.


I. INTRODUCTION
Advances in computational power and electronic structure methods have enabled large materials databases 1-4 . Using high-throughput approaches, 5 these databases have proven a useful tool to predict properties of materials. However, given the combinatorial nature of materials space, 6,7 it is infeasible to compute properties for more than a tiny fraction of all possible materials using electronic structure methods such as density functional theory 8,9 . A potential answer to this challenge lies in a new paradigm: surrogate machine-learning models for accurate materials predictions. [10][11][12] The key idea is to use machine learning to rapidly and accurately interpolate between reference simulations, effectively mapping the problem of numerically solving for the electronic structure of a material onto a statistical regression problem. 13 Such fast surrogate models could be used to filter the most suitable materials from a large pool of possible materials and then validate the found subset by electronic structure calculations. Such an "accelerated high-throughput" (AHT) approach (Figure 1) could potentially increase the number of investigated materials by several orders of magnitude.
Traditionally, empirical interatomic potentials were used to reproduce macroscopic properties of materials faster than DFT. Well-known empirical interatomic potentials for periodic solids include Lennard-Jones potentials, the Stillinger-Weber potential and Embedded Atom Methods (EAM) for alloys. A problem with empirical interatomic potentials is that they are designed with a fixed number of parameters and cannot be systematically improved. In contrast, surrogate models which are empirical interatomic models based on machine learning, systematically improve with additional data.
We demonstrate the feasibility of accurate general (simultaneously interpolate energies of multiple different materials) surrogate machine-learning models for en-thalpies of formation of materials across the composition, lattices, and structures. We find that five combinations of state-of-the-art representations and regression methods (Table I) all yield consistent predictions with errors in the low meV/atom range.

II. BACKGROUND
A surrogate machine-learning model replaces ab initio simulations by mapping a crystal structure to properties such as enthalpies, elastic constants or band gaps. Their importance lies in the fact that once the model is trained, properties of new materials can be predicted in time either constant or linear in the size of the material, with low pre-factor, typically in milliseconds.
The two major parts of a surrogate machine-learning model are the numerical representation of the input data 11,21 and the learning algorithm. We use the term "representation" for a set of features (as opposed to a collection of unrelated or only loosely related descriptors) that satisfies certain physical requirements 12,13,18,22 such as invariance to translation, rotation, and permutation of atoms, uniqueness 23 , differentiability, and computational efficiency. The role of the representation is akin to that of a basis set in that the predicted property is expanded in terms of a set of reference structures.
The materials community has proposed several representations [10][11][12]18,21,[24][25][26][27] for crystal structures. Many do not fulfill the above properties or are restricted in practice to materials with a few different chemical elements. Consequently, surrogate models based on these representations are limited in their accuracy, due to the violation of the uniqueness requirement 22,28 .
We explore three state-of-the-art representations that fulfill above properties for construction of general surrogate models: Many-body tensor representation 12 (MBTR), smooth overlap of atomic positions 10,18 (SOAP) and moment tensor potentials 11  representation is employed as proposed and implemented by its authors, including the regression method: Kernel ridge regression 13 (KRR) for MBTR, Gaussian process regression regression 29 (GPR) for SOAP, and polynomial regression 11 for MTP. Since predictions (but not necessarily other properties) of the kernel-based KRR and GPR are identical, we will use the two terms interchangeably here. We also employed cluster expansion [14][15][16][17] and Deep Neural Network 30,31 (DNN) models.
CE models have been used for three decades to efficiently model ground state energies of metal alloys, but require that the atomic structure can be mapped to site occupancies on a fixed lattice. They are therefore less suited to model different materials. In this work, we use them as a baseline and build a separate CE model for each alloy.
DNNs are essentially recursively stacked layers of functions, the large number of layers being the major difference to conventional neural networks. They have been used to predict energies [32][33][34][35][36] and to learn representations. 37,38 While DNNs can learn representations ("end-to-end learning", here from nuclear charges, atom positions and unit cell basis vectors to enthalpy of formation), this requires substantially more data than starting with a representation as input [24][25][26] . We, therefore, provide the DNN with MBTR as input. The idea of using MBTR+DNN is to explore whether a representation-learning technique can improve upon a manually designed representation in conjunction with the standard Gaussian kernel (MBTR+KRR).
Lattice parameters for each crystal structure were set according to Vegard's law. 40,41 Total energies were computed using density functional theory (DFT) with projector-augmented wave (PAW) potentials [42][43][44] within the generalized gradient approximation (GGA) of Perdew, Burke, and Ernzerhof 45 (PBE) as implemented in the Vienna Ab Initio Simulation Package 46,47 (VASP). The k-point meshes for sampling the Brillouin zone were constructed using generalized regular grids. 48,49 The details of the k-point density for all 10 alloys is mentioned in the appendix (table IV).

B. Models
All single-alloy surrogate models were trained using 1000 randomly selected crystal structures and validated on the remaining 595 structures. Models trained on multiple alloys use the union of the individual alloy's splits. Parametrization details of all surrogate models used in this work can be found in the appendix.

A. Energy predictions for single alloys
Prediction errors for enthalpies of formation of each of the five surrogate models on each binary alloy subset of the data are presented in Figure 2. Prediction errors of all surrogate models agree qualitatively on all subsets of the data. We interpret this consistency to be indicative of the validity of the machine learning approach to surrogate models of materials properties, independently of the parametrization details of the models.
For four binary systems (AgCu, AlMg, CoNi, CuNi) predictions errors are at or below 3 meV/atom. The prediction errors of all surrogate models on the remaining six systems (AlFe, AlNi, AlTi, CuFe, FeV, NbNi) are consistent, and it is not obvious as to why these systems are harder to learn. When generating the data, the same methodology and parameters were used for all alloys, and similar fitting procedures were employed for each surrogate model.
We point out that whenever the elements that constitute a binary alloy system belong to the same column of the periodic table or are close to each other in the periodic table in terms of atomic number, the surrogate models' predictions are good and vice versa. Indeed, together these numbers explain 80 % of the variance in prediction errors ( Figure 5). A complementary observation is that while absolute errors vary from alloy to alloy, relative errors (δ RMSE ), expressed as a percentage of the range of energies of an alloys' subset of the data, remains essentially constant at around 1 %.

B. General models trained on all alloys
Surrogate models based on analytic expressions need to be fitted to each alloy system separately. For CE, the representation is naturally tied to a particular lattice (e.g. fcc, bcc), making it difficult to train on multiple alloy systems with different lattices at the same time. Here we train a cluster expansion on all alloys by constraining all 30 systems to use a single set of hyperparameters for regularization (i.e. all use the same prior probability distribution of ECI values). The machine-learning surrogate models based on MBTR, SOAP, and MTP do not suffer from the problem of representation being tied to a particular lattice. They express energy as a continuous function of distances and can be trained on multiple materials simultaneously.
We trained four of the five investigated surrogate models simultaneously on all 10 alloy systems and compared the mean absolute error (MAE) of these combined models with the average MAE when trained on each alloy system separately (Table II; note that RMSE would differ due to its non-linear nature). The quantitative agreement indicates that prediction errors remain unaffected when training on multiple systems. For the cluster expansion, these results suggest that there is a single set of parameters for generating a prior probability distribution over ECI values (provided in the supporting information) that works well across a variety of chemistries and lattice types.
We investigate simultaneous training of alloys in more detail for the MBTR+KRR model. Figure 3 presents deviations of the MAE of a single model trained on k alloy systems from the average MAE when the model is trained on each alloy system separately. In all of the possible 10 k=1 10 k = 1023 cases the deviation is below 1 meV/atom. These deviations are on the order one would expect from minor differences in hyperparameter values. We conclude that prediction errors remain consistently unaffected when increasing the number of simultaneously modeled alloys.

C. Caveat emptor
Are reported errors reliable estimates of future performance in applications? It depends. We discuss the role of training and validation set composition as an example of the intricacies of statistical validation of machine learning models.
In the limit of infinite independent and identically distributed data, one would simply sample a large enough validation set and measure prediction errors, with the law of large numbers ensuring the reliability of the estimates. Here, however, data are limited due to the costs of generating them via ab initio simulations, and are neither independent nor identically distributed. In such a setting, part of the available data is used for validation, either in the form of a hold-out set (as in this work) or via cross-validation, suited for even smaller datasets.
Prediction errors in machine learning models improve with data (otherwise it would not be machine learning). This implies that if only few training samples exist for a "subclass" of structures, prediction errors for similar structures will be high. For example, consider the number of atoms per unit cell in the DFT-10B dataset used here: There are only 11 structures for each alloy that have 1 or 2 atoms in the unit cell. Consequently, prediction errors are high for those ( Figure 6).
For combinatorial reasons, the number of possible structures increases strongly with the number of atoms in the unit cell (Table III). This biases error statistics in two ways: As discussed, prediction errors will be lower for classes with more samples. At the same time, because these classes have more samples, they will contribute more to the measured errors, dominating averages such as the RMSE.  of the data into training and validation sets. On the left, all structures with |k| or fewer atoms in the unit cell are excluded from the training set (and therefore included in the validation set). This results in many high-error structures in the validation set, with the effect decreasing for smaller |k|. For k = 0, size does not influence the split. On the right, structures with ≤ k atoms are always included in the training set, resulting in fewer high-error structures in the validation set. The dashed line marks the value of k = 2 recommended in this work (see also Figure 6).
Retrospective errors reported in the literature should, therefore, be critically assessed. The design of such studies should favor "representative" validation sets over those tweaked to yield lowest possible errors to report. For combinatorial datasets, the smallest structures (those that can be considered to be outliers) should be included in the training set. 50

V. CONCLUSION AND OUTLOOK
We showed that it is possible to use machine learning to build general surrogate models that can predict the enthalpy of formation of crystal structures across 10 different binary alloy systems, for three lattice types (fcc, bcc, hcp) and for structures not in their ground state. In this, we find that the concept of using machine learning to predict materials properties is independent of the details of the used surrogate models as predictions of several state-of-the-art materials representations and learning algorithms were found to be in qualitative agreement. This observation also seems to be congruent with recent efforts towards a unifying mathematical framework for some of the used representations. 51 The ability to use a single general surrogate model has the potential to greatly simplify the use of surrogate models for exploration of materials spaces by avoiding the need to identify "homogeneous" subspaces and then building separate models for each of them. This also avoids problems such as discontinuities at the boundaries of separate models.
Is it possible to do better? Recent results suggest that it might be possible to exploit similarities between chemical element species to improve learning rates further. 52 This requires either to explicitly account for element similarities in the representations or to learn element similarities from the data, for example with a DNN. While such alchemical learning is outside of the scope of this work, we do observe an improvement in prediction errors for the general MBTR+DNN model (Table II,

SUPPLEMENTARY MATERIAL
A. Method details

Cluster expansion
To determine the cutoff distances for the cluster expansions and determine the initial parameters for the prior probability distributions, we used a length scale in which the edge of a bcc unit cell is 1 unit of length and assumed the hcp, bcc, and fcc crystal structures all had the same nearest-neighbor distance. The cutoff distances used to determine the set of clusters included in the expansion are as follows: Number of sites in cluster Maximum distance between sites 2 8 3 4 4 2 5 1. 5 6 1.5 This resulted in a total of 791, 941, and 2870 distinct orbits of clusters in the bcc, fcc, and hcp expansions, respectively, including the empty cluster. These numbers were reduced after fitting by "trimming" the cluster expansions, in which cluster functions with very small ECI were removed from the expansion. To determine which clusters to remove, we used the fact that when the cluster functions are orthonormal, the expected squared error due to truncation, E(error 2 ) , is given by where V b is the ECI for the b-th cluster function, the expectation of the squared error on the left is over all possible lattice decorations, and the sum on the right is over cluster functions excluded from the expansion. Thus removing an orbit of clusters with multiplicity m b increases the expected squared error by m b V 2 b . To trim clusters from the expansion with little loss of accuracy, we removed all orbits of cluster functions for which m b V 2 b < 10 −5 eV. The trimming procedure changed the final average root-mean-squared prediction errors on the training sets by less than 10 −5 eV / atom and removed on average more than 70% of the ECIs in the expansions.
The ECIs for the cluster expansions were fit to the training data using the Bayesian approach with a multivariate Gaussian prior distribution 17 . The inverse of the covariance matrix for the prior, Λ, was diagonal, with elements given by for n α = 0 e −λ1 for n α = 1 e −λ2 e −λ3rα n λ4 α for n α > 1 where n α is the number of sites in cluster function α and r α is the maximum distance between sites in Angstroms. The parameters λ 1 , λ 2 , λ 3 , and λ 4 were initially set to 10, 10, 5, and 5 respectively then optimized by using a conjugate gradient algorithm to minimize the root mean square leave-one-out cross-validation error, an estimate of prediction error 16 . For the combined fit, in which a single set of regularization parameters were used for all 30 cluster expansions, the optimized values of λ 1 , λ 2 , λ 3 , and λ 4 were 10.0, 20.8, 4.2, and 15.3 respectively.

MBTR+DNN
The mathematical details of the many-body tensor representation for the crystal structures are mentioned in ref. 12 . Each crystal structure is expanded in terms of distributions (k-body terms) of atom counts, (inverse) distances and angles. The Gaussian kernel with a variance (σ) of 11.3 was used for fitting. Each MBTR vector is 1450 long and was optimized using a grid search. The details of the weighting functions, smearing parameters for each k-body term are as follows, k geometry weighting discretization σ 1 atom count 1/identity (0.5, 1, 25) 10 −4 2 1/distance identity^2 (0.1, 0.005, 70) 2 −17 3 angle 1/dotdotdot (0.1, 0.05, 140) 2 −8.
MBTR+DNN model uses the same parameters as MBTR+KRR model for generating the representation. The only difference between the representations is that the k-body terms in MBTR+DNN model are stratified by all 10 elements instead of just two. This results in a ix representation vector which is 147100 long. The architecture of the convolution neural network used in this work is listed in the table below. The DNN code is implemented using the Tensorflow framework 53 . The models were trained with a mini-batch size of 50 and the RMSE error is used as the cost function for optimizing the weights of the network.

SOAP+GP
GAP fits were generated for each alloy system using a 2-body + SOAP approach. The standard deviation (SD, parameter δ) of the Gaussian process for the 2-body GAP is set to match the SD in energies of the training set. After fitting the 2-body potential, another SOAP GAP is fit with its SD set to match the remaining RMSE of the 2-body GAP relative to the DFT energies in the training set. The fits were performed using teach_sparse 54 with the following parameters for the 2-body GAP: As described above, δ is set using the standard deviation of the Gaussian Process based on the training set for the 2-body and SOAP fits respectively. The following table lists these values for each of the alloy systems. For all alloy fits, the error hyperparameter σ was set to 1 meV for energies. Force and virial were not used in the fits. Because pure energies have a large effect on the predicted formation enthalpies, we increased the error hyperparameter to 10 −4 for those training configurations that represented pure elements. This ensured accurate reproduction of the pure energies so that enthalpy errors closely match errors in total energy for configurations.
The parameter 0 was calculated for each isolated atom by including a padding of 10Å around a single atom and using the same pseudopotential as the bulk calculations discussed above. These energies were converged with respect to basis set size and used only the γ k-point.

MTP
MTP was introduced in Ref. 11 for single-component system and in Refs. 19,20 was extended to multicomponent systems. MTP partitions the predicted energy into contributions of environments of each atom. Around the central atom of an environment, the neighboring atoms form shells. In these shells, atoms are assigned fictitious weights depending on the distance to the central atom, their types, and the type of the central atom. These weights are free parameters fitted from data. An environment is described by moments of inertia of these shells. All possible contractions of one or more moment tensor to a scalar comprise an infinite sequence of basis functions. This sequence is truncated to yield a finite set of basis functions used in a particular MTP model. The contribution of an environment to the energy is, thus, a linear combination of basis function with coefficients which are also found from data. Refer to Ref. 20 for more details.
In this work for binary systems, we used an MTP with about 300 basis functions. The cutoff for atomic environments was 7Å. The environments were described by five shells, and the dependence of the weight of a neighbor on the distance to the central atom of the environment was described by eight basis functions. Thus, the total number of parameters in a binary MTP is 5×8×2 2 +300 ≈ 450 (the factor 2 2 follows from the fact that there can be two types of the central atom and two types of each neighboring atoms). For the 10-component MTP, we x used six shells and 850 basis functions, totaling about 6 × 8 × 30 + 864 ≈ 2300 parameters, where the factor 30 is the number of interacting pairs of atoms.
B. k -point density  NbNi (33) Figure 6. Influence of unit cell size on errors. Shown are the absolute errors (meV/atom) as a function of the number of atoms in the unit cell for a validation set of 595 randomly chosen structures. The number in brackets and the dashed line indicate the root mean squared error (RMSE, meV/atom) and the median absolute error (meV/atom) on the same set. If small structures (one or two atoms in the unit cell) are not contained in the training data (that is, are shown in the plot) they tend to have larger errors, increasing overall RMSE as well. If all small structures are contained in the training data, the overall RMSE is low (AlMg, CoNi). Retraining models with small structures included in the training set improved RMSE in all cases, by an amount depending on how many structures were added.