Genetic algorithms for computational materials discovery accelerated by machine learning

Materials discovery is increasingly being impelled by machine learning methods that rely on pre-existing datasets. Where datasets are lacking, unbiased data generation can be achieved with genetic algorithms. Here a machine learning model is trained on-the-fly as a computationally inexpensive energy predictor before analyzing how to augment convergence in genetic algorithm-based approaches by using the model as a surrogate. This leads to a machine learning accelerated genetic algorithm combining robust qualities of the genetic algorithm with rapid machine learning. The approach is used to search for stable, compositionally variant, geometrically similar nanoparticle alloys to illustrate its capability for accelerated materials discovery, e.g., nanoalloy catalysts. The machine learning accelerated approach, in this case, yields a 50-fold reduction in the number of required energy calculations compared to a traditional “brute force” genetic algorithm. This makes searching through the space of all homotops and compositions of a binary alloy particle in a given structure feasible, using density functional theory calculations.

Materials discovery is increasingly being impelled by machine learning methods that rely on pre-existing datasets. Where datasets are lacking, unbiased data generation can be achieved with genetic algorithms. Here a machine learning model is trained on-the-fly as a computationally inexpensive energy predictor before analyzing how to augment convergence in genetic algorithm-based approaches by using the model as a surrogate. This leads to a machine learning accelerated genetic algorithm combining robust qualities of the genetic algorithm with rapid machine learning. The approach is used to search for stable, compositionally variant, geometrically similar nanoparticle alloys to illustrate its capability for accelerated materials discovery, e.g., nanoalloy catalysts. The machine learning accelerated approach, in this case, yields a 50-fold reduction in the number of required energy calculations compared to a traditional "brute force" genetic algorithm. This makes searching through the space of all homotops and compositions of a binary alloy particle in a given structure feasible, using density functional theory calculations. a) Electronic mail: teve@dtu.dk

I. INTRODUCTION
The current rate of discovery of clean energy materials remains a key bottleneck in the transition to renewable energy, and computational tools enabling accelerated prediction of the chemical ordering and structure of such materials, e.g., nanoparticle alloys and catalysts, are in high demand.
Genetic algorithms (GAs) are metaheuristic optimization algorithms inspired by Darwinian evolution. Performing crossover, mutation and selection operations, the algorithm progresses a population of evolving candidate solutions. Selecting well designed operators and optimal parameters, GAs have exhibited a high degree of robustness in terms of finding ideal solutions to difficult optimization problems 1,2 . The robustness results from the evolutionary process being able to advance solutions that would have been very difficult to predict a priori. Though, GAs often require a large number of function evaluations, resulting from typical offspring not being very "fit" solutions. Modern machine learning (ML) methods have the capacity to fit complex functions in high-dimensional feature spaces while controlling overfitting 3,4 . However, the high-dimensional feature space means that finding an optimum in an ML model is not a simple task. The robustness of the GA is analyzed while accelerating its convergence through integration with an on-the-fly established Gaussian process (GP) regression model of the feature space. Although we have used a GP model any ML framework, e.g. deep learning, would also be applicable.
For materials applications, GAs have typically employed (semi-) empirical potentials (EP) [5][6][7][8][9][10][11] to describe the potential energy surface (PES). [12][13][14][15] The utilization of more accurate methods to describe the PES, such as density functional theory (DFT) has been limited, due to computational cost. To account for the increased computational cost of searching the PES directly with DFT, studies have often been limited in size, 16 though these methods have successfully been used in a number of investigations. [17][18][19][20][21][22][23][24][25] This study focuses on utilizing the GA to gain an understanding of chemical ordering within larger particles. Searching across a range of compositions is particularly important in the field of materials discovery, where composition can have a profound effect on the desired property e.g. catalytic activity. 26,27 Further, the optimal composition may vary with the size of the nanoparticle. Therefore, the accurate description of chemical ordering is important; where, for certain motifs, the ordering is very complex. 28 Focus is placed on expediting a fast unbiased homotop search 2 by reducing the number of energy evaluations needed to explore the PES and locate the putative global minimum, i.e. the full convex hull, for a given template structure.

A. Icosahedral particles
The chemical ordering of atoms is optimized for a 147-atom Mackay icosahedral template structure. 29 Searches elucidate the full convex hull of possible Pt x Au 147−x for x ∈ [1,146] compositions. The convex hull is defined as the line connecting the lowest excess energy of each stable composition. A composition can be unstable if the homotop with the lowest excess energy lies above the line. The number of homotops for each particle rises combinatorially toward the 1:1 composition. The number of possible homotops is given by Equation 1.
There are a total of 1.78 × 10 44 homotops for all 146 compositions. The total number homotops for each composition is shown in Fig. 1 as well as an example of a randomly ordered icosahedral structure under consideration in this study. A small number of PtAu compositions will preferentially distort to form rosette-icosahedral instead of the Mackay icosahedral structures. 30,31 Other distortions have previously been observed for the AuPt system, 32 but not in this study. The GA locates the rosette distorted structures in a number of cases, though as structure optimization is not the focus of these benchmarks, when the distortion occurs, the calculations are prevented from entering the population preserving the template structure.

B. Traditional Genetic Algorithm
We first run a traditional GA (described in detail in the Methods section III B) to baseline our benchmark and then describe the ML extensions and their results. When using the traditional GA, it is possible to locate the hull of local minima with ∼16,000 candidate minimizations. This is significantly lower than the total number of homotops that are present, and thus the number of energy calculations required if a brute-force method was used (1.78 × 10 44 ). However, this is still typically above the number of energy calculations one would wish to perform if a more expensive energy calculator were being employed. To overcome inefficiencies in this method, the underlying search algorithm is optimized and coupled with machine learning selection. A GP regression model is used to predict excess energies of nanoparticles before employing electronic structure calculations. A discussion of the GP model is given in the Methods section III C. This is well suited to making large steps on the PES without performing expensive energy evaluations. A difficulty when searching with the MLaGA is that convergence criteria typically used in these studies is no longer suitable. The MLaGA methodology is specifically implemented to limit the number of energy evaluations that are performed. Therefore, every candidate in the generation typically progresses the population. This progression within the population continues until the ML routine is unable to find new candidates that are predicted to be better, essentially stalling the search. For this reason, convergence is considered

D. DFT verification
To ensure that advantages of the methodologies discussed above were not an artifact of utilizing the less accurate EMT calculator, the MLaGA was tested searching directly on  The convex hull located for the DFT search is shown in Fig. 4a. The shaded region shows the difference in stability between the distorted structures and the most stable icosahedral structures located. The complete core-shell Au 92 Pt 55 structure is located as the most stable for both the EMT and DFT searches. There is good general agreement between the structures obtained elsewhere on the hull, aside from the region of distortion. We conclude this section with a discussion of how to include geometrical optimization in this framework. First one would add operators that move the atoms and use a fingerprint able to invariantly describe the local geometrical environment. 35,36 However, if any appreciable rearrangement takes place during relaxation the fingerprint is no longer reliable, leaving the ML predicted energies wrong. It is possible to limit the relaxation to maintain reliable fingerprints, this would require that the GA runs through more candidates as the steps taken on the PES will be smaller. Another possibility is to utilize ML schemes that can also take care of relaxation, [37][38][39] some examples already exist for coupling these with a GA. 40,41 6 III. METHODS

A. Computational details
The effective-medium theory (EMT) potential 12 is used as the energy calculator for initial benchmarking. The fast inertial relaxation engine 42

B. Traditional Genetic Algorithm
The GA implemented within the Atomic Simulations Environment (ASE) software package 46 has been utilized.
The excess energy is used when determining fitness within the GA, as in Equation 2.
For particles containing a total of N atoms, n A and n B are the number of atoms of type A and B, respectively. E AB is the total energy of the mixed particle, while E A and E B are the energies of the pure particles. To efficiently search across the full compositional convex hull, we employ a fitness function based on a niching routine. 47 The candidates are grouped according to the composition and the fitness assigned within each composition niche is based on the excess energy (Equation (2)). The fittest individuals for each composition, i.e. across niches, are given equal fitness. This negates the energy penalties that would otherwise prevent the algorithm from searching for minima of all possible compositions and bias the search towards a narrow composition window comprising the lowest excess energies.
When initializing the traditional GA, the population size is set to 150 candidates. This will, due to niching, keep all compositions in the population. The method for selecting parents is handled by roulette wheel selection. Selection probabilities are directly related 7 to the ascribed fitness, which accounts for the stabilities of the nanoparticle. Offspring are created by either mating two parents, or by mutating a single candidate. The mating and mutation routines are mutually exclusive and thus are not allowed to stack i.e. performing crossover and mutation before evaluation. Cut and splice crossover functions, described by Deaven and Ho, 5 are used to generate new candidates with a call probability of 0.6.
Random permutation mutations are utilized with a call probability of 0.2, e.g. swapping the positions of two random atoms of different elemental species. A random swap mutation is also employed with a call probability of 0.2, where one atom type is swapped for another.
The convergence criterion is assigned through a lack of progression in the population e.g. the fitness of the population does not change for a number of generations. 6 The GA is run with relatively loose convergence criteria, when there has no observed change in the population for two generations, the search is concluded.

C. Gaussian Process Regression Model
The squared exponential kernel was utilized for the mapping function, as in Equation 3.
The kernel is applied to determine relationships between the fingerprint vectors (x) of two candidates, where x is the Euclidean L 2 -norm and w denotes the kernel width.
The training dataset is comprised of unique numerical fingerprint vectors, with features representing distinct chemical ordering within a particle, based on a simple measure, the averaged number of nearest neighbors, as in Equation 4.
Where, e.g. #A-A accounts for the number of homoatomic bonds between atom type A.
Where a new candidate has the fingerprint vector x * , k is the covariance vector between a new data point and the training data set, K is the covariance, or Gram, matrix for the training data and λ is the regularization hyperparameter added in order to evaluate the uncertainty of the full model. 4 In order to progress the search as efficiently as possible, the cumulative distribution function (cdf), as in Equation 6, is used as the fitness of a candidate.
When the fitness function also accounts for the variance, it is possible to utilize the inherent uncertainty within a prediction to either exploit the current known information in the model, or to explore unknown regions of the search space. 49 The cdf is calculated up to the current known fittest candidate in the composition.

IV. DATA AVAILABILITY
The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request. The convergence profile for the GA search. The error is the cumulative energy deviation from the correct convex hull, it is plotted against each energy calculation i.e. it gives an indication of the energy gain of each calculation.