Learning grain boundary segregation energy spectra in polycrystals

The segregation of solute atoms at grain boundaries (GBs) can profoundly impact the structural properties of metallic alloys, and induce effects that range from strengthening to embrittlement. And, though known to be anisotropic, there is a limited understanding of the variation of solute segregation tendencies across the full, multidimensional GB space, which is critically important in polycrystals where much of that space is represented. Here we develop a machine learning framework that can accurately predict the segregation tendency—quantified by the segregation enthalpy spectrum—of solute atoms at GB sites in polycrystals, based solely on the undecorated (pre-segregation) local atomic environment of such sites. We proceed to use the learning framework to scan across the alloy space, and build an extensive database of segregation energy spectra for more than 250 metal-based binary alloys. The resulting machine learning models and segregation database are key to unlocking the full potential of GB segregation as an alloy design tool, and enable the design of microstructures that maximize the useful impacts of segregation.

I n alloys, the segregation of solute atoms at grain boundaries (GBs) induces structural effects 1,2 that include strengthening [3][4][5] , embrittlement 6,7 , corrosion resistance 8,9 , and GB phase transitions 10,11 . As such, controlling GB segregation is an essential tool for many engineering applications 12 , including, e.g., the thermodynamic stabilization of nanocrystalline alloys against grain growth [13][14][15] . And, though most technically relevant alloys are used in a polycrystalline form, there is a very limited understanding of GB segregation in polycrystals 16 , and a general lack of databases of segregation information relevant to them.
In a polycrystal, the GB network has a variety of site-types that can either promote or inhibit segregation to different degrees, depending on their unique local atomic environments. The drive for a solute atom to segregate to a GB site-type (i) is quantified by the segregation enthalpy ΔH seg i , which, in solids 17 , is equivalent to the internal energy difference between the solute atom occupying the GB site, and a bulk (intra-grain) site, ΔH in a polycrystal will determine the extent of equilibrium GB segregation in an alloy [18][19][20] . Recently, we have shown this spectrum to be captured by a skewnormal distribution for an Mg solute segregation in an Al polycrystal 20 . However, the computation of these segregation spectra is a resource-intensive task. For example, a (50 nm) 3 Al polycrystal with an average grain size of 10 nm has roughly one million GB sites, which translates to a million atomistic calculations, where a solute atom is placed substitutionally at each GB site independently and allowed to relax. This makes the task of investigating different microstructures (i.e. multiple polycrystalline samples) cost-prohibitive for a given alloy.
Here, we propose a machine learning (ML) framework that can accurately predict the relaxed segregation energy of a solute atom in a GB site, solely based on its undecorated (pre-segregation) atomic environment. Our approach is tiered and offers two models. The first is a high-fidelity model that is trained to accurately capture the variation of segregation energy across a large swath of the GB space, and thus can be used to study an alloy system in detail and instantaneously evaluate segregation for different microstructures. The second is an accelerated model that uses dimensionality reduction to reproduce the high-fidelity model-with a minimal loss in accuracy-using three orders of magnitude fewer data-points for training (only 100 sites). We use the accelerated approach to scan across the alloy space, and build an extensive database giving GB segregation spectra for all aluminum, magnesium, and transition metals-based binary alloys for which an interatomic potential exists in the Interatomic Potentials Repository 21,22 of the National Institute of Standards and Technology (NIST) -a total of 259 binary alloys. This database allows us to identify alloys of interest with minimal computational cost, for which high-fidelity models can be trained and used. The proposed ML framework and the resulting spectral segregation database should provide a general and broadly applicable alloy design toolbox relevant to all material properties impacted by solute segregation.

Results
High-Fidelity ML model for GB segregation. If a solute atom is substitutionally placed at a GB site and is allowed to relax, its local neighboring atoms will be affected (and possibly displaced) by the introduced elastic and chemical interactions. Hence it follows that the local atomic environment (LAE) of a GB site will influence its favorability for solute segregation, and thus this environment should be accurately captured in any learning model that aims to correlate the undecorated (pre-segregation) GB site to its final decorated (post-segregation) relaxed state. So far, the state-of-the-art learning models in the literature use simple wellknown structural features 23,24 , such as atomic volume, coordination, and Voronoi parameters, which mostly limit the description of the LAE to its first nearest-neighbor atoms. Instead, we propose using an atom-centered feature extraction method "descriptor" that encodes the local atomic environment around an atom within a cutoff radius 25,26 . Such descriptorsalso known as "fingerprints"-are developed and widely used to construct ML-based interatomic potentials; examples include the atom-centered symmetry functions 27 , bispectrum components 28,29 , and smooth overlap of atomic positions (SOAP) 30 . There are two main advantages to using such atom-centered descriptors. The first is that no a priori knowledge or selection of what constitutes an important structural feature of the LAE (such as volume, coordination, etc.) is required, but rather, by using a complete description of the LAE within a cutoff radius, we relegate the decision of learning the most important features to the ML model. The second is that the use of a large cutoff radius ensures that the most dominant interactions between the solute atom and its LAE are captured. As these descriptors are borrowed from the interatomic potential fitting literature, we can think of our approach as fitting a "pseudo interatomic potential" for solute segregation at GBs.
The proposed high-fidelity ML model is summarized in Fig. 1, which shows two main steps: (a) feature extraction and (b) a learning algorithm. For feature extraction, we use the SOAP method 30 , as it was recently shown to perform well in describing GB environments (albeit for the different problem of predicting GB energies) 31 . The SOAP method produces for a given GB site and its LAE within a cutoff radius, a feature vector (descriptor) that is invariant under all physical symmetries (permutation, translation, rotation, etc.). The size of the feature vector is controlled by the SOAP hyperparameters (detailed in the methods section), which, in essence, determines the resolution of the vector and its sensitivity to changes in the LAE. In this work, the SOAP feature vector for each GB site has F SOAP =1,015 features. For the cutoff radius, we use 6 Å, which is a conservative cutoff used in constructing interatomic potentials, as it captures the most dominant atomic interactions for an atom with its LAE 25,32 . We note that, though we opted to use the same F SOAP and a radial cutoff of 6 Å for all binary alloys (as optimal parameters that require minimal input from the user), this procedure is flexible, and one could, by further optimizing the SOAP hyperparameters to the specific alloy of interest, improve the accuracy of the ML model. (For example, a solute atom that has a large size mismatch with solvent atoms could benefit from a larger radial cutoff.) The product of the first step of the ML framework, feature extraction, is a feature matrix of size (N GB atoms x F SOAP features), which is used as the input to the second step, the learning algorithm, which learns to map the input SOAP features to the target property (segregation energy). For the learning algorithm, we use linear regression for three reasons: first, it is a simple inexpensive model to train and use for predictions; second, it can be automated as it does not require any hyperparameter optimization; and third, it inherently ensures regularization (i.e. is less prone to overfitting)-by simply using a sample size of >10xF SOAP GB sites (following the "one in ten" rule of thumb 33 , which we further validate in Supplementary  Fig. 21) to fit the F + 1 coefficients of the model (F coefficients + intercept), we guard against model overfitting, and selection bias towards a small subset of the population (randomly sampling as few as~400 points from an infinite population gives a 95% confidence level and 5% margin of error 34 ). We note that although more elaborate learning algorithms could be used, such as support vector machines 31 , Gaussian process regression 28 or neural networks 32 , our proposed ML framework prioritizes simplicity and minimal input from the user, so that other researchers can adopt it easily. We use this approach to showcase the utility of using atom-centered descriptors for learning GB site segregation energies, without getting lost in the intricate details of fine-tuning more advanced learning algorithms. We note that though the proposed learning framework focuses on segregation spectra in substitutional alloys, it is extensible in principle to interstitial alloys by defining interstitial sites 35 at the GB and bulk regions.
Using the high-fidelity approach, we train a model, in Fig. 1c, for Mg solute 36 segregation in a thermally annealed 20 × 20 × 20 nm 3 Al polycrystal that has 16 grains and~10 5 GB sites, using a randomized 50/50 split for training/testing. This simple holdout method is easy/cheap to train and use, and its conservative test ratio will guard against a high variance model in most cases. The trained model is highly accurate, with a mean absolute error (MAE) of 2.4(2.5) kJ/mol for the train(test) datasets, respectively, and a root-mean-square error of RMSE=3.8(4.1) kJ/mol. The model faithfully reproduces the distribution of segregation enthalpies in the polycrystal and has a well-behaved error with normally distributed residuals. This result compares favorably with a more sophisticated ML model by Huber et al. 23 , which used 19 structural features (volume, coordination, Voronoi analysis parameters, and Steinhardt bond-order parameters) with gradient boosted decision trees, and had a 9-fold cross-validation RMSE=7.7 kJ/mol for Mg solute segregation in a database of 38 low and high-symmetry boundaries in Al. The comparison is not direct, of course, since that work focused on bi-crystals whereas we use polycrystals, but it is also encouraging that the present error is also much lower than the reported error of the interatomic potential as compared to DFT GB segregation energies, which has an RMSE of 8.7 kJ/mol 24 .
We further validate the efficacy of the high-fidelity ML model for GB solute segregation across the alloy space by training to six more 20 × 20 × 20 nm 3 polycrystalline volumes for different alloys: Ag(Ni) 37 , Cu(Zr) 38 , Fe(Al) 39 , Ni(Cu) 40 , Pt(Au) 41 , and Zr (Ni) 42 . As shown in Fig. 2, the ML model accurately reproduces the segregation spectra for the six binary alloys, and has a low MAE typically below~6 kJ/mol and often below 1 kJ/mol. Alloys with higher absolute values (wider distribution) for the segregation energy will correspondingly have a higher MAE, and the worst of these seen here is MAE = 12.8 kJ/mol for the Zr(Ni) system, but here the segregation spectrum spans about 250 kJ/ mol; as a fraction of the total spread of the segregation spectrum, the MAE is uniformly below about 5%. To test the extrapolability of the high-fidelity framework, we report the mean (and standard deviation) absolute errors using 5-fold cross-validation in Supplementary Table 1, which shows that the fitted models are able to generalize well to the unseen folds of the dataset (with similar errors as reported in Fig. 2 for the 50/50 holdout method, and low standard deviation across the folds). We note that although most of the surveyed base-metals have fcc lattice structure, the ML framework seems to be insensitive to the lattice structure, as it similarly performs well for bcc (Fe), and hcp (Zr) metals. Therefore, we conclude that the high-fidelity ML model can be used to accurately model GB segregation across the GB and alloy spaces.
Accelerated ML model for GB segregation. In alloy design, it is of interest to be able to quickly scan across the alloy space for interesting combinations. In the context of GB segregation, for example, significant efforts have been conducted to screen for nanocrystalline stabilizing elemental combinations 15,43 , complexion forming combinations 44 , or GB embrittling solute additions [45][46][47] . Though the high-fidelity ML model is highly accurate, it still requires~10 4 data points for training and fitting its~10 3 coefficients (features). To reduce the training cost and permit a broader scan across the full alloy space, it is appropriate to reduce the dimensions of the input features. We propose the use of unsupervised dimensionality reduction algorithms, which map a high dimensional feature vector into a low-dimensional embedding that captures its main characteristics; "unsupervised" signifies that such mapping is done without a priori knowledge the  of the target value (segregation energies). As an illustration, we adopt the simplest of these algorithms, namely principal component analysis 48 , which we use to transform the F SOAP = 1015 into 10 principal components (P SOAP ) that maximize the captured variance of the feature space. We can think of this process as compressing the 1,015 features into 10 components; such compression captures >99% of the variance of the SOAP feature matrix of the Al polycrystal, as shown in Fig. 3.
For an accelerated option of the ML framework, we propose using the 10 principal components obtained from PCA as the input for the linear regression algorithm 49,50 . As the problem is now reduced to fitting P SOAP + 1 coefficients (instead of F SOAP + 1), we conservatively only need~P × 10 = 100 data-points for training; 100 molecular statics computations involving the substitution of a single solute atom at a grain boundary site in a polycrystal give insight on the entire segregation spectrum. As for the selection of the 100 training data points, though random selection can be used, this could be a biased approach due to the low number of points accessing only a prevalent subset of the GB feature space in a given polycrystalline structure. Instead, we propose using k-means clustering 51,52 to partition the reduced feature space into k=100 clusters that minimize within-cluster variances. We then use the cluster centroids to identify optimal training data-points (i.e., shortest Euclidean distance to the centroids), as shown in Fig. 3, for which GB segregation is computed, and use it to train the accelerated model. Such an approach is computationally inexpensive and ensures the full coverage of the feature space in our training dataset.
Similar to the high-fidelity model, the accelerated one can be fully automated and requires minimal input from the user. To compare the performance of both approaches, an accelerated model for Al(Mg), trained with only 100 GB sites, results in an MAE of 4.2 kJ/mol for predictions of the full~10 5 GB sites, compared to an MAE of 2.5 kJ/mol from the high-fidelity model trained with 50% of GB sites. This reduction (two orders of magnitude) in the required training data points, with minimal loss of accuracy, is significant, and showcases the power of the accelerated model to quickly, and accurately, predict the segregation spectra in binary alloys. It also signifies that the full GB space could possibly be reduced to a small number of key GB environments-also known as GB "building blocks" 31 -that decipher the features of the full space. We expect this to be a significant direction of future work in the context of grain boundary segregation.
Using the accelerated approach, we build ML models to predict solute segregation spectra in polycrystals for every aluminum, and magnesium, and transition metal-based binary alloys ( Supplementary  Figs. 2-20) that have interatomic potentials in the NIST Interatomic Potentials Repository-a total of 259 alloys (see Supplementary  Fig. 1). This segregation database not only allows us to screen the alloy space for segregation "hot-spots" or regions of interest, but also to compare the variation of the spectrum with different interatomic potentials (for alloys where more than one potential exists). To illustrate the utility of the database, we plot in Fig. 4 all solute segregation spectra in a nickel-based alloy; Ni(Ag) 37 is predicted to be highly segregating, and the opposite for Ni(Al) 53 .

Discussion
There are three key findings to the spectral segregation database (Fig. 4 and Supplementary Figs. 2-20). The first is that all    Fig. 1, for solute segregation in six 20x20x20 nm 3 polycrystalline alloys: a-f Ag(Ni) 37 , Cu(Zr) 38 , Fe(Al) 39 , Ni(Cu) 40 , Pt(Au) 41 , and Zr (Ni) 42 .  ARTICLE segregation spectra in all binary alloys surveyed, as hypothesized earlier in our study of the Al(Mg) system 20 , are captured well by a skew-normal function (the fitted probability density function has an R 2 > 0.95 in all but one alloy with an R 2 =0.80; see Supplementary Figs. [2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17][18][19][20]. This function involves three parametersthe characteristic energy μ, width σ and shape α of the distribution: These parameters are provided in the corresponding figure for each alloy considered. The second key finding is that using a McLean "average" 54 segregation energy to characterize a binary alloy, which is the segregation literature norm 55,56 , misses key information about the accessible segregation states at the GB network. For example, the Ni(Ag) system that has a reported "average" segregation energy of −50 kJ/mol 55 , has approximately 15% of its GB network with segregation energies more than twice that, below −100 kJ/mol, as shown in the first panel of Fig. 4. GB segregation occurs first in the lowest energy states, and before a grain boundary in Ni(Ag) would experience the McLean average segregation energy, it would lie at an extremely high composition of approximately 50 atom% Ag. The knowledge of the full spectrum is thus essential to enable the design of microstructures 57 that maximize the desired tail of the segregation spectrum (i.e. either promote or inhibit segregation). The third key finding is that, for alloys with more than one available interatomic potential, the computed segregation spectra can be sensitive to the choice of the potential. For example, potentials for the Al(Ni) system produce completely different segregation spectra, as shown in Supplementary Fig. 3, which range from having almost all GB sites being unfavorable to segregation 58 to the complete opposite 59 ; such variation can result in an order of magnitude difference in predictions for GB solute concentration even at low total solute concentrations in the system (see Supplementary Fig. 23). Further work is needed in the future to quantify the accuracy of such potentials for GB segregation studies 60 , and as always with atomistic models, it is important to remember that the present framework will only return physically reasonable results if the potential is specifically suitable for the problem at hand. For now, we report all of them, and leave the selection step to the judgment of the user.
Though the analysis in Figs. 1-3 shows that the ML models faithfully reproduce most of the details of the GB segregation spectrum, this is not the most critical test for their practical viability; these models are only useful to the extent that they correctly capture GB segregation in some realistic situation. Thus, the most important metric is the prediction for the equilibrium GB segregation state (i.e. extent of segregation). For a spectrum of segregation energies at the GB network, the equilibrium solute distribution among the different sites follows Fermi-Dirac statistics [18][19][20] . In a closed system with finite grain sizes, the total solute concentration X tot is fixed and shared by the bulk (intragrain) and GB solute concentrations, X c and X gb , respectively, according to the GB site fraction f gb : The equilibrium X c and X gb are a function of the temperature T, the distribution of GB segregation energies F gb i (ΔE seg i ), and are obtained by numerically solving for X c that satisfies the expanded form of Eq. (2) 20 : In Fig. 5, we compare the equilibrium GB segregation state obtained using the true computed spectrum versus the ML predicted ones with both high-fidelity and accelerated models, for all seven alloys from Figs. 1 and 2, in a polycrystal of average grain size 15 nm (f gb % 10%) at T = 600 K. The predictions of the ML models closely match those of the true spectrum, indicating that the ML models capture the necessary information to correctly predict the equilibrium segregation state. Also, as briefly discussed earlier, though the value of the MAE differs from one system to another, a higher MAE does not necessarily translate to a worse result, when one normalizes to the scale of the segregation energy distribution, e.g. the Zr(Ni) system in Fig. 5. Finally, we note that the difference (deviation) in predictions of the equilibrium segregation state could be even less of an issue if the skew-normal approximation Eq. (1) is used, instead of the full discrete spectra, to quantify GB segregation using the continuous form of the segregation isotherm Eq. (3) 20 : as the three fitted parameters (μ, σ, and α) of the skew-normal function for the true and ML predicted spectra should closely match, even for systems with high MAE, as the residuals are wellbehaved and normally distributed (with a zero mean, as shown in Figs. 1 and 3).
To motivate further analysis of the spectral segregation database, and visually summarize the segregation tendency across the alloy space, we plot a two-dimensional Pettifor 61 map in Fig. 6 (for most alloys in the database) using the 25 th percentile value (energy) for the segregation spectra (i.e. 25% of GB sites have lower segregation energies). As the lower tail is the most enthalpically favorable, it will disproportionately influence the Predictions of equilibrium X gb using the true, and predicted (from both high-fidelity and accelerated ML models) segregation spectra for seven polycrystalline alloys with an average grain size of 15 nm at T= 600 K. segregation tendency in any given alloy, especially at low or dilute solute concentration. The choice of the Pettifor chemical scale (which preserves the Mendeleev-type features of the elements 61 ) is based on its success in pattern clustering (separation) for miscibility 62 , ordering tendency 63,64 , and crystal structures of intermetallics 61 in binary alloys. Though Fig. 6 shows some clustering, it is not enough to draw concrete conclusions on the segregation tendency across the alloy space; the same finding applies to another two routinely used parameters to characterize the chemical and physical nature of the elements-electronegativity 65 , and metallic radius 66 -(see Supplementary Figs. [24][25][26][27]. It is evident that more effort is needed to formulate (or extract from ML) simple phenomenological parameters (preferably derived from atomic features e.g. Miedema-style parameters 67 ) that better explain these trends. We hope that this preliminary exploration of the data will promote further work on this front.
In summary, our proposed ML framework, inspired by methods developed for fitting ML-based interatomic potentials, aims to fit a "pseudo interatomic potential" for GB segregation energies in polycrystalline alloys. The framework is designed to require minimal input from the user, and as such, is automatable. As the ML literature is constantly evolving, we look forward to new developments and tools that can further improve the framework. We offered two model options. The first is a highfidelity model that uses a large SOAP vector (>10 3 features), a conservative radial cutoff (6 Å), and linear regression. The second, is an accelerated model that uses PCA to transform the original features into a few (10) principal components (which are then used as input features to linear regression); this reduces the dimension of the learning problem to just 100 key GB environments, which are selected by k-means clustering to ensure coverage of the GB space. The accelerated model is used to build an extensive database for segregation spectra in 259 binary alloys, which is included the Supplementary Information. We look forward to applications of this database in alloy design, and hope it motivates more widespread use of spectral approaches to GB segregation in polycrystalline materials.
Methods GB segregation enthalpies. The atomistic simulation package LAMMPS 68,69 is used for all molecular statics and dynamics simulations; OVITO 70 is used for visualization and identification of atomic structures.
To generate the base-metal polycrystal, we fill a 20x20x20 nm 3 volume with 16 randomly oriented grains using Voronoi tesselations with Atomsk 71 . The polycrystal is thermally annealed at 0.3-0.5 of the melting temperature under a Nose-Hoover thermostat/barostat for 250 ps using a time step of 1 fs, which relaxes the grain structure and boundaries without permitting exaggerated grain growth; this is followed by slow cooling to 0 K at a cooling rate of 3 K/ps, and a final conjugate gradient energy minimization.
To compute the spectrum of segregation enthalpies in a binary alloy, we follow the procedure in ref. 20 . We first relax the base-metal polycrystal using the interatomic potential of that alloy, by applying an external pressure of zero in a conjugate gradient minimization, followed by a second conjugate gradient minimization (with no applied pressure). This is necessary to scale the cell, and correct for minor differences in the equilibrium lattice parameter of the base-metal across the different interatomic potentials (for example, the Ni polycrystal is thermally annealed using an interatomic potential 42 that is fitted to Ni lattice parameter of 3.518 Å, but the Ni(Al) 53 is fitted to 3.520 Å). Then, every GB site in the annealed polycrystal is identified using adaptive-common neighbor analysis method 72 ; all atoms that have a different atomic structure than the base metal are assumed to be GB atoms. For every GB site (i), its ΔE seg i is calculated as the relaxed energy difference between the solute atom occupying the GB site, versus a bulk (intra-grain) site: ΔE seg i ¼ E solute gb;i À E solute c ; the relaxation of each state is achieved using a conjugate gradient minimization, and the reference bulk site for E solute c is chosen as the center of a 6 nm sphere of the pure solvent (in the polycrystal), to avoid any long-range interactions with GB atoms. All calculations are at 0 K, isolating the enthalpic portion of the segregation energy for each site.
Machine Learning. For feature extraction, the LAE of every GB site within a cutoff radius of 6 Å is described using the SOAP method, as implemented in the QUIP/ GAP software package 28,30 . SOAP fits a set of radial basis functions and spherical harmonics to Gaussian particle density functions placed over all neighboring atoms in the LAE. The maximum number of radial basis functions (n max ), degree of spherical harmonics (l max ), and the width of Gaussian functions (σ at ) control the size and resolution of the SOAP feature vector. We use n max = l max = 12 and σ at = 1 Å for all alloys, which gives a SOAP vector with 1015 features. As for the other components of the ML framework: linear regression, principal component analysis, and k-means clustering are used as implemented in the Scikit-learn 73 python package.

Data availability
The database for segregation spectra of all 250+ binary alloys, in the form of LAMMPS text dump files of solvent polycrystals with predicted GB solute segregation energies, is available at https://doi.org/10.5281/zenodo.4107058. Additional data related to this work are available from the authors upon request.