Decoding defect statistics from diffractograms via machine learning

Diffraction techniques can powerfully and nondestructively probe materials while maintaining high resolution in both space and time. Unfortunately, these characterizations have been limited and sometimes even erroneous due to the difficulty of decoding the desired material information from features of the diffractograms. Currently, these features are identified non-comprehensively via human intuition, so the resulting models can only predict a subset of the available structural information. In the present work we show (i) how to compute machine-identified features that fully summarize a diffractogram and (ii) how to employ machine learning to reliably connect these features to an expanded set of structural statistics. To exemplify this framework, we assessed virtual electron diffractograms generated from atomistic simulations of irradiated copper. When based on machine-identified features rather than human-identified features, our machine-learning model not only predicted one-point statistics (i.e. density) but also a two-point statistic (i.e. spatial distribution) of the defect population. Hence, this work demonstrates that machine-learning models that input machine-identified features significantly advance the state of the art for accurately and robustly decoding diffractograms.


INTRODUCTION
Diffraction techniques can probe large volumes of material while maintaining high resolution in both space and time [1][2][3][4] . Hence, these techniques are widely used to provide structural characterizations across a variety of scientific fields, including biology 5-7 , materials science 1,8,9 , and polymer physics 10,11 . However, the difficulty of decoding diffractograms has greatly limited their utility 12 . Challenges concern both steps of the decoding process: (i) identifying the key features in the diffractogram and then (ii) modeling their relationships to the desired structural characterizations. To identify key features, most studies employ a common set of "human-identified features" (HIFs). Assessments of 2D diffraction patterns, such as from selected area electron diffraction (SAED), have focused on the positions, areas, and shapes of the spots. Likewise, assessments of 1D diffraction line profiles, such as from conventional X-ray diffraction (XRD), have focused on the positions and widths of the peaks. Unfortunately, neither the 1D nor the 2D HIFs was developed systematically, and they cannot comprehensively summarize a diffractogram. Hence, models that input HIFs can only predict a limited set of structural characterizations. As a result, models are numerous, highly empirical, and often conflicting. For example, we recently demonstrated that fitting two popular width models with the same XRD data yielded opposite trends in characteristic size 13 .
To enable complete decoding, the current work systematically demonstrates (i) how to compute machine-identified features (MIFs) that fully summarize a 2D diffraction pattern and (ii) how to employ machine learning to reliably connect these features to an expanded set of structural characterizations. For instructional purposes, we examined virtual SAED patterns produced from atomistic simulations of irradiated copper. This simulation-based approach bypassed experimental complications, such as dynamic diffraction conditions, enabling a 1:1 comparison between diffractogram features and defect statistics. We chose irradiation damage as an exemplar because it provided a "defect laboratory" that offered numerous defect statistics to be related to the diffractogram features. We chose SAED specifically because it exemplified 2D diffraction patterns, which contain more information than 1D diffraction line profiles (as is common in conventional XRD). However, 2D diffraction patterns could also be constructed for XRD 4 , so our findings would apply to that as well.
As summarized in Fig. 1, we first characterized defect statistics of the irradiated microstructures to serve as evaluation metrics (i.e. outputs) for our models. We computed one-point statistics (i.e. point-defect density and dislocation density) as is common in diffraction studies. We also computed a two-point statistic (i.e. the pair-correlation function (PCF) of the point-defect distribution) to demonstrate that diffractogram decoding could be extended. A two-point statistic effectively captures all of the information of the corresponding one-point statistic while adding higher-order information. Our particular two-point statistic captured both the density and the spatial distribution of the point defects. To produce the inputs for our models, we characterized the virtual SAED patterns in two ways. For HIFs, we tracked the area, perimeter, eccentricity, and position for each family of diffracted spots. For MIFs, we simply computed the principal components of the SAED patterns. Finally, we used Gaussian Process Regression to construct a "human model" from the HIFs and a "machine model" from the MIFs. Both machine-learning models captured the one-point statistics, but only the machine model (which used MIFs to characterize the virtual SAED patterns) captured the twopoint statistic. Therefore, our results demonstrate how to decode advanced structural information unavailable to conventional diffractogram decoding.

RESULTS
This study comprised three major parts: (i) simulation of irradiated microstructures to provide defect statistics, (ii) simulation of SAED patterns to provide HIFs and MIFs, and (iii) construction of machine-learning models to decode defect statistics from the HIFs and MIFs. The Methods section provides further technical details on each step.

Choosing defect statistics
To understand the defect statistics generated from our atomistic simulations, we first examined the defect mechanisms as a function of the number of irradiation events, which we iteratively introduced over time. We observed two primary stages in the evolution of point defects (e.g., Fig. 2a, b) and dislocations (e.g., Fig. 2c, d). In the first stage (S1), small clusters of point defects and small dislocation loops rapidly accumulated in a mostly isolated fashion. In the second stage (S2), the microstructures reached a saturation point for both types of defects. Hence, isolated clusters and small dislocation loops tended to coalesce into more complex defect structures. In both stages, the predominant types of dislocations were Shockley dislocations (i.e. 1/6〈112〉) and stairrod dislocations (i.e. 1/6〈100〉 and 1/3〈100〉). The former tended to comprise complex networks of dislocation loops while the latter mainly were present within stacking-fault tetrahedra. Figure 3a, b quantify our one-point statistics: point-defect density and dislocation density, respectively. In the accumulation stage (S1), both densities sharply increased with the number of irradiation events. Due to the sparse distribution of defects, interactions between neighboring defect clusters and between neighboring dislocation loops were infrequent. However, as the system saturated with defects in the saturation stage (S2), the likelihood of defect interaction increased significantly, and defects tended to coalesce or annihilate rather than nucleate. Note that the point-defect density and the dislocation density followed a nearly identical trend, which suggests that these types of defects were physically coupled, as previously observed 43 . Further, Fig. 3b shows that the major constituent dislocations (i.e. Shockley and stair-rod) followed a trend nearly identical to that of the total dislocations. Figure 3c plots our two-point statistic of the irradiation damage: the PCF of the point-defect distribution. Because of the radial averaging, each PCF curve represents the correlation between point defects as a function of their separation distance. For explanation purposes, imagine a random reference point defect. The PCF then reveals the probability of finding another defect at a specified distance away from that reference defect. As the number of irradiation events increased, so did the point-defect density and correspondingly the PCF between 0 (blue) and 38k (red) events. Now consider the general trend in each PCF curve in Fig. 3c: a rapid decrease within region R1, a transition within region R2, and finally a flatline throughout region R3. In R1, the PCF sampled the separation distances of defects within close proximity (i.e. within small defect clusters). Because the curves were nearly identical in this region, most defect clusters must have contained similar core structures. In R3, the PCF sampled defects that were separated by large distances, usually between separate clusters or within very large clusters. Therefore, the PCF successfully captured the characteristic spacing both within defect clusters (largely in R1) and among defect clusters (largely in R3). Overall, the evolution of the one-point statistics ( Fig. 3a, b) and the two-point statistic (Fig. 3c) demonstrated that the accumulation of defects was not a simple matter of density but rather a complex evolution resulting from two competing processes.
To facilitate model construction, we performed principal component analysis (PCA) on the PCF to reduce its dimensionality. Figure 3d shows that the first two principal components (i.e. PCF 1 and PCF 2 ) accounted for 93.5% and 6.3% of the variation in the PCF, respectively. Because these components effectively summarized the point-defect distribution, we selected them as our final evaluation metrics. By definition, these principal components are orthogonal, so each one captures a unique aspect of the PCF. Further, these components resulted from a linear transformation, so we evaluated the linear correlation of each component to the overall PCF to evaluate the utility of each component (Fig. 3e). Because PCF 1 and PCF 2 had significant Pearson linear-correlation coefficients in R1 and R2, both components must be modeled to capture the short-range order within clusters. In contrast, only PCF 1 was significant in R3, so only PCF 1 would be needed for modeling the long-range order among isolated clusters and within large clusters. Figure 3f summarizes the four defect statistics that ultimately served as outputs for our machine-learning models: point-defect density, dislocation density, PCF 1 , and PCF 2 . To compare these statistics, we transformed them into Z-scores by shifting each data point by the corresponding mean and normalizing by the corresponding standard deviation. Hence, each Z-score represents the number of standard deviations that each point deviated from the mean. Interestingly, the total defect density, total dislocation density, and PCF 1 exhibited strong linear correlations (with linearcorrelation coefficients above 0.97). These results confirm that the point defects were physically coupled with the dislocations. As for PCF 1 , recall that this principal component had a significant correlation coefficient with the PCF for a broad range of correlation lengths (Fig. 3e). If PCF 1 can well capture the defect distribution, then it should certainly well capture the defect density. In contrast, PCF 2 correlated with a much smaller range of correlation length and exhibited a different trend in Fig. 3f.
Identifying key SAED features Now consider the SAED patterns simulated from the irradiated microstructures. Figure 4a-d presents the Z-scores for the HIFs of the SAED patterns: area, perimeter, eccentricity, and position for the first two families of diffracted spots. The spot areas (Fig. 4a) followed a nearly identical trend as the point-defect (Fig. 3a) and dislocation densities (Fig. 3b) because defects generally smear a diffraction pattern by interrupting the interferences that would otherwise yield small spots. The spot perimeters (Fig. 4b) followed a similar trend as the spot areas because the spots almost uniformly broadened, as evidenced by the lightly oscillating values of the eccentricity (Fig. 4c). The spot positions (Fig. 4d) also jumped in S1 (i.e. toward the center of the SAED pattern) but then oscillated much more than the area or the perimeter in S2. The position of a spot indicates the length of the diffraction vector in reciprocal space and therefore the corresponding interplanar separation in real space. As defects accumulated and wedged themselves among the crystallographic planes, the planes expanded, so the spots moved inward. However, as the system saturated, defect annihilation and defect coalescence sometimes reduced the strain to allow the planes to approach their original positions. But then new defects would accumulate and induce strain again, hence the oscillation of the position curves.
Given the similarities in the trends of the areas and perimeters, we produced independent inputs for the human model by performing PCA on the set of 8 HIFs. If we had not ensured independent inputs, the human model would have been susceptible to overfitting error. As shown by Fig. 4e, the first principal component (HIF 1 ) accounted for 98.6% of the variability in the HIFs, and the second (HIF 2 ) captured the remaining 1.4%. Predictably, the trend in the dominant HIF 1 resembled that of the point defects, dislocations, and PCF 1 . In contrast, HIF 2 exhibited a fairly unique trend, which could be helpful in modeling the elusive PCF 2 . Regardless, we replaced the 8 individual HIFs with HIF 1 and HIF 2 as inputs for our human model because these principal components captured the same structural information but in an independent fashion.
To contrast the HIFs and produce the inputs for the machine model, we directly computed the 76 principal components of the 76 pre-processed SAED images to serve as MIFs. Interestingly, the distribution of these principal components was far more spread out than in the HIF analyses. For example, the first two principal components accounted for 99.8% of the PCF (Fig. 3d) and 99.95% of the HIFs (Fig. 4e) but only 46% of the SAED patterns (Fig. 4g). In fact, the first 15 components accounted for only ≈70% of the variability of the SAED patterns. Fortunately, we only needed to capture the SAED variability resulting from the salient defects. Further, because MIF 1 (Fig. 4h) alone highly correlated with the point-defect density, total dislocation density, and PCF 1 , there were many more MIFs available to capture the elusive PCF 2 . As explained in the next subsection, we used correlation coefficients to downselect which MIFs to include in the machine model because selecting too many MIFs would risk error via overfitting.

Modeling defect statistics from SAED features
For the human model, we had two obvious inputs: HIF 1 and HIF 2 . These two principal components effectively summarized the evolution in the area, perimeter, eccentricity, and position for the first two families of SAED spots. As shown by Table 1, HIF 1 had a strong, linear correlation with the point-defect density, dislocation density, and PCF 1 . Hence, the human model would certainly capture these defect statistics. In contrast, neither HIF 1 nor HIF 2 had a strong linear correlation with PCF 2 , which related to the close-range order of the point-defect distribution. Surprisingly, several of the individual HIFs fared better. For example, the eccentricity for {200} had a correlation coefficient of 0.28 for PCF 2 in comparison to the 0.02 and 0.06 of HIF 1 and HIF 2 , respectively. Recall that the principal components of the HIFs captured 99% of the variation of the individual HIFs, including the eccentricity for {200}. Therefore, we deduced that the relationship between the HIFs and the PCF 2 was likely non-linear if present. In this case, a non-linear human model (e.g., based on Gaussian Process Regression) could potentially still predict PCF 2 .
In contrast to the two obvious selections for inputs in the human model, we had numerous significant MIFs available for the machine model (recall Fig. 4g). To increase the likelihood of convergence and decrease the likelihood of overfitting error, we downselected to MIFs that had high linear correlation coefficients with the desired defect statistics. This feature-selection technique can obscure non-linear relationships but is a common first Fig. 3 Defect statistics to be used as outputs in the machine-learning models. Raw defect statistics for a point-defect density, b dislocation density, and c pair-correlation function (PCF) of the point-defect distribution. d PCF variation captured by its principal components (PCF i ). e Linear-correlation coefficient for PCF 1 and PCF 2 with the overall PCF. f Total point defects, total dislocation density, PCF 1 , and PCF 2 .
approach within the machine-learning community. For the current work, we simply desired a proof of concept, around which a fully optimized model could later be conceived. As with HIF 1 , MIF 1 had a strong linear correlation with point-defect density, total dislocation density, and PCF 1 . Hence, we chose to model these defect statistics by using only the first three MIFs. Overfitting was unlikely with just three inputs, and the additional 2 MIFs might have been useful in a non-linear fashion. As for PCF 2 , we noted that several higher MIFs had significant linear correlation coefficients, so we used the first 12 MIFs to model this defect statistic. Figure 5 provides parity plots that reveal how effectively both machine-learning models decoded the four desired defect statistics from the chosen SAED features. Predictably, both the human model and the machine model captured the following defect statistics that exhibited strong linear correlations with SAED features: point-defect density, dislocation density, and PCF 1 (Fig. 5a-f). However, only the machine model captured PCF 2 , which must have been non-linearly correlated with MIFs (Fig. 5g, h).
This difference in performance for predicting the two-point statistic (particularly at short-range order) indicated that SAED patterns contain non-trivial fingerprints beyond human recognition.

DISCUSSION
By connecting features of virtual SAED patterns to a variety of defect statistics from irradiated microstructures, this work demonstrates the power of machine learning in decoding 2D diffractograms. Specifically, we constructed a "human model" that inputs HIFs and a "machine model" that inputs MIFs. Both models were based on the same machine-learning algorithm, so performance differences were solely due to the inputs. The human model certainly had the advantage in the interpretability of its inputs and captured the one-point statistics (i.e. point-defect density and total dislocation density). However, the HIFs incompletely assessed the SAED patterns, so the human model failed to capture the two-point statistic (i.e. PCF of the point-defect Fig. 4 SAED features to be used as inputs to the machine-learning models. Human-identified features (HIFs), comprise a area, b perimeter, c eccentricity, and d position for the first two families of SAED spots. e Variation of the HIFs captured by the principal components of the HIFs (i.e. HIF i ). f The first two principal components of the HIFs (i.e. HIF 1 and HIF 2 ). g Variation of the SAED patterns captured by their principal components, which were also called the "machine-identified features" (MIFs). h The first four MIFs. When plotted against irradiation events, the principal components were negated in order to enhance the inspection of trends. distribution). In contrast, by incorporating features beyond human recognition, the machine model captured all of the evaluated defect statistics. The MIFs of the machine model resulted from the comprehensive, direct analysis of the full SAED patterns, so avoiding feature-selection bias was much easier than with the HIFs. Further, our MIFs were independent by definition, so it was easier to avoid overfitting errors with MIFs than with the oftencorrelated HIFs.
Even as proofs of concept, both of our models more accurately and more comprehensively decoded diffractograms than traditional approaches would. However, our framework could be readily enhanced even further in both steps of the decoding process. Regarding the identification of diffractogram features, we could have expanded the types of inputs. For example, we could have incorporated SAED patterns with other zone axes and/or added space-group information 44 . Instead of identifying MIFs via PCA, we could have used texture/shape statistics 45,46 , training autoencoders, generative neural networks, or transfer learning (from pre-trained convolutional neural networks). Regarding the model building, we could have optimized feature selection and/or evaluated other machine-learning algorithms. For example, we could have considered other combinations of principal components for our machine model. As for the machine-learning algorithm itself, we successfully used Gaussian Process Regression to capture both linear relationships and non-linear relationships within a relatively small dataset. Alternatively, we could have evaluated the effectiveness of a variety of other algorithms, including support vector machines, random forests, and k-nearest neighbors.
Going forward, we note that our framework could be extended to other structural statistics, other loading conditions, other materials, and other diffraction techniques. In the current work, we assessed point defects and dislocations in irradiated copper via electron diffraction, but the same methods could also be applied to, for example, disclinations of a fatigued ceramic probed with hard X-rays. Further, the machine-learning models did not have to be built exclusively on simulation-based data. We chose to illustrate our concept via simulation in order to establish comprehensive defect statistics to serve as ground truths for model evaluation. However, experimental data could have been used to explore nuanced effects, such as dynamic diffraction and complex loading conditions within larger systems.

METHODS Accelerated irradiation simulations
We used the Large-scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) 47 (which is an open-source, molecular-dynamics code) with the reduced-order atomistic cascade method (ROAC) 48 to simulate radiation damage in copper. Instead of expensively simulating a series of consecutive cascades initiated by primary-knock-on atoms (PKA), the ROAC method simultaneously simulates multiple cascades at once by approximating collision cascades as randomly-located, core-shell regions, provided that the core regions do not overlap. The shells capture the long-range, athermal, recombination-corrected displacements per atom, and the cores capture the short-range, thermal-spike-induced replacements per atom 49 . Depending on the number of atoms and the energy of the recoil spectrum, the ROAC method can model irradiation events 10,000 times faster than consecutive PKA simulations 48 . Hence, we were able to simulate a large number of irradiation events to produce complex defect structures. Specifically, we introduced a total of 38,000 irradiation events of 1 MeV copper ions (500 recoil events were inserted every 20 ps, resulting in 76 unique defect structures) into a fully periodic, 4-million-atom copper sample at room temperature (300 K). We modeled the interatomic interactions with an embedded-atom-method potential that incorporated both nuclear stopping 50 and electronic stopping 51,52 . We calibrated the core-shell structures for a broad range of recoil energies by performing consecutive PKA simulations in bulk copper 48 and then incorporating these results into an extension of the Norgett-Robinson-Torrens DPA model 49 .

Defect characterizations
For each irradiated structure, we identified the point defects via polyhedral template matching 53 . Using this technique, we compared the local neighborhood of each atom with that of a pristine face-centered-cubic crystal. Atoms that deviated by at least 0.1 Å were considered defects. We also identified the dislocations (i.e. line defects) via the dislocation analysis algorithm in OVITO 54 . We considered Shockley partials (i.e. 1/6〈112〉), stairrods (i.e. 1/6〈100〉 and 1/3〈100〉), and others. To characterize the morphology of the irradiation damage, we computed the PCF 55,56 from a voxelized representation of the point-defect distribution. This approach had a similar effect as computing a radial distribution function on the discrete data and then performing a smoothing 57 . For the voxelization, each atom not in its lattice position was assigned a spherical volume with a 3-Å radius (which was slightly larger than the first nearest-neighbor distance), and then overlapping volumes were combined. The autocorrelation function was computed on this continuous domain and radially averaged to produce the PCF. Because of the high dimensionality of this two-point statistic, we performed PCA on the PCF via sci-kit learn 58 .

Electron diffraction simulations
For each irradiated structure, we produced a virtual SAED pattern by employing the LAMMPS user-diffraction package 59,60 . This package computes kinematic electron diffractograms directly from atomistic simulations without prior knowledge of the underlying crystal structure. This package had already been successfully used to detail displacementcascade formation in a prior study 51 . We simulated SAED patterns along the 100 ½ zone axis with an incident electron wavelength of 0.0251 Å,

Diffractogram characterizations
To prepare each SAED pattern for characterization, we performed a series of preprocessing steps. First, we applied both a base-10 logarithmic transformation and a square crop to emphasize the intensities of the spots from the two most dominant families of planes (i.e. {200} and {220}). Then we smoothed the intensity map with a Gaussian blur to ease feature identification. Finally, we removed the direct central spot (which cannot be measured experimentally) and lessened the impact of relrods (which are more prominent in simulation than in experiment) by applying a fixed mask. We constructed this mask by fitting circular regions to the centroids of the diffracted spots in the SAED pattern of the non-irradiated copper. After preprocessing, we characterized the diffractograms. For the HIFs, we identified the diffraction spots (by constructing a binary map via thresholding at 5% of the maximum intensity) to compute the area, perimeter, position (via its centroid), and eccentricity (via sci-kit image 61 ) for each spot. Averaging these values across each of the two families of spots yielded a total of eight HIFs per SAED pattern. Incorporating correlated inputs into a model would have complicated its convergence and risked its accuracy, so we performed PCA on the full set of HIFs. The resulting principal components would ultimately serve as the inputs to the human model. For the MIFs, we simply performed PCA directly on the preprocessed SAED patterns. As compared to the human-based approach, this machine-based approach was more straightforward and therefore more robust. For example, there is no need to determine which features to characterize, which threshold to use, or how to ensure feature independence. However, we did need to downselect MIFs to avoid overfitting. As detailed in the Results section, we used the Pearson linear correlation coefficient to inform our feature selection.

Model construction
We used Gaussian Process Regression as implemented in sci-kit learn 58,62 to link the SAED features to the desired defect statistics. This nonparametric, Bayesian approach was chosen for its ability to capture non-linear relationships in relatively low amounts of data. The dataset comprised Fig. 5 Parity plots for the human model (left column) and machine model (right column). Evaluated metrics are the Z-scores for the following: total point-defect density (a and b), total dislocation density (c and d), and the first two principal components of the pair-correlation function (PCF) of the point-defect distribution (e-h). The black line is the y = x reference. The Pearson correlation coefficient (CORR), rootmean-squared error (ERROR), and standard deviation (STDEV) as compared to that reference line are provided.
SAED features (HIFs and MIFs) and their corresponding defect-statistic labels. To construct both the human model and the machine model and to compute their error distributions, we used a radial basis function kernel, which encodes for smoothness of functions (such that the similarity of inputs corresponds to similarity of outputs). This kernel has two hyperparameters: a signal variance and a density scale. We tuned these hyperparameters by maximizing the log marginal likelihood of the training data while using a gradient-based optimizer for efficiency. Because the log marginal likelihood was not necessarily convex, multiple restarts of the optimizer with different initializations were used. We used k-fold cross validation across 50 data splits (80% training and 20% testing).

DATA AVAILABILITY
The data that support the findings of this study are available from the corresponding author upon reasonable request.