Reinforcing materials modelling by encoding the structures of defects in crystalline solids into distortion scores

This work revises the concept of defects in crystalline solids and proposes a universal strategy for their characterization at the atomic scale using outlier detection based on statistical distances. The proposed strategy provides a generic measure that describes the distortion score of local atomic environments. This score facilitates automatic defect localization and enables a stratified description of defects, which allows to distinguish the zones with different levels of distortion within the structure. This work proposes applications for advanced materials modelling ranging from the surrogate concept for the energy per atom to the relevant information selection for evaluation of energy barriers from the mean force. Moreover, this concept can serve for design of robust interatomic machine learning potentials and high-throughput analysis of their databases. The proposed definition of defects opens up many perspectives for materials design and characterization, promoting thereby the development of novel techniques in materials science.

represents an individual atomic environment. The structural data is taken from the GAP potential database [9] and fully corresponds to that from Figure 2 in the main text of the manuscript.
Previous study by Sharp et al. [31] used d SV M to estimate the so-called softness of atoms, a continuous, signed, scalar measure, which was used to describe the likelihood of atoms to rearrange under thermal fluctuations within the grain boundaries in fcc Al. Similarly to our study, Ref. [31] reported a positive correlation of SVM distances with local atomic energies with the large spread d SV M for a given energy value. Thus, while using completely different formalisms to compute local atomic energies (ML GAP potential [9] vs conventional EAM potential [22]), different atomic descriptors (bispectrum SO4 [3] vs G 2 and G 3 functions [4]) and different training databases to describe defects in different materials (bcc Fe vs fcc Al), our calculations are consistent with those by Sharp et al. [31] and indicate that d SV M is not directly attributable to the local energy.
Using SVM with a non-linear kernel could help improving the correlation, however, it will lead to a drastic increase of numerical cost of the analysis (learning process on the database with M LAEs requires M 3 numerical operations) and will imply a careful choice of the kernel form and parameterization. Moreover, the chosen kernel can not remain universal for different structural types and the parameterization should be adjusted by the user for the given application. This characteristics of the kernel-based methods do not allow to consider them as a universal method of structural analysis, suitable for the Applications 1 and 2 described in the main text of the paper.

Stratification of defect structures
Here we examine the correlation between the SVM distances with those from robust MCD (Fig.   2) for the self-interstitial (SIA) clusters in bcc Fe. For the simple I two techniques perform similarly. The pair of atoms that form a dumbbell (yellow points) are characterised by large SVM and MCD distortion scores and can be unambiguously distinguished from the rest of atomic array. For more complex structures of 3D interstitial clusters (Fig. 2b, c), the two techniques exhibit a different performance for the outliers with large distances. For the atoms that contribute to the defect structure itself (green and yellow points in Figure 2b, c), SVM distances saturate. Thus, for the Gao-triangle I N P 2 (Fig. 2b), SVM provides only one possibility to localise the interstitial atoms, while MCD enables a clear distinction between the two families of atoms (structure inserts A and B) that contribute to the defect structure. In the case of the C15 cluster, it is not possible to isolate the I C15  Under irradiation by high energy ions or neutrons, crystalline solids can be supersaturated with defects, i.e., the defects in irradiated materials are present at concentrations well above the thermodynamic equilibrium. The effects of radiation on the material's properties and ageing depend on a combination of multiple factors, such as the type of induced radiation, the nature of irradiated material and the energy of the radiation particles. Once irradiated, the primary knock-on atom (PKA) induces a chain of collisions, leading to a displacement cascade. During this collision phase, atomic displacements mainly occur in the form of point defects, e.g., self-interstitial atoms (SIAs), vacancies, Frenkel pairs. These defects may eventually recombine and annihilate or regroup and form larger defects like 2D and 3D clusters, dislocation loops, stacking faults, etc. The morphology of primary defects, i.e., those which were formed as a direct consequence of the cascade, defines their mobility and stability with respect to each other. These factors play an an important role in subsequent micro-structural evolution in the material.
Molecular dynamics calculations of displacement cascades provide the essential physical basis for understanding of atomic-scale processes that occur during primary displacement events [7,26,36].
However, interpretation of such calculations often implies a challenge [5,23] due to the complex structural processes that produce a rich variety of defects, as well as due to the large size of the simulation cells, which can reach up to few tens of millions of atoms when computing the damage produced by high energy PKAs (e.g., 200 keV and more).
Here we demonstrate the performance of MCD for analysis of displacement cascades in fcc Cu and, in particular, for detection of targeted defect types in the damaged structure using a multidimensional distortion score.

Construction of the training data set with multimodal distribution of LAEs
The distortion score provided by robust MCD is adapted for training data with unimodal distribution. However, the complexity of structural data selected as a reference for training not always can be represented by one Gaussian. Below we demonstrate how this limitation for construction of training data sets suitable for MCD analysis can be overcome using a modal decomposition of the database. As a possible solution, the multimodal training database D can be decomposed in several unimodal data sets D 1 ⊕ D 2 ⊕ . . . ⊕ D n . After such a decomposition, a robust distance can be computed with respect to each unimodal dataset D i , forming thus a multidimensional distortion score. The multidimensional distortion score is particularly suitable for advanced structural characterization of any and all defects that are simultaneously present in a material. When using a multidimensional score, several measures can be associated to each atomic environment in such a way that each component of the distortion score is related to a specific reference structure with unimodal distribution of LAEs in the descriptor space. Here we aim to construct a training data set for detection of various interstitial-type defects in fcc Cu.

Choosing relevant structures for the training data set
Fcc metals are generally known to have small interstitial defects, like 100 dumbbells, alone or packed into 3D structures, like those described by Ingle et al. [15]. The larger defects are 2D and represented by faulted (Frank) 1 3 111 {111} or unfaulted (prismatic, perfect) 1 2 110 {111} dislocation loops. The fault inside the 1 3 111 {111} loop has the hcp structure. Consequently, the interstitial-type defects of interest can be described here using a multidimensional distortion score where each dimension is spanned by the robust MCD distance associated to the following structural classes: fcc, hcp, 100 dumbbells and 3D SIA clusters.

Training structures and their representation in the descriptor space
The training structures are created using a f cc

Analysis of the multimodal distribution of the training data
Once the relevant structural classes are defined and the training data is represented in the descriptor space, we apply a Gaussian mixture model (GMM) [16,19] to the training data represented by the Supplementary Figure 3: The multimodal Bayesian Gaussian analysis [16,19] of the fcc, hcp, 100 and 3D SIA training atomic environments. For better visualization, the 40-dimensional descriptor space of training data is represented in 2D using linear PCA dimensionality reduction. The subplots (a-d) illustrate the GMM analysis using from n g = 2 up to n g = 5 Gaussians (n g denotes the number of Gaussians).
Each point on the plot represent an atomic environment. The atomic environments belonging to the same Gaussian are depicted with the same colour. The subplot (e) shows the training configurations using the same PCA representation as for the GMM analysis (a-d), whilst the colours correspond to the structural type: the 100 dumbbells, 3D clusters, hcp and fcc are shown in green, blue, pink and yellow, respectively.
four selected structural classes. The mixture model based on Gaussians is particularly convenient for modelling continuous observations in the descriptors space, such as our training data. This data will be denoted as x ∈ R D . Consequently, the data can be written as mixture of n g Gaussians of dimension D of mean m i and covariance matrix S i : where p i is the mixture weight for component i. The optimal set parameters m i , S i , p i can be obtained using standard maximum likelihood optimization. This procedure is applied simultaneously for the four training classes that were identified as necessary for the structural analysis of the displacements cascade.
The GMM analysis using from n g = 2 to n g = 5 Gaussians is presented in the Fig. 3a-d. The 40-dimensional training data is visualised in 2D after a linear PCA decomposition. The GMM analysis indicates that n g = 4 ( Fig. 3c) accurately explains the distribution of the training data. In this case, the number of Gaussians (Fig. 3c) is equal to the number of physical classes of the reference structures (Fig. 3e). This conjecture is supported by the fact that all atomic environments of the reference structures were generated using a Gaussian distribution of atomic displacements.
The GMM analysis with 5 Gaussians (Fig. 3d)  Analysis of the displacement cascade calculations using a multidimensional distortion score The analysis is performed for the fcc Cu simulation box with 5,324,000 atoms, taken from the dataset R093 [23] of the open-source database cascadesdb.org by the International Atomic Energy Agency (IAEA). Similarly to the procedure described in the main text of the article (see maint text, Fig. 1), the structural damage can be isolated from the defect-free bulk structure using the distortion score component related to the fcc Cu structure (Fig. 4a, c, e). In order to identify and localise the defects of interest in the cloud of structural damage, the other three components of the constructed 4D distortion score can be applied, namely, those related to hcp, 100 dumbbells and 3D SIA clusters. In this case, the corresponding component of the distortion score distortion score SIA dumbbells that are concealed in Figure 4a. More interestingly, this method can be applied for detection of rare defects or such defect types that are challenging to identify using standard geometry based methods. Figure 4e illustrates the detected 3D SIA clusters, which were detected using the MCD model trained on LAEs of rare 3D clusters described by Ingle et al. [15].
Thus, using a multidimensional distortion score, it is possible to perform a detailed analysis of the systems with complex damage. Here, we have unambiguously localised the targeted defects, including the rare 3D SIA structures (Fig. 4e), which are difficult to identify with traditional techniques.   Fig. 5c). In order to select the relevant bulk atoms, we consider possibilities for bulk stratification along the NEB path ( Fig. 5b, d, f). Here, we mainly focus on the snapshots with the reaction coordinate ζ = 0, ζ = 1 (Fig. 5b), ζ = 0.26, ζ = 0.74 (Fig. 5d) and ζ = 0.5 (Fig. 5f). These configurations correspond to the minimum-and maximum-energy points along the migration path (main text, Fig. 5c). The stratification levels B with d RB = 1.45 and C with d RB = 1.0 within the bulk structure (shown with dashed black lines in Fig. 5b,d,f) are selected in such a way that they separate distinct distortion score layers in all the selected snapshots. The analyzed simulation cells contain dislocations that are only distant by 17.45Å, which imposes strong elastic interaction between the cores. In such small simulation cells, consideration of at least half of the bulk atoms is indispensable for reliable reconstruction of the migration barrier. In this particular case, integration of the forces over the region with d RB > 1.0 is sufficient for reconstruction of the Peierls barrier with 95% accuracy.
For larger simulation cell with less important interaction between the cores, the proportion of the relevant bulk atoms will be drastically reduced. The example of the Peierls barrier reconstruction for bigger cells is provided in the next section. It is worth emphasizing that an appropriate stratification of the bulk (Fig. 5b,d,f) is ensured by the training database of the MCD model. Bulk atoms represent inliers and their accurate description can be better obtained when using the training bulk structure from the relevant MD calculations. The training bulk structures with the noise generated be random displacements within the Gaussian distribution may not necessarily provide the results with sufficient accuracy to describe the subtle details of the displacement field produced by dislocations in the bulk structure.
It is also worth mentioning that the choice of atomic descriptors and their dimensionality has an impact on the computed distortion scores. Throughout this study, we employ bispectrum SO(4) [3].
This type of descriptors ensures a good radial and angular description of the structure, therefore they are commonly used for design of interatomic ML potentials [12,33,35]. We find that the obtained with different j max is different, but the overall pattern for the structure stratification remains similar (Fig. 6) and the subsets of atoms classified as inliers and outliers are the same.

Reconstruction of the Peierls barrier from mean force calculations in a big simulation cell
As in the case of a small dislocation dipole (main text, Fig. 5 c,d), the local definition of the dislocation core (Fig. 7b) is not sufficient to accurately reconstruct the Peierls barrier even if the dislocation cores are distant by more than 80Å. Figure 8 illustrates the stratification of a big simulation cell with 20,250 atoms. From these stratification we define the three distortion scores: d RB = 3.23 identified by the MCD model, d RB = 0.62 and d RB = 0.55 (Fig. 8b,d,f). The resulting structures and the Peierls barriers are provided in Figure 7. When taking into account only atoms-outliers (d RB > 3.23), the migration barrier is underestimated by more than 20 %. Including the relevant bulk atoms with distortion scores greater than 0.62 (Fig. 7c) and 0.55 (Fig. 7d) and thereby taking into account the elastic interaction of the cores significantly improves the results. IV.

MODELLING RADIATION INDUCED DEFECTS IN BCC IRON
Radiation induced atomic displacements mainly occur in the form of point defects, e.g., selfinterstitial atoms (SIAs), vacancies, Frenkel pairs. These defects may eventually recombine and annihilate or regroup and form larger defects like 2D and 3D clusters, dislocation loops, stacking faults, etc. The defects adopt different morphologies, which further define their size-dependent mobility and stability with respect to each other. In bcc Fe, low energy 111 and 110 loops are characterised by a high mobility (with migration energy barriers in the order of meV and 0.3 eV, respectively), while the higher energy 100 loops are immobile (migration energy barriers in the order of few eV). The C15 SIA clusters have a 3D crystallographic structure, which makes them intrinsically immobile. These complex and highly symmetric defects act as sinks for other small defects. By absorbing neighbouring SIAs, C15 clusters grow larger and, after accumulating more than 50 point defects, they may eventually dissociate into 111 and 100 loops [2]. Overall, the energy landscape of defects in bcc Fe is extremely complex and its accurate description at the atomic scale requires using appropriate force-field models that provide a correct description of atomic systems beyond the equilibrium conditions. Below we report the results of the formation and migration energy calculations using GAP potential [9] and compare them with those from the DFT calculations and from the two EAM potentials commonly used for modelling defects in bcc Fe: AM04 [1] and MA07 [18,21].

Self-interstitial atoms and their clusters
In contrast to the equilibrium conditions where vacancies represent the dominant part of selfdefects, self-interstitial atoms (SIAs) occur in the same amount as vacancies under irradiation.
Mobility and relative stability of SIAs in bcc Fe strongly depend on their geometry.
For single SIAs, resistivity recovery experiments [10,32], DFT calculations [11,34], EAM [18,34] and the tested GAP potential [9] consistently indicate that the dumbbells with 110 orientation are the most stable. These dumbbells can migrate from their initial position to the next site via several different jump mechanisms [11]. Figure 9 illustrates the most important of them. The computed migration energy barriers for each of these mechanisms are summarised in Table I. For all the mechanisms (Fig. 9), the GAP potential performs better than any EAM potential and provides the energy barriers very close (within 10% difference) to the DFT values. Although the  Table I exact trajectories of the jump mechanisms were not explicitly included into the training database of the GAP potential [9], it can accurately compute the migration barriers for all the five mechanisms.
Such a good performance of the potential likely results from the rich variety of single SIA atomic environments in the training data set [9]. for five different jump mechanisms in bcc Fe using GAP [9] and EAM potentials AM04 [1], MA07 [18,21].

Supplementary
The results are compared with the DFT values from Ref. [11]. The listed mechanisms correspond to those illustrated in Fig. 9 Mechanism Energy barrier ∆E m , eV With increasing size, SIA clusters change their relative stability. For 2D loops with more than 4 SIAs, the 111 clusters become more stable than the 110 family [34]. Further we test the performance of GAP to correctly reproduce size-dependent stability of SIA clusters, including 2D loops (Fig. 10a) and more complex 3D C15 clusters (Fig. 10b).
The qualitative predictions of the relative loop stability and the crossover between the 111 and 110 families (Fig. 10a) is slightly overestimated with respect to the DFT calculations [34].
Overall, for the small 2D loops, GAP leads to the results very similar to those from the MA07 potential and predicts the transition between the 111 and 110 loops after accumulation of 5 Supplementary Table II: Migration energies ∆E m of the most stable small SIAs (I 1−3 ) and vacancy (V 1−3 ) clusters (from 1 to 3 defects) with the lowest migration energy computed using GAP [9] and EAM potentials AM04 [1], MA07 [18,21]. The results are compared with the DFT values from Ref. [10].
SIA ∆E m , eV Vacancy ∆E m , eV  SIAs, which is slightly bigger than the cluster sizes obtained by DFT and AM04 potential. The provided by the GAP potential for the SIA loops that are built by more than 3 dumbbells has more than 50% error with respect to the DFT calculations [34].
Thus, for the size-dependent formation energies of 2D SIA clusters, the GAP potential exhibits a lower accuracy compared to its performance for single SIAs.
The relative stability of the C15 clusters with respect to the 2D loops is illustrated in (Fig.   10b). As was previously shown in Refs. [2,6,21], AM04 potential poorly reproduces the energy landscape of C15 compared to DFT calculations. This results from the fact that the small I C15 energy of C15 and 111 loops is very small. The second EAM potential, MA07, performs much closer to the DFT data [20]. In contrast to the 2D loops where the GAP provides qualitatively good results, the tested ML potential exhibits a limited transferability for the C15 clusters and provides results close to those from AM04 in Fig. 10b. For the small size I C15 2,3 clusters, the GAP potential provides the formation energies that are much higher than those of dumbbell configurations. The predicted formation energy of I C15 2 is c.a. 2.5 eV higher than that of 110 dumbbells, yielding an impossible formation of C15 in Fe. Moreover, for larger clusters, the GAP suggests that the C15 clusters are less stable compared to the 111 loops. These results are not consistent with the DFT calculations [2,20].
It is interesting to note that both MA07 and GAP were not fitted to compute C15 (i.e., the C15 clusters were not explicitly included in the potential databases). However, the fitting databases of both potentials contained the so-called nonparallel (NP) di-interstitial configuration I N P 2 . The I N P 2 configuration represents an elementary building block required to create a C15 cluster [2,21].
Thus, in the case of MA07 EAM potential (which is based on a relatively simple physical concept), including such a brick was sufficient to reconstruct more complex structural defect, while in the case of GAP (which relies on statistical methods), the outcome is not the same. As a possible solution, one may consider including more complex defect structures into the training databases of ML potentials in order to enrich the variety of atomic environments known by the model.

Vacancies and their clusters
In the original publication of the GAP potential [9], the authors provide the calculations of the formation and binding energies of mono-di-and tri-vacancies. These properties were directly incorporated into the training dataset. Hence, the performance of the potential on these properties is excellent and closely consistent with the reference DFT calculations [9]. In this work we do not intend to reaffirm and discuss these calculations.
The migration energy barriers of mono-vacancies to the first, second, and third nearest neighbour site were also reported by Dragoni et al. [9]. It is worth emphasizing that the migration energy barriers of mono-vacancies computed with GAP are remarkably accurate compared to those provided by existing EAM potentials (Table II). Moreover, the energy barrier of monovacancy migration from GAP calculations exhibits a clear saddle point (Fig. 11a), while it is not the case for the EAM potentials.
We further test GAP to compute migration energy barriers of di-and tri-vacancy clusters. For the small V 2,3 clusters, GAP also provides a single saddle-point curves, however the computed energy barriers are notably lower than the corresponding DFT values (Table II). Thus, for the divacancies V 2 , the computed energy barrier is 25% lower than the DFT migration energy, while for the tri-vacancies V 3 , the error reaches 60% (Table II, Fig. 11b). Such an error in migration barriers will have a strong impact on the predictions of defect kinetics under irradiation and interpretation of processes during resistivity recovery experiments. The stage IV will be strongly impacted by the fast diffusion of the vacancies clusters V n with n > 2 [10,32] affecting the prediction of the size and the density of vacancy clusters at temperatures higher than 300 K.

Methods
Atomistic calculations with semi-empirical and ML force fields are performed in lammps [24] using the AM04 [1], MA07 [18,21] EAM potentials and the GAP potential [9] for Fe. The calcu- Ab initio calculations of the mono-and tri-vacancy migration barriers (Fig. 11) are performed for the simulation cells with 127 and 125 atoms at the constant volume of 4 × 4 × 4 bcc simulation cells with a 0 =2.834Å. The simulations are carried out in VASP [17] using PAW pseudopotential with GGA-PBE functional that accounts for 8 valence electrons 3d 7 4s 1 . The plane-wave cutoff energy is set to 500 eV. The Brillouin zone is sampled with the Monkhorst-Pack scheme using