Robust recognition and exploratory analysis of crystal structures via Bayesian deep learning

Due to their ability to recognize complex patterns, neural networks can drive a paradigm shift in the analysis of materials science data. Here, we introduce ARISE, a crystal-structure identification method based on Bayesian deep learning. As a major step forward, ARISE is robust to structural noise and can treat more than 100 crystal structures, a number that can be extended on demand. While being trained on ideal structures only, ARISE correctly characterizes strongly perturbed single- and polycrystalline systems, from both synthetic and experimental resources. The probabilistic nature of the Bayesian-deep-learning model allows to obtain principled uncertainty estimates, which are found to be correlated with crystalline order of metallic nanoparticles in electron tomography experiments. Applying unsupervised learning to the internal neural-network representations reveals grain boundaries and (unapparent) structural regions sharing easily interpretable geometrical properties. This work enables the hitherto hindered analysis of noisy atomic structural data from computations or experiments.


Supplementary Methods
Isotropic scaling. To reduce the dependency on lattice parameters, we isotropically scale each prototype according to its nearest neighbor distance d NN . This way, one degree of freedom is eliminated, implying that all cubic systems are equivalent and thus are correctly classified by construction. To compute d NN , we calculate in a first step the histogram of all nearest neighbor distances. Since the area of spherical shells grows with the squared radius, we divide the histogram by the squared radial distance. Then, we use the center of the maximally populated bin as the nearest neighbor distance d NN . Dividing the atomic position by d NN yields the final isotropically scaled structure, which is used for calculating the SOAP descriptor. Alternatively, one may use the mean of the nearest neighbors as d NN , which, however, is more prone to defects. In case of multiple chemical species, we consider all possible substructures as formed by the constituting species to calculate the SOAP descriptor (see next paragraph). For each of the substructures, we compute d NN , while we determine the histogram of neighbor distances only from distances between atoms whose chemical species coincide with those of the substructure. For instance, given the substructure (α, β ), i.e., the atomic arrangement of atoms with species β as seen from the perspective of atoms with species α, we consider only α-atoms and determine all distances to β -atoms.
SOAP descriptor. As discussed in the main text, encoding of physical requirements we know to be true is crucial for machine-learning application. For instance, in crystal classification, two atomic structures that differ only by a rotation must have the same classification label. This is not guaranteed if real space atomic coordinates are used as descriptor (cf. Fig. 1a). As an attempt to fix this, one might include a discrete subset of orientations in the training set, hoping that the model will generalize to unseen rotations. However, there is no theoretically guarantee that the model will learn the rotational symmetry, and if it does not, it will fail to generalize and return different predictions for symmetrically equivalent structures. In contrast, when a rotationally invariant descriptor is employed, only one crystal orientation needs to be included in the training set and the model will generalize to all rotations by construction. This reasoning readily applies to other physics requirements such as translational, or permutation invariance (for atoms with same chemical species).
In the following, we provide details on adapting the standard SOAP descriptor such that its number of components is independent on the number of atoms and chemical species.
Starting with the simple case of one chemical species, we consider a local atomic environment X , being defined by a cutoff region (with radius R C ) around a central atom, located at the origin of the reference frame. Each atom within this area is represented by a Gaussian function centered at the atomic position r i and with width σ . Then, the local atomic density function of X can be written as 1 ρ X (r) = ∑ i∈X exp − (r − r i ) 2 2σ 2 = ∑ blm c blm u b (r)Y lm (r), (1) where in the second step, an expansion in terms of spherical harmonics Y lm (r) and a set of radial basis functions {u b (r)} is performed. One can show that the rotationally invariant power spectrum is given by 1 These coefficients can be arranged in a normalized (SOAP) vectorp (X ), describing the local atomic environment X .
In total, we obtain as many SOAP vectors as atoms in the structure, which one can average to obtain a materials descriptor independent of the number atoms N at . Another possibility (the standard setting in the software we use) is to average the coefficients c blm first and then compute Eq. 2 from this 2 . The cutoff radius R C and σ (cf. Eq. 1) are hyperparameters, i.e., supervised learning cannot be used directly to assign values to these parameters, while their specific choice will affect the results. Typically, one would employ cross-validation while here, we take a different route: First, we assess the similarity between SOAP descriptors using the cosine similarity to identify parameter ranges that provide sufficient contrast between the prototypes. Using this experimental approach, we find that values near σ = 0.1 · d NN and R C = 4.0 · d NN yield good results. Then we augment our dataset with SOAP descriptors calculated for different parameter settings. The extension to several chemical species is achieved by considering all possible substructures as formed by the constituting atoms: Considering NaCl, we first inspect the lattice of Cl atoms as seen by the Na atoms, which we denote by (Na, Cl); this means that Na atoms are considered as central atoms in the construction of the local atomic environment while only Cl atoms are considered as neighbors. A similar construction is made for the remaining substructures (Na, Na), (Cl, Na), and (Cl, Cl), which may be quite similar depending on the atomic structure. For each substructure, we compute the SOAP vectors via Eq. 2, obtaining a collection of SOAP vectors. Averaging these gives us four (in case of NaCl) averaged SOAP vectors. Averaging the latter again, yields a materials representation being independent on the number of atoms and chemical species.
Formally, given a structure with S species α 1 , ..., α S , we consider all substructures formed by pairs of species (α i , α j ), j = 1, ..., S, resulting in S 2 averaged SOAP vectors <p α i α j > N at,α i , where the bracket represents the average over number of atoms N at of species α i . These vectors are averaged over, yielding the final vectorial descriptor <<p α i α j >> α i α j .
Note that this construction of SOAP deviates from the previously reported way of treating multiple chemical species in the following way: Usually, for each atom, one constructs the 2/25 following power spectra 3 where the coefficients originate from basis set expansion as in Eq. 2, while the density ρ is constructed separately for each species. For a specific α and β , the coefficients of Eq. 3 can be collected into vectors p αβ . In case of α = β , cross-correlations, i.e., products of coefficients from different densities are used to construct the vectors p αβ , which are missing in our version. Bayesian deep learning. As discussed in the main text, one can think of Bayesian neural networks as standard neural networks with distributions being placed placed over the model parameters. This results in probabilistic outputs from which principled uncertainty estimates can be obtained. The major drawback is that training and obtaining predictions from traditional Bayesian neural networks is generally difficult because it requires solving computationally costly highdimensional integrals. For classification, expensive calculations are required to determine p(y = c|x, D train ), which is the probability that the classification is assigned to a class c, given input x and training data D train . Then, for a specific input x (in our case the SOAP descriptor), the most likely class c, i.e., the one with largest p(y = c|x, D train ) is the predicted class.
Gal and Ghahramani 4 showed that stochastic regularization techniques such as dropout 5,6 can be used to calculate high-quality uncertainty estimates (alongside predictions) at low cost. In dropout, neurons are randomly dropped in each layer before the network is evaluated for a given input. Usually, dropout is only used at training time with the goal of avoiding overfitting by preventing over-specialization of individual units. Keeping regularization also at test time allows to quantify the uncertainty. Practically, given a new input, one collects and subsequently aggregates the predictions while using dropout at prediction time. This gives a collection of probabilities being denoted as p(y = c|x, ω t ), which is the probability of predicting class c given the input x at a specific forward-pass t, with model parameters ω t . From this collections of probabilities, one can estimate the actual quantity of interest, p(y = c|x, D train ), by a simple average 4 : where T is the number of forward-passes (see Methods section "Neural network architecture and training procedure" for details on how we choose this parameter). While the average can be used to infer the class label c, additional statistical information, which reflects the predictive uncertainty, is contained in the collected forward-passes, i.e., the probabilities p(y = c|x, ω t ) which effectively yield a histogram for each class and define, when varying over all possible c, a (discrete) probability distribution. For instance, mutual information can be used to quantify the uncertainty from the expressions p(y = c|x, ω t ). Specifically, for a given test point x, the mutual information between the predictions and the model posterior p(ω|D train ) (which captures the most probable parameters given the training data) is defined as 4, 7 Hyperparameter optimization. The Tree-structured Parzen estimator (TPE) algorithm 8,9 is an example of a Bayesian optimization technique. Specifically, one has to define a search space which can comprise a variety of parameters such as the learning rate or model size specifics such as the number of layers or neurons. Then, the goal is to minimize a performance metric (in our case, we maximize the accuracy by minimizing its negative). For large search spaces, iterating through each possible combination, i.e., performing a grid search, will get expensive very quickly. Random search is one alternative, while Bayesian methods such as TPE can be more efficient. Approaches such as grid or random search assign uniform probability to each hyperparameter choice, which implies that a long time is spent at settings with low reward. This becomes particularly troublesome if the performance metric is expensive to calculate. In Bayesian methods such as TPE, the objective is replaced by a computationally cheaper surrogate model (for instance, Gaussian process or random forest regressor). New hyperparameters are selected iteratively in a Bayesian fashion. Specifically, the selection is based on an evaluation function (typically so-called expected improvement) taking into account the history of hyperparameter selections and thus avoiding corners of the search space with low reward.
The search space is chosen the following way (alongside the chosen hyperopt commands hp.choice or hp.uniform): • Number of layers (2,3,4,5), hp.choice

Supplementary Notes
Supplementary Note 1 In the following, we provide details on the benchmarking.
For spglib, we only include prototypes from AFLOW. The reason for excluding structures from the computational materials repository (CMR) is that we do not always have the correct or meaningful labels for all structures. For instance, some 2D materials are specified as P1 in the database, which cannot be used as a correct label. Furthermore, for quaternary chalcogenides, the expected symmetries (as specified in the corresponding reference 10 ) cannot be reconstructed, which is most likely due to local optimization effects. Similar observations were made for the ternary Perovskites. More careful choice of precision parameters or additional local optimization may help. Thus, to enable a fair comparison, the benchmarking in the main text only reports results on elemental and binary compounds from AFLOW (where we know the true labels), while the performance on both AFLOW and CMR data is shown in Supplementary Tables 1, 2, and 3. To avoid the impression that spglib is not applicable to ternary, quaternary, and 2D materials, we still provide the label "96/108" behind spglib methods in the benchmarking tables. Note that non-periodic structures are excluded for benchmarking (again only in the main table), in particular carbon nanotubes, since these systems cannot be treated by spglib.
For the other benchmarking methods, which are common neighbor analysis (CNA, a-CNA), bond angle analysis (BAA), and polyhedral template matching (PTM), we use implementations provided in OVITO 11 , where for BAA we apply the Ackland jones method. As for spglib, only periodic structures were included. BAA, CNA, a-CNA all include fcc, bcc, and hcp structures, while PTM contains in addition sc, diamond, hexagonal diamond, graphene, graphitic boron nitride, L10, L12, zinc blende, and wurtzite. Each of the frameworks provide one label for each atom, i.e., for a structure with N atoms we obtain N labels. To obtain an accuracy score, we compare these N predictions to N true labels, which correspond to the space group associated with the prototype label (e.g., 194 for hcp). For CNA, we select the standard cutoff (depending on its value one is able to detect bcc but not hcp and vice versa). Also for BAA (Ackland jones) and a-CNA standard settings are used. For PTM, an RMSD cutoff of 0.1 was used (again default in OVITO). Note that PTM can also distinguish different sites of the L12 structure. For simplicity, we did not label the L12 structure by sites and take this classification into account, but always assign a true label as soon as an atom was assigned to the L12 class (even if it might be not the correct site).
Furthermore, for ARISE periodic and non-periodic structures are included, while for the benchmarking methods only periodic structures are considered. While for spglib, translational symmetry is violated by construction, the other methods can in principle be applied to these systems. However, when calculating the accuracy for a given non-periodic structure, we have to choose a label for the boundary atoms. If we select the same label for these atoms as for the central ones (which have a sufficiently larger number of neighbors), these methods will usually predict the class "None" and interpreting this as a "misclassification" would decrease the total classification accuracy. Therefore, for a fair comparison, we exclude non-periodic structures.

Supplementary Note 2
The Bain transformation path describes a structural transitions between bcc and fcc symmetries via intermediate tetragonal phases 12 of body-centered -or equivalently -face-centered tetragonal symmetry. Originally investigated for iron 12 , the Bain path is relevant in thermo-mechanical processing -a central aspect for steel properties 13 -as it serves as a model for temperature-induced transitions between fcc (γ) and bcc (α) iron 14 . The Bain path is also crucial for understanding properties of epitaxial films 15,16 or metal nanowires 17 .
Practically, the structures constituting a Bain path can be obtained by varying the ratio c/a between lattice parameters a and c of a tetragonal structure (cf. Supplementary  Fig. 1a); c/a = 1 corresponds to a cubic structure. We generate tetragonal geometries for lattice parameters a, c taking values in [3.0Å, 6.0Å] with steps of 0.05Å, resulting in 3721 crystal structures. These structures are then classified with ARISE, and the results depicted via classification and uncertainty maps in Supplementary Fig. 1b and c, respectively. Each point in these maps corresponds to a prediction for a specific geometry. We include in the training set fcc, bcc, and tetragonal geometries with structural parameters known experimentally; they are shown as stars in Supplementary Fig. 1b. Specifically, the lattice parameters (a, c, c/a) are (3.155Å, 3.155Å, 1.0) for the bcc 18 and (3.615Å, 5.112Å, √ 2) for the fcc prototype 19 , while two tetragonal structures (being assigned one common label "tetragonal") are included with (3.253Å, 4.946,Å, 1.521) in case of In 20 and (3.932Å, 3.238Å, 0.824) for α − Pa 21 . We isotropically scale every geometry to remove one degree of freedom (see Supplementary Methods section), so that all possible cubic lattices are effectively equivalent; this allows the model to generalize by construction to all cubic lattices regardless of the lattice parameter. The same holds for tetragonal structures (i.e., two degrees of freedom) with constant c/a ratio. As visual aid, we mark lines of constant c/a in Supplementary Fig. 1b-c starting from the four structures included in the training set. Note that any path connecting the constant c/a ratios corresponding to fcc and bcc structures constitutes a Bain path. To obtain a classification label, we select the class with the higher classification probability through a so-called argmax operation (i.e., the label c maximizing Eq. 4). These predictions are shown in Supplementary Fig. 1.
The model is able to detect the bcc and fcc phases in the expected areas, while all prototypes not being fcc, bcc, or tetragonal are correctly labeled as "Other". We point out that only four structures -corresponding to points in the plot marked by the four stars -are included in the training set, while all other 3717 structures are model (test) predictions. We can also observe that the model correctly predict the presence of a a tetragonal phase between fcc (yellow band) and bcc (green band), even though no tetragonal structures from this region are included in the training set. This transition is smooth, only interrupted by small areas for which other, lowsymmetry prototypes are assigned, but with high uncertainty, as quantified by the mutual information, cf. Supplementary  Fig. 1 c. We provide the classification probabilities of all assigned prototypes in Supplementary Fig. 3. In general, increased uncertainty appears at transitions between the assignments of different prototypes. We also note that there is a smooth transition for classification probabilities at the transition between prototypes (cf. Supplementary Fig. 3). These results represent a first indication that the network has learned physically meaningful representations. Surprisingly, for large or small c/a ratios, i.e., points far outside the training set, other (low-symmetry) phases appear such as base-centered orthorhombic molecular iodine or face-centered orthorhombic γ−Pu with small uncertainty. While it may be desirable to avoid overconfident predictions far away from the training set, the assignments could be actually physically justified given the similarities between tetragonal and orthorhombic lattices, the most evident being that all angles in both crystal systems are 90 • . We note that the transition to these prototypes is encompassed by regions of high uncertainty also in this case in agreement with physical intuition.

Supplementary Note 3
In the following, we investigate scenarios in which the model is forced to fail, i.e., we analyze ARISE out-of-sample predictions.
To assess the physical content learned by the network, we investigate its predictions -and thus its generalization ability -on structures corresponding to prototypes not included in the training. This is of particular relevance if one wants to use predictions of ARISE -for applications such as screening of large databases, or create low-dimensional maps for a vast collection of materials 22 .
Given an unknown structure, the network needs to decideamong the classes it has been trained on -which one is the most suitable. It will assign the most similar prototypes and quantify the similarity via classification probabilities, providing a ranking of candidate prototypes. The uncertainty in the assignment, as quantified by mutual information, measures the reliability of the prediction. Note that the task of assigning the most similar prototype(s) to a given structure among 108 possible classes (and quantifying the similarity) is a very complicated task even for trained materials scientists, in particular in case of complex periodic and possibly defective three-dimensional structures.
We consider three examples (cf. Supplementary Fig. 2  left): fluorite and tungsten Carbide (from AFLOW) where the correct labels are known, and one structure from the NOMAD encyclopedia (see last paragraph of this section for details on the provenance), for which the assigned space group is 1, i.e., no symmetry can be identified (via spglib). In all three cases there is no prototype in the dataset which represent a match for any of these structures. This is on purpose: the network will "fail" by construction since the correct class is not included in the possible classes the network knows (and needs to choose from). Analyzing how the network fails will . This is also true for the other methods, while additional structures have to be removed for instance for CNA, a-CNA, and BAA as they cannot classify simple cubic and diamond structures. In starred rows, all 108 classes summarized in Fig. 1e are included, leading to the strong decrease in performance. In contrast, the neural network approach proposed here can be applied to all classes, and thus the whole dataset was used. Moreover, we compare ARISE to a standard Bayesian approach: Naive Bayes (NB). We consider two different variants of NB: Bernoulli NB (BNB) and Gaussian NB (GNB), where the whole dataset was used -see the Methods section for more details. ARISE is overwhelmingly more accurate than both NB methods, for both pristine and defective structures.

6/25
Random displacements (δ )    give us insight on the physical content of the learned model. This test can also be viewed as discovering "unexpected similarities" across materials of different chemical composition and dimensionality. Following the pipeline for single-crystal classification summarized in Fig. 1, we compute classification probabilities and mutual information, yielding the assignments shown in Supplementary Fig. 2 right. To rationalize the predictions shown in Supplementary Fig. 2 from a physical standpoint, we inspect the substructures formed by the chemical species in both original and assigned structures. This is motivated by our choice of materials representation as averaged SOAP descriptor of substructures (see Supplementary Methods for more details). The two most similar prototypes to fluorite (CaF 2 ) are CsCl and NaCl, both consisting of two inter-penetrating lattices of the same type, two sc lattices for CsCl and two fcc lattices for NaCl. Fluorite contains both sc (F atoms) and fcc (Ca atoms) which is likely why CsCl and NaCl are assigned, together with a ternary halide tetragonal perovskite, also containing sc symmetry (via Cs and Sn atoms, respectively).
For tungsten carbide (WC), W and C form two hexagonal lattices. In the unit cell of the most similar prototype, FeSi, 60 • angles are formed within the substructures of each species (see dashed lines in the unit cell), thus justifying this classification. Furthermore, two quaternary chalcogenides appear as further candidates. This similarity -hard to assess by eyeoriginates by the presence of angles close to 60 • for S atoms (yellow) for both Cu 2 CdGeS 4 and Cu 2 ZnSiS 4 (marked in the figure for Cu 2 CdGeS 4 ). Also note that these two quaternary prototypes, Cu 2 ZnSiS 4 and Cu 2 CdGeS 4 are a result of substituting Ge and Si with isoelectric elements Zn and Cd, which implies that these structures are expected to be similar. This explains why they both appear as candidates for structures being similar to tungsten carbide. Finally, for the compound Tm 23 Se 32 from the NOMAD encyclopedia, the model identifies NaCl as the most similar prototype. Looking at the structure from different angles, especially from the top (cf. Supplementary Fig. 2, left part), a similarity to cubic systems can be identified. The classification method robustness to missing atoms makes the apparent gaps in the side-view negligible, and thus rationalizes the NaCl assignment. Regarding the uncertainty quantification (via mutual information), in-creased uncertainties appear for fluorite and tungsten carbide, since besides the top prediction with more than 70% classification probability, other prototypes are possible candidates for the most similar prototype. For the NOMAD structure Tm 23 Se 32 , the network is quite confident, most likely because no other good candidates are presented among the binaries included in the 108 classes dataset. These results show that the model -even when forced to fail by construction -returns (highly non-trivial) physically meaningful predictions. This makes ARISE particularly attractive for screening large and structurally diverse databases, in particular assessing structures for which no symmetry label can be obtained with any of the current state-of-the-art methods.
In addition to the analysis in Supplementary Fig. 2, we present some results for further out-of-sample structures: The structure taken from the NOMAD encyclopedia has the ID mp-684691 in Materials project, where further details can be found, e.g., on on the experimental origin.

10/25
Face-centered orthorhombic γ-Pu  Figure 5. Cosine similarity plots for elemental, binary, ternary, and quaternary compounds as well as 2D materials (for SOAP settings R C = 4.0 · d NN , σ = 0.1 · d NN , and exsf = 1.0 corresponding to the center of the parameter range used in the training set). Each line corresponds to a particular prototype. The x-axis corresponds to three different (non-periodic) supercells, where supercell "0" stands for the smallest isotropic supercell (for instance 4 × 4 × 4 repetitions) for which at least 100 atoms are obtained. Supercells "1" and "2" correspond to the next two bigger isotropic replicas (e.g., 5 × 5 × 5 and 6 × 6 × 6). The y-axis corresponds to the cosine similarity of the respective supercell to the periodic structure, i.e., the case of infinite replicas. One can see a continuous increase of similarity with larger supercell size, where for the largest supercell, the similarity is greater than 0.8 for all prototypes. Thus, it is to be expected that systems sizes larger than those included in the training set can be correctly classified by ARISE. For smaller systems, however, generalization ability will depend on the prototype. Practically, one can include smaller supercells in the training set, which is not a major problem due to fast convergence time. Mono-species elemental polycrystal investigation via strided pattern matching using lower resolution (stride of 3.0Åin both x and y direction opposed to 1.0Å as in Fig. 2). Choosing the stride is a trade-off between computation time and resolution. For instance, at the grain boundary between diamond and hcp, the transition from diamond to hexagonal diamond to hcp (cf. Supplementary Fig. 7) are recognized in Fig. 2b, while being obscured in the presented low resolution pictures.  Supplementary Figure 11. Quantitative study of mutual information distribution for different annealing times. a Histogram of mutual information values for each annealing time (where the corresponding histograms are normalized via dividing each bin by the total number of boxes). Only mutual-information values smaller than 0.2 are shown, which correspond to the "dark", i.e., low mutual information spots in Fig. 4c. b Cumulative distribution calculated for the histogram shown in a. From a, b it is apparent that the number of low-uncertainty boxes increases for larger annealing times. c-d Investigation of the radial distribution of the mutual information. c Histograms of uncertainty (mutual information) obtained via spatially binning the SPM maps of 4c into spherical shells, where the median is computed for each bin. Given a mutual information value, the associated radius is calculated as the distance of the center of the corresponding box (as obtained via SPM) to the center of the most central box. d Each panel shows the difference between the cumulative distributions of two annealing times, where the cumulative distributions are calculated from the histograms shown in b. In addition the histograms are normalized the following way: Given the times t 1 ,t 2 with t 1 < t 2 , the cumulative sum of t 2 − t 1 is calculated and then divided by the cumulative sum of time t 1 such that the fractional change from t 1 to t 2 is obtained. One can conclude that in c a clear decrease of mutual information can be spotted in specific regions, e.g., for the radial region 15-20Å. The cumulative sums that are used in d allow to quantify the order more globally in the sense that each bin (of the cumulative sum corresponding to a specific annealing time) is proportional to the spherically averaged integral from radius zero up to the radius corresponding to the bin. Since the particle sizes are changing over time due to diffusion, the particles have different size. Thus, we single out a radius at which to compare the global order: for instance comparing the bins corresponding to a radius of r=25Å, we see that for all three panels, the values are negative and thus the structure that has been annealed longer shows larger global order. Nanotubes, mono-species ASE 105. Carbon nanotube chiral, (6,2), 13.9 • , 5.65Å Nanotubes, mono-species ASE 106. Carbon nanotube chiral, (7,1), 6.59 • , 5.91Å Nanotubes, mono-species ASE 107. Carbon nanotube zigzag, (6,0), 0.0 • , 4.7Å Nanotubes, mono-species ASE 108. Carbon nanotube zigzag, (7,0), 0.0 • , 5.48Å Nanotubes, mono-species ASE   24 . This database can be accessed via Materials Project or http://crystalium.materialsvirtuallab.org/. For each structure, four lines of information are provided:

Supplementary
The first line specifies the information that is required to uniquely describe a grain boundary structure 25 , where first the Σ-parameter is given, followed by rotation angle, rotation axis and grain-boundary plane. The relative orientation of two neighboring grains is described by three degrees of freedom (rotation angle and axis). The two degrees of freedom specified via the grain-boundary plane complete the unique characterization of a grain-boundary structure. The second line specifies the element and the entry number of the polymorph in the database (for a given element, multiple grain boundaries can be available). The third line specifies the materials project ID and the space group. The last line specifies the grain-boundary type (twist, tilt) alongside the dominating crystal structure. The database entries correspond to periodic cells that contain a grain boundary. We replicate this initial cell isotropically (in the plane parallel to the grain boundary) until at least 1000 atoms are contained in the structure. For all examples, the dominating phase and grain boundary regions are correctly detected as shown via the 3D classification probability maps of the most popular assignments according to ARISE. These selected structures illustrate the advantages of ARISE in the following way: a shows that ARISE can detect dhcp symmetry in a polycrystal. In particular, the close-packing corresponding to dhcp cannot be classified in comparable automatic fashion by any of the available methods. For hcp (b) and fcc (c), the dhcp assignments only appear at the grain boundary. d and e are two different grain boundary types that do not only differ in their defining degrees of freedom but also are of tilt (d) and twist (e) type. ARISE distinguishes the local structures at the grain boundary which is indicated by its assignments: while for the twist type (e) hcp is the dominating assignment at the grain boundary, for the tilt type the hcp probability drops to zero at the grain boundary (except for the outer borders). The following SPM parameters are chosen for all examples: A stride of 2Å suffices to resolve the main characteristics. For a box size of 16Å at least 100 atoms are contained in the boxes within the grains.  Figure 13. Cross similarity matrix for a selection of the defect library 26 that is larger than in the main text (Fig. 3d). Specifically, 140 structures as well as the mono-species structures from Fig. 3a (right),e are considered. For reconstruction of atomic positions, Atomnet is employed, where for the structures from the library, atomic positions are reconstructed using a model that can also classify the chemical species. We employed the model that is available at https://github.com/pycroscopy/AICrystallographer/tree/master/DefectNet.