Introduction

Distinct crystal structures, surfaces, and interfaces in bulk as well as nanomaterials play a key role in tailoring desirable properties in many applications, e.g., catalysis or energy conversion and storage1,2,3. In particular, exposed surface structures in catalysts determine catalytic performances comprising activity and selectivity4. Furthermore, interfaces such as grain boundaries or stacking faults can largely affect the transport properties in energy storage or conversion devices5,6,7,8,9,10. For example, grain boundaries serve as ion migration paths in batteries6,7, act as scattering sites for phonons in thermoelectric devices5,8, and could degrade electronic conductivity in solar cells9,10. To engineer advanced materials for such applications, it is necessary to characterize their crystalline structure down to the atomic level, including defects or interfaces, local lattice orientations, and distortions11,12,13. Currently, the ultimate tool to probe imperfections in crystalline materials is electron microscopy.

To date, electron microscopy techniques with aberration correction have been developed for investigating microstructures of materials with atomic spatial resolution. In particular, scanning transmission electron microscopy (STEM) images are more readily interpretable than images obtained via high resolution transmission electron microscopy (HR-TEM), due to direct correlation between image contrast and the atomic number Z of the observed species14. In STEM, a focused, high-energy electron beam passes through an electron transparent and hence thin sample. The electrons interact with the atoms in the sample and get both scattered elastically and inelastically, enabling to image the sample through various detector geometries (e.g., bright field (BF), dark field (DF), angular dark field (ADF), as well as high-angle annular dark field (HAADF)) and probe it through spectroscopic techniques (e.g., electron energy-loss spectroscopy (EELS) and energy-dispersive X-ray spectroscopy (EDS))15,16,17,18. The most commonly employed technique to image atomic structures and crystalline defects is HAADF-STEM, where electrons scattered to large angles are collected by an annular detector forming an incoherent image. Moreover, a variety of data channels can be collected simultaneously with high-speed detectors, but as of today the wealth of information available in STEM is not fully exploited, due to the lack of versatile, automatic analysis tools19,20.

Big-data analytics and artificial intelligence (AI) have great potential for analyzing large electron-microscopy data, with several applications to various datasets being reported20,21,22,23,24,25,26. Such methods are introduced to uncover overlooked characteristics and this way drive a paradigm shift in image analysis and design of descriptors of atomic-resolution data. To provide a few examples, space-group classification was proposed based on electron imaging and diffraction datasets21. Also, multivariate statistical techniques were employed to extract structural information such as the crystal structure and orientation of a small sample region from complex four-dimensional STEM datasets24. Detection and assignment of microstructural characteristics that differ from the vast majority of crystalline regions and phases in STEM datasets has been performed, e.g., the identification of the local dopant distribution in graphene22,25, or monitoring of electron-beam induced phase transformations23. One can also train AI methods to assign two-dimensional (2D) Bravais lattices to STEM or scanning tunneling microscopy (STM) images23,27. A further approach is to reconstruct the real-space lattice from atomic-resolution images25,28,29, providing real-space information that can be analyzed with structure-identification methods that are based, for instance, on graphs25 or structural descriptors30. Unsupervised learning for defect detection or chemical-species classification is reported, for instance, in31,32. The above approaches rely heavily on recent developments in deep learning33. Properly trained neural networks (NNs) such as convolutional neural networks (CNNs) have been shown to solve image classification problems more accurately than other machine-learning methods and in particular, more efficiently than humans, especially in high-throughput tasks.

Here, we propose AI-STEM, which stands for Artificial-Intelligence Scanning Transmission Electron Microscopy. AI-STEM automatically identifies projected crystal symmetry and lattice orientation as well as the location of defects such as grain boundaries in STEM images. Both synthetic and experimental images can be processed directly and in automatic fashion, no reconstruction of real-space lattices is required. We employ a Fourier-space descriptor, termed FFT-HAADF (FFT: Fast Fourier Transform), as input for a CNN. The deep-learning model classifies a given image into a selection of crystalline regions that differ not only by crystal symmetry but also orientation. This provides additional information compared to, for instance, the classification of a given image into the five Bravais lattices that exist in two dimensions. In particular, we propose an efficient training scheme that enables fast retraining and extension of the method. The model is trained on simulated images only, achieving near-perfect accuracy on both training and test data (in total 31 470 data points, see “Methods”). The training data contains typical noise sources that are encountered in experiment. Notably, we adopt the Bayesian neural-network (BNN) approach, employing the Monte Carlo dropout framework that was originally developed by Gal and Ghahramani34. BNNs do not only classify a given input but also provide uncertainty estimates. We exploit this additional information that is absent in standard deep-learning models to locate bulk regions as areas of low and interfaces as areas of high model uncertainty. This way, AI-STEM can identify defects without being explicitly informed about them during training. The identification of bulk and interface regions is related to semantic segmentation, a popular computer-vision task in which each image pixel is classified in order to locate individual objects35. Based on AI-STEM’s bulk-versus-interface segmentation, further analysis can be conducted where it is meaningful—according to the model: for instance, we demonstrate how the local lattice rotation can be calculated in the detected bulk regions. Finally, we employ unsupervised learning to visualize the high-dimensional NN representations in an interpretable, two-dimensional map. This reveals that the model separates not only crystalline grains with different symmetry but also different types of interfaces—despite never being explicitly instructed to do so. All code and data is made publicly available.

Results

Development of an automated classification procedure

Our goal is to develop an automatic framework for analyzing experimental HAADF-STEM images of bulk materials such as shown in Fig. 1a: in this image, the bulk crystalline regions are separated by a grain boundary (the interface region). The final prediction as shown in Fig. 1f should classify the image into bulk and interface regions, while also obtaining information about the bulk symmetry and lattice orientation. Here, the bulk region should be labeled as “fcc 111”, i.e., face-centered cubic symmetry in [111] orientation, since both grains are viewed along their common [111] zone axis corresponding to the tilt axis of the grain boundary. Finally, AI-STEM’s predictions can be used to automatically identify where to calculate additional properties that provide further characterization, for instance, of the bulk regions and their local lattice rotation (cf. Fig. 1g). In the following, we explain the intermediate steps that are required to map image input (Fig. 1a) to a characterization such as shown in Fig. 1f.

Fig. 1: Schematic overview of the AI-STEM procedure for analyzing experimental STEM images.
figure 1

The starting point for AI-STEM (Artificial Intelligence Scanning Transmission Electron Microscopy) is a high-angle annular dark field (HAADF) STEM image (a) that here contains two different crystalline regions and one grain boundary (interface). A local window is scanned over the image with a certain stride to fragment the input into local windows (b). Three different local windows are indicated in (a), corresponding to regions in the bulk (red, blue) and the boundary (yellow). The local windows are then represented using a fast Fourier transform (FFT) HAADF descriptor (c, normalized between 0 and 1), where typically, a pronounced central peak can be observed. To enhance the neighboring peaks, the maximum value in the color scale is set to 0.1. This Fourier space descriptor is used as input for a Bayesian convolutional neural network (d) that provides a classification of crystal structure and lattice plane as well as uncertainty estimates (e). The former can be used to detect the bulk regions and the latter reveal the interface and in general, regions with crystal defects (f). On top of this segmentation, additional analysis, e.g. the determination of the local lattice orientation can be performed (g). The scale bar is 1 nm in a.

Fourier-space representation of atomic-resolution images

To achieve sensitivity to the substructure in an image such as shown in Fig. 1a, we divide it into local fragments (cf. Fig. 1b). Specifically, a sliding window of predefined size is scanned over the whole image and local patches are extracted for each stride. This allows to investigate structural transitions, e.g., between bulk and interface regions, in a smooth fashion. The selection of stride and window size is discussed in the Methods section. Each of the local patches is then transformed into reciprocal space by computing a Fourier-space descriptor (cf. Fig. 1c). Essentially, the fast Fourier transform (FFT) is calculated with additional pre- and post-processing steps (see Methods). We term this descriptor FFT-HAADF and use it as input for the machine-learning classification model. By calculating the Fourier transform, information on the lattice periodicity is enhanced, thus providing a starting point for a machine-learning model, which can be generalized to imaging modalities that provide atomic resolution information, such as HR-TEM or STM. In addition, translational invariance is introduced already at the level of the representation. The descriptor is not rotationally invariant, which is why we employ data augmentation, as we will explain in the section “Training data generation”.

The Bayesian classification model

To define the classification task, we need to specify the target labels as well as the model that maps the FFT-HAADF descriptor to the corresponding target labels. As classification model, we employ a CNN. This machine-learning method is well-known for its record-breaking performance in image classification36,37 and is thus a perfect fit for our problem setting. The model receives the FFT-HAADF image descriptor as input and assigns the symmetry (e.g., face-centered cubic) and lattice orientation (e.g., [111], cf. Fig. 1d, e). We select in total 10 different crystalline surface structures into which a given image is classified (cf. Fig. 2a). This includes the most common crystal structures appearing in metals, comprising face-centered cubic (fcc), body-centered cubic (bcc), and hexagonal close-packed (hcp) structures. We focus on low-index crystallographic orientations, which can be resolved at atomic resolution, as the projected interatomic distances are well within the resolution limit. The selected orientations are also based on mono-species metal systems for each of the crystal structures considered here: copper (Cu) for fcc, iron (Fe) for bcc, and titanium (Ti) for the hcp structure, respectively. The CNN consists of a sequence of convolutional, pooling, and fully connected layers (cf. Fig. 2b). The last layer is composed by 10 neurons, each corresponding to one of the surface classes. In particular, the output neurons are normalized such that each represents the classification probability for one of the 10 surface structures. For a given image, the most likely class corresponds to the predicted label. In the complete AI-STEM workflow, the CNN is applied to each local window, providing a classification for each local segment (Fig. 1e).

Fig. 2: Image descriptor and convolutional neural network (CNN) model for classification of STEM HAADF images.
figure 2

a Examples of FFT-HAADF images for all 10 crystalline surfaces included in the training set, which include face-centered cubic (fcc), body-centered cubic (bcc) and hexagonal close-packed (hcp) symmetry. b Schematic CNN architecture. FFT-HAADF images are used as the input, and the assignment to one of the 10 classes is calculated in the final layer.

In general, beyond classification, it is desirable to estimate the model uncertainty. This allows to assess how much one can trust a specific prediction, especially in situations that are different to the training set. This can be useful in various scenarios, e.g., for autonomous driving38 or medical diagnosis39. In our case, we train the model only on perfect crystal structures with periodic arrangements of atomic columns and use the uncertainty in the classification to identify the presence of structural defects. Given the large number of degrees of freedom for any defect, creating a library of potentially interesting defects for training is challenging – which is why we take a different approach: we use a Bayesian neural network34,40 which does not only classify a given (local) HAADF-STEM image, but also provides uncertainty estimates of the classification. If the uncertainty is high (low), the image is likely (unlikely) to deviate from the perfect crystal structure (on which the model is trained) and could contain a crystal defect, secondary phase with different crystal symmetry or even amorphous regions. This way, we can identify the host crystal structure and orientation at the same time and can locate regions in the image that differ from any of the training classes, where in this work, we consider grain boundaries as an example. One may be tempted to interpret the classification probabilities from the last CNN layer as being informative about model uncertainty. However, high classification probability does not always correlate with low uncertainty. In particular, standard NNs are known for overconfident extrapolations – even for points that are far outside the training set34,40. Modeling of predictive uncertainty can be improved by constructing a probabilistic model that provides a distribution of predictions rather than a single, deterministic one.

In order to estimate uncertainty in deep learning models, distributions are placed over the NN weights—resulting in probabilistic outputs—instead of considering a single set of NN parameters as done in the standard approach—resulting in deterministic predictions. More formally, a standard NN is a non-linear function \({f}_{{{{\boldsymbol{\omega }}}}}:{{{\mathcal{X}}}}\to {{{\mathcal{Y}}}}\), i.e., a mapping from input to output space that is parametrized by parameters ω (a set of weights and biases \({{{\boldsymbol{\omega }}}}:= {\{{{{{\bf{W}}}}}_{l},{{{{\bf{b}}}}}_{l}\}}_{l = 1}^{L}\), where L is the number of layers). After training a model on data Dtrain, inference of a target y (here: a class label) for a new point x (here: the FFT-HAADF image descriptor) is calculated via

$$p(y| {{{\bf{x}}}},{D}_{{{\mbox{train}}}})=\int\,p(y| {{{\bf{x}}}},{{{\boldsymbol{\omega }}}})p({{{\boldsymbol{\omega }}}}| {D}_{{{\mbox{train}}}})d{{{\boldsymbol{\omega }}}}.$$
(1)

In this expression, p(ωDtrain) denotes the posterior that indicates how likely a set of parameters is given training data Dtrain. Moreover, the likelihood p(yx, ω) corresponds to the softmax activation function – a standard approach to normalize the output layer such that they can be interpreted as classification probabilities:

$$p(y=c| {{{\bf{x}}}},{{{\boldsymbol{\omega }}}})=\frac{\exp \left({[{f}_{{{{\boldsymbol{\omega }}}}}({{{\bf{x}}}})]}_{c}\right)}{\mathop{\sum}\limits_{{c}^{{\prime} }}\exp \left({[{f}_{{{{\boldsymbol{\omega }}}}}({{{\bf{x}}}})]}_{{c}^{{\prime} }}\right)}.$$
(2)

where \({[{f}_{{{{\boldsymbol{\omega }}}}}({{{\bf{x}}}})]}_{c}\) is the output value of the NN for the class c. We see in Eq. (1) that instead of a single hypothesis, all parameter settings weighted by their posterior probabilities are included during inference. The standard approach would correspond to choosing the posterior as a delta distribution over a specific parameter setting—resulting in the above-mentioned overconfident predictions in out-of-distribution scenarios. Evaluating integrals over the whole parameter space, as appearing in Eq. (1), is practically impossible – especially for large deep learning models. Fortunately, approximating tools for evaluating Eq. (1) are available.

One way to approximate Bayesian inference in deep learning models (i.e., Eq. (1)) is Monte Carlo (MC) dropout34,40. This approach is principled in the sense that the uncertainty estimates from MC dropout approximate those of a Gaussian process40. In more detail, dropout41,42 is employed—a regularization technique that is usually used to avoid overfitting by dropping individual neurons during training. This way, the model has to compensate the loss of individual neurons, avoiding that the neural activation concentrates to local parts of the network. It has been shown that powerful uncertainty estimates can be obtained by using dropout not only during training but also at test time34. Specifically, for a given input, the output layer is sampled for a certain number of iterations T, where each sample is calculated from different networks that are perturbed according to the dropout algorithm. To obtain a Bayesian CNN, dropout is applied after each convolutional and fully connected layer (see the yellow blocks in 2b). Classification can then be performed by calculating a simple average, i.e., the probability of class c given input x and training data Dtrain (whose general expression is shown in Eq. (1)) can be approximated as

$$p(y=c| {{{\bf{x}}}},{D}_{{{\mbox{train}}}})\approx \frac{1}{T}\mathop{\sum }\limits_{t=1}^{T}p(y=c| {{{\bf{x}}}},{{{{\boldsymbol{\omega }}}}}_{t}).$$
(3)

Here, p(y = cx, ωt) (defined in Eq. (2)) denotes the classification probability of class c given input x and parameter configuration ωt that is obtained by random removal of neurons (defined according to the dropout algorithm). Modest number of samples typically suffice40, where in this work, we employ T = 100 samples. Notably, this process is in principle trivial to parallelize. We discuss more details on computation time and choice of T in the Supplementary Information (cf. Supplementary Fig. 4). Beyond the simple average in Eq. (1), additional information about the model confidence is contained in the collection of samples p(y = cx, ωt). For this, we invoke information theory, specifically mutual information. This (scalar) quantity provides a means to quantify the uncertainty, which has been employed in different settings including self-driving cars43 as well as crystal-structure identification30. The mutual information is defined between predictive and posterior distribution and is denoted as I(ω, yDtrain, x) (see “Methods” for the exact definition). Intuitively, it can be understood as the information gained about the model parameters ω if one would receive the label y for a new point x. Thus, if the mutual information is high for a given data point, one would gain information once the label is specified—corresponding to high predictive uncertainty. Similar to Eq. (1), integrals over the whole parameter space appear, which are computationally intractable. However, using MC dropout, one can find a tractable expression that only involves summations over all classes and samples34 (“Methods”).

Training data generation

To train the classification model for crystal-structure identification in atomic-resolution images, a suitable training dataset has to be generated. Notably, we refrain from training on experimental images which may contain unknown artefacts, such as noise, distortions or defects. Furthermore, acquiring and curating an experimental database of images of pristine crystal structures imaged at different orientations with atomic resolution is an elaborate task. Instead, we train only on simulated images, where we have exact control over imaging conditions and noise sources, allowing us to create a dataset with known labels. Obtaining such reliable training data is essential to achieve trustable labeling output of the CNN. One may criticize simulations for potentially missing crucial features that are present in experiment. However, with the advent of aberration-correction in STEM14, the direct comparison of experimental and simulated images at atomic resolution became accessible also on a quantitative basis44. It has even been shown that it is possible to determine the number of atoms in an atomic column or to retrieve the 3D atomic structure of nano-objects by combining experimental and simulated images45,46. Recently developed efficient implementations of the multislice algorithm enable to simulate STEM images similar to experimental conditions47,48. Using high-performance computing, realistic simulations of images can be conducted, achieving computation times of a few hours to days for 10–100 images. In this work, we provide additional speed-up by using a convolution-based approach, reducing the computation time from days to minutes for an entire training dataset (see “Methods”). Using this efficient simulation scheme, we obtain images for each of the 10 classes for different lattice constants. Additionally, we include data augmentation steps to consider a range of lattice rotations and noise sources that are resembling typical experimental conditions (“Methods”). In this work, we include lattice shear, blurring, as well as Gaussian and Poisson noise—resulting in 31,470 data points. We want to emphasize that even though the model is trained on synthetic data, we apply it to classify experimental atomic resolution STEM images as shown in the Results section (see Fig. 4).

Neural-network training procedure

For training, the 31,470 data points are split, where 80% is used for training and 20% for validation. Based on the performance on the validation set, we optimize hyperparameters such as the filter size in the convolutional layers and dropout ratio (the number of neurons dropped). Specifically, we employ Bayesian optimization, which is a general approach for global optimization of black-box functions that are computationally expensive to evaluate49. This makes Bayesian optimization a perfect fit for optimizing NNs, where exploring different architectures and optimization parameters is typically accompanied with high computational cost. Here, the black-box function to be optimized is the validation loss, and the optimization protocol we invoke50 provides us with a list of candidate models, all with near-perfect accuracy (see “Methods”). Their uncertainty estimates, however, are different, as we will highlight via the following model selection procedure.

To find the model that shows strongest performance in both classification and detection of out-of-training-distribution regions, we analyze the simulated test image in Fig. 3a. It contains both crystalline and amorphous regions, providing a test bed for identifying models with high uncertainty at the transition between grains and in the amorphous region—both of which are never shown to the models during training. The four regions in the image are simulated separately (using full multi-slice simulations) and then stitched together. Three of these regions are crystalline, representing one of the in total three different symmetries in the training set: Fe (bcc, [100]), Cu (fcc, [100]), and Ti (hcp, [0001]). Here, we expect low uncertainty and correct assignment of the respective symmetry. The amorphous region is simulated based on a three-dimensional structure obtained via realistic molecular-dynamics simulations of amorphous silicon51. All models obtained via Bayesian optimization are applied to this image. Given their near-perfect accuracy during training, they all can recognize the crystalline parts of the image, while their assignments in the amorphous region differ. We can now also analyze the corresponding uncertainties, which provide an estimate of the reliability of the classifications (cf. Fig. 3b). We select the model with the highest uncertainty, as quantified by the mutual information (cf. Eq. (6)), in the amorphous region. For this model, the classification results are shown in Fig. 3c, where one can see that the correct crystal symmetries are assigned in the expected regions, while in the amorphous part, several different phases are assigned. The mutual information shown in Fig. 3d increases at the interfaces between the four different crystalline regions, as well as in the amorphous part. The detailed architecture is specified in Table 1.

Fig. 3: Application of AI-STEM to synthetic, polycrystalline data.
figure 3

a The simulated image has 4 crystalline regions with different structural order, including three crystalline (Cu fcc [100], Fe bcc [100], Ti hcp [0001]) and one amorphous grain. Each grain is rectangular with an edge length of 40 Å. The sliding window is 1.2 × 1.2 nm (100 pixels) and is visualized in the top left corner. b The Bayesian CNN employed in the AI-STEM workflow (cf. Fig. 1) provides a distribution and not only point estimates in the final output layer. The averaged classification probabilities can be used to identify the most likely class (c). An uncertainty estimate (a scalar value, cf. (b) and Eq. (6)) can be obtained via the mutual information (d), revealing the grain boundaries as well as the amorphous region. The scale bar is 1 nm in a.

Table 1 Convolutional neural network architecture employed in this work.

Application to experimental STEM data

Now we turn to applying AI-STEM to experimental data. In the following, we challenge the model with several HAADF-STEM images, demonstrating the practical applicability of AI-STEM. In particular, we show that the model can classify crystalline regions in experimental images and how the bulk-versus-interface segmentation can be inferred and employed for further analysis—here, for determining the local lattice orientation in the bulk regions.

First, a HAADF image of elemental Cu shown in Fig. 4a is analyzed. The image contains a horizontally aligned grain boundary separating two misoriented single crystals with a [111] orientation in the upper and lower grain, respectively. As shown in Fig. 4b, the model classifies the grain regions correctly as fcc [111]. At the interface, the same label is assigned, but with increased uncertainty (as quantified by mutual information), allowing to detect the interface region (cf. Fig 4c).

Fig. 4: Application of AI-STEM to experimental STEM images with grain boundaries.
figure 4

Three experimental images are investigated: a Σ 19b(178)[111] tilt grain boundary (GB) in Cu with a misorientation angle of ~48 (class: fcc 111, 9 × 9 nm), a Σ 5(013)[001] tilt GB in Fe with misorientation angle of ~38 (class: bcc 100, 6.4 × 6.4 nm), and a low angle [0001] tilt GB in Ti with a misorientation angle of ~13 (class: hcp 0001, 12.8 × 12.8 nm), which are shown in a, e, and i. One can see from the classification maps (b, f, j) that the expected bulk symmetries are correctly assigned. In the color scale, only the most frequent assignments are labeled, the full color scale (indicating the other assignments, e.g., in (f) at the interface) is shown in Fig. 3c. The uncertainty, as quantified by the mutual information (c, g, k), indicates the grain-boundary regions. Combining these two pieces of information allows to identify the bulk and boundary regions. For the bulk regions, one can conduct further analysis: as an example, in (d, h, l), we determine for each local window the local lattice mismatch, which is defined as the mismatch angle between the real-space lattices reconstructed from local window and reference image (shown below the heatmaps in d, h, l). This analysis is only conducted where it is meaningful, i.e., in bulk regions, in particular excluding high-uncertainty regions that are indicated by gray areas in (d, h, l). The scale bar is 1 nm in (a, e) and 2 nm in (i).

The segmentation obtained via AI-STEM’s predictions can now be used to conduct further analysis of the local lattice structure. Practically, to separate the image into bulk and interface regions, we fix a mutual-information threshold of 0.1, interpreting all local windows above this value as interface and the remaining ones as bulk. Depending on the type of region, i.e., interface or bulk, different quantities are suited. As an example, we calculate here the local lattice orientation, a quantity that is only reasonable to compute in the bulk regions. Specifically, for each local window, we reconstruct52 the real-space lattice from the atomic columns and determine53,54 the angle of misalignment with respect to a reference training image or rather its reconstructed atomic columns (cf. Supplementary Methods for more details). Note that this way, information from the training data is entering this analysis. Also note that the reference lattice is not required as input but determined based on the NN assignments—making this procedure fully automatic and extendable (in case of retraining and new classes being added to the training set). The calculated angle is termed lattice mismatch and the results for the Cu grain boundary are shown in Fig. 4d. The reference images are shown below the heatmaps. For the interface region, depicted in gray in Fig. 4d, no calculation is performed. The expected misorientations are exemplarily indicated in Fig. 4a, which closely match the calculated values of Fig. 4d.

Next, we consider a HAADF image of Fe55 containing a grain boundary that is horizontally aligned and separates two crystalline grains with [100] orientation (cf. Fig. 4e). Compared to the previous example, this image contains intensity variation of the background but also the atomic columns, which is more pronounced, for instance, in the upper left part compared to the lower part of the image. Such variations are common in experimental images and may stem from surface damage induced during sample preparation or surface oxide formation. However, AI-STEM correctly classifies the bulk regions as bcc [100] (cf. Fig. 4f), while the assignment changes at the grain boundary, but with increased uncertainty (cf. Fig. 4g). In the upper left in more noisy parts of the image, the uncertainty also increases. The obtained mismatch angles are shown in Fig. 4h and also here calculated and expected angles (again indicated in the original image in Fig. 4e) are in agreement.

Finally, we investigate a low-angle [0001] tilt grain boundary in Ti (cf. 4i), which consists of a periodic array of dislocations with a line direction perpendicular to [0001]. Hence, the interface structure is qualitatively different compared to the previously shown high angle grain boundaries for Cu and Fe. In particular, the smaller misorientation angle between both grains in the Ti image leads to regions within the interface where the atomic lattices of the two grains are still connected with each other. AI-STEM correctly assigns hcp [0001] (cf. Fig. 4j), with only few outliers in the classification at the grain boundary, which is again revealed via the mutual information (cf. Fig. 4k). One can observe that the mutual information is decreasing in the regions in between the grain boundary dislocations, where the lattice resembles that of undisturbed Ti [0001] and is increasing in the locations of the dislocation cores at the interface. This shows that the uncertainty estimate of the predictions can even be used to locate more confined lattice defects such as individual dislocations. Similar to the previous two examples, we obtain the local lattice mismatch (cf. Fig. 4l) that matches the expectations (cf. Fig. 4i) with a margin of few degrees.

Analyzing AI-STEM’s internal representations via unsupervised learning

So far, we have demonstrated how AI-STEM can be used to classify lattice symmetry and orientation and is capable of detecting interfaces and even individual dislocations within an interface. To understand how the model interprets crystalline grains and interface regions, we apply unsupervised learning to the internal NN representations. Specifically, we employ manifold learning to embed the high-dimensional NN representations into two-dimensional, readily interpretable maps. We employ Uniform Manifold Approximation and Projection (UMAP)56, which approximates the manifold that underlies a given dataset, and allows to construct low-dimensional embeddings that can capture both global and local relationships among the original, high-dimensional data points. We consider the experimental images shown in Fig. 4a, e, i, and compute the NN representations for each of the local windows, as determined within the AI-STEM workflow (cf. Fig. 1b). Superficially, we inspect the last, fully connected layer before the output layer, i.e., before the classification is conducted (cf. Fig. 2b). The two-dimensional UMAP embedding is shown in Fig. 5a, where the color scale corresponds to the NN assignments. Despite the high level of compression, from 128 to 2 dimensions, all three images are well separated. For each image, two sub-clusters can be observed that correspond to the two bulk grains (cf. Fig. 5a). These are joined by contiguous strings that correspond to the interface regions, respectively. This is also visualized by using the mutual information as a color scale (cf. Fig. 5b), where along the strings, increased uncertainty can be observed (which indicates the presence of the defects). Notably, the different grain boundary types (e.g. high angle vs. low angle) are also mapped to different regions in the map. This demonstrates the capability of AI-STEM to not only recognize bulk symmetry and orientation but also to distinguish different interface types – even though it has never been provided with explicit examples for such a task during training.

Fig. 5: Visualizing neural-network representations of local crystalline and defective atomic structure in experimental images.
figure 5

For each of the three experimental images in Fig. 4, we apply the fragmentation procedure of AI-STEM (cf. Fig. 1b), and extract the neural-network representations of these local windows (for the last fully connected layer before the classification, cf. Fig. 2b). The dimension-reduction (via Uniform Manifold Approximation and Projection, short UMAP) of these high-dimensional NN representation is shown in Fig. (a, b), where in (a) the color scale corresponds to the AI-STEM assignments, and in (b) the color scale corresponds to the mutual information that quantifies model uncertainty. All images are separated into three connected regions. In each of these, two connected clusters can be seen that correspond to the crystalline grains, while the connections indicate the grain boundary region. Notably, the boundary regions, which correspond to distinct interface types and are of critical importance for the material properties, do not intersect and are thus not confused by AI-STEM.

Discussion

In this work, we propose AI-STEM which automatically characterizes crystal structure and interfaces in simulated and experimental atomic-resolution STEM datasets. This is enabled by adapting several techniques: we employ signal-processing tools to represent imaging data, deep learning to identify crystal symmetry and orientation, and Bayesian modeling in combination with information theory to estimate model uncertainty as well as to optimize NN hyperparameters. At the core of AI-STEM is a Bayesian convolutional neural network, which goes beyond standard NN models, providing classifications and principled uncertainty estimates. The former allow identification of lattice symmetry and crystal orientation while the latter are used to segment an image into bulk and interface regions. Despite being trained only on simulated STEM images of perfect lattice structures, AI-STEM generalizes to experimental images, as demonstrated by several challenging examples. The training data can be obtained by discrete multislice image simulations considering dynamical scattering effects, while, in this work, we show that a fast convolution approach can be employed. In order to verify the applicability of the labeling procedure, a diverse set of simulated images of typical monocrystalline structures is generated, serving as reliable ground truth. Based on the segmentation provided by AI-STEM’s prediction, one can conduct augmenting analysis that reveals additional characteristics of the identified regions. Here, we determine the local lattice rotation in the crystalline grains. Using unsupervised learning, we demonstrate that different types of interfaces appear separated in the internal NN space, despite no explicit information on any interface pattern is being provided during training. This analysis also shows how unsupervised learning can be used to explain a black-box model, in post-hoc fashion57,58,59. Moreover, on-line data processing is feasible with the proposed approach since the method is easy to parallelize and already using a single GPU we are within the range of typical acquisition times (cf. Supplementary Fig. 4).

Furthermore, note that the presented experimental images in Fig. 4 have near to perfect zone axis orientation within the experimental limit, since the aim is to resolve the atomic structure of the interfaces with highest possible precision. While this might be considered an idealized scenario, the results of Fig. 4 should at least constitute a test for small deviations in crystal tilts, which are typically present in experimental images. Since we did not include any information on this in the training data set, the model already at that level shows to be robust. We have conducted further tests for larger deviations in tilt of the adjoining crystals in Supplementary Fig. 5. The model provides the expected assignments, even if only lattice fringes are resolved, while the accompanying high uncertainty values require a careful interpretation of the prediction. The model robustness may be further improved by including crystal tilt variations as additional parameters in the training set.

Since various types of noise components, such as scan and detector read-out noise, are typically present in STEM images, we further tested the applicability of AI-STEM for different noise levels as shown in Supplementary Fig. 6. Here, we deployed AI-STEM to images with different degrees of primarily fast scan noise that are contained in the images. Specifically, we consider the experimental image in Fig. 4a as a reference example with reduced fast scan noise by frame averaging and show that the number of frame averages does not influence AI-STEM’s performance. Even a single shot frame with high noise contributions is correctly classified showing low uncertainty values in the prediction of the bulk crystal regions (see Supplementary Fig. 6).

In the future, it would be interesting to provide not only a bulk-versus-interface segmentation but also predict additional details automatically, e.g., how the crystalline grains differ. Currently, this can only be done by additional analysis, e.g., based on reconstructing the (projected) real-space lattice. However, one can see from the latent space visualization in Fig. 5 that grains with different orientation are separated. In principle, a clustering algorithm may be employed to separate the grains, while this can be challenging to automate as clustering typically involves several parameter choices that are not guaranteed to generalize well. Alternatively, one may consider a multi-label classification problem or construct a separate machine-learning model to predict the (local) lattice rotation automatically.

In conclusion, our method shows great potential to automatically analyze and classify crystallographic attributes in STEM datasets without human intervention. In electron-microscopy research, the development of a “self-driving” microscope appears on the horizon due to rapid advances in artificial intelligence60,61. While we focus on mono-species systems as a proof of concept, this work paves the way to autonomous investigations of complex nanostructures at the atomic level.

Methods

AI-STEM parameters

Besides the classification model, the two most important components are the stride and box size. For the box size, we recommend a value of 12 Å, on which the model is trained. If significantly larger window sizes are necessary for a desired application, the practical approach is to augment the dataset using our efficient training procedure and retrain the model. Also note that the model is trained for a specific resolution, in which 1 pixel corresponds to 0.12 Å. For different resolutions, one may simply rescale the image or, as we proceeded here, adjust the window size. For instance, the Cu image in Fig. 4a is measured for a resolution of about 0.0880 Å per pixel, while the other images in Fig. 4 are measured for 0.1245 Å. To match both resolutions to the training range, we increase the box size to 136 pixels for Cu (as it is measured at higher resolution, i.e., we need to increase the box size to obtain a number of atomic columns that is comparable to the training set), and 96 for the other two images (as it is recorded at lower resolution, i.e., smaller windows are required to obtain a number of atomic columns that is comparable to the training set). In principle, our data-generation method also allows to vary the resolution, such that retraining with various resolutions could be done as well. For the stride, we use values on the order of 1 Å, to demonstrate the high-resolution capabilities of the approach. Smaller strides can suffice to reveal the main characteristics, cf. Supplementary Fig. 3 (in particular, it is possible to separate an image into bulk and interface regions). For the synthetic image in Fig. 3, we employ a stride of 12 pixels, corresponding to ~1.4 Å. The same settings were used for the experimental images of Ti and Cu (Fig. 4a, i). For Fe (Fig. 4e), the stride was halved as this image is smaller (about half of the size of Ti, and two third of Cu), enabling a comparable number of local fragments.

FFT-HAADF descriptor

We start from the periodic arrangement of atomic columns in HAADF-STEM images. These are acquired in low-index crystallographic orientations, which directly represent the underlying projected crystal symmetry. In the AI-STEM workflow, an input image corresponds to a local fragment or window, extracted from a larger image. The cutting procedure may lead to to boundary effects, e.g., truncated atomic columns. This can lead to spurious patterns in the FFT, which is why we apply a window function to the STEM HAADF image before calculating the FFT—a standard practice in signal processing62. Here, we use the Hann window that provides a smooth decay at the image boundaries. Then, the FFT is calculated, resulting in spectra which have a dominant central peak, suppressing possibly valuable information at higher frequencies. Thus, we apply a thresholding scheme: the FFTs are normalized to the range [0, 1] and then all values above 0.1 are set to 1.0. This provides visible enhancement of peak patterns around the central peak, which is visualized for all classes in this work in Supplementary Fig. 1.

Neural network training

The CNN is trained on 31,470 64 × 64 pixel images (the FFT-HAADF descriptor of the STEM HAADF images). A split of this dataset into training and test is performed in stratified fashion (via scikit-learn, using a random state of 42; see Data availability for the dataset link). Adam optimization is employed for training63. The CNN is implemented using Tensorflow64. Hyperparameters are optimized using Bayesian optimization, specifically the Tree-structured Parzen estimator (TPE) algorithm as provided by the Python library hyperopt50. We experimented with minimizing either validation loss or accuracy, while no significant difference could be found, in terms of classification accuracy. We chose the validation loss as objective function to be minimized. We tested different configuration spaces for the network architecture and optimization parameters, including number of layers, number of filters, filter size, dropout ratio as well as batch sizes (example notebooks are provided, cf. “Data availability”). The models typically converge to near-perfect accuracy in few epochs and we find that we can restrict to smaller configuration spaces, reducing the computational cost. We fix the architecture to 6 layers (number of filters: 32, 32, 16, 16, 8, 8) and focus on the search for the right kernel size (3 × 3, 5 × 5, 7 × 7) as well as the dropout ratio (values between 2 and 10 percent, step size 1 percent). In particular, the choice of dropout ratio is known to be important for the quality of the uncertainty estimates40. We run the TPE algorithm for 25 iterations. Each model is optimized for 25 epochs, saving only the model with best validation accuracy. These models achieve all near-perfect accuracy (99,9% classification accuracy on both training and validation set), but their uncertainty estimates differ. We thus select the model that has the highest median uncertainty in the amorphous region in the synthetic polycrystal example (Fig. 3), where we expect a low degree of crystallinity. The model chosen in this fashion is reported in Table 1.

Uncertainty quantification

Given the test point x, the mutual information between the predictions and the model posterior p(ωDtrain) is defined as34,40,65

$${\mathbb{I}}\left[y,{{{\boldsymbol{\omega}}}}| {{{\bf{x}}}},{D}_{{{\text{train}}}}\right]:= {\mathbb{H}}[y| {{{\bf{x}}}},{D}_{{{\text{train}}}}]-{{\mathbb{E}}}_{p({{{\boldsymbol{\omega}}}}| {D}_{{{\text{train}}}})}\left[{\mathbb{H}}[y| {{{\bf{x}}}},{{{\boldsymbol{\omega}}}}]\right].$$
(4)

The first term on the r.h.s. is termed predictive entropy40. It quantifies the (average) information in the distribution of predictions and is defined by

$${\mathbb{H}}[y| {{{\bf{x}}}},{D}_{{{\mbox{train}}}}]:= -\mathop{\sum}\limits_{c}p(y=c| {{{\bf{x}}}},{D}_{{{\mbox{train}}}})\log p(y=c| {{{\bf{x}}}},{D}_{{{\mbox{train}}}}).$$
(5)

The second term on the r.h.s. of Eq. (4) is defined as

$$\begin{array}{l}{{\mathbb{E}}}_{p({{{\boldsymbol{\omega}}}}| {D}_{{{\text{train}}}})}\left[{\mathbb{H}}[y| {{{\bf{x}}}},{{{\boldsymbol{\omega}}}}]\right]:= {{\mathbb{E}}}_{p({{{\boldsymbol{\omega }}}}| {D}_{{{\text{train}}}})}\left[\mathop{\sum}\limits_{c}p(y=c| {{{\bf{x}}}},{{{\boldsymbol{\omega }}}})\log p(y=c| {{{\bf{x}}}},{{{\boldsymbol{\omega }}}})\right].\end{array}$$

One may refer to this as expected entropy as it averages the entropy of the predictions given the parameters ω that are distributed according to the posterior distribution66. Using Monte Carlo dropout, one can approximate the mutual information as34

$$\begin{array}{l}{\mathbb{I}}\left[y,{{{\boldsymbol{\omega }}}}| {{{\bf{x}}}},{D}_{{{\mbox{train}}}}\right]\approx \\ -\mathop{\sum}\limits_{c}\left(\frac{1}{T}\mathop{\sum}\limits_{t}p\left(y=c| {{{\bf{x}}}},{{{{\boldsymbol{\omega }}}}}_{t}\right)\right)\log \left(\frac{1}{T}\mathop{\sum}\limits_{t}p\left(y=c| {{{\bf{x}}}},{{{{\boldsymbol{\omega }}}}}_{t}\right)\right)\\ +\frac{1}{T}\mathop{\sum}\limits_{c}\mathop{\sum}\limits_{t}p\left(y=c| {{{\bf{x}}}},{{{{\boldsymbol{\omega }}}}}_{t}\right)\log p\left(y=c| {{{\bf{x}}}},{{{{\boldsymbol{\omega }}}}}_{t}\right).\end{array}$$
(6)

Details on training data generation

For each of the 10 surface classes, we consider a small interval of ± 0.1 Å around their respective experimental lattice parameters. This is due to the fact that some of the classes can be similar (a consequence of the 2D projection provided by STEM images), for instance fcc 100 and bcc 100 (cf. Fig. 2a). The lattice parameters are the following: for all Cu fcc single crystals, the lattice constant a is 3.63 Å; for all Fe bcc single crystals, the lattice constant a is 2.87 Å; for all Ti hcp single crystals, the lattice constants a and c are 2.95 Å and 4.68 Å, respectively (c/a ~ 1.587). For each of these classes, we include a range of rotations (0-90 degrees, step size 5 degrees, using the Python package scipy67). Then, different noise sources are applied, as implemented in the Python package scikit-image68: first, shear is applied to all images (affine transformation applied to the images, only using shear but no scaling or translation) for all rotations. We apply additional noise sources for a subselection of data points (only every second rotated and sheared image, keeping the dataset size below 100k), including Gaussian blurring (scanning a Gaussian filter of certain width over the image), and finally, addition of random noise sources (Gaussian or Poisson). Visual examples are provided in Supplementary Fig. 2.

Simulation of STEM dataset

To generate the ground truth STEM datasets consisting of HAADF images, STEM image simulations were performed with the abTEM software package47. The crystal orientations as shown in Fig. 2 for the ten classes are generated by the atomic simulation environment (ASE) Python module69. The thickness of all simulation cells was set to 8 nm (z − direction) with a slice thickness of 0.2 nm. The x − and y − dimensions of the simulation cells was chosen to be ~8 nm, respectively. An electron energy of 300 kV, a probe semi-convergence angle of 24 mrad and semi-collection angles of the HAADF detector ranging from 78 to 200 mrad were used for the simulations. The pixel size was fixed to 12 pm resulting in images with ~64 × 64 pixels. Thermal diffuse scattering was considered by using 12 frozen phonon configurations with a root-mean-squared thermal displacement according to the Debye-Waller factors obtained from Peng et al.70 for Cu, Fe and Ti at 280 K. The image simulations were performed on a Windows 11 Pro based workstation with an Intel Xeon CPU with 32 GB of RAM and a NVIDIA Quadro K1200 GPU. The total simulation times for the Cu-fcc class was ~11 h, for the Fe-bcc class ~26 h and Ti-hcp class ~13 h, respectively.

To speed up the training dataset generation, we also employed a simple convolution approach where the probe wave function generated in abTEM47 was convolved with the summed projected potentials for each cell. This reduces the total calculation time for each class to several minutes. We employ this approach for training the CNN model, demonstrating that this computationally efficient approach can yield strong performance on experimental images (cf. Fig. 4).

Scanning transmission electron microscopy experiment

All experimental STEM data were acquired using a probe corrected Titan Themis 60-300 (Thermo Fisher Scientific). The TEM is equipped with a high brightness field emission gun and a gun monochromator. The electrons were accelerated to 300 kV and images were recorded at a probe current of 80 pA with a high-angle annular dark field (HAADF) detector (Fishione Instruments Model 3000). The collection angles for the HAADF images were set to 73-200 mrad using a semi-convergence angles of 17 mrad and 23.8 mrad. Image series with 20–40 images and a dwell time of 1–2 μs were acquired, registered and averaged in order to minimize the effect of instrumental instabilities and noise in the images.

Experimental HAADF-STEM images of a Σ 19b(178)[111] tilt grain boundary in Cu with a misorientation angle of ~48 , a Σ 5(013)[001] tilt boundary in Fe with a misorientation angle of ~38 and a low angle [0001] tilt GB in Ti with a misorientation angle of ~13 are used to test the AI-STEM approach. Details on the sample fabrication and preparation for the Cu, Fe and Ti grain boundary images can be found in13,55,71, respectively.