Experimental methodologies for three dimensional tomography of the internal microstructure of materials has been refined considerably in the past few decades1,2,3,4,5,6,7,8, growing to include a variety of modalities such as X-ray computed tomography (CT), optical imaging, electron imaging, energy dispersive X-ray spectroscopy (EDS), and electron back-scattered diffraction (EBSD) data, among others. In all of these cases, a volume of material is interrogated slice by slice, and these slices are then stacked and reconstructed using a variety of software tools. Three dimensional microstructure data, while often laborious to generate compared to collecting a single 2D section, can provide unique insight, including things like the topology of microstructural features like grains, pores, precipitates and the true shape and size distributions of such features. Properties such as fatigue and oxide transport are sensitive to the three dimensional arrangement of these features9,10,11. And novel manufacturing processes like additive manufacturing also demand three dimensional characterization of microstructure to fully link the processing to the structure3,12.

While advancements have been made in robustness and closed loop collection of data2, corruptions in data can still happen. This may not be as significant of a problem in non-destructive techniques where the data can be recollected, but in destructive techniques like serial sectioning, it may often be the case that a few slices out of a thousand have no data or are of very poor quality, lowering the accuracy of the reconstruction. The corruptions can happen for a number of reasons including removing more material (via mechanical polishing or laser/ion abalation) than planned, the electron source nearing the end of its life resulting in low signal images, the magnification on the microscope being incorrect for a slice, or the brightness/contrast settings result in an over/under saturated image. While adjustments to the control software can be made to prevent these and other issues, it is hard to a-priori imagine every reason a slice or subset of data can be corrupted.

However, if most of the data was collected properly, we hypothesize it should be possible to infer a substantial amount of the missing data. The current common practice is to fill in missing slices is to just copy the layer above or below the missing slice2. This is reasonable in most serial sectioning cases where the slices are being collected at a high enough frequency that most of the data stays the same from slice to slice. Still, there is some room for improvement on nearest neighbor replacement type approaches, and the transformer model, which has been used to fill in text data, is a tantalizing framework for also filling in missing data in sequential image data.

Since its introduction13, transformers have been the backbone of many impressive breakthroughs in natural language processing (NLP). In their general form, transformers are excellent at processing sequential information in parallel. Consequently, its use has spread across multiple domains, such as in computer vision14, speech processing15, and bioinformatics16. In our case, transformers are appealing to EBSD data because the readings are inherently sequential as it represents a real physical structure.

Early large transformer models such as BERT17 and GPT18 were pretrained on copious amounts of text data with the intention of learning features to model the language through associations between words and phrases with a collection of self-supervised tasks19. Unlike supervised tasks, such as classification, which requires labeled datasets, self-supervised tasks involve automatically generating labels from unlabeled data, such as unscrambling a sequence of shuffled sentences, which force the model to vaguely learn the structure of the data. This technique remained fairly intact even when scaling to billion-parameter large language models (LLMs), enormous models that process much longer sequences over the course of hundreds of billions of tokens (language units comprising of a sequence of characters)20,21,22. Taking inspiration from a similar popular self-supervised task in NLP, our training procedure involves randomly masking a slice in an EBSD volume and having the model predict the masked slice.

Our slice recovery task is closely related to that of masked image modeling, masked autoencoders, and inpainting in computer vision23,24,25,26,27. Even though these may be applicable and there exists plenty of transformers in computer vision14,28, we want our model to also be computationally efficient as we scale our method and leverage the fact that EBSD volumes have sparse structures29, the dynamics of which operate differently from typical images or videos. Regarding efficiency, the fact that EBSD produces high dimensional data means the final model’s computational footprint cannot be ignored.

Taking these into consideration, our contributions are the following:

  1. 1.

    We propose a novel method to recover missing EBSD data consisting of a scalable transformer model and straightforward projection algorithm that produces superior results compared to existing methods. This is accentuated when zoning in on the recovery accuracy of grain boundaries where other methods appear to perform poorly.

  2. 2.

    We demonstrate that despite being trained solely on synthetic data, our transformer can generalize to real EBSD data without additional training while still outperforming all the baselines. This robustness to out-of-distribution EBSD data overcomes a major limitation of relatively low amount of available real EBSD data that deep learning requires.

  3. 3.

    Our results suggest that serial sectioning experiments using similar experimental parameters to those in our test datasets, collection time can effectively be slashed by up to 25% with very little error in predicted voxel orientation, as the collection of every fourth slice can be bypassed. With the current lengthy procedure of EBSD, this is a significant improvement in efficiency.


Terminology and notation

For this paper, we use bold capital letters (e.g. \(\varvec{A}\)) for matrices and capital calligraphic letters (e.g. \(\varvec{\mathscr {A}}\)) for tensors. Scalars are represented by plain uppercase and lowercase letters (e.g. A and a). Tuples are used to describe the shape of a structure, so if \(\varvec{\mathscr {A}}\in \mathbb {R}^{N_1 \times N_2 \times N_3}\), then \(\varvec{\mathscr {A}}\) is of shape \((N_1, N_2, N_3)\) and vice versa. A dimension is an index (starting at 1) in the shape tuple, and the size of a dimension i is the value of the shape tuple at that index i. When we refer to multiple volumes that share the same first k dimensions (e.g. \(\varvec{\mathscr {A}}\in \mathbb {R}^{N_1 \times N_2}\), \(\varvec{\mathscr {B}}\in \mathbb {R}^{N_1 \times N_2 \times N_3}\), and \(\varvec{\mathscr {C}}\in \mathbb {R}^{N_1 \times N_2 \times N_3'}\)), we use \(*\) to capture all remaining and possibly different dimensions (e.g. \(\varvec{\mathscr {A}}\), \(\varvec{\mathscr {B}}\), and \(\varvec{\mathscr {C}}\) have shape \((N_1, N_2, *)\)).


In this section, we outline the technical details of the encoder-only transformer based on its first introduction13. The same source describes encoder-decoder and decoder-only transformers for interested readers.

Input processing

The first layer of a transformer is the embedding layer. For a single length N sequence of \(D_{in}\)-dimensional vectors, \(\varvec{X}\in \mathbb {R}^{N \times D_{in}}\), each vector is transformed into a vector of size \(D_{out}\). If the input is a sequence of tokens such as in language tasks, then \(D_{in} = 1\). The transformation could be a linear one, or a more complex nonlinear one such as another neural network. The result is an embedded sequence of shape \((N, D_{out})\). Next, positional encoding is added to the embedding to inject positional information to each vector in the sequence. Positional encodings can either be fixed or learned.

Multi-head self-attention

The idea of self-attention is to find varying degrees of associations between elements in a sequence. Transformers typically implement multi-headed self-attention within each encoder layer, which allows the model to learn different types of associations. Let \(\varvec{X}\in \mathbb {R}^{N \times D}\) be the input sequence into the attention mechanism and \(\varvec{W}_h^Q, \varvec{W}_h^K, \varvec{W}_h^V \in \mathbb {R}^{D \times \frac{D}{H}}\) for head \(h = 1, 2, \dots , H\). Additionally, let \(\sigma\) be the row-wise softmax operator. Then, the attention for head h, \(\varvec{A}_h\) is defined as

$$\begin{aligned} \varvec{A}_h = \sigma \left( \frac{\varvec{Q}_h \varvec{K}_h^\top }{\sqrt{D / H}} \right) \varvec{V}_h, \end{aligned}$$

where \(\varvec{Q}_h = \varvec{X}\varvec{W}_h^Q, \varvec{K}_h = \varvec{X}\varvec{W}_h^K, \varvec{V}_h = \varvec{X}\varvec{W}_h^V\) are the query, key, and value matrices, respectively. Next, the results are concatenated and linearly transformed to produce the output \(\varvec{Y}= [\varvec{A}_1 \cdots \varvec{A}_H] \varvec{W}_O\) for \(\varvec{W}_O \in \mathbb {R}^{D \times D}\). Note that \(\varvec{A}_1, \dots , \varvec{A}_H\) can be computed in parallel.

The main source of inefficiency is the computation of \(\varvec{Q}_h \varvec{K}_h^\top\) which levies a computational and memory cost of \(\mathscr {O}(N^2)\). Consequently, the utility of transformers can be limited for long sequences without proper hardware. This becomes a dire issue as the dimensionality of the input grows, such as in images and videos which can be flattened into a long sequence. Thankfully, there has been substantial work in reducing the complexity with more efficient attention approximation mechanisms, such as Linformer30, Reformer31, Big Bird32, and many others33.

Axial self-attention

When it comes to high-dimensional data, such as the EBSD volumes, the sequence length explodes if we flatten (or patchify in the case of many transformers in computer vision14) the input and apply vanilla self-attention, making attention computation a serious bottleneck. Furthermore, since we observe a lot of structure with EBSD data, it is reasonable to utilize some simplified attention mechanism. Our model uses axial attention34,35 which runs the self-attention mechanism along the dimensions of the input tensor. For instance, in a cube, each voxel only attends to voxels in the same row, column, or depth. This greatly reduces the amount of computation and memory, especially for higher order tensors. Intuitively, axial attention is appropriate because we hypothesize that since local information can be quite uniform (nearby voxels are likely to be in the same grain), long range information should also be included. With axial attention, it is highly likely to also obtain information in other grains and its boundaries.

In terms of implementation, the formula for multi-headed attention (1) can be reused. Let \(\varvec{\mathscr {X}}\in \mathbb {R}^{N_1, \dots , N_K, D}\) be a single K-dimensional input where \((N_1, \dots , N_K)\) defines the shape of the volume and D is the embedding size. As an example, suppose we are interested in finding axial attention along the k-th dimension for \(1<k<K\). With proper dimension permutation and flattening, we can reshape \(\varvec{\mathscr {X}}\) to a tensor of shape \((\prod _{i\ne k} N_i, N_k, D)\) and compute multi-headed attention by treating the first dimension as the batches. In our model, we repeatedly use the outputs of axial attention to compute axial attention along the next dimension.

Feedforward blocks

The next major component to create a transformer is the feedforward block, a multilayer perceptron with a single hidden layer. The activation function is often either a Gaussian Error Linear Unit (GELU) or Rectified Linear Unit (ReLU) function, and although the input and output sizes are identical, the hidden size can be much larger. Consequently, although self-attention is a computational bottleneck, the feedforward blocks are the dominant source of parameters in the model.

Putting it all together

With all these components, we are ready to define the general architecture of a transformer. First, an input is processed by an embedding layer and positional encoding layers at the start. Then, the input progresses into the encoder, a sequence of alternating self-attention and feedforward blocks. For training stability, residual connections and normalization layers are inserted in between each attention block and feedforward block. Finally, the features from the encoder are linearly projected to be the desired shape of the output. A visualization of a vanilla and axial transformer, which mainly differ in their attention implementation, is depicted in Fig. 1.

Figure 1
figure 1

Transformer architecture (left) with either vanilla attention (center) or axial attention (right). The axial transformer is produced by simply substituting full attention with axial attention. Although axial attention has significantly more layers, it scales much more favorably for high-dimensional data. Attention layers take three inputs for the query, key, and value.


We propose a transformer model to learn missing slices of EBSD data, followed by a projection step to smooth out the voxel values. Described in more detail in the following sections, our methodology is summarized in Fig. 2. Due to limited real EBSD data, our goal is to train this model on a large and diverse synthetic dataset and demonstrate that it can generalize to real EBSD data, which we evaluate on two nickel superalloy EBSD volumes, one for alloy IN62536,37 and one for alloy IN71838,39. The following repository contains the code for our method:

Figure 2
figure 2

Method overview. The transformer takes in a sequence of cubochoric EBSD slices with one unobserved slice as input to produce an output of identical shape where the index of the unobserved slice contains the orientation predictions along the grain boundaries which are the sole contributors to the loss function. The output is then processed by a projection step that assigns each voxel to a grain based on its predicted orientation and its neighboring voxels.

Data description and preparation

Each dataset (both synthetic and experimental) includes orientation information at every voxel in a 3D image. For the experimental data, a substatial amount of preprocessing was done to handle the alignment of the data and clean up noise; a complete description of the preprocessing may be found in Chapman et al.2 and Stinville et al.38 In particular, we remove any grains smaller than 27 voxels (\(3^3\)) and average orienations per grain. While operating on grain-average orientations simplifies the orientation prediction problem, it does not represent the real complexity of orientation fields, which often exhibit subtle local variations. We chose to focus on grain-averaged orientations in this first implementation primarily to simplify our initial interpretation of the transformer predictions. Additionally, the publicly available version of the IN625 dataset only contained grain-averaged orientations. The original volume containing Euler angles is of shape \((N_1, N_2, N_3, 3)\), where \(N_1, N_2, N_3\) are the physical dimensions, and the final dimension represents the 3 Euler angles needed to define an orientation. Additional arrays are then computed from the input data:

  • Cubochorics: A volume of shape \((N_1, N_2, N_3, 3)\) where the last dimension contains the cubochoric coordinates40 converted from the original Euler angles at each voxel. Cubochoric coordinates are chosesn since the Euclidean metric is used for regressing the transfomer model. The Euclidean distance between points in Euler angle space does not necessarily relate to the angular distance between those points. While this is also true for points in cubochoric space, as the cubochoric representation is an equal-volume mapping of SO(3) onto a grid as opposed to an equal-angle mapping, the Euclidean distance in cubochoric space approximates the Euclidean distance between unit quaternions for small misorientations41. Since Euclidean distance is a valid metric in SO(3)42, we operate completely in cubochoric space, under the assumption that the Euclidean metric is a reasonable approximation for similarity between points in this space.

  • IDs: A volume of shape \((N_1, N_2, N_3)\) which assigns identification numbers (IDs) to denote which grain each voxel belongs to. Each ID number has a unique vector of cubochoric coordinates associated with it. While not needed during training, these ID numbers will be used to smooth our model outputs and evaluate model accuracy. For experimental data, the IDs are found by segmenting the grains using a misorientation tolerance2,38, and for synthetic data, the IDs are generated alongside the orientation data.

  • Boundaries: A volume of shape \((N_1, N_2, N_3)\) containing Boolean values indicating if a voxel is on the grain boundary. More specifically, a voxel is a boundary voxel is at least one of its neighbors is of a different ID number than itself, where two voxels are neighbors if they share a face.

To illustrate, Fig. 3 contains example slices of (scaled and shifted) Cubochorics and Boundaries of IN625 and IN718.

Figure 3
figure 3

Three consecutive crops of real EBSD slices from IN6252 (left) and IN71838 (right). The top row shows the cubochoric values that are scaled and shifted for better visualization, and the bottom row shows the same region in Boundaries. Note that the height and width of each slice here is 64 voxels and not the original shape.

Synthetic volumes are generated via DREAM.3D43. Each synthetic training and validation volume is of shape \((192, 192, 192, *)\) and \((64, 192, 64, *)\), respectively. Within DREAM.3D, we generate 9 training volumes while independently varying the mean grain size and mean transformations per grain. In particular, we generate volumes with mean grain sizes 2, 2.5, and 3 with no twins; we also generate volumes with mean twins frequencies 0, 1, 2, 3, 4, and 5 while fixing mean grain size to be 2.3. Due to the nature of the software, these parameters are unitless, as we can always synthetically increase or decrease the granularity of the volume. For example slices of these synthetic volumes, see Fig. 4. The grain size distributions for each dataset are shown in Fig. 4 as probability plots. The natural logarithm of the normalized grain sizes are shown; ideal lognormally distributed data would lie on straight lines in such plots. In general, the grain sizes are primarily lognormal near their means, with noted deviations from lognormality in their tails, which is a known phenomenon44.

Figure 4
figure 4

Example synthetic slices of cubochoric values that are scaled and shifted for visualization. Keeping mean grain size fixed, (ac) show generated slices when we specify the mean frequency of twins per grain to be 0, 2, and 4, respectively. With no twins, (d,e) show generated slices when we specify the mean grain size to be 2 and 3, respectively. All slices have a height and width of 64 voxels.

Figure 5
figure 5

Probablity plots showing the grain size distributions for each dataset. (a) Compares the real test datasets to the synthetic training datasets without twins, while (b) compares the real test datasets to the synthetic training datasets with twins. Grain sizes are represented as sphere equivalent diameters, D, normalized by the distribution mean.

Training details

We first describe a few data augmentation steps. The number of unique cubochoric coordinates is quite limited, so color shift transformations, along with other augmentations, are critical. Note that use of the word “color” is loose here because we treat the cubochoric coordinates the same as color channels in computer vision. In particular, our augmentations include random linear color shifts, rotations, and flips.

Using the 9 synthetic volumes generated by DREAM.3D, we train our axial transformer model in a self-supervised fashion similar to masked language modeling tasks17. These volumes are first normalized along each of the three cubochoric indices. Each training input is sampled from a randomly chosen volume with physical dimensions randomly permuted. Not only is cropping necessary due to computational limits, it also acts as another form of augmentation. For a sample \(\varvec{\mathscr {X}}_\star \in \mathbb {R}^{64\times 7 \times 64 \times 3}\), one of the central 5 slices (along the second dimension) is randomly masked. If m is the index of the masked or unobserved slice, define \(\varvec{\mathscr {M}}\in {[0,1]}^{64\times 7 \times 64 \times 3}\) to be a mask such that \([\varvec{\mathscr {M}}]_{\cdot , m} = 0\) and 1 elsewhere. Then, the model input and output will be \(\varvec{\mathscr {X}}_\star \odot \varvec{\mathscr {M}}\) and \(\widehat{\varvec{\mathscr {X}}} \in \mathbb {R}^{64\times 7 \times 64 \times 3}\), respectively.

Figure 6
figure 6

Changes between slices are captured in boundary voxels. The top row shows two consecutive slices. The bottom row shows their respective slices in Boundaries (red) with the ID difference between the two slices overlaid (white). Note that differences between these slices lie fully in the boundary voxels of both slices.

Since EBSD produces very structured data, we can leverage this to design a more effective loss function. A simple mean-squared-error (MSE) loss function across all voxels would not be sufficient because most voxels in the input are observed, so we would risk learning an identity function which would produce a fairly low loss value. A better approach would be to find the MSE of only the missing slice, analogous to masked language modeling objective functions17. However, based on Fig. 3, we see that many of voxels remain unchanged across multiple slices, and the slice-to-slice changes all lie along the grain boundaries (see Fig. 6). Since voxels within grains are easy to predict the values of, the source of difficulty is recovering the boundary voxels. Therefore, we opt for a MSE loss function that only considers boundary voxels in the missing slice. More formally, letting k be the index of the missing slice and \(\varvec{\mathscr {E}}\in {[0,1]}^{64 \times 64 \times 1}\) be the missing slice’s boundaries map in Boundaries, the loss function we use is

$$\begin{aligned} \mathscr {L}(\widehat{\varvec{\mathscr {X}}}, \varvec{\mathscr {X}}_\star ) = \frac{\left\| [\widehat{\varvec{\mathscr {X}}} - \varvec{\mathscr {X}}_\star ]_{\cdot , m} \odot \varvec{\mathscr {E}}\right\| _{{{\,\mathrm{\textsf{F}}\,}}}^2}{\left\| \varvec{\mathscr {E}}\right\| _0}, \end{aligned}$$

following broadcasting rules. By the way we defined \(\varvec{\mathscr {E}}\), \(\left\| \varvec{\mathscr {E}}\right\| _0\) is the number of unobserved boundary voxels. In summary, this loss function essentially averages MSE error across all unobserved boundary voxels.

Using this scheme, we train our 8-headed 8-layer model using stochastic gradient descent with a momentum parameter of 0.9 and weight decay of \(1\textrm{e}{-5}\). Using cosine schedules, we warm-up the learning rate up to 0.01 for 8000 steps. While we decay the learning rate until 160000 total gradient steps are taken with batches of 1 sample, performance plateaus around halfway. The model uses learnable positional encoding, 4 attention heads, an embedding size of 128, and a feedforward size of 512. The feedforward block is consists of two 3D convolutions with a window size of 3 along each dimension separated by a GELU. See Fig. 1 for a visualization of the architecture. Notably, our model is compact for a transformer, only consisting of slightly under 30 million parameters. We also apply 10% dropout. Recall that the transformer returns predicted cubochoric coordinates at each voxel of the same shape as the input, but only the masked slice contributes to the loss function (2).

Nearest neighbor projection

The final step is to use outputs of our transformer, which are continuous values, to assign each voxel to a grain and produce a smoother slice with fewer intra-grain variations. To do this, we prepare a dictionary relating observed grain IDs to cubochoric coordinates. First, we assign voxels whose neighbors (voxels that share a face) that reside in the previous and next slice have the same ID to be that ID. We observe these voxels act like anchors that provide more neighbors for voxels that are more difficult to classify, which empirically improves the recovery. Next, we begin projecting voxels, prioritizing the ones with the most neighbors that have been assigned an ID, which include observed voxels and previously projected voxels. This way, we first project voxels with the most known information in their neighborhoods first to gradually build up information in more obscure neighborhoods. Projections are determined by the minimum \(\ell _2\) distance to relevant cubochoric coordinates pulled from the dictionary based on neighboring IDs. Summarized in Fig. 7, the projection algorithm essentially turns the transformer outputs of the missing slice into IDs which can then be converted into cubochoric coordinates or Euler angles. For an example, see the bottom row of Fig. 7.

Figure 7
figure 7

The nearest neighbors projection processes the outputs of a transformer to assign voxels to grains. A two dimensional toy example is shown in (a). White boxes denote which voxels have not been projected yet. The input contains the transformer output sandwiched between observed adjacent slices. First, the anchoring step assigns voxels whose neighbors in adjacent slices are from the same grain. Then, voxels are sequentially projected to their neighbors, as indicated by the small arrows, starting with the voxels with the most observed and previously projected neighboring voxels. The bottom row showcases a real projection example from the IN625 dataset. (bd) Show the target slice, the transformer prediction, and the processed output via projection, respectively. (e) Captures the voxel-wise \(\ell _2\) difference between before and after projection.


With our method defined, we now evaluate its performance on synthetic and real EBSD data with no additional training or fine tuning. Even though our method overwhelmingly outperforms the baselines in terms of recovery, we also point out some weaknesses and avenues for improvement.

There are three baselines we compare to. One is k-nearest neighbors (KNN) where a voxel’s ID is determined by vote based only on observed or previously assigned IDs among its neighboring voxels. This is the exact process of the projection step (including the anchoring procedure), except instead of using a distance metric, voxels are assigned IDs by a vote system. Ties are broken randomly. Another method, which is currently employed as a simple solution to missing slices, is to copy an adjacent slice to replace the missing slice. This usually maintains fairly decent accuracy since the changes from slice to slice are minuscule compared to the number of voxels. As such, copying the previous and next slice are our other two baselines.


Because the changes from slice to slice are captured by boundary voxels, we define two different accuracy metrics using the IDs. We denote the accuracy of each voxel in the recovered slice as the overall accuracy and the accuracy only considering boundary voxels in the recovered slice as the boundary accuracy. The latter poses a greater challenge for all models, as it is consistently lower than the overall accuracy where the influence of non-boundary voxels often dominates.

While we report the mean accuracies and standard deviation across validation samples, the performance of one method is heavily correlated with the performance of others. For instance, if a particular slice is difficult to recover for one method, it is likely to be difficult for all other methods. To better compare the performances between our transformer and each baseline, we obtain differences in accuracy for each sample. Namely, we find \(d(m, \varvec{\mathscr {X}}_\star , \widehat{\varvec{\mathscr {X}}}, \widehat{\varvec{\mathscr {X}}}_{b}) := m(\varvec{\mathscr {X}}_\star , \widehat{\varvec{\mathscr {X}}}) - m(\varvec{\mathscr {X}}_\star , \widehat{\varvec{\mathscr {X}}}_{b})\) for an accuracy metric m, ground truth \(\varvec{\mathscr {X}}_\star\), and outputs, \(\widehat{\varvec{\mathscr {X}}}\) and \(\widehat{\varvec{\mathscr {X}}}_b\), from our transformer and baseline b, respectively. Computing this across all samples, the result is a distribution of accuracy improvements.

Synthetic volumes

First, we generate 4 independent synthetic volumes of shape (64, 192, 64) for each setting. Each volume is divided into 27 nonoverlapping segments of shape (64, 7, 64) each with one of the five interior slices masked. This results in 108 validation samples for each setting. These validation metrics are displayed in Fig. 8 when varying the mean twins frequency and in Fig. 9 when varying mean grain size.

Figure 8
figure 8

Overall accuracy (a), boundary accuracy (b), overall accuracy improvements (c), and boundary accuracy improvements (d) of synthetic volumes with varying twin frequencies. Error bars in (a,b) represent one standard deviation.

Figure 9
figure 9

Overall accuracy (a), boundary accuracy (b), overall accuracy improvements (c), and boundary accuracy improvements (d) of synthetic volumes with varying mean grain sizes. Error bars in (a,b) represent one standard deviation.

We observe that our method more accurately recovers missing slices than all other baselines for every synthetic validation sample since each slice observed a positive improvement in overall and boundary accuracy. The performance gain is much more apparent for boundary voxels. Furthermore, our transformer performance is much lower variance. Among the baselines, KNN achieves the closest accuracy to our method, but it still underperforms in comparison. As expected, the difference between our method and the baselines diminishes as the scenario get simpler (larger grain sizes and fewer twins) since the baselines are already producing very accurate results.

Real EBSD datasets

Now, we seek to understand how well our model transfers to real data by running our trained model on IN62536,37 and IN71838,39. For both volumes, we subdivide each volume into nonoverlapping subvolumes such that all slices are of height and width 64 voxels. Then, each subvolume is then further partitioned into nonoverlapping segments of shape \((64, 7, 64, *)\), each representing a single test sample. Again, each sample contains one masked slice among the five central slices. In the end, there are 800 samples for IN625 and 3298 samples for IN718 after discarding a small set of samples whose missing slice did not have any boundary voxels (these samples would trivially result in exact recovery regardless of the model we use). Average accuracies and accuracy improvements for both volumes are shown in Table 1 and Fig. 10, respectively. For recovery examples, see Fig. 11.

Table 1 Average overall accuracy and boundary accuracy across samples of real world datasets, including their standard deviations.
Figure 10
figure 10

Overall and boundary accuracy improvements of IN625 and IN718 test sets. Note that the y-axes are on different scales.

Surprisingly, even though our model is trained exclusively on synthetic data, we observe it transfers well to real EBSD data, as it still outperforms every baseline on nearly every sample, and the interquartile range is positive. Again, the difference is accentuated for boundary voxels. Test results for the IN718 dataset had much higher variance, likely due to each slice having proportionally fewer boundary voxels than IN625 slices. Qualitatively, our model has a stronger capability to recovery thin features than KNN, which visually tends to ignore more subtle structures.

Figure 11
figure 11

Four random example predictions of missing slices from IN625 (top two rows) and IN718 (bottom two rows) test sets. Each row is a separate example. (a) Contains the target; (be) are predictions made by our transformer, KNN, the previous slice, and the following slice, respectively.


While our method provides superior results compared to the baselines, we also seek to understand the circumstance in which it may underperform. One challenge is rapid changes between slices which make up a small minority of test inputs. For instance, if the k-th slice is missing and the set of grains present in the \((k-1)\)-th is different from the set of grains in \((k+1)\)-th slice, the model has a lot of freedom to decide on which ones are present and to what degree in the missing slice. A case in which this can arise is when the faces of grains or twins are perpendicular to the slicing direction. However, we observe this is an issue for all methods. Using the second row of Fig. 11 as an example, the bright green grain is a large crescent shape on the slice before the missing slice but is hardly present in the following slice. Our model and KNN strives to find some middleground, but ultimately, both misclassify many of the voxels. A similar scene plays out in the fourth row of Fig. 11. Thus, future work includes designing a better loss function to mitigate errors for this or emphasizing these scenarios during training.

Another limitation arises from our projection method. Recall our local projection method projects a voxel to have the ID as one of its observed or previously assigned voxels in order to encourage grain connectedness. This means long range dependencies may be ignored in favor of more smooth structures. Furthermore, connectedness is not guaranteed for very thin features at an angle. Examples of both edge cases for matrices are illustrated in Fig. 12. Though these scenarios are fairly rare in practice, further work could be done to improve our projection algorithm. In particular, our use of grain-averaged orientations directly impacts the design of the projection step. While our transformer predicts real-valued orientations, we utilize a projection step that “snaps” these predicted values to the nearest grain-averaged value to compare with the original training data. Moving to orientation values that are not grain-averaged will require adaptation of the projection step, and potentially require modifications of the underlying transformer architecture.

Figure 12
figure 12

Matrix examples where projection to observed and previously projected neighbors is suboptimal, even when giving it the ground truth. Projection involves using the neighboring slices to predict the center slice. White squares indicate which pixels have not been projected yet, and the black arrows point to the possible neighbors that the pixel will choose to project to. The top row shows an example where the projection operation smooths out very thin features that lie completely in the missing slice. The bottom row shows an example where the projection operation may disconnect a grain.

Future directions can also better integrate the dynamics of orientation data into our method. For example, our use of MSE loss operating on cubochoric values is not a mathematically rigorous metric representation. We initially attempted to utilize an angular metric for loss, but encountered instability during training that we believe related to the trigonometric functions used during the loss computation. An immediate improvement to the current approach would be operating on an orientation representation for which we can compute a stable loss that represents the true angular difference in that representation space. In addition, our current data augmentation method borrowed from computer vision techniques, particularly color shifting, cannot guarantee preservation of misorientation relationships between twins and their parent grains. Taking such additional special considerations may enhance the performance of our method.


We have presented a novel method using transformers followed by a projection algorithm to recover missing 3D EBSD data which vastly outperforms all baselines. Notably, even though our model is trained on synthetic data, it still recovers more accurate slices for real 3D EBSD data than the baselines by a wide margin, making it a powerful data processing tool for faulty 3D EBSD readings. Furthermore, our model opens the possibility to skip every fourth slice during data collection (by using every skipped slice and the three collected slices on each side as the input to our model), potentially reducing the collection time by 25%. Future work involves addressing the limitations, scaling our method, and considering more general cases, such as altering the projection algorithm to apply to consecutive missing slices within a sample. In terms of scaling, we hope to observe emergent behavior similar to scaling LLMs and their training data, which have seen vast improvements to performance and versatility45. Beyond EBSD, it would be interesting to investigate our method’s applicability for other high dimensional material science datasets.