A lightweight transformer for faster and robust EBSD data collection

Three dimensional electron back-scattered diffraction (EBSD) microscopy is a critical tool in many applications in materials science, yet its data quality can fluctuate greatly during the arduous collection process, particularly via serial-sectioning. Fortunately, 3D EBSD data is inherently sequential, opening up the opportunity to use transformers, state-of-the-art deep learning architectures that have made breakthroughs in a plethora of domains, for data processing and recovery. To be more robust to errors and accelerate this 3D EBSD data collection, we introduce a two step method that recovers missing slices in an 3D EBSD volume, using an efficient transformer model and a projection algorithm to process the transformer’s outputs. Overcoming the computational and practical hurdles of deep learning with scarce high dimensional data, we train this model using only synthetic 3D EBSD data with self-supervision and obtain superior recovery accuracy on real 3D EBSD data, compared to existing methods.


Introduction
Experimental methodologies for three dimensional tomography of the internal microstructure of materials has been refined considerably in the past few decades [1][2][3][4][5][6][7][8] , growing to include a variety of modalities such as X-ray computed tomography (CT), optical imaging, electron imaging, energy dispersive X-ray spectroscopy (EDS), and electron back-scattered diffraction (EBSD) data, among others.In all of these cases, a volume of material is interrogated slice by slice, and these slices are then stacked and reconstructed using a variety of software tools.Three dimensional microstructure data, while often laborious to generate compared to collecting a single 2D section, can provide unique insight, including things like the topology of microstructural features like grains, pores, precipitates and the true shape and size distributions of such features.Properties such as fatigue and oxide transport are sensitive to the three dimensional arrangement of these features [9][10][11] .And novel manufacturing processes like additive manufacturing also demand three dimensional characterization of microstructure to fully link the processing to the structure 3,12 .
While advancements have been made in robustness and closed loop collection of data 2 , corruptions in data can still happen.This may not be as significant of a problem in non-destructive techniques where the data can be recollected, but in destructive techniques like serial sectioning, it may often be the case that a few slices out of a thousand have no data or are of very poor quality, lowering the accuracy of the reconstruction.The corruptions can happen for a number of reasons including removing more material (via mechanical polishing or laser/ion abalation) than planned, the electron source nearing the end of its life resulting in low signal images, the magnification on the microscope being incorrect for a slice, or the brightness/contrast settings result in an over/under saturated image.While adjustments to the control software can be made to prevent these and other issues, it is hard to a-priori imagine every reason a slice or subset of data can be corrupted.
However, if most of the data was collected properly, we hypothesize it should be possible to infer a substantial amount of the missing data.The current common practice is to fill in missing slices is to just copy the layer above or below the missing slice 2 .This is reasonable in most serial sectioning cases where the slices are being collected at a high enough frequency that most of the data stays the same from slice to slice.Still, there is some room for improvement on nearest neighbor replacement type approaches, and the transformer model, which has been used to fill in text data, is a tantalizing framework for also filling in missing data in sequential image data.
Since its introduction 13 , transformers have been the backbone of many impressive breakthroughs in natural language processing (NLP).In their general form, transformers are excellent at processing sequential information in parallel.
Consequently, its use has spread across multiple domains, such as in computer vision 14 , speech processing 15 , and bioinformatics 16 .In our case, transformers are appealing to EBSD data because the readings are inherently sequential as it represents a real physical structure.
Early large transformer models such as BERT 17 and GPT 18 were pretrained on copious amounts of text data with the intention of learning features to model the language through associations between words and phrases with a collection of self-supervised tasks 19 .Unlike supervised tasks, such as classification, which requires labeled datasets, self-supervised tasks involve automatically generating labels from unlabeled data, such as unscrambling a sequence of shuffled sentences, which force the model to vaguely learn the structure of the data.This technique remained fairly intact even when scaling to billion-parameter large language models (LLMs), enormous models that process much longer sequences over the course of hundreds of billions of tokens (language units comprising of a sequence of characters) [20][21][22] .Taking inspiration from a similar popular self-supervised task in NLP, our training procedure involves randomly masking a slice in an EBSD volume and having the model predict the masked slice.
Our slice recovery task is closely related to that of masked image modeling, masked autoencoders, and inpainting in computer vision [23][24][25][26][27] .Even though these may be applicable and there exists plenty of transformers in computer vision 14,28 , we want our model to also be computationally efficient as we scale our method and leverage the fact that EBSD volumes have sparse structures 29 , the dynamics of which operate differently from typical images or videos.Regarding efficiency, the fact that EBSD produces high dimensional data means the final model's computational footprint cannot be ignored.
Taking these into consideration, our contributions are the following: 1. We propose a novel method to recover missing EBSD data consisting of a scalable transformer model and straightforward projection algorithm that produces superior results compared to existing methods.This is accentuated when zoning in on the recovery accuracy of grain boundaries where other methods appear to perform poorly.
2. We demonstrate that despite being trained solely on synthetic data, our transformer can generalize to real EBSD data without additional training while still outperforming all the baselines.This robustness to out-of-distribution EBSD data overcomes a major limitation of relatively low amount of available real EBSD data that deep learning requires.
3. Our results suggest that serial sectioning experiments using similar experimental parameters to those in our test datasets, collection time can effectively be slashed by up to 25% with very little error in predicted voxel orientation, as the collection of every fourth slice can be bypassed.With the current lengthy procedure of EBSD, this is a significant improvement in efficiency.

Background Terminology and Notation
For this paper, we use bold capital letters (e.g.A A A) for matrices and capital calligraphic letters (e.g.A ) for tensors.Scalars are represented by plain uppercase and lowercase letters (e.g.A and a).Tuples are used to describe the shape of a structure, so if ) and vice versa.A dimension is an index (starting at 1) in the shape tuple, and the size of a dimension i is the value of the shape tuple at that index i.When we refer to multiple volumes that share the same first k dimensions (e.g. ), we use * to capture all remaining and possibly different dimensions (e.g.A , B, and C have shape (N 1 , N 2 , * )).

Transformers
In this section, we outline the technical details of the encoder-only transformer based on its first introduction 13 .The same source describes encoder-decoder and decoder-only transformers for interested readers.

Input Processing
The first layer of a transformer is the embedding layer.For a single length N sequence of D in -dimensional vectors, X X X ∈ R N×D in , each vector is transformed into a vector of size D out .If the input is a sequence of tokens such as in language tasks, then D in = 1.The transformation could be a linear one, or a more complex nonlinear one such as another neural network.The result is an embedded sequence of shape (N, D out ). Next, positional encoding is added to the embedding to inject positional information to each vector in the sequence.Positional encodings can either be fixed or learned.

Multi-head Self-attention
The idea of self-attention is to find varying degrees of associations between elements in a sequence.Transformers typically implement multi-headed self-attention within each encoder layer, which allows the model to learn different types of associations.Let X X X ∈ R N×D be the input sequence into the attention mechanism and for head h = 1, 2, . . ., H. Additionally, let σ be the row-wise softmax operator.Then, the attention for head h, A A A h is defined as where h are the query, key, and value matrices, respectively.Next, the results are concatenated and linearly transformed to produce the output The main source of inefficiency is the computation of Q Q Q h K K K ⊤ h which levies a computational and memory cost of O(N 2 ).Consequently, the utility of transformers can be limited for long sequences without proper hardware.This becomes a dire issue as the dimensionality of the input grows, such as in images and videos which can be flattened into a long sequence.Thankfully, there has been substantial work in reducing the complexity with more efficient attention approximation mechanisms, such as Linformer 30 , Reformer 31 , Big Bird 32 , and many others 33 .

Axial Self-attention
When it comes to high-dimensional data, such as the EBSD volumes, the sequence length explodes if we flatten (or patchify in the case of many transformers in computer vision 14 ) the input and apply vanilla self-attention, making attention computation a serious bottleneck.Furthermore, since we observe a lot of structure with EBSD data, it is reasonable to utilize some simplified attention mechanism.Our model uses axial attention 34,35 which runs the self-attention mechanism along the dimensions of the input tensor.For instance, in a cube, each voxel only attends to voxels in the same row, column, or depth.This greatly reduces the amount of computation and memory, especially for higher order tensors.Intuitively, axial attention is appropriate because we hypothesize that since local information can be quite uniform (nearby voxels are likely to be in the same grain), long range information should also be included.With axial attention, it is highly likely to also obtain information in other grains and its boundaries.
In terms of implementation, the formula for multi-headed attention (1) can be reused.Let X ∈ R N 1 ,...,N K ,D be a single K-dimensional input where (N 1 , . . ., N K ) defines the shape of the volume and D is the embedding size.As an example, suppose we are interested in finding axial attention along the k-th dimension for 1 < k < K.With proper dimension permutation and flattening, we can reshape X to a tensor of shape (∏ i̸ =k N i , N k , D) and compute multiheaded attention by treating the first dimension as the batches.In our model, we repeatedly use the outputs of axial attention to compute axial attention along the next dimension.

Feedforward Blocks
The next major component to create a transformer is the feedforward block, a multilayer perceptron with a single hidden layer.The activation function is often either a Gaussian Error Linear Unit (GELU) or Rectified Linear Unit (ReLU) function, and although the input and output sizes are identical, the hidden size can be much larger.Consequently, although self-attention is a computational bottleneck, the feedforward blocks are the dominant source of parameters in the model.

Putting it All Together
With all these components, we are ready to define the general architecture of a transformer.First, an input is processed by an embedding layer and positional encoding layers at the start.Then, the input progresses into the encoder, a sequence of alternating self-attention and feedforward blocks.For training stability, residual connections and normalization layers are inserted in between each attention block and feedforward block.Finally, the features from the encoder are linearly projected to be the desired shape of the output.A visualization of a vanilla and axial transformer, which mainly differ in their attention implementation, is depicted in Figure 1.

Method
We propose a transformer model to learn missing slices of EBSD data, followed by a projection step to smooth out the voxel values.Due to limited real EBSD data, our goal is to train this model on a large and diverse synthetic dataset and demonstrate that it can generalize to real EBSD data, which we evaluate on two nickel superalloy EBSD volumes, one for alloy IN625 36,37 and one for alloy IN718 38,39 .The following repository contains the code for our method: https://github.com/hdong920/ebsd_slice_recovery.

Data Description and Preparation
Each dataset (both synthetic and experimental) includes orientation information at every voxel in a 3D image.For the experimental data, a substatial amount of preprocessing was done to handle the alignment of the data and clean up noise; a complete description of the preprocessing may be found in Chapman et al. 2 and Stinville et al. 38 In particular, we remove any grains smaller than 27 voxels (3 3 ) and average orienations per grain.The original volume containing Euler angles is of shape (N 1 , N 2 , N 3 , 3), where N 1 , N 2 , N 3 are the physical dimensions, and the final dimension represents the 3 Euler angles needed to define an orientation.Additional arrays are then computed from the input data: • Cubochorics: A volume of shape (N 1 , N 2 , N 3 , 3) where the last dimension contains the cubochoric coordinates 40 converted from the original Euler angles at each voxel.Cubochoric coordinates are chosesn since the Euclidean metric is used for regressing the transfomer model.The Euclidean distance between points in Euler angle space does not necessarily relate to the angular distance between those points.While this is also true for points in cubochoric space, as the cubochoric representation is an equal-volume mapping of SO(3) onto a grid as opposed to an equal-angle mapping, in practice the angular discretization between Euclidean neighbors in cubochoric space is relatively consistent across the grid.Thus, for all of our experiments, we operate completely in cubochoric space, under the assumption that the Euclidean metric is a reasonable approximation for similarity between points in this space.
• IDs: A volume of shape (N 1 , N 2 , N 3 ) which assigns identification numbers (IDs) to denote which grain each voxel belongs to.Each ID number has a unique vector of cubochoric coordinates associated with it.While not needed during training, these ID numbers will be used to smooth our model outputs and evaluate model accuracy.
For experimental data, the IDs are found by segmenting the grains using a misorientation tolerance 2,38 , and for synthetic data, the IDs are generated alongside the orientation data.
• Boundaries: A volume of shape (N 1 , N 2 , N 3 ) containing Boolean values indicating if a voxel is on the grain boundary.More specifically, a voxel is a boundary voxel is at least one of its neighbors is of a different ID number than itself, where two voxels are neighbors if they share a face.

4/15
To illustrate, Figure 2 2a) and IN718 38 (Figure 2b).The top row shows the cubochoric values that are scaled and shifted for better visualization, and the bottom row shows the same region in Boundaries.Note that the height and width of each slice here is 64 voxels and not the original shape.
Synthetic volumes are generated via DREAM.3D 41.Each synthetic training and validation volume is of shape (192, 192, 192, * ) and (64, 192, 64, * ), respectively.Within DREAM.3D,we generate 9 training volumes while independently varying the mean grain size and mean transformations per grain.In particular, we generate volumes with mean grain sizes 2, 2.5, and 3 with no twins; we also generate volumes with mean twins frequencies 0, 1, 2, 3, 4, and 5 while fixing mean grain size to be 2.3.Due to the nature of the software, these parameters are unitless, as we can always synthetically increase or decrease the granularity of the volume.For example slices of these synthetic volumes, see Figure 3.The grain size distributions for each dataset are shown in Figure 3 as probability plots.The natural logarithm of the normalized grain sizes are shown; ideal lognormally distributed data would lie on straight lines in such plots.In general, the grain sizes are primarily lognormal near their means, with noted deviations from lognormality in their tails, which is a known phenomenon 42 .

Training Details
We first describe a few data augmentation steps.The number of unique cubochoric coordinates is quite limited, so color shift transformations, along with other augmentations, are critical.Note that use of the word "color" is loose here because we treat the cubochoric coordinates the same as color channels in computer vision.In particular, our augmentations include random linear color shifts, rotations, and flips.
Using the 9 synthetic volumes generated by DREAM.3D,we train our axial transformer model in a self-supervised fashion similar to masked language modeling tasks 17 .These volumes are first normalized along each of the three cubochoric indices.Each training input is sampled from a randomly chosen volume with physical dimensions randomly permuted.Not only is cropping necessary due to computational limits, it also acts as another form of augmentation.is the index of the masked or unobserved slice, define M ∈ [0, 1] 64×7×64×3 to be a mask such that [M ] •,m = 0 and 1 elsewhere.Then, the model input and output will be X ⋆ ⊙ M and X ∈ R 64×7×64×3 , respectively.Since EBSD produces very structured data, we can leverage this to design a more effective loss function.A simple mean-squared-error (MSE) loss function across all voxels would not be sufficient because most voxels in the input are observed, so we would risk learning an identity function which would produce a fairly low loss value.A better approach would be to find the MSE of only the missing slice, analogous to masked language modeling objective functions 17 .However, based on Figure 2, we see that many of voxels remain unchanged across multiple slices, and the slice-to-slice changes all lie along the grain boundaries (see Figure 5).Since voxels within grains are easy to predict the values of, the source of difficulty is recovering the boundary voxels.Therefore, we opt for a MSE loss function that only considers boundary voxels in the missing slice.More formally, letting k be the index of the missing slice and E ∈ [0, 1] 64×64×1 be the missing slice's boundaries map in Boundaries, the loss function we use is following broadcasting rules.By the way we defined E , ∥E ∥ 0 is the number of unobserved boundary voxels.In summary, this loss function essentially averages MSE error across all unobserved boundary voxels.Projection algorithm to take transformer outputs and assign voxels to grains.White boxes denote which voxels have not been projected yet.The input contains the transformer output and observed adjacent slices.First, the anchoring step assigns voxels whose neighbors in adjacent slices are from the same grain.Then, voxels are sequentially projected to their neighbors, as indicated by the small arrows, starting with the voxels with the most observed and previously projected neighboring voxels.
Using this scheme, we train our 8-headed 8-layer model using stochastic gradient descent with a momentum parameter of 0.9 and weight decay of 1e−5.Using cosine schedules, we warm-up the learning rate up to 0.01 for 8000 steps.While we decay the learning rate until 160000 total gradient steps are taken with batches of 1 sample, performance plateaus around halfway.The model uses learnable positional encoding, 4 attention heads, an embedding size of 128, and a feedforward size of 512.The feedforward block is consists of two 3D convolutions with a window size of 3 along each dimension separated by a GELU.See Figure 1 for a visualization of the architecture.Notably, our model is compact for a transformer, only consisting of slightly under 30 million parameters.We also apply 10% dropout.Recall that the transformer returns predicted cubochoric coordinates at each voxel of the same shape as the input, but only the masked slice contributes to the loss function (2).

Nearest Neighbor Projection
The final step is to use outputs of our transformer, which are continuous values, to assign each voxel to a grain and produce a smoother slice with fewer intra-grain variations.To do this, we first construct a dictionary relating observed grain IDs to cubochoric coordinates.Then, we assign voxels whose neighbors (voxels that share a face) that reside in the previous and next slice have the same ID to be that ID.We observe these voxels act like anchors that provide more neighbors for voxels that are more difficult to classify, which empirically improves the recovery.Next, we begin projecting voxels with the most observed or previously projected neighbors and incrementally decrease the requirement on the number of neighbors when further projections are impossible until all voxels have been assigned an ID.Projections are determined by the minimum ℓ 2 distance to relevant cubochoric coordinates pulled from the dictionary based on neighboring IDs.Summarized in Figure 6, the projection algorithm essentially turns the transformer outputs of the missing slice into IDs which can then be converted into cubochoric coordinates.

Experiments
With our method defined, we now evaluate its performance on synthetic and real EBSD data with no additional training or fine tuning.Even though our method overwhelmingly outperforms the baselines in terms of recovery, we also point out some weaknesses and avenues for improvement.
There are three baselines we compare to.One is k-nearest neighbors (KNN) where a voxel's ID is determined by vote based only on observed or previously assigned IDs among its neighboring voxels.This is the exact process of the projection step (including the anchoring procedure), except instead of using a distance metric, voxels are assigned IDs by a vote system.Ties are broken randomly.Another method, which is currently employed as a simple solution to missing slices, is to copy an adjacent slice to replace the missing slice.This usually maintains fairly decent accuracy since the changes from slice to slice are minuscule compared to the number of voxels.As such, copying the previous and next slice are our other two baselines.

Performance
Because the changes from slice to slice are captured by boundary voxels, we define two different accuracy metrics using the IDs.We denote the accuracy of each voxel in the recovered slice as the overall accuracy and the accuracy only considering boundary voxels in the recovered slice as the boundary accuracy.The latter poses a greater challenge for all models, as it is consistently lower than the overall accuracy where the influence of non-boundary voxels often dominates.
While we report the mean accuracies and standard deviation across validation samples, the performance of one method is heavily correlated with the performance of others.For instance, if a particular slice is difficult to recover for one method, it is likely to be difficult for all other methods.To better compare the performances between our transformer and each baseline, we obtain differences in accuracy for each sample.Namely, we find ) for an accuracy metric m, ground truth X ⋆ , and outputs, X and X b , from our transformer and baseline b, respectively.Computing this across all samples, the result is a distribution of accuracy improvements.

Synthetic Volumes
First, we generate 4 independent synthetic volumes of shape (64, 192, 64) for each setting.Each volume is divided into 27 nonoverlapping segments of shape (64, 7, 64) each with one of the five interior slices masked.This results in 108 validation samples for each setting.These validation metrics are displayed in Figure 7 when varying the mean twins frequency and in Figure 8 when varying mean grain size.We observe that our method more accurately recovers missing slices than all other baselines for every synthetic validation sample since each slice observed a positive improvement in overall and boundary accuracy.The performance gain is much more apparent for boundary voxels.Furthermore, our transformer performance is much lower variance.Among the baselines, KNN achieves the closest accuracy to our method, but it still underperforms in comparison.As expected, the difference between our method and the baselines diminishes as the scenario get simpler (larger grain sizes and fewer twins) since the baselines are already producing very accurate results.

Real EBSD Datasets
Now, we seek to understand how well our model transfers to real data by running our trained model on IN625 36,37 and IN718 38,39 .For both volumes, we subdivide each volume into nonoverlapping subvolumes such that all slices are of height and width 64 voxels.Then, each subvolume is then further partitioned into nonoverlapping segments of shape (64, 7, 64, * ), each representing a single test sample.Again, each sample contains one masked slice among the five central slices.In the end, there are 800 samples for IN625 and 3298 samples for IN718 after discarding a small set of samples whose missing slice did not have any boundary voxels (these samples would trivially result in exact recovery regardless of the model we use).Average accuracies and accuracy improvements for both volumes are shown in Table 1 and Figure 9, respectively.For recovery examples, see Figure 10.
Surprisingly, even though our model is trained exclusively on synthetic data, we observe it transfers well to real EBSD data, as it still outperforms every baseline on nearly every sample, and the interquartile range is positive.Again, the difference is accentuated for boundary voxels.Test results for the IN718 dataset had much higher variance, likely due to each slice having proportionally fewer boundary voxels than IN625 slices.Qualitatively, our model has a stronger capability to recovery thin features than KNN, which visually tends to ignore more subtle structures.

Limitations
While our method provides superior results compared to the baselines, we also seek to understand the circumstance in which it may underperform.One challenge is rapid changes between slices which make up a small minority of test inputs.For instance, if the k-th slice is missing and the set of grains present in the (k − 1)-th is different from the set of grains in (k + 1)-th slice, the model has a lot of freedom to decide on which ones are present and to what degree in  the missing slice.A case in which this can arise is when the faces of grains or twins are perpendicular to the slicing direction.However, we observe this is an issue for all methods.Using the second row of Figure 9 as an example, the bright green grain is a large crescent shape on the slice before the missing slice but is hardly present in the following slice.Our model and KNN strives to find some middleground, but ultimately, both misclassifies many of the voxels.A similar scene plays out in the fourth row of Figure 9. Thus, future work includes designing a better loss function to mitigate errors for this or emphasizing these scenarios during training.Another limitation arises from the locality of our projection method.Recall our local projection method projects a voxel to have the ID as one of its observed or previously assigned voxels in order to encourage grain connectedness.This means long range dependencies may be ignored in favor of more smooth structures.Furthermore, connectedness is not guaranteed for very thin features at an angle.Examples of both edge cases for matrices are illustrated in Figure 11.Though these scenarios are fairly rare in practice, further work could be done to improve our projection algorithm.

Conclusion
We have presented a novel method using transformers followed by a projection algorithm to recover missing 3D EBSD data which vastly outperforms all baselines.Notably, even though our model is trained on synthetic data, it still recovers more accurate slices for real 3D EBSD data than the baselines by a wide margin, making it a powerful data processing tool for faulty 3D EBSD readings.Furthermore, our model opens the possibility to skip every fourth slice during data collection (by using every skipped slice and the three collected slices on each side as the input to our model), potentially reducing the collection time by 25%.Future work involves addressing the limitations, scaling our method, and considering more general cases, such as altering the projection algorithm to apply to consecutive missing slices within a sample.In terms of scaling, we hope to observe emergent behavior similar to scaling LLMs and their training data, which have seen vast improvements to performance and versatility 43 .Beyond EBSD, it would be interesting to investigate our method's applicability for other high dimensional material science datasets.

Figure 1 .
Figure 1.Transformer architecture (left) with either vanilla attention (center) or axial attention (right).The axial transformer is produced by simply substituting full attention with axial attention.Although axial attention has significantly more layers, it scales much more favorably for high-dimensional data.Attention layers take three inputs for the query, key, and value.

Figure 2 .
Figure 2. Three consecutive crops of real EBSD slices from IN625 2 (Figure2a) and IN71838 (Figure2b).The top row shows the cubochoric values that are scaled and shifted for better visualization, and the bottom row shows the same region in Boundaries.Note that the height and width of each slice here is 64 voxels and not the original shape.

Figure 3 .Figure 4 .
Figure 3. Example synthetic slices of cubochoric values that are scaled and shifted for visualization.Keeping mean grain size fixed, Figures3a, 3b, and 3cshow generated slices when we specify the mean frequency of twins per grain to be 0, 2, and 4, respectively.With no twins, Figures3d and 3eshow generated slices when we specify the mean grain size to be 2 and 3, respectively.All slices have a height and width of 64 voxels.

Figure 5 .
Figure 5. Changes between slices are captured in boundary voxels.The top row shows two consecutive slices.The bottom row shows their respective slices in Boundaries (red) with the ID difference between the two slices overlaid (white).Note that differences between these slices lie fully in the boundary voxels of both slices.

Figure 6 .
Figure 6.Projection algorithm to take transformer outputs and assign voxels to grains.White boxes denote which voxels have not been projected yet.The input contains the transformer output and observed adjacent slices.First, the anchoring step assigns voxels whose neighbors in adjacent slices are from the same grain.Then, voxels are sequentially projected to their neighbors, as indicated by the small arrows, starting with the voxels with the most observed and previously projected neighboring voxels.

Figure 9 .
Figure 9. and boundary accuracy improvements of IN625 and IN718 test sets.Note that the y-axes are on different scales.

Figure 10 .
Figure 10.Four random example predictions of missing slices IN625 (top two rows) and IN718 (bottom two rows) test sets .Each row is a separate example.Figure 10a contains the target; Figures10b, 10c, 10d, and 10eare predictions made by our transformer, KNN, the previous slice, and the following slice, respectively.

Figure 11 .
Figure11.Matrix examples where projection to observed and previously projected neighbors is suboptimal, even when giving it the ground truth.Projection involves using the neighboring slices to predict the center slice.White squares indicate which pixels have not been projected yet, and the black arrows point to the possible neighbors that the pixel will choose to project to.The top row shows an example where the projection operation smooths out very thin features that lie completely in the missing slice.The bottom row shows an example where the projection operation may disconnect a grain.

Table 1 .
Average overall accuracy and boundary accuracy across samples of real world datasets, including their standard deviations.