Addressing materials’ microstructure diversity using transfer learning

Materials’ microstructures are signatures of their alloying composition and processing history. Automated, quantitative analyses of microstructural constituents were lately accomplished through deep learning approaches. However, their shortcomings are poor data efficiency and domain generalizability across data sets, inherently conflicting the expenses associated with annotating data through experts, and extensive materials diversity. To tackle both, we propose to apply a sub-class of transfer learning methods called unsupervised domain adaptation (UDA). UDA addresses the task of finding domain-invariant features when supplied with annotated source data and unannotated target data, such that performance on the latter is optimized. Exemplarily, this study is conducted on a lath-shaped bainite segmentation task in complex phase steel micrographs. Domains to bridge are selected to be different metallographic specimen preparations and distinct imaging modalities. We show that a state-of-the-art UDA approach substantially fosters the transfer between the investigated domains, underlining this technique’s potential to cope with materials variance.


Motivation
The inner structure of a material, the so-called microstructure, determines most properties and shows substantial variance depending on its composition and process history.Hence, the quantification and digitization of microstructures play an important role in virtual material design as well as objective and automated quality control.Currently, the quantification of microstructural constituents is performed on material sections, which undergo mechanical polishing and chemical etching routines to expose these inner features in e.g.light optical microscopy (LOM) or scanning electron microscopy (SEM).Aside from the material itself, this process introduces substantial scatter in the micrographs' appearance.With evolving materials complexity, not only the quantitative micrograph assessment by metallography experts becomes increasingly subjective and costly, but also conventional computer vision algorithms have reached their limits.
Therefore, the materials science field recently adopted deep learning models for this task but is hampered by the amount of available and specifically labeled data.While annotated data is scarce, input micrographs are acquired of various materials, with different settings and optics, by different experts, and with varying processing.This motivates the need for models that can generalize across such data domains without additional labeled data in the target domains.However, deep learning approaches arXiv:2107.13841v1[cond-mat.mtrl-sci]29 Jul 2021 have been shown to exhibit poor generalization capability across miscellaneous data sets [1][2][3] and materials science data sets 4,5 .Furthermore, in terms of annotated data, materials parameter spaces are often sparsely (i.e., disjointly) populated, contributing to poor model generalization.Advanced generalization techniques could enable model sharing between institutes and their applicability on diverse data sets.Microstructure characterization is one of many tasks, which could benefit from such models.
This work addresses DL-based microstructure quantification and generalization thereof while focusing on a binary segmentation (pixel-wise classification) task introduced in our last paper 5 .It encompasses the segmentation of the lath-shaped bainite phase on topography-contrast SEM images of an etched complex phase steel surface.In this particular case, the annotation process is not only expensive but also complex, to the extent that there is a frequent disagreement between different experts on the exact nature of some phases 6 .This impairs the development of DL models.In our previous work 5 , an extensive effort has been made to combine the SEM micrographs with orientation-contrast from electron back-scatter diffraction (EBSD) images, facilitating a more repeatable and precise labeling process.However, this time-consuming procedure emphasizes the demand for frugal models in terms of labeled training data.
Moreover, this work draws the focus on model transferability to a target domain in the context of low source data availability and unavailability of target domain annotations.These are typical boundary conditions in materials science.As a first step, having trained a model on a source dataset, we want to give some insight into the possibility of adopting it to a target domain (domain generalization experiments).In this case, we investigate whether pre-training with domain-extrinsic datasets and subsequent fine-tuning to the source domain can improve generalization to target domains in the low data regime.To this end, we compare different pre-training datasets with models trained from random initialization.Subsequently, we apply a state-of-the-art unsupervised domain adaptation (UDA) model introduced by Tsai et al. 7 to our phase segmentation task across different domains.With the aid of additional unlabeled target data, this technique attempts to learn domain-invariant features to facilitate the domain transfer.Provided few data, and especially in the materials science field, it is unresolved whether such advanced deep learning techniques can facilitate the transfer across different processing routes or even materials.As example studies, distinct metallographic surface etchings and different imaging modalities were investigated as the domains to bridge.

Related work
Instances where deep learning was applied to solve materials science tasks, are fairly scarce.For instance, Azimi et al. 8 perform a microstructure segmentation task on SEM images of dual-phase steels using a fully convolutional deep neural network coupled with a super-pixel voting approach.Holm et al. investigate various tasks, amongst others microstructure segmentation, on their ultra-high carbon steel (UHCS) dataset using an end-to-end deep learning approach 9 .Recently, Thomas et al. 4 published a study implementing a U-Net architecture for damage segmentation, detecting fatigue-induced extrusions and cracks on steel micrographs.

Fine-tuning pre-trained models
When training with little data, the first prevalent practice in DL is fine-tuning a pre-trained model 10 .This procedure is often used synonymously with transfer learning even though latter now encompasses many further methods.Pre-training a neural network is based on the assumption that its first layers learn similar features regardless of the trained task.Indeed, these layers act as feature extractors, detecting edges, corners, colors, or blobs.Thus, carrying over the weights of a model pre-trained on a large-scale dataset and fine-tuning them on an another task reduces the demand for training data in the latter task while accelerating convergence.Different strategies exist, ranging from full model fine-tuning with a small learning rate to freezing the initial layers of the model and fine-tuning only the last ones.
Utilizing readily available model weights from pre-training with ImageNet, owing to its apparent size and richness, has become the status quo for weight initialization in deep learning.Recently, He et al. 11 questioned the undifferentiated ImageNet weight usage, showing that it only helps in faster convergence but does not enable performance improvement.However, the low-data regime, i.e., when little data is available for the final task, was shown to be exempt from this finding.Specifically, pre-training on ImageNet culminated in improved COCO object detection 12 only when less than 10k images were used for fine-tuning.
However, in material science, this low-data regime is a lasting attendant circumstance of both materials diversity and annotation complexity, making pre-training a suitable strategy to infer robustness and to raise the performance of the trained models 13 .Nevertheless, pre-training dataset selection remains an unsettled issue.
In other domains, this has been the subject of recent research 14,15 .The overarching trends seem logical -using a large pre-training dataset as close as possible to the target task dataset proves beneficial.However, the trade-off between data quantity and domain gap is undefined.Cheplygina et al. 14 condensed 12 studies from the medical field in a detailed review, comparing ImageNet with smaller in-field datasets.The conclusions are not consistent, and no clear trend could be identified.Romero et al. 15 studied the effect of different pre-training for chest X-ray radiographs classification, varying the number of target training images from 50 to 2,000.Pre-training was conducted using ImageNet or one of two X-ray datasets (220k chest images and 40k images of various body parts).They show that pre-training is always helping in their low-data regime and emphasize that the chest dataset pre-training yields the best results.However, their miscellaneous body parts radiography dataset only performed comparably to ImageNet, showing the trade-off between data amount and domain gap for choosing a pre-training dataset.
Gonthier et al. studied similar aspects for artwork classification, where pre-training on ImageNet outperformed another artwork pre-training dataset containing around 80k images, presumably ascribed to the low quantity of the artwork pre-training dataset 16 .Interestingly, a gradual two-step pre-training, using first ImageNet and then the intermediate artwork dataset, led to further improvement.This suggests that successive pre-trainings could help to bridge domain gaps continuously.

Unsupervised domain adaptation
Unsupervised domain adaptation (UDA) describes training a model with labeled data from a source domain and unlabeled data from a target domain to perform well on the latter.This is of major interest when the target domain labeling process is costly or when source domain annotations are readily available.One very good example is the well-known GTA5 17 or SYNTHIA 18 (source) to Cityscapes 19 (target) task.In both cases, source datasets provide synthetic and inherently annotated urban landscapes.
A wide branch of the UDA methods is now improving an adversarial learning framework, which was initially proposed by Ganin et al. 20 .The underlying idea is to force the model to adopt a domain-independent feature representation.This is achieved by passing images of both domains into the main model and feeding intermediate layer feature representations into a discriminator.This discriminator then guesses whether the initial data comes from the source or the target domain.Thereby, the discriminator penalizes internal feature representations when they differ substantially for both domains.At the same time, the source data is used in a supervised fashion to train the main model for the task.This adversarial approach has been adapted to semantic segmentation by Tsai et al., where matching of internal features and segmentation masks was performed 7 .
To the largest extent, recent articles about UDA report results on the previously mentioned GTA5 to Cityscape task, and only a few applications can be found in other fields such as the adversarial training approach in the medical domain 21,22 .This techniques' potential for the materials science community has never been showcased to the best of our knowledge.By doing so, we hope that this work will spark interest in UDA techniques in this field.

Contributions
The contributions of this work are the following: • We study the impact of different data augmentation policies, models, and pre-trainings on the segmentation performance.
• We show the impact of these different training strategies on the domain generalization (i.e., applicability) of a model to an alternate target domain.Specifically, the domain generalization across SEM micrographs of differently contrasted complex phase steel microstructures and across different imaging modalities is addressed.
• We implement a state-of-the-art UDA approach 7 and show its merit in materials science despite data limitations by bridging the aforementioned domain gaps.The UDA frameworks' suitability for applications beyond the ones presented here is discussed.

Specimen fabrication and image acquisition methodology
This work is based on SEM images of a low-carbon complex-phase steel which was subject of our last work 5 .The specimen were taken from thermomechanically rolled heavy industrial plates.Resulting microstructure is composed of lath-shaped bainite (foreground) as well as polygonal and irregular ferrite with dispersed granular carbon-rich 2nd phase and martensite-austenite (MA) islands (collectively referred to as background).Micrographs were taken in the plate's transversal direction (TD), between quarter-and mid-thickness of the plate.Rolling-induced stress and cooling rate gradients result in a small microstructure variance, where in some images taken from comparatively surface-near regions, the polygonal ferrite grains are elongated in the rolling direction (RD).During imaging, segregation zones in the plate core were avoided.The specimens were ground using 80-1200 grid SiC papers, and then subjected to polishing with 6, 3, and finally, 1 µm diamond grain sizes.Using different etching and imaging conditions as well as image modalities, four image sets were drawn from these specimens.The configurations are presented in Table 1.
Etching.The duration was controlled by a metallographer waiting for a macroscopic contrasting to be visible to the naked eye.
Etching reveals grain boundaries since the reaction kinetics depend on the local chemical composition and crystallographic orientation.Therefore, carbide films and a few MA constituents are exposed.An example input image, the same with superimposed label, and a detail view of each data set (i.e., domain) is presented in Figure 1.

4/20
Imaging conditions.The SEM setup 1 was utilizing a Zeiss Merlin FEG-SEM using secondary electron contrast (in-lens) at a magnification of 2000× with an image size of 2048×1433 (annotation bar cropped), which represents 56.7×42.5 µm 2 (pixel size = 27.7 nm).The SEM was set at an acceleration voltage of 5 kV, a probe current of 300 pA, and a working distance of 5 mm.Small acceleration voltages reduce the interaction volume and increase surface sensitivity.In SEM setup 2, the micrographs were recorded in a Zeiss Supra FEG-SEM with another in-lens detector.The acceleration voltage, the probe current and the working distance were the same as in the first setup.The magnification was set at 1000× with a higher image resolution, giving virtually identical physical pixel size compared to the first setup.Due to subsequent stitching and cropping, images of SEM setup 2 ultimately had the same size as others.Contrast and brightness settings differed from the first imaging setup, see Figure 1.Within each domain, micrographs were acquired with the same image contrast and brightness settings in the SEM.Lastly, the LOM images were recorded in an Olympus LEXT OLS 4100.Micrographs were taken at a magnification of 1000× with an image resolution of 1024×1024 pixels, corresponding to an area of 129.6×129.6 µm 2 (pixel size = 126.6 nm).All LOM images were acquired with the same exposure settings.
Annotations.Segmentation masks were drawn manually by human experts on a digital drawing tablet (Wacom).Correlative EBSD maps on the source domain micrographs rendered the annotation more reproducible and accurate.For more details on the acquisition, multi-modal registration and annotation process we refer to 5 .Note that for the target domains, with the exception of target 3 (T3), annotations were only available for a small portion, i.e. the test images.

Dataset processing and descriptions
Processing.The LOM images were assimilated to the native SEM datasets in terms of physical pixel size and field of view by scaling and cropping operations, cf. Figure 1a and 1j.Subsequent processing was identical for all datasets.Images were mirror-padded to make them square so four tiles can be extracted per image.Following the results of our last paper 5 , the images were downscaled by a factor ×0.5 before tiling, effectively increasing the context passed to the models' receptive field 23 .The obtained training tiles are then of size 636×636 (512×512 plus an additional 62 pixel overlap extending into the adjacent tiles at each tile border).This overlap-tile strategy was introduced by Ronneberger et al. 24 to cope with memory restrictions and helps the model to circumvent tile border effects.For segmentation loss computation, the overlap regions extending in neighboring tiles were discarded (i.e., 512×512 center region was used solely).To align the domains' phase fractions as much as possible, few training tiles of the target domains were discarded, ultimately resulting in the tile numbers and phase fractions listed in Table 1.The source dataset was split in five folds for cross-validation.In contrast, the annotated portion in the target domains was too small to perform cross-validation.Therefore, only a single train and test set has been built for the target datasets.While training was performed with tiles, the model evaluation was conducted on full images.Bright-field LOM images (T3) pixel values were inverted to give them a dark background similar to SEM.
Descriptions.The microscopic differences between the four datasets become evident in the detail views in Figure 1c, f, i, l.Conducted electrolytical Struers A2 etching (source) does not emphasize the sub-grain boundaries and culminates in comparatively slender carbide film appearance overall.In contrast, target 1 (T1) and to a lesser extent target 2 (T2), highlight the hierarchical structure in the lath-bainite regions (see arrow annotations in Figure 1f, i).However, these two target domains are also accompanied by etching artifacts (see ellipse annotations).In T2 and T3, as highlighted by the rectangle annotations, some grain boundaries appear faded due to their grain boundary inclination and the progressed etching state.Along with the pronounced contrast and wider carbide films, this renders it evident that T2 was over-etched.In the bright-field LOM images (T3), carbide film morphology cannot be resolved and image features differ substantially due to the modality change.
In conclusion, these target domain datasets represent typical but distinct degrees of domain gap, where the shift is gradually increasing with respect to the source domain.On the other hand, statistical differences are present particularly in the T2 domain, where the lath-bainite phase fraction deviates from the source significantly.

Pre-training datasets
The first pre-training dataset used in this study is ImageNet 25 .Further, the apparent domain gap between ImageNet and our target datasets motivated us to test an additional pre-training dataset with a smaller domain gap.Therefore, we selected a SEM dataset of nanoscientific objects 26 , which comprises approximately 22k images, non-uniformly distributed in 10 classes: biological, fibers, films and coated surfaces, microelectromechanical systems (MEMS) and electrodes, nanowires, particles, porous sponges, pattern surfaces, powders, and tips.In the following, we refer to this pre-training dataset with NanoSEM.Before using the NanoSEM dataset, a pre-processing cleaning step was performed.Images with many burned-in measurement annotations were discarded and SEM annotation bars were cropped to avoid spurious correlations between annotation bars and class predictions, known to occur otherwise.Finally, the pre-training dataset amounted to 18,750 images.More details about the pre-training methodology and results are supplied in Section 2.4.2 and the Supplemental.

Segmentation architectures
As part of this work, two main segmentation architectures are implemented.The well-established U-Net 24 is used in the first place to investigate different pre-training strategies.This fully-convolutional architecture is, as its name implies, composed of an encoder-decoder structure with skip connections between the corresponding levels of the encoder and decoder.It gave outstanding results on medical segmentation tasks even with very little training data.
Among many different segmentation models that were proposed after the U-Net, one series of models marked a turning point in this field.Chen et al. published the first version under the name DeepLab 27 .In this work, the second DeepLab version is implemented (DeepLabv2).This architecture uses so-called dilated convolutions (or atrous convolutions), which help the model to enlarge its field of view (receptive field) and take patterns at larger scales into account appropriately.The main idea of DeepLabv2 is to learn and aggregate patterns at different scales with dilated convolutions having different dilation rates.This aggregation of dilated convolutions effectively causes a more uniform distribution in the effective receptive field 23 .
The encoder used for the U-Net is a portion of the VGG16 classification network 28 , while DeepLabv2 was built with a ResNet-101 29 encoder.The exact architecture for both cases are given in the Supplemental and in 7 , respectively.
For segmentation training, a binary cross-entropy loss and an Adam optimizer was employed.Learning rates, batch sizes and training times vary along this study.Thus, these parameters will be specified in Section 3 for the different experiments.The models were trained on a GPU cluster node consisting of four parallel NVIDIA Ampere A100 GPUs.

Pre-training & fine-tuning procedure for domain generalization experiments
For the ImageNet pre-training, we used pre-trained weights provided by the python package Segmentation models pytorch 30 .
Concerning our self-performed NanoSEM pre-training, we passed the U-Net encoder output to an auxiliary classification head 30 .This head consists of a global average pooling layer, followed by 50% dropout and a linear layer with a sigmoid activation.The auxiliary classification head facilitates encoder training on classification datasets.
For NanoSEM pre-training, ImageNet weights were used as an initialization for the trainings, making this process a two-step pre-training (from ImageNet to NanoSEM to the final task).The 18,750 NanoSEM images were split into 15k images for training and 3,750 images for testing purposes.Pre-training used an Adam optimizer with a constant and encoder-layer independent learning rate and was run for 100 epochs.The obtained pre-trained models were transferred to the segmentation task by just copying the weights of the model performing best on the pre-training task.The full model was then fine-tuned on the source dataset (without frozen layers) with a reduced learning rate for the pre-trained encoder (10× lower than the decoder learning rate).The two-stage pre-training process along with fine-tuning is summarized in Figure 2. Please note that the individual learning rates applied at the pre-training and fine-tuning stage are of major importance.An optimization of the learning rate used for pre-training on NanoSEM has been carried out and is given in the Supplemental.In case of the VGG16 U-Net model, aside from random initialization, either the two-stage pre-training or ImageNet pre-trained weights were used as initial conditions before pursuing fine-tuning to the source domain.This allowed to investigate the impact of pre-training on generalization capability to the target domains (see Table 1).In contrast, for the DeepLabv2 models solely ImageNet pre-trained weights were used.The described procedures for both architectures, did not utilize any target domain data.Instead, source domain fine-tuned models were directly tested on target domain data (domain generalization).

Unsupervised domain adaptation -An adversarial framework
To additionally take advantage of unlabeled target domain data, which often can be created in abundance effortlessly, the UDA method of Tsai et al. 7 has been implemented.This adversarial framework can be used to train a semantic segmentation unsupervised domain adaptation task.It is based on the original idea proposed by Ganin et al. 20 The code of 7 was adapted to make it compatible with our data.Figure 3 depicts the training process in a simplified fashion.
As proposed in the original framework 7 , we utilize a DeepLabv2 with a ResNet-101 as the segmentation architecture, which facilitates comparability with the corresponding domain generalization experiments described at the end of Section 2.4.2.In the UDA framework, annotated source and unannotated target domain data are fed into this segmentation model (shared weights).The source domain prediction is used for training the segmentation model in a supervised manner, given that labels are available in this domain.This gives a first part of the loss function (L seg ), evaluated as a binary cross-entropy.Furthermore, source and target domain predictions are passed to a discriminator model, which attempts to classify from which domain the prediction comes.The second part of the loss, the so-called adversarial loss (L adv ), quantifies the ability of the segmentation model to fool the discriminator.It is also computed as a binary cross-entropy for the domain classification.Additional to the segmentation outputs, network-internal feature representations of both domains are extracted from an auxiliary segmentation head, reshaped to segmentation mask size (auxiliary segmentation), and passed to the discriminator (L aux adv ).This is not represented on Figure 3 for the sake of simplicity.Moreover, the source domain auxiliary segmentation is compared to the annotation mask (L aux seg ).The different loss parts are weighted so emphasize can be put on either of the segmentation or adversarial losses introducing three further hyperparameters.Note that the loss portions related to the auxiliary feature output are typically less weighted, making Figure 3  This complex architecture forces the segmentation model to learn features that are domain-independent, rendering the transfer from the source to the target domain possible.

Data augmentation
Data augmentation is a common practice in ML for increasing the labeled data amount without additional annotation cost.It consists in applying transformations to the data before passing it to the model.The main objective is to render the network invariant to specific transformations.As part of this work, a simple flip and 90°rotation pipeline (with probability 0.5) was first tested (marked as basic).Moreover, an extended pipeline making use of, amongst others, elastic transformation and optimized for the source domain in our last publication 5 was implemented (marked as extended).Both pipelines have been built with the Albumentations package 31 .Full details about the pipelines are given in the Supplemental.Data augmentation was applied to train all our models except for the NanoSEM pre-training.In the UDA framework, both source and target datasets were augmented using the optimized pipeline.

Evaluation metrics
For the segmentation models evaluation, we use an intersection over union (IoU) metric, see Equation 1, averaged over background and foreground classes (mIoU).In order to evaluate the trained models' generalizability, we used a relative mIoU deviation between the source domain performance with sole supervised learning and the target domain performance with the concerned generalization method (either domain generalization or UDA).We refer to this metric as relative domain transferability (RDT; see Equation 2).Using a relative deviation avoids overrating models that are performing better on the target domains because of their inherent advantage on the source domain.For instance, a hypothetical model with 70% and 65% mIoU on the source and target (RDT = -0.07)generalizes better than one yielding 80% and 70% mIoU (RDT = -0.13),despite latters' better target performance.
T P, T N, FP, and FN represent true positive, true negative, false positive, and false negative pixels, respectively.mIoU T and mIoU S are the class-averaged model performance on the target domain and reference source domain, respectively.

Results
In the first part (Section 3.1), the fully supervised results on the source domain will be presented, posing a reference for the segmentation task.Subsequently, performance on the target domains (T1,T2,T3, see Table 1) is tested in Section 3.2 by directly applying former models (domain generalization = DG) and by unsupervised domain adaptation (UDA), showing the latter methods' benefit.

Supervised-learning on the source domain
Models were trained on the source dataset in a supervised fashion.Five-fold cross-validation was implemented to investigate the influence of different training data configurations.The VGG16 U-Net architecture has been used to test aforementioned pre-training and data augmentation settings in order to maximize the phase segmentation task performance.A batch size of 12 with a constant learning rate of λ = 5E − 3 was utilized.The models were trained for 200 epochs, except for experiment S.3, which required more iterations due to the random initialization coupled with the extended augmentation pipeline (increased data variance).This experiment was thus extended to 400 epochs.
Additionally, a ResNet-101 DeepLabv2 model was trained in a supervised manner.This segmentation architecture corresponds to the one used in the UDA framework, rendering the results comparable.For this architecture, only the pre-training on ImageNet was implemented.The models were trained for 550 epochs with an initial learning rate of λ = 1E − 3 and a polynomial decay with a decay factor of 0.9.Note that a larger epoch number was required for this architecture because of ResNet-101's higher number of model parameters.The mentioned training epoch numbers were chosen to ensure proper convergence of the validation mIoU.The results are given in Table 2.The results in Table 2 demonstrate that pre-training helps the model in this low-data regime.Indeed, we observe the systematic trend that the two-stage NanoSEM pre-training outperforms the ImageNet one, which in turn surpasses random initialization, regardless of the used augmentation pipeline.Similarly, the extended augmentation pipeline consistently grants better results than the basic one, which in turn compares favorably to the unaugmented result.Thus, the best performance is observed with the NanoSEM pre-training and the extended data augmentation pipeline, reaching 80.2% mIoU as an average over the five folds (S.9).
Experiment S.10 consists in the supervised training of the DeepLabv2 architecture on the source domain.Concerning the ImageNet pre-training case, DeepLabv2 slightly outperforms the VGG16 architecture on this task (cf.S.10 to S.6).
In the following sections, all mIoU values in Table 2 are utilized as the reference value to compute equivalent DG experiments' RDT metrics.Analogously, model S.10 serves as the reference for all the UDA-based models.

Model generalization and domain adaptation to target domains
In this section, the main objective is to achieve good segmentation results on the target datasets introduced in Table 1 even though no labeled training data is available in these domains.The three following subsections are dedicated to the three target datasets with continuously increasing domain shifts.

Etching type variation -A small domain shift to start with (T1)
The models trained in Section 3.1 were tested on this target domain, and the aforementioned UDA framework was used for improving the results.These UDA models were trained with a polynomial learning rate (λ = 1E − 3, decay factor of 0.9) for 3,000 epochs and a batch size of 8 both for source and target tiles.Once again, the number of epochs has been chosen to reach a satisfying convergence of the validation mIoU.The results are given in Table 3 and some visualizations are provided in Figure 4 to show the advantage of the UDA method over domain generalization.Table 3. Performance of the models trained on the source domain evaluated on the target 1 domain (cf.Table 1) along with the UDA reached performance.The given mIoU is always averaged over the five folds used for cross-validation.NanoSEM refers to the two-step pre-training introduced in Section 2.4.2.The augmentation pipelines are detailed in the Supplemental.

Model
Pre Looking at the results from Table 3, it first appears surprising that most source domain trained models perform excellent on T1, sometimes even exceeding the source performances, resulting in positive RDT values.This will be discussed in Section 4.2.Secondly, the general tendency that pre-training helps domain generalizability is evident, as random-initialized models yield the lowest RDT values.Despite the aforementioned good model transferability between source and T1, UDA surpasses DG clearly (compare T1.11 to T1.10).Aside from the 2.5% mIoU increase in favor of the UDA approach, the obtained model is more balanced in terms of class mispredictions.Figure 4 illustrates this phenomenon with two examples.Without UDA, the models transferred from scratch exhibit a skewed behavior towards the background class, thus giving substantially more false negatives than false positives.On Figure 4b, d, the amount of false negatives is reduced while false positives increase slightly, giving an overall better segmentation and an improved phase fraction estimation in the UDA case.Additionally, it can be observed on Figure 4 that both models (DG and UDA) detect the lath-shaped areas as bainite appropriately, whereas the DG model particularly struggles to find the proper boundaries between the foreground and background classes.The main improvement offered by UDA lies in the better boundary localization (Figure 4b, d).

Etching type and imaging setup variation -A larger domain shift to address (T2)
Analogous to the previous subsection, this one addresses the T2 dataset (cf.Table 1).Results are given in Table 4 and visualizations on Figure 5.

Table 4.
Performance of the models trained on the source domain evaluated on the target 2 domain (cf.Table 1) along with the UDA reached performance.The given mIoU is always averaged over the five folds used for cross-validation.NanoSEM refers to the two-step pre-training introduced in Section 2.4.2.The augmentation pipelines are detailed in the Supplemental.Regarding domain generalization (DG), two major observations can be made.First, exactly as for T1, it appears that pre-training helps.Second, as opposed to T1, the basic augmentation pipeline improved over the extended one.Moreover, there is a significant RDT drop in T2, rendering this a more challenging task for the UDA framework.

Model
In this case, UDA gives a pronounced advantage, exceeding the DG DeepLabv2 model by 6.3% mIoU.Figure 5 displays how UDA corrects the skew towards the background class, leading to a segmentation that is better and more balanced in terms of misclassifications.While difficult regions at the top of Figure 5a, b remain challenging for the UDA network, larger lath-shaped regions are segmented more comprehensively (cf.bottom of Figures 5a and 5b or 5c and 5d).The classification of these difficult regions at the top of Figure 5a, b is equally complicated for humans experts.

10/20
Furthermore, some checkerboard patterns appear clearly on Figures 5c and 5d.These periodic patterns originate from bilinear interpolation in the DeepLabv2 architecture used to restore the input image resolution after the encoding stage and in regions of model uncertainty.In such uncertain areas the segmentation could be improved by combining models in a voting scheme (i.e., bagging) for getting better final predictions.Such a bagging strategy will be briefly discussed in Section 4.2.

Bridging the gap between modalities (T3)
Table 5. Performance of the DeepLabv2 model trained on the source domain evaluated on the target 3 domain (cf.Table 1) along with the UDA reached performance.The given mIoU is always averaged over the five folds used for cross-validation.

Model
Pre Lastly, the T3 (LOM) dataset has been investigated.Along with the UDA performance, the domain generalization DeepLabv2 model is reported as the sole baseline in Table 5.The VGG16 U-Net models are omitted, considering their poor performance on this target domain.In this case, the large scatter between the five folds' results made any conclusion about the relative domain generalizability of individual models on T3 impossible to draw.
While domain generalization seems compromised on this target dataset, a tremendous improvement of 23.6% mIoU is experienced when using the UDA method.Moreover, the scatter over the five folds is reduced substantially.Considering the few prediction examples in Figure 6, it is apparent that UDA turns a completely unusable model into a convincing one without requiring any labeled data in the target domain.

Supervised-learning on the source dataset
Achieving the results above with this source data set (see Table 2) contrasts the common preconception that DL techniques require a large training data quantity.Specifically, satisfactory results on this complex microstructure segmentation task were attained despite using barely more than 100 tiles (27 native SEM images) for training.This can be explained by two factors.First, the data has been acquired in a very repeatable manner.Images of each dataset (i.e., domain) were drawn from an individual etched specimen, and reproducible imaging conditions were applied among images of the same domain.It has to be underlined that this results in comparatively low intra-domain variance, which might not be representative for large-scale datasets acquired by multiple operators or even different institutions.Second, the native micrographs exhibiting a high resolution and rich feature density, 27 such images still represent an appropriate learning foundation for our binary segmentation task.
The results in Table 2 emphasize that pre-training improves the performance of the trained models, giving up to 1.5% mIoU improvement between the best random-initialized model (S3) and the best NanoSEM pre-trained model (S9).Moreover, pre-trained weight initialization led to faster model convergence.Even with further training, it was observed that random initialized models did not catch up to the pre-trained ones.Therefore, the dataset is situated in the low-data regime mentioned by He et al. 11 , where pre-training elevates the performance irrespective of training iterations.In addition, the two-step NanoSEM pre-training shows better performance compared to ImageNet in all cases.In contrast, pre-training solely on NanoSEM resulted in poor model performance (not reported here).These observations are in line with Gonthier et al. 16 , who performed a two-step pre-training process as well to gradually bridge the domain gap between real-word image datasets and artwork datasets.Please note that this gradual pre-training procedure, compared to conventional pre-training, introduces a further learning rate hyperparameter, which is known to affect the final task performance sensitively.Therefore, relatively more learning rate optimization is required for the pre-training and fine-tuning steps.More details on the learning rate variation are given in the Supplemental.
This two-stage pre-training on ImageNet and NanoSEM might be called into question considering the slight performance increase over sole ImageNet pre-training (1% improvement from experiment S6 to S9).However, it has to be emphasized that NanoSEM is far from being the optimal pre-training dataset for our target task.Indeed, it entails the following limitations: • Structures in certain classes such as MEMS, patterned surfaces, and tips contain shape-related features but barely any apparent microstructural ones, making the learned weights possibly sub-optimal for the final task.
• The image formation is complicated in SEM and depends on a multitude of settings.Most images in the NanoSEM dataset, depending on the class, were either acquired with an Everhart-Thornley (SE2) detector or in-lens detector.Therefore, concerning the detector class, only the latter portion of the pre-training data matches the acquisition of both source and target datasets.Generally, the SE2 detector exhibits a more pronounced topography sensitivity due to its location and orientation, while the in-lens detector combines surface topography and, to a lesser extent, material contrast.
• NanoSEM represents a classification task.Hence, only the encoder of our segmentation model could be pre-trained.While it can be assumed to be domain gap dependent, there is no quantitative understanding to which extent and how many layers of a segmentation model would benefit from such a decoder pre-training.Conventional pre-training was reported to primarily help the models' first layers to learn general features 10 .
• For pre-training standards, NanoSEM is comparatively small.Despite these inadequacies, the underlying rationale of utilizing this NanoSEM dataset for pre-training was that high-level characteristics such as noise levels typical image textures can be learned.However, we assume that a more extensive dataset involving a micrograph segmentation task of arbitrary alloy would prove beneficial over NanoSEM.For instance, The ultra-high carbon steel micrographs collection subset introduced in 9 would have been appropriate if not for its low quantity.A more promising candidate could be the recently published Aachen-Heerlen annotated steel microstructure dataset 32 containing annotated martensite-austenite islands.While this datasets' annotations exhibit a systematic offset at instance boundaries potentially causing adverse effects during learning, such tendencies presumably can be unlearned during fine-tuning.
With respect to data augmentation, a systematic increase in performance is observed when applying the two pipelines, which is not surprising considering the low amount of data used for training the models.
Lastly, it appears that the DeepLabv2 architecture achieves better results compared to the U-Net one (compare experiments S.6 and S.10).However, the improvement is relatively small considering the model size difference.A possible explanation is that our segmentation task does not exploit the full representation power of the ResNet-101 DeepLabv2 architecture.Another potential cause might be the too small tile size used for training.Indeed, the DeepLabv2 architecture is built to learn large receptive fields thanks to its dilated convolutions.Using 636×636 tiles may hinder the learning process by forcing the model to learn smaller receptive fields than it was designed for in order not to suffer from border effects. 13/20

Transferability to other domains
Pre-training not only helps improving performances on the source domain but also brings generalizability to the trained models.Table 6 provides the obtained mIoU on the T2 domain when applying the best source domain trained models after 100 and 200 epochs of fine-tuning.The 200 epoch results were already contained in Table 4. Generally, pre-trained models perform better, especially with NanoSEM.Interestingly, when pre-training with ImageNet, models' generalization power decrease between 100 and 200 epochs.It can be assumed that, during prolonged training, weights are tweaked such that very dataset-specific features are progressively replacing general ones.While after 100 epochs, using NanoSEM pre-training achieves the best results, the difference between ImageNet and the two-stage NanoSEM after 200 epochs is marginal.However, quantifying and visualizing this phenomenon of increasing model specialization through prolonged training remains an open question.
Contrary to T1, the basic augmentation pipeline consistently outperforms the extended one for T2.This poor domain generalizability (Exp.#T2.3, T2.6, T2.9) suggests that models were rendered invariant to some task-relevant features of T2 when trained with the extended pipeline.It should be emphasized again that this pipeline was optimized for the source domain, which exhibits a substantially wider domain gap with T2 compared to T1. Extended data augmentation causing a drop of generalizability has previously been observed in 4 .Concerning the UDA framework, the obtained results are very encouraging.For T1, it appeared that due to the minimal domain shift with respect to the source, transferring source-trained models was already performing satisfactorily (cf.experiments T1.6, T1.9).Hence, this problem posed to UDA is not overly challenging.The DG models' achieved mIoUs on this target domain even exceed the source mIoUs.Presumably, this can be attributed to the additional parallel features introduced by the subgrain boundaries (see Figure 1), rendering the prediction easier.This was verified using the GradCAM network visualization technique, which computes the network gradients at a specific layer with respect to a target class and thereby estimates pixel-wise activation.For more information, we refer to 33 .
In Figure 7 this technique has been applied to the source and T1 domain to determine regions that were deemed important by the (same) ResNet-101 model (trained on source).It is clear that the activation is more extensive in the T1 domain and additionally involves the subgrain boundaries inside the laths (red arrow annotations).This supports the theory that these additional features induced by Nital etching are beneficial for the model.In our prior study, we discovered that image downscaling for the source domain culminates in a performance increase since the pixel gap between carbides at lath boundaries is reduced, and information loss is minimal 5 .Therefore, the parallelism of these features can be assessed at earlier network layers.The GradCAM results on layer 3_16 indicate that the hierarchical microstructure and internal subgrain boundary features revealed in T1 can help to bridge the otherwise feature-sparse bainitic ferrite regions in the source domain to improve learning.Note that the activation in Figure 7c is high where parallel carbide films are in close vicinity.These parallel carbide films being decisive features indicates that these trained models' performance could be compromised when evaluated on cross-sections in rolling or normal direction due to their distinct microstructural patterns 34 .Moreover, considering the small test sets, it can not be excluded that the T1 test images potentially being easier to predict on average contributes to the better T1 performance.Overall the UDA framework gave a 2.5% mIoU boost on this target domain (T1.11 compared to T1.10).Furthermore, training the UDA framework with T1 as the target domain gave models that perform better on the source domain with 79.7% mIoU, granting a 0.3% boost compared to experiment S.10.Similarly, such small domain gaps led to the same observation in the context of urban images 35 .
On the other hand, T2 and T3 have broader gaps with respect to the source domain due to stronger etching and different imaging modalities, respectively.Despite the large domain gap of the T3 dataset with the source one, UDA performed substantially better on this dataset than on T2, culminating in a 23.6% mIoU improvement over the DG experiments, which is mirrored in Figure 6.As a reference, fully supervised training on T3 presented in our last paper 5 achieved 79% mIoU, showing that employing UDA results in a 6% drop only, despite not relying on target labels.This result with respect to bridging modalities is promising and in line with literature where domain adaptation in the medical field was successfully applied to transition between computer tomography and magnetic resonance imaging 21 .Note that UDA models trained with T2 and T3 datasets scored 78,8% and 75,8% mIoU on the source domain, respectively, falling short compared to the fully-supervised training (Exp.#S.10: mIoU = 79.4%).Indeed, UDA reaches a compromise between source and target, which is detrimental to the source domain when large domain gaps with the target are involved.This observation confirms that T1, T2, and T3 are gradually increasing the domain shift with respect to the source domain.
The difference in UDA performance on the target domain between T2 and T3 could be attributed to the 5× larger data amount available for the latter set (see Table 1), where 48 unannotated training tiles for T2 might be insufficient.Another reason could be the phase fraction of the T2 dataset, which is substantially lower than the source dataset.This assumption will be discussed in the following Section 4.3.Lastly, we consider it unlikely that the SEMs different distortion and noise level characteristics (see Section 2.1) are causing this difference since the UDA framework can cope with different modalities and corresponding data augmentations were applied.
Additionally, to improve over the individually trained UDA models, we also implemented a bagging strategy.This consists in averaging the predictions of multiple models to give an improved segmentation.In our case, we use the models trained with the five different folds and achieve 85.9%, 70,2% and 75.0%mIoU, which results in 1.2%, 2.9% and 1.7% increase over the best results presented in the T1, T2, and T3 result tables, respectively.The larger improvement for T2 can be attributed to its comparatively weak individual classifiers, making the bagging paradigm relatively more profitable.
Lastly, as opposed to DG models, which are frequently biased towards a class (cf.Figures 4, 5), UDA leads to more balanced models.Consequently, this characteristic improves the estimation of phase fractions or other metrics, which do not require full details of the segmentation mask.As an example, predicted phase fractions on the different target datasets with DG and UDA are reported in Table 7.It is evident that UDA improves phase fraction estimation in all cases and additionally reduces the scatter substantially (e.g., by a factor of 10 for T3).The large observed scatter of DG models for T2 and T3 predictions can be explained by the different training folds, each leading to skewed predictions in favor of either background or foreground class.Using UDA systematically reduced the skewed behavior and gave models that are only slightly biased towards the background class.This suggests that UDA-based models, when applied to the target domains, misinterpret some foreground class features.One potential cause could be incompletely bridged gaps with the source domain.Another possibility for this consistent lath-bainite underestimation could be the labeling process.Unlike the target domains, the source was labeled 15/20 with supporting EBSD images, which could conceivably result in different annotation patterns.Lastly, one should recall that the fairly small test set size renders the results very sensitive concerning any labeling inconsistencies.For these reasons, the impact of intra-rater reliability is presumably elevated.Another constraint was witnessed when trying to bridge the gap between SEM and LOM data (T3).Indeed, the learning process failed when feeding the model with bright-field LOM tiles, motivating us to invert their pixel values.As the segmentation architecture shares weights between source and target data, it might be difficult to learn filters, which perform well on both modalities while keeping internal features independent of the originating domain.
Moreover, this AdaptSegNet 7 framework is built on the strong prior assumption that the source and target datasets are sharing the same label space distribution.This poses a boundary condition for the segmentation model to give good predictions and fool the discriminator simultaneously.In case of pronounced label space deviations between the source and target datasets, the discriminator should hypothetically quickly learn how to differentiate the segmentation masks, hampering the transfer learning process.Such a label distribution shift could be due to different phase morphology or phase fractions.For this purpose, the phase fractions are provided in Table 1.Aside from generally low data quantity in the T2 domain, the lath bainite phase was not oversampled during image acquisition as opposed to the other domains.Therefore, selecting a suitable training tile subset to match the source domains' 53% mean lath-bainite phase fraction was unfeasible.Taking the phase fraction histograms of the training tiles (based on expert-reviewed pseudo-labels) into consideration, the discriminator seemingly should learn the tendency that the images from the T2 domain show generally a smaller lath-bainite content.Initially, we considered this to be the primary reason for UDA being more beneficial for T3 compared to T2.However, an additional experiment where we varied the lath-bainite phase fraction of the LOM training dataset provided to the UDA framework invalidated this hypothesis.Specifically, we sampled another target 3 v train set with a lower mean phase fraction (φ train,3 v = 0.28 similar to T2) and trained a model with it in the UDA framework.A common test set exempt from both LOM training sets was created for testing purposes consisting of 12 test images with a mean phase fraction of 40%.This LOM test set phase fraction was chosen to be the average between the training phase fractions such that the influence of target train-test shifts could be excluded and domain phase fraction shifts during training could be investigated.The results showed that models trained with T3 reached 78.0 ± 1.0 mIoU, whereas those trained with T3 v scored 77.4 ± 1.8.Note that these results deviate from the results provided in Table 5 due to the distinct test set.This suggests that distinct phase fraction training data of the different domains does not hamper the UDA training.One thing to underline is that the discriminator receives predictions rather than actual annotations, which complicates the distinction based on phase fraction histogram separability.Nevertheless, the robustness of this adversarial process concerning space label non-conformity is auspicious for materials science tasks.For instance, this appears promising for generalizing to different alloys or processing routes, as the phase topology and morphology then can be altered significantly between the source and target sets.Alternatively, if problems due to too different phase distributions were to arise, these could be overcome by feeding the discriminator with tile sub-patches selected based on pseudo-labels to balance both sets' apparent tile phase fractions artificially.
We consider the employed UDA model a good trade-off between complexity and reached performance, and therefore an excellent introductory framework for the material science community.However, in view of the fast-paced ML research, it has been outperformed on the GTA5-to-cityscape task.Several improvements have been published over the past three years, most of the time using the work of Tsai et al. 7 as a reference and starting point [36][37][38][39][40] .All these studies rely on the GTA5-to-Cityscapes reference task, and some of them exploit specific characteristics of these datasets 37 .Therefore, the approaches in these works are not directly applicable for our binary segmentation task but potentially relevant for other material science tasks.Nevertheless, some models could potentially improve over AdaptSegNet in our setting.One example is the ADVENT model of Vu et al. 38 , which makes use of entropy maps instead of segmentation maps as input for the discriminator.It encourages entropy minimization in the target domain by matching the source and target entropy distributions.This entropy minimization paradigm is borrowed from semi-supervised learning.Also inspired by semi-supervised techniques, Pan et al. 36 implement the AdaptSegNet framework with an extra pseudo-labeling step.The easiest-to-predict half of the target data is pseudo-labeled in a first training iteration and then utilized as "source" domain data for a second training, using the rest of the target data as the target domain.This is motivated by the intra-domain variance in the target domain.In our case, this approach would probably not be overly beneficial as our intra-domain variance has been reduced to the minimum by repeatable data acquisition.Potentially, such an approach can take the minor emerging intra-domain variance from grain morphology differences due to imaging at different locations on the rolled sheet cross-sections into account.Recently, Yu et al. published an improvement of AdaptSegNet 7 including an attention mechanism in order to focus domain adaptation on the parts of the images that are the most difficult to transfer from the source to the target.While the work at hand has focused on adversarial UDA techniques only, promising style transfer GAN approaches are also good candidates for UDA methods and are currently an active research topic.
Whether to employ UDA for training a model depends on three criteria.First, the effort associated with labeling source and target domain data needs to be considered since UDA avoids this cost for the target.For instance, UDA is especially favorable when synthetic source data (e.g., simulation data) with inherent labels and expensive target annotations are concerned.Second, the features contained in the source and target input data determine the attainable annotation accuracy.A setting where significantly more precise labels can be obtained in a source domain compared to the target proves beneficial to UDA.Thus, assuming comparatively poor-quality target labels, the performance gap between a UDA training and the direct supervised training on the target domain diminishes.The transfer from SEM to LOM provides a good example as SEM image features not only render the phase annotation process easier but also the bainite sub-class differentiation possible in the first place.Therefore, considering that a SEM acquires data substantially slower and is not affordable for every research laboratory, training a UDA model with external annotated SEM data to transfer to LOM can increase accessibility to high-quality models and potentially even enable specific tasks.Lastly, the domain gap to bridge has to remain maintainable.Source and target domains need to share enough common image features for the model to learn a domain-independent representation of the data.While this work gives first insights into the UDA scope of application, its precise limits still need to be explored.
UDA could alleviate the demand for expensive annotations, considering the variety of characterization methodologies and materials utilized in materials science.For instance, can surface-sensitive SE2 SEM images be used in a UDA setting to transfer to topography mapping techniques such as atomic force microscopy?Or accordingly, can an alloy quenched with cooling rate A be used to transfer to a cooling rate B? These are essential tasks to experimentally confirm computationally optimized microstructures, given the materials design acceleration we currently undergo.

Conclusion and Outlook
Amidst others 15,16 , this study validates the benefit of pre-training when a low-data scenario is encountered.In this context, we show that a two-stage pre-training using ImageNet and an in-field SEM dataset also improves the generalizability of the trained models across domains.From the results, it is evident that models learn and forget relations, and generalizability is sacrificed for specificity upon prolonged training.The success of pre-training and fine-tuning motivates the demand for publicly available datasets in the material science field.
Concerning model transferability, this works' significant contribution lies in the successful application of Tsai et al.'s UDA approach 7 with small datasets, making this technique promising for many applications.Substantial improvements in performance on the target datasets were observed despite only providing few tens of unlabeled micrographs.Even modality transfers from SEM to LOM could be facilitated successfully with such data.The UDA frameworks' insensitivity with respect to different phase fractions in source and target domains yields hope to enable generalization across different alloys and heat treatments.Considering the increasing image acquisition rates and automation, the discrepancy between labeled and unlabeled data quantities is very likely to dramatically increase in the future, raising interest for unsupervised-learning methods (e.g.UDA).Indeed, the availability and expense of annotation processes in material science are impediments, that UDA can evidently help to overcome.This applies especially when source domain data is substantially cheaper to annotate, shows relevant features for labeling precisely, and when domain gaps to bridge are narrow enough (comparable features between the domains).
Lastly, efforts have been made over the past years to infuse image classifiers with semantics and structured knowledge 41,42 , to alleviate the restrictions with respect to data efficiency and generalizability across datasets.These models utilize knowledge modeling in the form of ontologies and derived knowledge graphs as well as graph neural networks to introduce reasoning capability to models.As materials science currently undergoes the digital transformation in numerous large scale projects [43][44][45] , resulting knowledge graphs will not only increase the findability of potential source training data but can, along with the techniques presented here, potentially shape the future of generalizable learning to account for materials diversity.

Figure 2 .
Figure 2. Summary of the pre-training and fine-tuning procedure.Starting from ImageNet pre-trained weights (1), we optionally use the NanoSEM dataset for pre-training the encoder of the model on this classification task (2).The weights of the encoder are then directly transfered for the final fine-tuning on the source domain (3), while the decoder is random-initialized.
representation a good first approximation of the model.The four aforementioned loss parts (L seg , L adv , L aux adv , L aux seg ) compose the training loss of the segmentation model, whereas the discriminator is optimized based on a domain-classification cross-entropy loss.When back-propagating the combined loss of the segmentation model, the weights of the discriminator are temporarily frozen.Both the segmentation and discriminator models are trained in an end-to-end fashion.Full technical details are given in the Supplemental.

Figure 3 .
Figure 3. Simplified schematic representation of AdaptSegNet 7 .Tile examples from the source and target 2 domains are given on the left.The source prediction is used in a supervised manner with cross-entropy loss.Both domains' prediction are fed into the discriminator.

Figure 4 .
Figure 4. Predictions on the target 1 test images.The left column (a),(c) gives the predictions of a DeepLabv2 model trained on the source dataset (cf.experiment T1.10).The right column (b),(d) gives the predictions of a DeepLabv2 model trained with the UDA framework on the source and target 1 datasets (cf.experiment T1.11).

Figure 5 .
Figure 5. Predictions on the target 2 test images.The left column (a),(c) gives the predictions of a DeepLabv2 model trained on the source dataset (cf.experiment T2.10).The right column (b),(d) gives the predictions of a DeepLabv2 model trained with the UDA framework on the source and target 2 datasets (cf.experiment T2.11).

Figure 6 .
Figure 6.Predictions on the inverted target 3 test images.The left column (a),(c),(e) gives the predictions of a DeepLabv2 model trained on the source dataset (cf.experiment T3.1).The right column (b),(d),(f) gives the predictions of a DeepLabv2 model trained with the UDA framework on the source and target 3 datasets (cf.experiment T3.2).

Figure 7 .
Figure 7. Network visualizations of the ResNet-101 DeepLabv2 model (supervised) on the source (a-c) and target 1 (d-f) domain with lath-shaped bainite labels (b) and (e) given as reference.The heat maps indicate regions that were taken into consideration at layer 3_16 of the ResNet-101 29 to classify the lath-shaped bainite regions.In the detail views (c), (f), which are the same as in Figure 1, heat maps in the target 1 domain are more extensive and additionally incorporate subgrain boundaries (red arrow annotations).

Table 1 .
Micrograph set descriptions.Mean lath-bainite phase fractions φ train of the train tiles are given.Note that no annotation was available for the target 1 and target 2 training sets.Hence, their phase fractions were estimated based on pseudo-labels and confirmed by a metallography expert.Symbols "o" and "+" refer to normal and deliberately prolonged etching duration, respectively.
Figure 1.Input images, same superimposed with the annotation, and a detail view are each shown for the source (a-c), target 1 (d-f), target 2 (g-i), and target 3 (j-l) domain.Please refer to Table 1 for the technical details about the different sets.The red frames in the first column highlight the region of interest shown in the third column.Yellow annotations are used in Section 2.2 for discussing major differences between the domains.

Table 2 .
Performance of the models trained on the source dataset (cf.Table1).The given mIoU is always averaged over the five folds used for cross-validation.NanoSEM refers to the two-step pre-training introduced in Section 2.4.2.The augmentation pipelines are detailed in the Supplemental.

Table 6 .
Source domain-trained models' performance on the target 2 domain (cf.Table1) when fine-tuning for 100 or 200 epochs.Indices refer to training epochs.Provided mIoUs are averaged over the five folds used for cross-validation.NanoSEM refers to the two-step pre-training introduced in Section 2.4.2.Augmentation pipelines are detailed in the Supplemental.

Table 7 .
Estimated phase fractions on the test sets of the different target domains with DG and UDA models.Values are averaged over the five models trained on the different folds and standard deviations are provided.

Limitations and potentials of the implemented unsupervised domain adaptation framework
Firstly, it has to be emphasized that adversarial-based frameworks such as AdaptSegNet suffer from training instability, rendering them rather laborious to tune and affecting training repeatability.Facing these training pitfalls while working with low quantity data hampers the deduction of relationships from ablation studies and hyperparameter studies.Furthermore, given the implementations and hardware at hand, typical UDA training runs for six hours, whereas a DeepLabv2 supervised fine-tuning takes only 45 minutes.