Materials swelling revealed through automated semantic segmentation of cavities in electron microscopy images

Accurately quantifying swelling of alloys that have undergone irradiation is essential for understanding alloy performance in a nuclear reactor and critical for the safe and reliable operation of reactor facilities. However, typical practice is for radiation-induced defects in electron microscopy images of alloys to be manually quantified by domain-expert researchers. Here, we employ an end-to-end deep learning approach using the Mask Regional Convolutional Neural Network (Mask R-CNN) model to detect and quantify nanoscale cavities in irradiated alloys. We have assembled a database of labeled cavity images which includes 400 images, > 34 k discrete cavities, and numerous alloy compositions and irradiation conditions. We have evaluated both statistical (precision, recall, and F1 scores) and materials property-centric (cavity size, density, and swelling) metrics of model performance, and performed targeted analysis of materials swelling assessments. We find our model gives assessments of material swelling with an average (standard deviation) swelling mean absolute error based on random leave-out cross-validation of 0.30 (0.03) percent swelling. This result demonstrates our approach can accurately provide swelling metrics on a per-image and per-condition basis, which can provide helpful insight into material design (e.g., alloy refinement) and impact of service conditions (e.g., temperature, irradiation dose) on swelling. Finally, we find there are cases of test images with poor statistical metrics, but small errors in swelling, pointing to the need for moving beyond traditional classification-based metrics to evaluate object detection models in the context of materials domain applications.


Introduction
Metal alloys used in nuclear reactor cores and surrounding structures undergo irradiation, causing damage to the material which can result in the production of extended defects such as dislocation loops, precipitates, and cavities (sometimes called voids when they do not contain gas or bubbles when they do contain gas) that, in turn, have a deleterious impact on the mechanical properties via hardening, embrittlement and swelling.[1][2][3][4][5] Bias-driven growth of cavities leading to unconstrained swelling under neutron irradiation generally occurs via the presence of helium (produced from nuclear transmutation) that stabilizes the cavities.[3,6] Significant swelling can result in material degradation and failure, hence, understanding the interplay of alloy composition, microstructure, and reactor conditions such as operating temperature and irradiation dose are important for informing safe and reliable reactor operation.[7] Bulk measurement methods of reactor components, such as the Archimedes method, are typically easiest to conduct to obtain information on the total volumetric swelling response of a material.[8] However, Transmission and Scanning Transmission Electron Microscopy (S/TEM) methods are also commonly employed in materials research and development evaluations for ex situ characterization of alloy microstructure and swelling quantification.TEM methods have an advantage over bulk measurement methods as they enable one to obtain the strict swelling response from the presence of cavities, eliminating swelling contributions from other factors such as creep, secondary phase formation, and phase densification at high temperature.TEM analysis can also be used to identify swelling responses locally, e.g., as is seen during ion irradiations or in complex microstructures due to localized microstructural effects on the helium and defect formation energetics and kinetics.Finally, TEM analysis can be used to help understand early stage irradiation response, e.g., the nucleation and growth process of cavities, which initiates before significant macroscopic swelling has occurred.
Such microscale characterization thus enables detailed mechanistic understanding important for the design of swelling resistant alloys, and enables researchers to understand linkages between material microstructure, composition, and swelling response as a function of key operational variables such as temperature, irradiation type (e.g., neutron vs. ion), dose rate, and total dose.[9] This information is in turn useful for informing materials modeling of swelling in different regimes (i.e., incubation, transient, and steady state swelling) and can help inform operational limits of a material in a nuclear reactor.[5] At present, swelling quantification from TEM samples is typically performed by considering a handful of TEM images and manually counting and measuring individual cavities in each image, for example using image analysis programs such as ImageJ.[10] This approach typically treats relatively small sample sizes due to (1) the time and resource-intensive nature of TEM sample preparation and (2) the cavity labeling and counting analysis.Regarding the first issue, recent advances in TEM sample preparation, including high-throughput focused ion beam (FIB) methods (e.g., plasma FIB) and flash polishing, can be used to generate an extensive library of TEM samples.[11,12] Therefore, sample preparation limitations are rapidly being overcome.
We note also that modern TEM instruments have undergone exponential growth in data acquisition rates with the development of new detector technologies, resulting in higher resolution images and larger overall data sizes.[13][14][15][16] Therefore, it is clear that manual labeling and measurement of cavities will not be able to keep pace with the scaling of TEM dataset sizes.
Thus, the second issue above is rapidly becoming the bottleneck in scaling up image-based analysis capabilities.An automated method that can quickly analyze large TEM datasets, automatically detect and quantify cavities, and then assess material swelling would enable researchers to evaluate many more areas of interest on a given sample, providing more robust statistics, quantification of effects of heterogeneity, and in-depth evaluations of cavity properties and material swelling.
In the past decade, deep learning methods have witnessed significant advancement.They have resulted in revolutionary changes to the field of computer vision.Specifically, in the context of object detection, deep convolutional neural networks (CNNs) such as ResNet50, ResNet101 and VGG16 are used to extract detailed underlying feature sets from tens of thousands of images in canonical databases such as ImageNet [17] and Common Objects in Context (CoCo).[18] These so-called "backbone" networks are implemented in CNN-based object detection frameworks such as the Faster Regional Convolutional Neural Network [19] (R-CNN) and Mask R-CNN models, [20] which contain additional neural networks that suggest regions of interest in the image and classify and segment individual objects within each region of interest.[21,22] There has been a growing body of work applying object detection methods to electron microscopy images in materials science, [23] with applications ranging from detecting various defects (e.g., dislocations, precipitates, black dot defects) in irradiated metal alloys [24][25][26] to quantifying micro and nanoparticles [27,28] and finding individual atoms in high-resolution STEM images.[29,30]     While at first glance the MAE value of 0.66 percent swelling on the NOME test set does not appear much worse than the MAE of 0.40 percent swelling on the CNL test set, the range of swelling values for the NOME data are much smaller, and the higher error, in this case, is better exemplified by inspecting the MAPE value of about 215% for testing on NOME vs. just under 20% for testing on CNL, as well as the reduced RMSE value which is much higher (lower) than unity for the NOME (CNL) test set.

Understanding model errors of swelling assessment
Here, we seek to better understand the source of error in the model swelling assessments.Based on the equation to calculate material swelling (Eq. 1, see Section 4), it is intuitive that cavity size (cubic scaling) has a larger impact than cavity density (linear scaling) to determine the swelling (see SI Note 3 for a visualization of this fact using our present database).
Given the detailed data obtained from the Mask R-CNN model output, we show this effect in practice and quantify potential problematic areas of model use more precisely.Figure 4A shows the relationship between the true per-image cavity size and the model error in the cavity density.
In Figure 4A the sizes of the data points scale with the model error in the swelling.What we learn from Figure 4A is that the images with the highest density errors are those with small cavities, at least on average.The small sizes of the points with high density errors indicate that these images with poor density assessments also have minor swelling errors.From the standpoint of desiring a model which produces accurate swelling assessments, the fact that at times the model shows poor assessments of cavity density are not necessarily concerning, as the poor density assessments coincide with small swelling errors, at least for the images analyzed in our present database.It is worth noting that our model is largely unbiased with regard to cavity size predictions (see Figure S1A in SI Note 1), biased to underpredict cavity densities (see Figure S1B in SI Note 1), resulting in essentially no bias in the swelling errors (see Figure 2), which is due to the fact that small cavities have a small impact on the swelling values, and are the cavities that are undercounted in the density predictions.In when the true swelling is large (e.g., average swelling error of 0.60% and percentage error of 16.0% for true swelling >2%).Overall, across all test images in our database, our model shows average absolute swelling errors (percentage swelling errors) of about 0.3% (25%).and our model can have a poor swelling assessment that is the result of a combination of errors (Figure 5B-predicts too many small cavities and misses some large cavities).However, we reiterate that when evaluating the numerous images comprising our complete test set, our model shows good assessments of material swelling on average.

Summary and Outlook
In this work, we used an end-to-end deep learning approach based on the Mask R-CNN model to detect and characterize nanoscale cavities in irradiated metal alloy TEM micrographs.
We have assembled the largest database of labeled cavity images to date, which includes 400 images and >34k cavities, with a domain encompassing an array of alloy compositions and irradiation conditions.We evaluated the performance of our Mask R-CNN models using a set of canonical classification-based metrics (overall and per-image precision, recall, and F1 scores) and materials domain-specific metrics of cavity size, cavity density, and swelling assessments.Given the importance of accurately characterizing swelling in irradiated alloys for their use as materials in nuclear reactor components, we particularly emphasized assessments of material swelling.
Our model provides material swelling assessments with an average (standard deviation) swelling mean absolute error based on random leave-out cross validation of 0.30 (0.03) percent swelling, demonstrating good assessment ability of swelling with sufficiently small error to provide useful insight for new alloy design.We investigated the source of our swelling errors in greater detail, with three related findings of interest: (1) The model can occasionally have poor assessments of cavity density, but these poor density assessments always coincided (at least for the images evaluated here) with small swelling errors as the missed cavities were all small (e.g., cavities which span about 2% or less of the image size), indicating that poor cavity density assessments are not necessarily a worrisome sign for model performance.
(2) Canonical classification-based metrics can sometimes paint a misleading picture of how well a model may perform for a specific materials-domain application.For example, we analyzed two extreme cases of test images with low (high) F1 scores which, in turn, ended up displaying very low (high) swelling errors, indicating that, like with point (1) above, missing many cavities is not necessarily an issue, assuming they are small.
(3) Directly related with the above points, which is given that swelling scales with the cube of cavity sizes, it is essential to capture the sizes of large cavities accurately.While this is obvious from inspection of Eq. 1, we showed how this effect can manifest in practice, where even test images with small average cavity size errors may show larger-thandesired errors in swelling, where in some cases errors in the full cavity size distribution, at least as it relates to accurately assessing swelling, are mainly the result of errors in cavity sizes of about 15 nm or larger.
Although the present results are very promising, the inability to reliably assess new types of cavity data, the errors on small cavity detection, and the swelling errors introduced for some large cavities are all still concerns.Some or all of these issues may be overcome with more data, but obtaining and annotating new TEM images of irradiated samples is very time-consuming, particularly if one also needs to conduct the irradiation experiments before imaging.We believe a potentially fruitful area of future research is to include synthetic training data, which can augment existing experimental databases to expand the model training domain to include different size distributions, focusing and imaging conditions, and noise levels to improve model training.One avenue for creating synthetic data is to use generative models such as Generative Adversarial Networks (GANs).However, the main downside of using GANs is their reliance on an initial set of training images of cavities.A different method that doesn't rely on an initial seed of training data is a physics-based simulation of cavities.Our initial work in this space combined simulated cavities onto experimental images containing real cavities to improve object detection model training, [32] and work is ongoing to address challenges of how to best integrate synthetic cavities with background TEM images and comprehensively evaluate object detection model performance with the addition of synthetic cavity data.
To encourage future studies of object detection and quantification in this space, we have made our full database of images and their associated ground truth annotations publicly available (see Data and Code Availability section).In addition, we have provided a Python notebook tailored for running on the free GPU resources provided on Google Colab, to easily provide inference and basic analysis of material swelling on user-provided test images.Finally, our model is also hosted on DLHub, [33] which is part of the Foundry for data, models and science.[34] This infrastructure enables inference on new images using only two lines of python code.We have also included a notebook which can be used to call our model from Foundry (see Data and Code Availability section).The Mask R-CNN model used for this tool was trained on the complete CNL+NOME database of 400 images to create the most accurate present model for detecting cavities on new images.Provided a new test image, the notebook saves the image with the model-specified cavity segmentations overlaid, together with a spreadsheet containing the bounding box, segmentation, and calculated size of each cavity in the image, along with the computed cavity density and swelling.We hope that tools such as these assist researchers and new users alike in the short term by creating a reduced barrier to using object detection tools.In the longer term, we hope to facilitate the generation of a broader community base of standardized (experimental and synthetic) image data and associated object detection models for the goal of creating state-of-the-art models able to accurately detect cavities and quantify vital materials properties such as swelling for a range of alloy compositions, irradiation doses, and imaging conditions.We use the Mask R-CNN object detection model to detect and quantify cavities in this work, as implemented in the Detectron2 package (PyTorch backend).The Detectron2 package was developed by the Facebook AI Research (FAIR) team.[37] Detectron2 is freely available and enables the implementation of many object detection models, such as Faster R-CNN, [19] Mask R-CNN, [20] and Cascade R-CNN.[38] These object detection models have been pre-trained on either the ImageNet [17] or Microsoft COCO [18] (Common Objects in Context) image databases, enabling the use of the transfer learning technique.When using transfer learning, the model backbone weights are frozen to those obtained from the previous ImageNet or Microsoft COCO image training, save for a small number of terminal layers (2 throughout this work).The Mask R-CNN input configuration was the same as that used in our previous work of detecting and quantifying dislocation loops and black dot defects in FeCrAl alloys, [25] except here we adjusted the candidate anchor box sizes to be 4, 8, 16, 32, 64, 128, and 256 pixels to enable the model to better detect small cavities.We note here that input files in the Detectron2 package typically use candidate anchor box sizes that are powers of 2, so we follow that practice and also include the small anchor box sizes of 4 and 8 pixels in an effort to better detect small cavities, as some of the images examined in this work contain cavities that are on this length scale.
This work evaluates our model using both classification-centric and materials propertycentric metrics.For our classification metrics, we focus on the model precision (P), recall (R), and F1 (harmonic mean of precision and recall) scores.Since we have only a single prediction category (i.e., cavities), the precision is calculated by dividing the number of found defects by the number of predicted defects, and the recall is calculated by dividing the number of found defects by the number of true defects.We evaluate P, R and F1 scores both on a per-image basis, from which we can obtain average per-image P, R and F1 scores, and we evaluate the so-called overall P, R and F1 scores, which is a single calculation using the total numbers of true, predicted and found cavities for the entire test set.For the materials property metrics, we calculate size distributions of predicted cavities for every test image, but focus our evaluation on comparing the true vs.
predicted per-image average cavity size, true vs. predicted per-image cavity density (obtained by counting the number of true and predicted cavities in an image and dividing by the image area), and true vs. predicted per-image swelling.The swelling ∆  of an image (expressed as percent swelling) is calculated following the work of Jiao et al. [9]: where A is the area of the image,  is the image thickness, di is the cavity diameter, and N is the number of cavities in the image.Due to the lack of per-image thickness data, we have assumed that every image has a thickness of 100 nm.The cavity diameter is calculated as twice its radius, where the cavity radius is defined as the square root of the product of the minimum and maximum distances from the center of the cavity mask.
When evaluating the performance of object detection models like Mask R-CNN, there are two key hyperparameters to choose from, namely the intersection-over-union (IoU) threshold value, and the objectness score.The IoU threshold determines the cutoff between ground truth and predicted bounding boxes to determine when a cavity can be considered found in the correct position, and the objectness score is a measure of the model confidence that a predicted region corresponds to a cavity, and thus impacts the total number of predicted cavities.The method to match the true and predicted cavities based on IoU is the same as that employed in our previous work.[25] We provide a brief summary of this approach here.When evaluating an image, there is a list of true defect masks and predicted defect masks.To decide whether a defect has been found in the correct location, the IoU of every predicted defect is calculated for each true defect, and the defect with the highest IoU score is considered the best possible match.The IoU values are calculated using the bounding boxes obtained from the region proposal network.If this computed IoU score is above the designated threshold, this predicted defect is considered to be found.Each true defect can only be found one time, so if multiple predicted defects are found to pass the IoU threshold with a particular true defect, the predicted defect with the highest IoU score is considered the found defect, and the other defect(s) would then be considered false positives.The hyperparameters will be determined using the CNL+NOME initial split by evaluating the overall F1 score as a function of IoU threshold and objectness score, and by evaluating the error in predicted swelling as a function of objectness score (see Figure S5 in SI Note 5).This data split was chosen for hyperparameter optimization as it contains a representative and random subset of the full CNL+NOME image dataset examined in this work.
predictions on new images and save the associated data is also available on Figshare (https://doi.org/10.6084/m9.figshare.20063117).In addition, we have hosted the final trained model on DLHub, which is part of the Foundry for data, models and science.A notebook to use the hosted model on Foundry is also provided in the above Figshare repository.A small subset of the images (≈3%) are omitted from the public database due to protected rights of these images.
Access to the omitted images and corresponding labels can be obtained through request with the corresponding author.
classification metrics and materials property metrics for each split, together with the average and standard deviation across all five splits.Regarding model performance on classification-centric metrics, all models perform better at detecting underfocused cavities than overfocused cavities, where, for example, the CNL+NOME model shows overall (average per-image) F1 scores of 0.72 (0.73) for underfocused images and 0.54 (0.60) for overfocused images, respectively.In addition, when considering underfocused and overfocused images together, the CNL+NOME model shows overall (average per-image) F1 scores of 0.68 (0.69), which are nearly identical to scores of 0.68 (0.73) for the model trained and tested solely on CNL data and to scores of 0.68 (0.66) for the model trained and tested solely on NOME data.Further, we speculate that the model trained solely on NOME data may have a slightly larger domain of applicability than the model trained solely on CNL data.
The NOME database contains images of materials from more alloy types and irradiation conditions than the CNL database.From the data in Table S3, the model trained on NOME and tested on CNL displays a better overall F1 score of 0.46 than the model trained on CNL and tested on NOME, which has an overall F1 score of 0.39.Overall, the results of Figure 3 in the main text, Figure S7 and Table S3 demonstrate that it is preferable to simply train one model with training images from both datasets, as the model domain is widened without significant loss in classification or materials property metric performance.Finally, it is worth noting that comparisons can be made for the CNL initial split model that is trained and tested on the CNL data with previous findings from the work of Anderson et al.   cavities/cm 3 , assuming a thickness of 100 nm), and that the material swelling is much more sensitive to the cavity size than the density, consistent with intuition.above is one such split as it had effectively a random group of images pulled out for testing.We constructed an additional 4 random splits (for a total of 5 random splits), which we refer to as CNL+NOME CV split N (N=1-4) in  4, Figure S1, Figure S2, Figure S4 Table S1, Table S2, Table S3 CNL+NOME
imaging conditions where the different conditions invert the contrast modulation of the cavities present in the material.In Figure3A, we see that the model trained on CNL images demonstrates good assessment of material swelling on the CNL image test set with an MAE of 0.40 percent swelling.The model performs better on underfocused images compared to overfocused images from the standpoint of MAE, where the swelling MAE values on underfocused (overfocused) images are 0.33 (0.53) percent swelling, respectively (see SI Note 2).The improved performance on underfocused images is likely due to their being more underfocused versus overfocused cavities in the CNL database.A similar response was observed in our previous work using Mask R-CNN to detect dislocation loops in FeCrAl alloys, where our learning curves showed best model performance on the defect types present in highest quantity in the training data.[25]In Figure3A, we can also see that the model trained on CNL data performs poorly on the NOME test set.

Figure 3 : 2 and
Figure 3: Parity plots assessing Mask R-CNN per-image performance of predicting materials swelling.(A) CNL initial split, with model trained on CNL and tested on CNL (blue data) and trained on CNL and tested on NOME (red data).(B) NOME initial split, with model trained on NOME and tested on CNL (blue data), and trained on NOME and tested on NOME (red data).In both plots, the circle and triangle points denote overfocused and underfocused images, respectively, and the color-coded fit statistics coincide with the corresponding set of points of like color.In Figure 3B, we perform the test case where the model is trained only on the NOME data and separately tested on the CNL and NOME test sets.The model trained and tested on NOME data shows an excellent overall ability to assess swelling, with an MAE of just 0.15 percent swelling (MAPE = 37.97%).In contrast, the model performs poorly on assessments of the CNL test set, with large swelling MAE (MAPE) values of 1.98 percent swelling (76.25%), respectively, and essentially no ability to assess swelling of samples with true swelling values greater than about 1.5 percent swelling.This result makes sense through the lens of model applicability domain.While the NOME dataset constitutes a more diverse set of alloy compositions and irradiation conditions, the swelling present in the NOME images has a maximum of about 2.5 percent swelling (with all but one test image having less than 1.5 percent swelling), in contrast to the large swellings of some CNL images of up to nearly 7 percent swelling.We reiterate that by training a model which uses both the CNL and NOME data (Figure 2 and Figure S2 in SI Note 2), the model provides an accurate assessment of material swelling both on the separate CNL (MAE = 0.44 percent swelling) and NOME (MAE = 0.15 percent swelling) test sets, and collectively shows an MAE of 0.26 percent swelling.The model trained on CNL and NOME data shows virtually unchanged performance on each test subset compared to individually Figure 4B, we plot the average absolute swelling error as a function of the true per-image cavity size, binned based on ranges of cavity sizes.The sizes of the points in Figure 4B correspond to the number of test images contained in each cavity size bin.The error bars denote the standard error in the mean of the average absolute swellingerror in each cavity size bin.As an example, to obtain the first data point of the 0-5 nm binned NOME data, the sizes of the red square points in Figure4Athat are between 0-5 nm on the yaxis are averaged to obtain the average absolute swelling error in Figure4B, the error bar is the standard error in the mean of those same points, and the size of the point in Figure4Bscales with the number of data points in the 0-5 nm size bin (note this is why larger points tend to have smaller error bars).In Figure4B, we can see that the CNL (NOME) images with average cavity sizes greater than 10 nm (15 nm) have higher average swelling errors than the overall MAE of 0.3 percent swelling from random cross validation.Taken together, the analysis shown in Figure4points to images with large cavities being the most susceptible to high swelling errors, with errors potentially twice as high as that obtained from our random cross validation test.As a further piece of analysis, in FigureS4in SI Note 3 we have additional plots like that shown in Figure4B, except we plot the average absolute swelling error as a function of the (binned) true swelling, for the cases of all test images together as well as split out by CNL and NOME subsets.This analysis indicates we have smaller (larger) absolute swelling errors (percentage swelling errors) when the true swelling is small (e.g., average swelling error of 0.13% and percentage error of 33.0% for true swelling <1%) and larger (smaller) absolute swelling errors (percentage swelling errors)

Figure 4 : 5 .Figure 5
Figure 4: (A) Relationship between the true per-image cavity size and the model error in assessing the corresponding cavity density.Each data point represents one test image, where the blue circles and red squares denote CNL and NOME test images, respectively.The size of the data points scales with model error of the percent swelling.(B) The trend of model predicted absolute error in material swelling as a function of true average cavity size.Here, the x-axis represents binned values of true cavity size (i.e., groups of test images based on their range in true cavity sizes from the y-axis of the plot (A).The blue circles and red squares denote groups of CNL and NOME test images, respectively.The size of the points scales with the number of test images comprising the true average cavity size bin.The size legends denote the minimum, average, and maximum for the respective data trace.The error bars are the standard error in the mean of the absolute swelling error.

4
Data and MethodsIn this work, two datasets were used to train and test the performance of our Mask R-CNN object detection model.Both datasets consist of TEM images of irradiated metal alloys.Objects of interest for detection and quantification are cavities, which generally appear as spherical and faceted shapes in the micrsostructure with contrast consistent with a region devoid of matrix material).The first dataset consists of bright-field TEM micrographs obtained and labeled by the Canadian Nuclear Laboratory (CNL), which we refer to as the CNL dataset throughout this work.The images were obtained from reactor spacer springs of commercial nuclear reactors in the Canada Deuterium Uranium (CANDU) reactor fleet,[35] and consist of both overfocused and underfocused images of cavities in Inconel X-750 Ni alloys which have undergone neutron irradiation.The reactor spacer springs used to obtain the CNL images were in reactor service for 14 years, with a damage dose of 30 displacements per atom (dpa).Additional details of the sample preparation, TEM imaging, and cavity annotation are described in the work of Anderson et al.[31] Summary information of the number of overfocused and underfocused images, and the corresponding number of overfocused and underfocused cavities for the CNL dataset is summarized in SI Note 4. We note here that in the work ofAnderson et al.,    it is stated that a total of 253 images comprise the database, where 230 images were used for training and 23 were reserved for testing their Faster R-CNN model.However, from the publicly available data linked in their paper, the available training set consists of 224 images and the testing set contains 19 images (243 total images).Further, when inspecting the provided annotations for all images, it was found that for 5 images, the annotations did not coincide with the cavities present on the image.Rather than re-annotating these images, we simply removed them from our present CNL database used in this work, yielding a total of 238 images.(Note the While 68% of the present CNL database consists of underfocused images, a large majority (about 83%) of the cavities are underfocused, resulting in a class imbalance where the database is significantly biased toward underfocused cavities.The second dataset consists of TEM micrographs obtained and labeled by us as part of the Nuclear Oriented Materials & Examination (NOME) Laboratory at the University of Michigan, which we refer to as the NOME dataset throughout this work.These images were obtained through a wide variety of collaborations and professional contacts within the field.They consist of both overfocused and underfocused images.The materials compositions covered by these images are highly varied, including samples comprised of CW-316, T91, HT9, and 800H steel alloys.The irradiation undergone by each sample was also highly diverse and includes both damage received by light and heavy-ion as well as neutron bombardment, with total doses of up to 100 dpa.For annotating these images, a team of undergraduate student researchers were first trained by a domain expert to label images by practicing on several pre-labeled images not part of the NOME database.Feedback on their labeling was provided until results approximated those obtained by expert researchers.Once trained, the undergraduate team labeled the entire NOME database.The labels of each NOME database were corrected by a graduate student researcher (Matthew Lynch) and checked by a post-doctoral researcher (Priyam Patki) to form the final set of annotations.All labeling was done using the VGG Image Annotator (VIA) web tool.[36]The labeled NOME database comprises 162 images, as detailed in SI Note 4. Like the CNL database, the NOME database is significantly biased toward underfocused cavities, with about 75% of the total cavities coming from underfocused images.In order to assess different aspects of the model, 7 different splits of our combined CNL+NOME dataset were used to train and test the ability of our Mask R-CNN models to detect and quantify cavities, as detailed in SI Note 4. We note here that all of the images and annotations for the CNL and NOME datasets have been made publicly available on Figshare (see Data and Code Availability section).

Figure S6 :
Figure S6: Parity plot of true and predicted (A) average cavity size, and (B) cavity density.Each data point represents one test image.The different symbols correspond to different cross validation train/test splits.The fit statistics in black text denote the average +/-standard deviation across all five splits for each metric.In (B), the fit statistics in blue text denotes the average +/-standard deviation across all five splits for test images with true cavity density equal or less than 20 x 10 4 nm -2 .

Figure S7 :
Figure S7: Additional parity plot assessing Mask R-CNN per-image performance of predicting materials swelling.CNL+NOME initial split, with model trained on CNL+NOME and tested on CNL+NOME, where CNL images are shown as blue points and NOME images as red points.

[ 1 ]
Our model shows overall and average per-image F1 scores of 0.68 and 0.73, respectively, which are lower than the highest F1 score of 0.78 reported in Anderson et al.While it is not clear whether this F1 score of 0.78 reported by Anderson et al. represents an overall or average per-image F1 score, it is nonetheless 0.05-0.1 higher than the F1 scores we obtain here.We attribute this difference to being due to the different number of training images used here compared to Anderson et al. (219 vs. 230 in their work), and the different test set used to evaluate the model performance (19 images vs. 23 images in their work).In addition, different codebases and model types were used between this work and Anderson et al., who used a Tensorflow-based implementation of the Faster R-CNN model, while we use the Mask R-CNN model in Detectron2/Pytorch.

Figure
Figure S8 contains a scatter plot of the true per-image average cavity size vs. the true perimage cavity density for all data splits considered.In Figure S8, the sizes of the points scale with the true material swelling.It is immediately evident in Figure S8 that the images with the largest

Figure S8 :
Figure S8: Relationship between the true per-image cavity size and the true per-image cavity density.Each data point represents one test image.Respectively, the blue circles and red squares denote CNL and NOME test images.The size of the data points scale with the true percent swelling.

Figure S9 :
Figure S9: (A, C and E) Trend of model absolute error in material swelling as a function of true swelling.(B, D and F) Trend of model absolute percentage error in material swelling as a function of true swelling.In A-D, the x-axis represents binned values of true swelling.In C-F, the blue circles and red squares denote groups of CNL and NOME test images, respectively.The size of the points scales with the number of test images comprising the true swelling bin.The size legends denote the minimum, average, and maximum for the respective data trace.The error bars are the standard error in the mean.
Figure S10: (A) Heatmap showing hyperparameter selection to optimize model prediction of material swelling based on the choice of IoU threshold and model objectness score.The heat values correspond to the overall F1 score of the model, where a value of IoU=0.1 and 0.1 objectness score corresponds to the highest overall F1 of 0.672.(B) The mean absolute error (MAE, blue points) and root mean squared error (RMSE, red points) of material swelling as a function of objectness score, with an IoU=0.1.Here, an objectness score of 0.1 results in the , and F1 scores.A second level of assessment is how the model performs for defect properties, which might include basic properties (e.g., size distribution, mean size, density, shape, position, etc.) and evolutions or correlations associated with those basic properties (e.g., growth rate, diffusivity, pair distribution function, etc.).A third level of assessment is materials properies, which for irradiated alloys are generally swelling or hardening predictions based on physical models and properties of the observed defects.Assessments like those just listed can generally be done with different groupings of the data, e.g., for a fixed area, on a per image basis, or for a specific set of images.Also, since assessments are generally done on left-out test data, those test data sets can be generated by different methods, the most common being choosing them at Most relevant to the present work, Anderson et al. used the Faster R-CNN model to detect cavities in Ni-based X-750 alloys.[31]TheirFaster R-CNN model effectively found cavities, with reported F1 scores in the range of 0.7-0.8.Because the Faster R-CNN model does not provide pixel-level segmentation information, additional post-processing methods separate from the deep learning model were used to extract the cavity size information from the predicted bounding boxes.The present work employs the Mask R-CNN model to realize a fully end-to-end deep learning cavity detector.We include the publicly available data used in the work of Anderson et al. from the Canadian Nuclear Laboratory (CNL), which we refer to as the CNL dataset in this work, and significantly expand the previously available cavity image database to include images comprising a greater range of alloy compositions and irradiation conditions by including new images from the Nuclear Oriented Materials & Examination (NOME) Laboratory at the University of Michigan, which we refer to as the NOME dataset in this work (see Section 4 for more information).Two examples of images from each of the CNL and NOME datasets are shown in Figure 1.There are many possible ways of assessing a segmentation machine learning model for defects.One level is how the model performs as a classification algorithm, which can be done for any object classified by the model.A typical model provides classification for pixels (in or out of the defect), defects (found or not found), and defect types (for cases with multiple defect types).Such classification performance is generally characterized by metrics such as precision, recall, accuracyrandom (e.g., k-fold cross-validation) or removing specific groups of data with select properties to represent likely use cases for the model.In this work, we focus assessment on classification scores for finding defects, defect size distribution and density, and material swelling.We do this on both a per-image basis and averaged over multiple images.Together, these assessments explore the accuracy of the model for the information typically utilized by the radiation effects community.

which is provided in Figure S1 of SI Note 1. The
Overall, the Mask R-CNN model can assess the material swelling well with a typical mean absolute error of about 0.30 percent swelling, which is a small enough error for the model to discern changes in swelling repsonses based on material design (e.g., alloy refinement) and service conditions (e.g., temperature, dpa) and thus readily provides an accelerated means to assess these factors in TEM-based swelling quantification workflows.

Figure 2: Parity plot of true and predicted material swelling. The different symbols correspond to different cross validation train/test splits. The fit statistics in black text denote the average +/- standard deviation across all five splits for each metric.
In addition to the materials property statistics summarized here, we have collected the classification statistics of overall P, R, and F1 scores and average per-image P, R, and F1 scores for the tests discussed above.We find that the conclusions regarding model performance in the context of material swelling generally persist when considering the overall and average per- Table S2 in SI Note 2).image F1 scores.Additional discussion of the classification metrics and a table of their values can be found in SI Note 2. Overall, our results demonstrate that it is preferable to simply train one model with training images from both datasets, as the model domain is widened without loss in classification or materials property metric performance within any given single dataset.

Table S1 :
Summary of classification and material property metrics for five splits of random cross validation using the combined CNL+NOME dataset.

split Overall statistic Average per-image statistic Defect size error (nm) (percent error) Defect density error (x10 4 nm -2 ) (percent error) Defect swelling error (%) (percent error)
accurate cavity densities.This interplay of model errors of cavity size, density and swelling is discussed more in Section 2.3 of the main text.

Table S3 :
Summary of classification metrics of per-image P, R, F1 scores and overall P, R and F1 scores for Mask R-CNN models fit to the data splits shown in Figure 3 of the main text and Figure S7.

Table S5 :
TableS5and throughout this work.The purpose of evaluating models with these different random splits of CNL+NOME data was to quantify an expected average and standard deviation in model predictive performance for the scenario where the test images are drawn approximately from the same domain as the training images.Summary of data splits used to train and test Mask R-CNN models in this work.