Introduction

The disposal of nuclear waste has far-reaching implications for both the environment and human well-being. To ensure safe storage, transportation, and disposal, vitrification of radioactive materials into waste forms is a crucial step in the waste management process1,2. Among the critical attributes for successful waste disposal, the chemical durability of waste-bearing glasses stands out as a paramount consideration, given the potential exposure of the disposal to aggressive environments3. Over the past decades, scientific research has demonstrated that the durability of waste glasses is fundamentally influenced, though not entirely determined, by their chemical compositions. Therefore, understanding the relationship between glass composition and durability performance holds special significance in the context of proper waste disposal.

At the Hanford Tank Waste Treatment and Immobilization Plant (WTP), the immobilized low-activity waste (ILAW) is slated for on-site disposal at the near-surface Integrated Disposal Facility4. As the ILAW glasses act as the primary barrier against the release of radionuclides to the biosphere, the Department of Energy (DOE) has mandated two chemical durability-related measurements to characterize the corrosion behavior of candidate ILAW glasses—the product consistency test (PCT)5 and the Vapor Hydration Test (VHT)6. Different from the PCT which focuses on the steady-state leaching rate, the VHT is specifically designed to measure the glass corrosion rate under hydrothermal conditions, a process that leads to partial conversion of the glass into amorphous and crystalline alteration products. This hydrothermal corrosion test offers critical insights into the long-term durability of ILAW glasses in on-site disposal scenarios, contributing significantly to our understanding of the complex processes influencing glass performance in nuclear waste immobilization. The DOE has established specific requirements for measuring the VHT response limit as follows7:

The glass corrosion rate shall be measured using at least a seven (7)-day vapor hydration test run at 200°C as defined in the DOE-concurred upon ILAW Product Compliance Plan. The measured glass alteration rate shall be less than 50g/(m2·day). Qualification testing shall include glass samples subjected to representative waste form cooling curves. The vapor hydration test shall be conducted on waste form samples that are representative of the production glass.

In VHT, the rate of conversion is influenced by the coupled rates of glass corrosion and the formation of primarily crystalline zeolite alteration products. The nucleation of zeolite phases in this process is sporadic, leading to a high degree of response variability, particularly during the early stages of conversion (Fig. 1)8. Within the typical 24-day VHT duration, glasses with high alteration rates have usually progressed past the early sporadic phase, while glasses with low alteration rates may still be within the early sporadic phase. Consequently, simple functions used to describe the VHT responses of waste glasses are inaccurate. The high degree of nonlinearity and the variability in zeolite nucleation, coupled with variability in experimental factors, contribute to high bias and uncertainty in VHT response prediction. Jiricka et al. highlighted the significant impacts of surface finish, volume of water, and cutting procedures on the single-time VHT responses, all of which are generally not strictly controlled across different VHT tests8. As a result of these experimental variations, a relative standard deviation of 63% was observed for the pooled replicated single-time VHT response9.

Fig. 1: Impact of time on the amount of glass converted to alteration products in typical VHT tests8,58,59.
figure 1

The dashed lines represent linear fittings for the data measured from two studies, excluding the initial days.

In real production scenarios, glass property-composition models are utilized to predict the VHT response of glasses during the processing of each waste batch at Hanford’s LAW vitrification facility. The use of these models, along with their uncertainty descriptions, is necessary to ensure with sufficient confidence that glasses satisfy the VHT specifications listed in the WTP contract10. This approach is also used for other glass processing and product quality-related properties11. Existing VHT response models have met this goal with limited success. In terms of the previous modeling efforts, Vienna et al. modeled the logarithm VHT response of 58 glasses as a first-order function of composition12:

$${\mathrm{ln}}\left(r,\frac{{\rm{g}}}{{{\rm{m}}}^{2}\cdot {\rm{d}}}\right)=\mathop{\sum }\limits_{i=1}^{q}{r}_{i}{g}_{i}$$
(1)

where r is the VHT rate at 200 °C (based on multi-time VHT responses), ri is the ith component coefficient, and gi is the ith component mass fraction in a glass. More recently, partial quadratic mixture (PQM) models were applied to advance the prediction of the VHT response in a similar manner13:

$${\mathrm{ln}}\left(d,{\rm{\mu }}{\rm{m}}\right)=\mathop{\sum }\limits_{i=1}^{q}{v}_{i}{g}_{i}+\text{Selected}\left\{\mathop{\sum }\limits_{i=1}^{q}{v}_{{ii}}{g}_{i}^{2}+\mathop{\sum }\limits_{i=1}^{q-1}\mathop{\sum }\limits_{j=i+1}^{q}{v}_{{ij}}{g}_{i}{g}_{j}...\right\}$$
(2)

where d is the VHT alteration thickness in µm after a 24-day test duration at 200 °C, vi is the ith component coefficient, vii is the ith component squared coefficient and vij is the ith and jth component cross-product coefficient. These models demonstrated some success in predicting VHT responses within relatively narrow composition regions but faced high relative uncertainty. For instance, to satisfy the WTP constraint of the VHT rate being 50 g/m2/d with 90% confidence, the predicted rate needed to be below 7.9 g/m2/d10. Also, the validation performance of the model is unsatisfactory, as shown in Supplementary Fig. 1. As the composition regions expanded, particularly for the relatively large region of enhanced LAW glasses14,15,16, the prediction uncertainty increased, and distinct biases emerged. The PQM model by Vienna et al., for instance, over-predicted low VHT responses while under-predicting higher VHT responses (Fig. 2). Vienna et al. also attempted to address this issue using an artificial neural network (ANN) to model VHT response as a function of composition15. While this model exhibited reduced uncertainty, it did not perform well on validation data, indicating overfitting. Consequently, Vienna et al. did not obtain a suitable model to quantify VHT responses in enhanced ILAW glass data and so developed a logistic regression model to predict pass/fail state of glasses relative to the 50 g m−2 d−1 contract limit16. Therefore, for both research and the design of LAW glasses with enhanced performance, a more robust model is needed to overcome experiential noise and capture the true compositional effects on VHT response.

Fig. 2: Predicted vs. measured logarithm VHT responses using PQM, aligned to an equivalent 24-day alteration depth14.
figure 2

For measured VHT response, the pooled standard deviation is 0.9034 ln(µm)9, and, for predicted VHT response, the uncertainty (95% prediction interval) is illustrated in Supplementary Fig. 1.

This study leverages advanced machine learning methods to unveil the distinct effect of different oxides on the durability performance of nuclear waste immobilization glasses. We utilize a dataset comprising VHT alteration rates of more than 600 Hanford LAW glasses, encompassing a wide compositional envelope. Four machine learning methods, including PQM and ANN from previous studies, and two nonparametric approaches, namely, local linear regression (LLR) and Gaussian process regression (GPR), are employed to predict their VHT responses. The nonparametric methods, being more capable of fitting complex relationships without presuming the target function’s form, typically offer superior performance17. Our test results demonstrate that GPR achieves the best overall accuracy and robustness in predicting the VHT response of unseen glass compositions, significantly outperforming the benchmark models. Through detailed interrogation of the GPR model, we decompose the complex relationship between glass composition and VHT response, effectively unveiling the individual effects of the different major oxides. Based on the observations from this study, we further discuss an interesting phenomenon that is generic to machine-learning-based material studies, where a slightly-overfitted model is found to be beneficial for extrapolating a small dataset with a heterogeneous sample distribution.

Results

Prediction accuracy of the machine learning models

Based on the data reported by PNNL-30932 report16, a dataset consisting of 639 single-time VHT responses was curated. This dataset considers the influence of ten major oxides (as the model inputs) on the VHT response (as the model output). The distributions of those inputs and output are summarized in Fig. 3. As indicated by the orange whiskers in Fig. 3, the glass compositions involved in this dataset range over a broad envelope as relevant to diverse glass waste forms. Additional details of the data curation are provided in the “Methods” section.

Fig. 3: Composition ranges of the ten oxides in the 585 considered nuclear waste immobilization glasses for modeling the VHT responses, along with the distribution of the VHT responses in the dataset.
figure 3

Note that the unit of the scaled VHT alteration rate, *, represents g/(m2 d). The red, two purple, and two orange whiskers indicate the median, lower and upper quartiles, and lower and upper extremes of the feature values, respectively.

Based on the procedures described in the “Methods” section, the four models—PQM, ANN, LLR, and GPR—were optimized. The optimal prediction accuracies of these models were compared using ten different train-test splits, and the results are presented in Table 1. Additionally, Fig. 4 showcases the predicted and true VHT alteration rates of the test samples from one single train-test split. Overall, all the models achieve decent prediction accuracy in mapping the glass composition to the VHT alteration rate. While the averaged RMSE values on the test set ~1.0 ln[g/(m2·d)] for all four models, it is worth noting that the pooled standard deviation (SD) is found to be 0.9034 ln[g/(m2·d)]) from analyzing duplicate glasses (i.e., same glass composition with multiple measured values), representing the uncertainties caused by fabricating glasses, and performing VHT analyses9. The small difference between RMSE and pooled SD indicates that the models are properly fitted (i.e., not over or underfitting).

Table 1 Comparison of the predictive accuracies and lack-of-fit statistics for four models—PQM, ANN, LLR, and GPR
Fig. 4: Comparison of the accuracy of the machine learning models.
figure 4

Predicted vs. measured VHT alteration rate for (a) PQM, (b) ANN, (c) LLR, and (d) GPR models, based on the same test samples from a single train-test split. The points are color-coded based on the density of the scatters on the plot (the concentration increases from blue to red), and * represents the unit of the scaled VHT alteration rate, i.e., g/(m2·d). The dashed line is added to visualize the perfect agreement.

Another noteworthy observation from Fig. 4 is that all four models exhibit decreased prediction accuracy as the VHT alteration rates decrease. This trend can be attributed to the reduced number of samples with very low VHT alteration rates in the dataset, as illustrated in Fig. 3. Such scarcity of samples in this range implies that corresponding predictions entail more extrapolation. In that regard, an intriguing phenomenon observed during model optimization is the delicate balance between the degree of overfitting and extrapolability, influenced by varying degrees of model regularization. This dynamic arises due to the uneven distribution of samples in the dataset. Specifically, a slightly-overfitted model extrapolates more accurately for glass compositions with very high or very low VHT alteration rates. This point is discussed in detail in the “Discussion” section.

Among the four models, the PQM model, despite achieving a higher accuracy than reported in the previous study14, exhibits the lowest performance. The relatively lower accuracy of the PQM model can be attributed to its finite analytical terms, which struggle to adequately capture the intricate dependence between the glass composition and its VHT response18. For further insights into the analytical terms of this updated PQM model, please refer to Supplementary Note 3. Conversely, the ANN model demonstrates considerable improvements in both test R2 and RMSE, owing to its higher flexibility in fitting non-linear patterns. However, a significant gap is observed between the training and test accuracies of the ANN model, suggesting a tendency to overfit. This observation is consistent with previous findings and raises concerns about the ANN model’s ability to generalize well on new samples14, particularly for new glasses located near the edge of the compositional envelope.

In comparison, both the nonparametric models exhibit superior prediction accuracies. The LLR model shows the best performance on test RMSE among the four models, but the high prediction variations of LLR (indicated by the standard deviation values in Table 1) raise questions about its robustness. The high variations of LLR can be attributed to the fact that some samples in the VHT dataset are sparsely distributed in the compositional envelope, causing the local fitting of the LLR model to vary significantly across the different train-test splits. In contrast, the GPR model achieves a good prediction accuracy, with the highest R2 accuracy and the second-lowest RMSE on the test set. Importantly, the GPR model does not involve the instability associated with LLR or the overfitting issue of ANN. Furthermore, the lack of fit test indicates that GPR has the smallest F score (i.e., least likely to involve the lack-of-fit issue) compared to PQM and LLR models. Considering all the above-discussed factors, GPR emerges as the most suitable modeling method for the logarithm VHT response dataset19. Consequently, results and discussion in the subsequent sections focus only on GPR.

One-dimensional effect of individual oxides on VHT response

To interpret the effect of individual oxides on the VHT alteration rate as learned by the optimized GPR model, we conducted a one-dimensional feature effect analysis. We first assumed a hypothetical glass composition based on the median value of each oxide feature (as shown in Fig. 3), representing a generic composition of nuclear waste immobilization glasses. Subsequently, we analyzed the effect of each oxide by sequentially jittering its content within its 15-to-85 percentiles in the dataset, while rescaling other oxides such that the total oxide content remained unchanged. The impact of each oxide was then evaluated based on the variation in the predicted VHT alteration rate resulting from the jittered oxide content.

The results of the individual oxide effect analysis are summarized in Fig. 5. Notably, the different oxides exhibit distinct effects on the VHT response, characterized by varying slopes and degrees of nonlinearity, which broadly corroborate the observations reported in a prior analysis utilizing the PQM model16. Based on the slopes of the feature lines, the ten oxides considered in this study can be categorized into two groups (ranked by the feature slope and the quantity of each oxide in the dataset)—those that increase the VHT alteration rate of the glass (Na2O, Li2O, K2O, CaO, and Al2O3), and those that decrease the rate (ZrO2, SiO2, TiO2, and SnO2). Note that B2O3 has minimal impact on VHT response. Meanwhile, we stress that the displayed feature effects can be influenced by the choice of the reference composition (i.e., the centroid in Fig. 5). The effect of the same oxide can also vary depending on the reference composition, which, in turn, highlights the challenge of interpreting the effect of each individual oxide with a conventional model.

Fig. 5: One-dimensional composition effect on VHT alteration rate based on the optimized GPR model.
figure 5

The centroid corresponds to the median value of each oxide feature in the dataset. Note that* represents the unit of the scaled VHT alteration rate, i.e., g/(m2·d).

Model interrogation with SHAP analysis

To provide a more comprehensive understanding of the prediction of VHT alteration rate by the optimized GPR model, we employed the SHAP analysis, a powerful model interpretation technique20,21, as briefly reviewed in the “Methods” section. Unlike the one-dimensional composition analysis, the SHAP interpretation is based on the actual compositions in the VHT dataset, without relying on assumptions about a reference glass composition. For this analysis, all the samples in the dataset were considered, ensuring a robust and detailed examination of the model’s predictions.

In Fig. 6, we present a summary of the marginal contribution of each oxide component to the VHT alteration rate. The oxides are here ranked based on their absolute influence on the VHT alteration rate as captured by the Shapley values. The horizontal axis represents the Shapley value (i.e., relative impact), where positive (negative) values indicate an increase (decrease) in the VHT rate. For each oxide, the horizontal distribution of the scatters shows the variation of its marginal contribution in response to the standardized feature values such that all oxides can be compared with their mean value aligned to the centerline in the plot. This enables easy comparison of all oxides with their mean values. Taking Na2O as an example, this plot suggests that, as the content of this oxide increases (as the scatter color shifts from blue to pink), its marginal contribution changes from negative to positive. This suggests that a higher amount of Na2O tends to raise the VHT alteration rate while low concentrations show the opposite effect.

Fig. 6: Summary of the SHAP analysis.
figure 6

This plot shows the relative impact (i.e., marginal contribution) of the oxide components on the VHT alteration rate, as ranked by importance from top to bottom. Note that the feature values in this plot have been standardized to ensure a fair comparison. * represents the unit of the scaled VHT alteration rate, i.e., g/(m2·d).

We next display the feature-wise SHAP results in Fig. 7, grouping the oxides based on the slopes of their feature effects from Fig. 5. In each plot, the actual content of each oxide is plotted against the marginal contribution, with each point representing a specific sample in the VHT dataset. We observe that most of the oxides exhibit a monotonous influence on the VHT rate. Among all, Na2O, Li2O, ZrO2, and K2O, show clear effects in affecting the VHT rate, with ZrO2 showing a negative influence opposite to the other three oxides. In contrast, the data patterns extracted from CaO and Al2O3 are less clear in both Figs. 6 and 7. This point is further discussed with additional evidence based on the SHAP analysis in the “Discussion” section.

Fig. 7: Feature-wise results of the SHAP analysis.
figure 7

Those plots show the correlation between the oxide content (on the actual scale) and marginal contribution: a Na2O, b Li2O, c K2O, d Al2O3, e CaO, f B2O3, g SiO2, h SnO2, i TiO2, and j ZrO2. Here, the feature values are based on the actual scale. To assist comparison, the vertical and horizontal dash lines indicate the mean feature value and zero marginal contribution, respectively.

Discussion

We first discuss the trade-off between model overfitting and extrapolability associated with the uneven sample distribution in the dataset, as noted in the “Results” section. This is a common challenge faced in many machine-learning-based material studies due to the limited scale of common lab testing22,23,24,25. Conventionally, it is expected that a model’s extrapolation capability is maximized when overfitting is minimized. However, our findings based on the VHT dataset reveal a somewhat surprising outcome: a slightly-overfitted model appears to exhibit more accurate extrapolation at very low or very high VHT rates.

This trade-off issue is collectively illustrated with Figs. 8, 9, and 10, as explained in the following. We first employ progressive regularization of the GPR model, transitioning from overfitting to underfitting, by gradually increasing the noise level parameter. Using the RMSE difference between the training and test set samples as a measure, we then track the model deviation at different ranges of the VHT alteration rate. As a comparison, Fig. 8 presents the prediction deviations between a non-overfitted model (achieved by minimizing the difference between the train and test R2 accuracies) and a slightly-overfitted model. For statistical representativeness, the single sample with a VHT rate larger than 6 ln[g/(m2·d)] was excluded from this analysis.

Fig. 8: Trade-off between model overfitting and extrapolability observed in this study.
figure 8

Variation of the prediction deviation (i.e., the RMSE difference between the training and test sets) at different VHT alteration rates, based on the (a) non-overfitted and (b) slight-overfitted GPR models. The interpolation and extrapolation ranges are roughly divided according to the data distribution in the VHT dataset. Note that * represents the unit of the scaled VHT alteration rate, i.e., g/(m2·d).

Fig. 9: Variations of the prediction deviation (blue squares) and the degree of overfitting under different levels of model regularization (green triangles).
figure 9

Here, the two metrics are both calculated based on the RMSE difference between the training and test set samples, but the former takes the average from the different VHT intervals (see Fig. 8) while the latter is on the overall difference.

Fig. 10: Illustration of the trade-off between overfit and model extrapolability.
figure 10

Predictions made by (a) non-overfit and slight-overfit and (b) strong-overfit models trained based a simplified heterogeneous dataset. In both cases, the arrows indicate the major prediction deviations from the true function.

Ideally, a model with robust extrapolability should exhibit a consistent level of deviation across the entire range of the VHT rate. In Fig. 8a, the non-overfitted model demonstrates relatively low deviations within the interpolation region, but its deviation significantly increases in the extrapolation regions. Conversely, the slightly-overfitted model in Fig. 8b shows marginally inferior interpolation performance, yet it achieves substantially lower deviations during extrapolation. Consequently, compared to the non-overfitted model, the slightly-overfitted model appears to be better suited for predicting glass compositions with extreme VHT responses.

To further investigate the trade-off issue illustrated in Fig. 8, the average deviation (mean deviation averaged from the different VHT intervals), and the degree of overfitting (i.e., the RMSE difference between the training and test set samples) at different levels of model regularization are plotted in Fig. 9. Noted that the zero regularization here indicates a very low level of regularization that causes the model to overfit, as the noise level is not the only parameter controlling the regularization in GPR. Our analysis reveals that, as the model regularization increases, the degree of overfitting gradually diminishes as anticipated, but this is accompanied by a continuous increase of the average deviation, which indicates a decline in the model extrapolability. As seen in Fig. 8, the rise of the average deviation is primarily driven by the higher deviations at low and high VHT rates. These trends, depicted in Figs. 8 and 9, are not commonly expected, as the general understanding is that overfitting leads to larger deviations, undermining model performance.

The superior performance of the slightly-overfitted model observed in this study is likely related to the uneven distribution of the VHT alteration rate in the adopted dataset. Specifically, most of the VHT samples are concentrated within the mid-range of the alteration rate between 0 and 4 ln[g/(m2·d)] (see Fig. 3), while fewer samples are outside this range. The observed issue is illustrated with a simplified case of the heterogeneous dataset in Fig. 10. In this figure, the majority of the training samples are concentrated at 4 to 8, and the major prediction deviations from different models are indicated with arrows. The model’s extrapolability is reflected by the magnitude of deviations at the two ends, where both the non-overfitted and overfitted models exhibit large deviations when predicting samples outside the middle range. In contrast, despite the slightly higher variance compared to the non-overfitted model, the slightly-overfitted model achieves more reasonable predictions when extrapolating beyond the populated region.

Returning to the VHT dataset, it is apparent that the model optimization is dominated by the model accuracy in predicting the most populated samples within the mid-range of VHT rate. Consequently, much lower weights are assigned to extrapolating glass compositions with high and low VHT rates. With slightly lower regularization, however, more weight of the model optimization is allocated to the less populated regions. As a result, the slightly-overfitted model extrapolates the VHT rate substantially better at the extremities.

Drawing from the observations made from the VHT dataset as discussed above, different strategies can be employed to select the best model based on the target of the model prediction. If the model is primarily intended for interpolating material performance within the populated region of the dataset, the non-overfitted model is expected to provide the best predictability. However, in cases where there is an interest in extrapolating material performance, adopting a slightly-overfitted model could yield more robust predictions for extrapolating material behavior. Indeed, the ability of the slightly-overfitted model to accurately extrapolate at very low or very high VHT rates is particularly relevant and significant for material research. This is because the slightly-overfitted model may result in more robust predictions in extrapolating the material behavior, which facilitates the exploration of unexplored regions in the compositional space, potentially leading to the discovery of new materials with unconventional compositions.

Next, we discuss the effect of individual oxides on the VHT alteration rate, as conducted in both the one-dimensional feature effect analysis and the SHAP analysis, yields consistent and insightful results. Based on the results in Figs. 5, 6, and 7, the influence of the ten oxides can be ranked descendingly as follows: Na2O, Li2O, ZrO2, K2O, TiO2, SnO2, SiO2, CaO, Al2O3, and B2O3. Among all, Na2O has the most substantial impact on the VHT alteration rate, with higher concentrations leading to a faster rate of alteration, which becomes more pronounced when its molar fraction exceeds 20%14,16. Meanwhile, only CaO and Al2O3 exhibit unclear effects on affecting the VHT rate. In general, the interpretation of these oxide effects aligns well with the roles of the corresponding atoms in altering the glass’s formation ability and chemical durability, which are related through the oxide composition and atomic structure. In terms of the VHT test, the response of a glass is determined by two primary factors: 1) the influence of the components on glass network stability and 2) the influence of the components on the stability of the solution leading to the driving force to dissolution.

Regarding the glass network stability, it is associated with the role of different atoms in altering the topology of the glass atomic network (see topological constraint theory)26,27,28, which influences the strength of the chemical bonds within the glass29,30. Here, Na, Li, and K atoms that form alkali oxides play a role in depolymerizing the glass network, thus reducing the corrosion resistance of the nuclear waste immobilization glass and therefore increasing the VHT rate. In contrast, Si and B atoms, as typical network-forming elements, enhance the glass formation ability and, consequently, reduce the VHT response. Additionally, the weaker network-modifying elements, Zr, Ti, and Sn, also contribute to enhancing glass network formation31. Aluminum has a mixed effect on the glass structure depending on the overall glass bonding environment30,32.

As for the effect of the solution on glass dissolution, alkali ions released into the solution increase its pH, driving a higher dissolution rate while boron buffers the pH of the corroding solution. Zirconium, Ti, and Sn are only sparingly soluble in solution and therefore have little effect on this mechanism. Orthosilicic acid and aluminate ion concentrations in solutions most strongly influence the driving force for dissolution33,34. High Al concentrations in a caustic solution cause a precipitation of zeolites which dramatically reduce the concentrations of both aqueous Si and Al, then increase the dissolution rate35. Notably, the effects of Ca and Al atoms on the VHT response appear to be mixed, with their marginal contributions being either positive or negative, even at fixed oxide content levels (see Fig. 7e, d). By further taking advantage of the SHAP analysis, we revisit the effect of those oxides while focusing on identifying their most interactive features21, as displayed in Fig. 11. We find that the total content of alkaline oxides (Na2O, K2O, and Li2O) has the strongest interactive effect with CaO and Al2O3. Under low alkaline oxide content, the marginal contributions of CaO and Al2O3 are positive, while this trend reverses with the presence of a high total alkaline content. This dual pattern may be linked to the significant impact of alkaline oxides on solution pH, which is known to accelerate glass corrosion32. This observation points to a potential feedback effect involving the solution chemistry that influences the VHT response in complex ways. Thus, CaO and Al2O3 warrant further investigation to understand the underlying solution feedback mechanisms that govern their influence on the VHT response. To systematically analyze the interactive effects among other oxides that also exhibit potential significance in glass alteration processes, future investigations may benefit from employing more sophisticated criteria and validating the results through experimental characterizations.

Fig. 11: Interactive effects between oxides as revealed by the SHAP analysis.
figure 11

Variations on the marginal contribution of (a) CaO and (b) Al2O3 to VHT response under the influence of the total content of alkaline oxides (Na2O + K2O + Li2O). Here, the arrows are added to highlight the dual effects of these two oxides, and the feature values are based on the actual scale. To assist comparison, the vertical and horizontal dash lines indicate the mean feature value and zero marginal contribution, respectively.

Overall, machine learning proves to be a valuable tool for understanding the relationship between the composition of nuclear waste immobilization glasses and their VHT responses. Through proper model interpretations, data-driven machine learning offers meaningful insights into the durability of these glasses and the effects of individual oxides, which are not easily accessible using conventional approaches.The findings presented herein contribute to the knowledge and development of nuclear waste immobilization glasses, paving the way for enhanced material design and performance optimization in this critical field of research. More importantly, the generic analyses presented in this study can serve as a paradigm for other machine-learning-based material investigations.

Methods

Curation of the VHT dataset

A comprehensive database consisting of 654 single-time VHT responses, measured for 595 individual glasses, was initially compiled, including 59 replicates of 44 distinct glass compositions8,14,15,36. The raw data is documented in the Appendix A of the PNNL-30932 report16. The majority of the VHT results were obtained at the standard 24-day measurement period. In cases where some glasses were completely corroded during the 24-day duration, shorter measurement times such as a 7-day test were employed. For the sake of uniform analysis, those shorter testing points were extrapolated to an equivalent 24-day specimen thickness, assuming a constant rate of alteration from the test’s initiation. It is worth noting that this approach, while ensuring consistency, may introduce some uncertainty to the measured value.

Following an initial screening process, a total of 15 glasses were excluded from the database, as summarized in Table 2. The removal criteria were as follows (i) extreme compositions for 13 glasses, (ii) consistently gross model fit outliers for one glass, and (iii) abnormal VHT measurement for one glass. For the first type of outliers, the removal of the extreme compositions was based on a systematic analysis previously reported by Vienna et al. (see Section 4.1.1.1 of the cited report)16. For the second type, a single glass consistently appeared as an isolated data point, residing far from all other data points in the fitting conducted by different models. Regarding the third type, the 7-day corrosion result exceeded the 24-day result, possibly indicating an experimental issue such as water loss during the test or reflux.

Table 2 Outlying data removed from the VHT model fitting dataset16

After removing the outliers, a dataset was prepared for the subsequent machine learning analysis comprising 639 VHT responses from 585 individual glass compositions, with an additional 54 replicates of 42 glass compositions. Given the limited size of this dataset, some oxides in the glass compositions that are less influential to the VHT alternation rate were omitted to reduce data dimensionality, thereby improving the efficiency of the machine learning model training and mitigating the “curse of dimensionality” phenomenon37. The identification of uninfluential oxides was previously documented by Vienna et al. in a comprehensive study (refer to Section 3.3 of the cited report)14. This analysis initially pinpointed 15 oxides with reasonably distributed concentrations for potential inclusion as model inputs. These included Al2O3, B2O3, CaO, Fe2O3, K2O, Li2O, MgO, Na2O, P2O5, SiO2, SnO2, TiO2, V2O5, ZnO, and ZrO2. Through a systematic elimination process based on the step-by-step removal of oxides with negligible effects, four oxides— MgO, P2O5, V2O5, and ZnO—were excluded from further consideration. Herein, Fe2O3 was further excluded due to its strong correlation with multiple oxides known to exert a significant influence on the VHT alteration rate, such as Al2O3, Na2O, SiO2, and ZrO2. Consequently, a total of ten major oxide components (see Fig. 3) were selected based on their impact on VHT response14. These components constituted a significant portion of the total molar content, ranging from 82.2% to 96.5% with a median of 92.5%. For the model inputs, the molar fractions of the selected oxide were further normalized to sum to one.

Regarding the model’s output, the raw VHT response of the selected samples ranged from 0.7 to 13848 µm in depth. To standardize the influence of different sample sizes and testing periods, the VHT response was scaled as the glass mass change per square meter per day. Subsequently, the model’s output was taken as the natural logarithm of the VHT alteration rate, reducing the skewness of the output distribution. Following these transformations, the VHT alteration rate now ranges from −2.30 to 7.33 ln[g/(m2·d)] in the dataset. Note that the ranges of both the glass composition and VHT alteration rate considered herein are broader than those in previous investigations8,14,15,36, which presents additional challenges in modeling the relationship between glass composition and VHT alteration rate.

Machine learning analyses

Based on the dataset, we employed various machine learning approaches to predict the VHT alteration rate based on the glass composition as input. In detail, we considered four regression methods, namely, PQM, ANN, LLR, and GPR. The first two have been used in previous studies and they are considered as baselines15,18, while the latter two are nonparametric models that may exhibit superior performance in fitting the waste glass dataset. To ensure a fair comparison, all four models were provided with the same ten input features for model training (see Fig. 3). The modeling work was conducted in Spyder IDE (5.4.3) with Python (3.10.10), and the specific program packages (if applicable) used are provided below. A brief overview of these four models is provided as follows.

The PQM model employs a combination of relatively simple terms, including linear, squared, and quadratic cross-products, to generate predictions38. By incorporating squared and quadratic terms, PQM can capture certain non-linear patterns in the data. Previous studies have demonstrated its feasibility for modeling the VHT response of nuclear waste immobilization glasses13. The PQM model in this study was built using the statsmodels package39.

The ANN model has been widely popular in machine learning due to its flexibility in processing diverse datasets and performing various machine learning tasks40,41,42. In this study, the ANN model was built using the PyTorch package43. To optimize the model’s architecture, considering the limited size of the VHT dataset and experience established from previous studies22,44,45, we established a network with eight and four artificial neurons in two hidden layers, respectively. ReLU activation and Batch Normalization were implemented between each layer to enhance model accuracy46. Furthermore, Adam optimizer and mean square error (MSE) loss were adopted for guiding the model optimization.

The LLR model performs linear regression locally in the data space, providing non-linear and smooth predictions on a global scale47. LLR employs a distance-based weight function to determine the influence of each data point on local fitting, making it nonparametric and highly expressive for modeling complex relationships. LLR also exhibits good resistance to noise in the dataset. The LLR model in this study was directly coded according to the algorithm.

The GPR model is a probabilistic-based prediction method established on Bayes’ Rule48. GPR assumes that both the input features and output target follow Gaussian processes, enabling the modeling of the probabilistic distribution of the output target (i.e., posterior distribution) based on the input features (i.e., prior distribution)49. This allows GPR to provide not only predictions (mean of the probabilistic distribution) but also estimates of prediction confidence (variation of the probabilistic distribution). Being nonparametric, GPR can fit completely unknown relationships without assuming a specific mapping function. GPR requires selecting an optimal kernel function, which should ideally echo the generic trend of the predicted output distribution. The GPR model in this study was built using the GpyTorch package43,50. Based on the kernel selection pipeline established in a previous GPR study17, we used a combination of Linear and Radial Basis Function (RBF) kernels for modeling the VHT dataset. The former ensures good model extrapolability, while the latter allows the model to fit complex data patterns51. Another hyperparameter, noise level, was tuned for model regularization to prevent overfitting. Further details on the noise level choice are provided in the “Results” section.

In terms of the optimization of the four models, we primarily employed brute-force searches to identify their respective optimal hyperparameters, based on ten independent random training repetitions to evaluate model performance. Further details about the model tunning and the selected optimal hyperparameters are provided in Supplementary Note 2. The search was performed on 85% of the samples (i.e., training set), which were randomly stratified from the dataset to ensure a statistically consistent distribution within the raw dataset22,44,52. The remaining 15% samples (i.e., test set) were held hidden in order to test the true performance of the models with optimized hyperparameters since those samples were never exposed to model training. In each repetition, 15% of the training set samples were held out as a validation set to determine the optimal hyperparameters (i.e., those yielding the lowest averaged MSE loss). Model performance comparison was based on two statistical metrics computed on the test samples: coefficient of determination, R2, and root mean square error, RMSE46. Additionally, a lack-of-fit F-test was conducted to assess whether prediction errors were due to lack of model fit53, where a high F score (>1) indicates a lack-of-fit model.

SHAP analysis

Advanced machine learning algorithms like ANN can effectively interpolate complex, non-linear datasets, but they often act as “black boxes,” making their predictions challenging to interpret. In order to gain insight into the relationship between glass composition and VHT response, we employed the SHapley Additive exPlanations (SHAP) analysis. This method is based on the coalitional game theory, originally proposed by Lloyd Shapley54, and it is intended to address a classic math problem—how to attribute an outcome to the presence of multiple contributing factors. Under the ideal conditions assumed by Shapley’s theory, the “pure” contribution of each factor (namely, marginal contribution, which is presented by the Shapley value in SHAP analysis) can be determined by analyzing a large group of instances that record variations in these factors and the corresponding changes in the outcome. Importantly, this method, although also permutation-based, differs from conventional permutation feature importance analysis, which focuses on the reduction in prediction accuracy, as the Shapley value directly reflects the actual influence on the same scale of the outcome.

More recently, Shapley’s theory has been incorporated into the SHAP analysis for interpreting machine learning models. In brief, this analysis estimates the marginal contributions of the model’s input features to the output20,21. A significant advantage of the SHAP analysis is its ability to reliably approximate the Shapley value without requiring high computational demands for calculating the exact value. This makes it feasible to apply this approach to large datasets and sophisticated machine learning models. Over the past years, SHAP analysis has successfully been utilized to interrogate various machine learning models in several glass-related studies44,55,56,57. The SHAP analysis involves calculating outcome differences by permuting the factor of interest (between its actual value and its mean value) across all instances under different conditions. By averaging these differences over all instances, we can assess the marginal contribution of the factor of interest (i.e., Shapley value). In our analysis, each glass sample in the VHT dataset corresponds to an “instance”, the “factors” are the oxide molar fractions of the glass, and the “outcome” corresponds to the logarithm VHT alteration rate predicted by the model. Note that, depending on the actual influence of an oxide, the Shapley value can be either positive or negative, wherein a positive value signifies an increase in the VHT alteration rate, and vice versa.