Uncertainty quantification in multivariable regression for material property prediction with Bayesian neural networks

With the increased use of data-driven approaches and machine learning-based methods in material science, the importance of reliable uncertainty quantification (UQ) of the predicted variables for informed decision-making cannot be overstated. UQ in material property prediction poses unique challenges, including multi-scale and multi-physics nature of materials, intricate interactions between numerous factors, limited availability of large curated datasets, etc. In this work, we introduce a physics-informed Bayesian Neural Networks (BNNs) approach for UQ, which integrates knowledge from governing laws in materials to guide the models toward physically consistent predictions. To evaluate the approach, we present case studies for predicting the creep rupture life of steel alloys. Experimental validation with three datasets of creep tests demonstrates that this method produces point predictions and uncertainty estimations that are competitive or exceed the performance of conventional UQ methods such as Gaussian Process Regression. Additionally, we evaluate the suitability of employing UQ in an active learning scenario and report competitive performance. The most promising framework for creep life prediction is BNNs based on Markov Chain Monte Carlo approximation of the posterior distribution of network parameters, as it provided more reliable results in comparison to BNNs based on variational inference approximation or related NNs with probabilistic outputs.


Introduction
Uncertainty Quantification (UQ) plays a crucial role in various science and engineering disciplines.In the field of material science, the application of computational modeling has significantly accelerated the discovery of novel materials with enhanced properties.Determining the level of confidence in the predictions made by computational models is of high importance, as high levels of uncertainty can result with a set of predictive single-point and uncertainty metrics demonstrate that MCMC BNNs are the most promising UQ method for creep rupture life prediction, with performance that is competitive or exceeds the performance of GPR.The results also demonstrate that physics-informed knowledge leverages the models' capacity for improved creel life prediction.
Although prior works have explored the application of ML approaches for predicting material properties [20][21][22][23][24][25][26] and uncertainty estimates in the predictions [16] [27][28][29], our proposed approach introduces novel concepts related to BNNs with incorporated physics priors for UQ in material property prediction.Specifically, our work was inspired and has similarities to the recent article by Mamun et al. [9], who proposed an approach for predicting the creep rupture life of ferritic steels using conventional ML methods.The authors employed GPR to calculate point estimates and uncertainty estimates in the predicted rupture lire.Differently from the work by Mamun et al. [9], we developed approaches for point regression and UQ in predicting creep rupture life based on BNNs, and we demonstrated that Bayesian deep learning models consistently achieved improved performance in comparison to GPR for this task.Similarly, a body of work in the literature utilized Physics-Informed ML (PIML) to integrate knowledge from governing physics laws and data-driven methods for obtaining more consistent predictions [30][31][32][33][34].In our proposed approach, we drew inspiration from the work by Zhang et al. [31], where the authors introduced physics-informed features and a physics-informed NN loss for predicting creep rupture life.On the other hand, the authors did not consider UQ in their work, as well as, their focus was on designing standard NNs with deterministic parameters for creep life prediction.In another related paper, Oliver et al. [16] investigated the use of BNNs for UQ in the field of material science, and they developed VI-based ensemble methods for predicting the properties of composite materials.Differently from our work, the authors in [16] did not consider the integration of physics-informed knowledge in BNNs, and also they used simulated data as a proof-of-concept for the proposed approaches, whereas we used collected data from creep tests for experimental validation.Similarly, researchers have proposed incorporating physics-informed priors into BNNs in prior works [35] [36], however, these works focus on other tasks, and to the best of our knowledge, this is the first work to apply such framework for material property prediction.Lastly, although many previous works have studied AL to prioritize the most informative sample for model training [9] [35][36][37][38], in this paper we showed that physics-informed BNNs have the potential to accelerate the model training in AL for material property prediction.

The contributions of this paper include:
• Introduced a physics-informed BNNs approach for material property prediction, that incorporates physics knowledge for guiding the solutions of BNNs.• Performed a comprehensive evaluation of the performance of conventional ML models, traditional NNs with probabilistic outputs, and BNNs for creep rupture life prediction and uncertainty estimation.• Applied the UQ frameworks for an active learning task to iteratively select data points with the highest epistemic uncertainty and diversity for faster model training with fewer data points.
The paper is organized as follows.Section 2 provides a problem definition and presents an overview of ML methods for UQ in multivariable regression.Section 3 introduces the proposed approach of physicsinformed BNNs and describes the active learning application.Experimental validation and comparative analysis are provided in Section 4. Section 5 provides a discussion of the work, and the last section concludes the paper.

Preliminaries: Frameworks for Uncertainty Quantification
The considered problem is a multivariable regression task, where based on a set of  observed data points  = {( 1 ,  2 , … ,   ),   ∈ ℝ  } , the goal is to estimate the values of a target variable  = {( 1 ,  2 , … ,   ) ∈ ℝ}.The target variable  is a material property that we are interested in.In this paper, we consider creep rupture life as a property of interest.The observed data points  provide relevant material information, such as composition, known physical, mechanical, or other properties, and experimental conditions (such as temperature, and stress level) that are important for estimating the target variable.
For a training dataset  = {(  ,   )} =1  and a new data point  * which does not belong to the previously observed data  , the obeective is to find a mapping function  that estimates the target value, i.e.,  * = ( * ).The value  * is referred to as single-point prediction or point estimate.In addition, the focus of this paper is on methods that provide uncertainty quantification for the predicted value  * , either through a quantified measure of the variance of  * , via confidence intervals, or by other means.The next sections provide an overview of conventional ML and DL frameworks for UQ in regression tasks.

Conventional Machine Learning Methods
Whereas many ML models for classification tasks inherently output the probabilities in the model's prediction for each class, most ML models for regression tasks typically provide only point estimates of the target value.Consequently, several approaches have been developed that employ or modify standard regression models in order to provide uncertainty estimates.

Quantile Regression
Quantile Regression (QR) [6] is a non-parametric approach for estimating the conditional quantiles-and therefore, uncertainties-in prediction variables.For a quantile , the conditional quantile function of a target variable  given observed data  in QR is defined as: The QR loss function for quantile  is as follows: . ( For different quantiles, QR fits a separate model to minimize the loss function in (2).For instance, for a new data point  * , by estimating the quantile functions   * | * ( = 0.025) and   * | * ( = 0.975), we can obtain the 95% prediction interval of the target variable around the point estimate  * .Similarly, setting the quantile to  = 0.5, QR solves median regression and outputs the median values.
Compared to traditional regression methods that only estimate the conditional mean of  * given  * , QR estimates the conditional distribution   * | * () , inherently providing uncertainty estimation.Other advantages of QR include: robustness to outliers, there are no distributional assumptions resulting in distribution-free estimation and inference, and it can be used with any base model by replacing the original loss function with the quantile loss function.

Natural Gradient Boosting Regression
Natural Gradient Boosting (NGBoost) Regression [8] is a probabilistic variant of the traditional Gradient Boosting method [39].Gradient Boosting is an ensemble learning technique that combines a collection of sequentially-trained base learners (such as Decision Trees, or other ML models) into a strong learner.In each iteration of the training process, the base learners are trained to fit the residuals of the errors from the previous iteration, and the ensemble iteratively minimizes the prediction error.
NGBoost method comprises three key components: a collection of base learners, parametric form of the conditional distribution (|, ) , and scoring mechanism.The scoring mechanism ensures that the predicted distribution closely aligns with the actual distribution.To quantify the agreement, the negative likelihood is typically employed for scoring computation.For a new data point  * , the point estimate of the target variable  * and the uncertainty quantified as the standard deviation  * are obtained from the conditional distribution ( * | * , ).Advantages of NGBoost include the inherent property of uncertainty quantification, as well as the flexibility to be used with any base learners, and any distributions with continuous parameters and scoring rules.

Gaussian Process Regression
Gaussian Process Regression (GPR) [7][40] is a non-parametric Bayesian approach suitable for performing both function approximation and uncertainty estimation.Accordingly, GPR represents a collection of random variables as a multivariate Gaussian distribution over a set of data points.
For a Gaussian Process (|, ) ,  denotes a vector of function values estimated at  data points [( 1 ), … , (  )],  is the mean of the Gaussian Process that by default is assigned to be zero, and  is a positive definite covariance matrix.The smoothness of the distribution across functions is determined by the covariance kernel  , = (  ,   ) that defines the covariance between the function values (  ) and (  ).
Several kernel functions are used in practice, parameterized by a set of hyperparameters.For instance, the Radial Basis Function (RBF) is among the most common kernel functions, and it is defined as: where   is a hyperparameter that controls the vertical span of the function and  is a hyperparameter that determines the rate at which the correlation between two data points changes as the distance between them increases.
where  * represents the mean of the predicted distribution and is employed as the point estimate of  * , and the standard deviation  * of the predicted distribution is used to quantify the uncertainty in the predicted value.The mean of the predicted distribution  * =  *  ( +   2 ) −1  is used as the point estimate of  * , where  * is the covariance matrix between  * and the data points in the training dataset , and   2 is the variance of independent and identically distributed (i.i.d.) Gaussian noise representing the uncertainty in the training data.The covariance of the predicted distribution is given with  * 2 = ( * ,  * ) −  *  ( +   2 ) −1  * .GPR is among the most powerful and flexible methods for uncertainty quantification in regression tasks.By using different kernel functions and hyperparameters, GPR allows introducing domain knowledge and adapting the predictive distribution to the specific patterns and trends in a dataset.

Neural Networks with Deterministic Parameters
Neural Networks have been increasingly employed for uncertainty quantification of predicted values.Approaches such as Deep Ensembles and MC Dropout employ standard NNs with deterministic values of the parameters (weight and biases) to obtain probabilistic outputs, allowing for uncertainty quantification.Other works use Bayesian NNs with stochastic parameters for uncertainty quantification.

Deep Ensemble
Deep Ensembles (DE) [13] is a conceptually simple approach for generating probabilistic outputs with a collection of standard single-point prediction NNs.DE involves first training multiple NNs for a regression task, and afterward, aggregating their outputs to estimate the prediction uncertainties.The inherent randomness in the initializations of NN parameters and the associated training process, drive the NNs to converge to different solutions in the hypothesis space.As a result, the DE approach results in samples of different network parameters that produce stochastic outputs.
Let's assume an ensemble of  NNs trained on the dataset (  ,   ) ∈  and parameterized with parameters   ,  2 … ,   .For a new data point  * , the DE predictions are treated as a Gaussian distribution, where the predicted mean and standard deviation obtained by averaging the predictions of the ensemble are used as the target value and uncertainties estimates: . (5) In general, the individual NNs in the DE can have different architectures, although in most prior works, a NN with the same architecture is trained multiple times.Several related works have also applied bagging (i.e., bootstrap aggregation), where the networks in the DE are trained on random subsets of the training data, to add an additional source of randomness to the training process [41].The DE approach offers ease of implementation and is suitable for parallelizability and scalability across NNs and datasets.In comparison to other uncertainty quantification methods, a disadvantage is the increased training time and requirement for computational resources, since DE requires to train a group of NNs.

Monte Carlo Dropout
In standard NNs, the dropout technique is commonly applied during the training to reduce model complexity and prevent overfitting.It is implemented with a dropout layer that multiplies the output of each neuron by a filter selected with a Bernoulli distribution, randomly turning off (i.e., dropping out) a portion of the neurons.The dropout is not applied during the inference step with standard NNs.Monte Carlo (MC) Dropout [14] is a simple extension of the standard dropout technique, which applies dropout during inference.Consequently, the output from MC Dropout creates a distribution of the predictions by a trained model.
For a new data point  * , MC Dropout performs  forward passes through the trained network with the dropout enabled to obtain Monte Carlo samples, resulting in  different predictions ( * ,   ).Similarly to equations ( 5) and ( 6) in the DE approach, the uncertainty estimate is computed from the resulting distribution of predicted target values ( * ,   ).The advantages of MC dropout are that it can be used with any network architecture where a dropout layer is applicable, and it provides a computationally inexpensive way for uncertainty estimates with NNs.

Neural Networks with Probabilistic Parameters
Differently from standard NNs with deterministic parameters, Bayesian Neural Networks (BNNs) represent the network parameters with probability distributions, instead of fixed values.BNNs are probabilistic models that allow incorporating prior knowledge into the model and updating the parameters based on observed data.For a BNN model parameterized with parameters  that form probability distributions, inference for a new data point  * is made by using the posterior predictive distribution ( * | * , ) = ∫ ( * | * , ) (|) .Direct calculation of the posterior distribution of the parameters given observed data (|) is intractable (and hence, the same applies to the predictive distribution ( * | * , ) ).Various approximations for (|) have been used in practice, among which the most popular methods are Variational Inference (VI) and Markov Chain Monte Carlo (MCMC).

Variational Inference BNN
Variational Inference (VI) BNNs [17] employ an optimization technique to approximate the intractable posterior distribution (|) (that is, (|, ) ) with a simpler parameterized distribution   () (referred to as variational distribution) from a family of distributions .The VI optimization is as follows: where the goal is to calculate the parameters of the variational distribution   () that approximates the posterior distribution (|) .The Kullback-Leibler (KL) divergence is used as a measure of closeness between the two distributions.Directly calculating the KL divergence is also challenging, since it involves calculating the evidence (|) .To address this issue, an alternative approach has been developed, which utilizes the following Evidence Lower Bound (ELBO): Since the KL divergence term  KL [  ()||(|)] in ( 9) is always non-negative, the expected loglikelihood of the data log(|, ) is always larger than ELBO.Therefore, using a loss function that maximizes ELBO in (8) minimizes the KL divergence between the variational distribution   () and the posterior distribution (|).
The predictive uncertainty for new data point  * is estimated by sampling from the variational distribution   ().Using equations ( 5) and ( 6), the point estimate and uncertainty are calculated as the mean and standard deviation of the drawn samples from   ().

Markov Chain Monte Carlo (MCMC) BNN
MCMC BNNs [42] approximate the posterior distribution of NN parameters  given observational data (|, ) through Monte Carlo sampling.The approach employs a Markov Chain of model parameters, where each set of parameters   is a sample from the posterior distribution.To approximate the posterior distribution, the chain iteratively explores the space of possible parameters   .This exploration is guided by comparing the posterior probabilities, and as the Markov chain evolves, it effectively samples across the entire distribution space, allowing to converge to the target posterior distribution.After reaching a stationary distribution, for new input data point  * , the set of  generated samples { 1 ,  2 , … ,   } from (|, ) is used to generate  predictions ( * ,   ) .Point estimates and uncertainty estimates are calculated by averaging the predictions, as in equations ( 5) and ( 6).
Several MCMC sampling methods are used for approximating the posterior distribution with BNNs for regression tasks.Metropolis-Hasting algorithm [43] is often employed since it does not require exact knowledge about the probability distribution () to sample from, and a function that is proportional to the distribution is sufficient.Hamiltonian Monte Carlo (HMC) algorithm [44] is a version of Metropolis-Hasting that introduces a momentum term for proposing new states similar to simulating a physical system with Hamiltonian dynamics.Likewise, the No-U-Turn Sampling (NUTS) algorithm [45] is a sub-version of HMS that offers an approach for automatic selection of the hyperparameters.
MCMC method for BNNs is computationally expensive because it requires generating a large number of samples to obtain independent samples from the distribution, and it also requires a large number of iterations for convergence.On the other hand, MCMC BNNs are considered one of the most efficient methods for sampling from the posterior distribution and offer improved results in capturing uncertainty in comparison to VI BNNs, Deep Ensembles, or MC Dropout.

Physics-Informed Machine Learning
Physics-Informed Machine Learning (PIML) is an approach designed to integrate insights of the fundamental physics laws governing a process into ML models, in order to enhance the consistency of the predictions [29-33][46][47].This approach differs from the broadly used supervised ML methods, and instead of relying solely on observational data points to model a process, PIML leverages existing physics knowledge of a specific task to guide the model solutions.Specifically, prior knowledge about a task is employed to restrict the hypothesis space in ML models to architectural and other choices, introducing inductive biases.PIML models incorporate additional learning biases based on the fundamental physical rules guiding the process, commonly implemented as mathematical constraints in the learning algorithm.Such constraints are either encoded into the loss function of a PIML model (referred to as physics-informed loss function), can be enforced by designing custom physics-informed layers in the network architecture, or can be applied via physics-informed feature selection or feature engineering from the original highdimensional input data.Similarly, prior physics knowledge can be introduced into PIML models in the form of differential or integral equations.These flexibilities in designing PIML models provided by combining different types of mathematical formulations, prior task knowledge, and physical constraints are beneficial in guiding the models toward solutions that are more accurate and adhere to the governing physics laws of the process.
PIML can potentially address challenges associated with modeling material properties, as it leverages the demonstrated capability of ML methods-especially deep neural networks-to capture intricate relationships within high-dimensional multi-scale and multi-physics data.Namely, accurately representing the dynamics and deformation mechanisms of materials with physics-based models (e.g., via partial differential equations) poses insurmountable difficulties, since it is exceptionally challenging to mathematically define all different underlying processes that change over time.Indeed, existing physicsbased models capture only the most important factors that influence material properties, and are missing fine details of the underlying physics.PIML also can address the challenges posed by the limited availability of large curated datasets in materials science.Therefore, the fusion of historical experimentally collected material property data and physics-based models within a PIML framework holds the potential to enhance long-term predictions of material behavior under different conditions and exceeds the capabilities of traditional physics-based models.
In our proposed approach, we developed a PIML framework for predicting creep rupture life in metal alloys by introducing physics-informed feature engineering to augment the set of input features to the regression models and by applying a physics-informed loss function that introduces physics constraints into the learning algorithm.

Physics-Informed Feature Engineering
We introduce two categories of physics-informed features based on estimations of the creep rupture life and stacking fault energy by using physics-based models.

Creep Rupture Life Estimation
Creep is a slow irreversible deformation process under the influence of stresses below the yield stress of a material.) .In these formulations,  LM is a constant,  C is the creep activation energy,  denotes the universal gas constant, and  in ,  in are constants that represent the point of intersection of the iso-stress lines in log   versus  plots.The stress function () is typically represented as a cubic polynomial logarithmic function () =  0 +  1 log  +  2 log 2  +  3 log 3  , where the coefficients  0 ,  1 ,  2 ,  3 are obtained via least-square regression fit to short-term experimental data.For a given material, based on establishing the relationship between the stress () versus the time-temperature parameter (  , ) from available short-term creep measurements, TTP methods extrapolate the plots to longer times to estimate the creep rupture life   .
Creep constitutive (CC) models are based on the Monkman-Grant (MG) coneecture [51], which postulates that the product of the creep rupture life   and exponentiated minimum creep strain rate ̇m in for isothermal conditions is constant.That is,   • ̇m in  = C MG , where  is the creep strain rate exponent, and C  is a constant.Researchers have proposed several CC models to enhance the creep life predictions.

For instance, an alternative formulation of the creep strain rate is
where the first term  −  in  describes the power-law creep mechanism in the high-stress range, and the term sinh( −  pw  ) describes the viscous creep under the diffusion mechanism in moderate and lowstress ranges.The coefficients  and , and the activation energies  in and  pw are obtained by fitting to experimental data of creep strain rate ̇ for given temperature  and stress  .With the progress in material science, we can expect the development of novel models that more accurately predict creep behavior.
In this work, we employed the Manson-Haferd method for estimating the creep rupture life.The motivation is because the Manson-Haferd method was used for modeling the creep rupture life in SS316 alloys in the NIMS database, and the values of the coefficients  0 ,  1 ,  2 ,  3 for the least-square regression fit are provided in the database.Accordingly, we used the Manson-Haferd method for estimating the creep rupture life for the Nickel-based superalloys and Titanium alloys datasets.

Stacking Fault Energy
Stacking fault energy (SFE) represents the energy difference between atoms within the regular lattice structure and those located in the stacking fault region.It defines how resistant a material is to deformation occurring along specific crystallographic planes.SFE is an important parameter that impacts the strength and deformation behavior of steel materials.To calculate SFE of stainless steels   , we used the following equation [53]   =  0 + 1.59 − 1.34 + 0.06 2 − 1.75 + 0.01 2 + 15.21 − 5.59 − 60.69( + 1.2) 1/2 + 26.27( + 1.2) * ( + 1.2 +  + ) where SFE is approximated as a function of the chemical composition of the material where Ni, Mn, Cr, Mo, Si, C, and N denote weight percentages of the alloying elements, and  0 is a constant equal to 39 / 2 representing the SFE of pure austenitic iron at room temperature.

Physics-Informed Loss Function
The PIML paradigm allows introducing initial, boundary conditions, and other types of physics constraints into the loss function of a learning algorithm.Motivated by the work of Zhang et al. [31], we introduced two physics-informed boundary constraints regarding the predicted creep rupture life into the loss of NNs.
The first introduced loss term imposes that the predicted creep rupture life by the model  * is non-negative.The second introduced loss term ) enforces that the predicted creep rupture life is upper bounded by a constant .For the constant  we adopted the value of 100,000 hours, because it is the greatest creep rupture life value in the three datasets.In these loss terms, ReLU denotes Rectified Linear Unit activation function, defined as () = {0 for  < 0,  for  ≥ 0}.The two physics-informed terms regard negative and excessively large creep life values as physical violations that should be prevented from being output by the model.The two terms are added to the standard mean-squared error loss for regression tasks , resulting in a composite physics-informed loss function: ℒ = ℒ  +  1 ℒ −1 +  2 ℒ −2 , where  1 and  2 are weighting coefficients, which are empirically determined to quantify the contributions of the terms ℒ −1 and ℒ −2 , respectively.

Application Case: Active Learning
Supervised ML requires that for each input data, there is an associated target data point.Generally, for a supervised ML model to perform well, it often needs to be trained with a large number of labeled data points.In reality, labeled data may be scarce and expensive to obtain since the labeling or annotation process is time-and cost-consuming, whereas unlabeled data could often be accessed easily.Additionally, among the labeled data points, some could carry similar information, thus, contributing less value compared to the data points that carry dissimilar information.The Active Learning (AL) method was developed to select the most informative data points that speed up the training process [19].
In pool-based AL, the training dataset  comprises a pool of unlabeled data   and labeled data   .An initial ML model  is first trained with a small, randomly selected labeled dataset  ∈   .Next, a query strategy is applied to select the most informative data   ∈   by employing an acquisition function to measure and rank the informativeness of the data.An annotator is asked to label the data   selected by the query strategy, which is added to the dataset , and the model is re-trained with the updated dataset.These steps are iteratively repeated until the model  converges [36].
Two main types of query strategies are: uncertainty/informativeness-based strategies that ensure the informativeness of the unlabeled data, and representative-based/diversity-based strategies that measure the similarity of the instances and deal with issues such as sampling bias and inclusion of outliers.In addition, a hybrid query strategy combining these two strategies can also be applied [36].
In this paper, we apply a hybrid query strategy that includes a Variance Reduction (VR) uncertainty-based method, and a k-means clustering diversity-based batch mode method to guide the selection of the newly added data [9].
Variance reduction (VR) has been proven to be an effective AL method with regression tasks [54], where the goal is to minimize the variance of the model.For a hypothesis  ( () ) learned on  , and a true hypothesis  * , the total expectation of the error   ( ( () ) ) =   [( ( () ) () −  * ())] 2 , and the average estimation of learned hypotheses is: where  is a data point in .The expectation on the generalization error of the entire dataset is: Equation (11) indicates that for a given model, if the bias of the model is fixed, minimizing the variance of the model results in minimal generalization [55].Therefore, VR selects and annotates the samples with the highest prediction variances, i.e., for which the model is most uncertain of inferring with these samples.Adding these samples to the training dataset reduces the overall generalization error of the model [36].
Conventional AL queries a single data point at each iteration, which is inefficient and leads to iterative training with small changes to the training dataset.To avoid these issues, we implemented Batch Mode Active Learning (BMAL) [56].BMAL selects and annotates multiple samples  = { (1) ,  (2) , … ,  () } ∈   in each iteration and add these samples to the training dataset to re-train the model.On the other hand, an uncertainty-based BMAL query strategy may not be ideal since the most uncertain samples may be similar to each other.Therefore, diversity-based query strategy is typically preferred in BMAL.Clustering methods are commonly used to group the unlabeled dataset by similarity, and the most dissimilar samples are added to the training dataset.In this paper, we used k-means clustering to group the unlabeled samples into  clusters, where each group will have the least feature correlation with each other, and in each cluster the sample with the highest variance will be selected.Therefore,  number of samples will be selected and annotated for training.

Datasets
We used three datasets to conduct experimental validation of the UQ frameworks for creep rupture life prediction.
The first creep dataset of Stainless Stee (SS) 316 alloys was obtained from the National Institute for Materials Science (NIMS) database [57].The dataset contains 617 test samples with 20 features per sample.Specifically, the features provide information about the material composition related to the mass percent of the elements C, Si, Mn, P, S, Ni, Cr, Mo, Cu, Ti, Al, B, N, Nb+Ta and the material group of the alloy (there are 20 material groups in total), testing conditions including the applied stress (MPa) and temperature (°Celsius), test measurements related to the percentage of elongation and percentage of area reduction, and the recorded creep rupture life (hours).
The second creep dataset is for Nickel-based superalloys and was adopted from the work by Han et al. [58].The dataset includes 153 test samples with 15 features per sample.The features include material composition related to the weight percentage of the elements Ni, Al, Co, Cr, Mo, Re, Ru, Ta, W, Ti, Nb, and T, testing conditions including applied stress (MPa) and temperature (°Celsius), and the recorded creep rupture life (hours).
The third creep dataset pertains to Titanium alloys and was adopted from Swetlana et al. [59].It consists of 177 test samples with 24 features per sample.The features provide information about the weight percentage of the elements Ti, Al, V, Fe, C, H, O, Sn, Mb, Mo, Zr, Si, B, and Cr, testing conditions including applied stress (MPa) and temperature (°Celsius), finishing conditions related to the solution treated temperature (°Celsius), solution treated time (hours), annealing temperature (°Celsius), annealing time (hours), test measurements of the steady-state strain rate (1/seconds) and strain to rupture (%), and the recorded creep rupture life (hours).

Evaluation Metrics
The next section describes the used set of metrics to evaluate the predictive accuracy of the models on unseen samples, and the quality of uncertainty quantification.

Predictive Accuracy Metrics
Pearson Correlation Coefficient (PCC) quantifies the magnitude and direction of the linear relationship between two continuous variables.For a target variable  and predicted values by a model  * , PCC is calculated as where cov(,  * ) denotes the covariance and   ,   * are the standard deviations of  and  * .PCC ranges from -1 to +1, where a larger PCC implies a higher degree of linear correlation between the target and predicted variables.
Coefficient of determination ( 2 ) quantifies the proportion of the variance in the target variable  that is explained by the predicted values from the regression model  * .It is defined as where the numerator represents the sum of squared residuals, and the denominator calculates the sum of squares between the target values and the mean of , denoted  ̅.The coefficient of determination measures the goodness of fit of a regression model and ranges from 0 to 1. Higher  2 values indicate that a larger proportion of the variance in the target variable samples is explained by the predicted values by the model.
Root-mean-square error (RMSE) measures the average difference between the target values and the predicted values by a regression model.It is calculated as the standard deviation of the residuals: Mean absolute error (MAE) measures the average magnitude of the errors in a set of predicted values, obtained as Smaller values of RMSE and MAE indicate better predictive accuracy of a regression model.

Uncertainty Quantification Metrics
Coverage is a metric that quantifies the proportion of target values  that fall within the predicted uncertainty interval by a regression model.A high coverage implies accurate uncertainty quantification by the model.Values of the coverage metric that are close to the nominal confidence interval of 95% are preferred for reliable uncertainty quantification.In some statistical works, the coverage metric is referred to as validity, since it assesses whether the predicted uncertainty intervals are valid.
Mean interval width calculates the average size of the predicted interval around the point estimates, related to the upper and uncertainty bounds.I.e., this metric assesses how tight the uncertainty bounds are across all predicted values.Smaller values of the interval width are preferred, as they indicate more precise uncertainty estimates by a regression model.In some statistical works, this metric is also referred to as sharpness.
Composite metric was introduced here, in order to capture the trade-offs between the coverage and mean interval width metrics.Specifically, since the coverage and mean interval width metrics alone do not quantify well the accuracy of uncertainty estimation, we introduced the composite metric as: 0.75 • Coverage + 0.25 Mean Interval Width .We selected the values 0.75 and 025 for the weight coefficients of the coverage and mean interval width based on empirical evaluation.It is our hope that this metric can combine the impact of the coverage and mean interval width on the uncertainty estimates, where greater values of the composite metric indicate more reliable uncertainty quantification.

Implementation Details
For the regression models, we used the creep rupture life as a target variable, and the remaining features in the datasets were used as inputs to the models.Preprocessing of the input features involved normalization of the values in the range between 0 and 1. Similarly to related works, we applied base 10 logarithm transformation to the values of the creep rupture life.For evaluation of the performance of the different models, we used 5-fold cross-validation.
For the implementation of QR, NGBoost, and GPR, we adopted the same hyperparameters as in the work by Mamun et al. [9].Following this work, QR used CatBoost model for calculating the quantiles, and we applied the same kernel functions for GPR as in Mamun et al. [9].We used the scikit-learn, ngboost, and catboost libraries for model training and evaluation.
For implementing the BNN-VI method, the architecture consists of two fully-connected layers with 100 neurons in each layer.We applied Rectified Linear Unit (ReLU) activation function to the output of each layer.For the prior distribution of the network parameters, we adopted a normal distribution with a mean of 0 and a standard deviation of 0.06.The loss function is based on equation ( 8), with the weight coefficient for the KL divergence term set to 0.01.We used a Stochastic Gradient Descent (SGD) optimizer with Nesterov Momentum set to 0.95, and the learning rate was set to 0.001.After training the model, for inference we generated 1,000 samples from the variational distribution.The point estimates and uncertainty estimates are calculated as the mean and ±3 standard deviations of the drawn samples, according to ( 5) and (6).
For the BNN-MCMC approach, we selected an architecture with three fully-connected layers with 10 neurons in each layer.For the prior distribution of the network parameters, we adopted a normal distribution with a mean of 0 and a standard deviation of 1.For approximating the posterior distribution we used the No-U-Turn Sampling (NUTS) algorithm for training the model.For inference, we drew 100 samples, and the point estimates and uncertainty estimates were calculated similarly to the BNN-VI methods by taking the arithmetic mean and ±3 standard deviations of the generated samples.
For the Deep Ensemble method, we used an ensemble of 5 base learners, each of which is a standard NN with three fully-connected layers having 10 neurons in each layer.Each hidden layer is followed with a dropout layer with a rate of 0.5 and a ReLU activation layer.Mean-square error (MSE) loss function was used, and the Adaptive Moment Estimation (Adam) optimizer with a learning rate of 0.01 was selected for training the models.The final predictions were calculated by taking the arithmetic mean and ±3 standard deviations from the outputs of the base learners.
For the MC Dropout approach, we used an NN with three fully-connected layers having 100 neurons in each layer.Similar to the Deep Ensemble, we used a dropout rate of 0.5, ReLU activation function, MSE loss, and Adaptive Gradient Algorithm (Adagrad) optimizer with a learning rate of 0.01.For inference, we generated 1,000 predictions, which were used to calculate the point estimates and uncertainty estimates.
For comparison, we implement standard NNs with deterministic parameters, consisting of three fullyconnected layers with 1000, 200, and 40 neurons, respectively.We used ReLU activation function after each hidden layer.The loss function was MSE, and the Root Mean Squared Propagation (RMSprop) optimizer was used with a learning rate of 0.01.BNN-MCMC, BINN-VI, Deep Ensemble, and MC Dropout were implemented using the PyTorch library.For building BNN-MCMC we used the Pyro library [60], and BNN-VI was implemented with the torchbnn library [61].For the above models, we chose a batch size of 16 data points.The link to our codes is provided in the Abstract section.

Experimental Results
Experimental validation includes a comparative analysis of UQ with the three datasets by using the abovedescribed eight methods.The results are presented in section 4.4.1.For the top four performing approaches for UQ, we evaluated the proposed PIML framework, with the results presented in section 4.4.2.In section 4.4.3,we present the results from the AL study for the GPR, BNN-VI, and BNN-MCMC models.

Uncertainty Quantification
For the SS316 alloys dataset, Table 1 presents the average and the standard deviation (in the subscript) of the metrics for predictive accuracy and uncertainty estimations for the considered eight models based on five-fold cross-validation.High values of PCC indicate high correlations between the model predictions and the experimentally measured creep rupture life, and high values of  2 point to more consistent fit of the predicted values to the experimental creep rupture life.Similarly, low values of RMSE and MAE imply that the predicted creep rupture life more closely aligns with the experimentally measured creep rupture life.The results show that the NN model and Bayesian NNs performed better than the traditional ML models.
The BNN-MCMC approach achieved the best performance for all point accuracy metrics, including PCC,  2 , RMSE, and MAE.In addition, BNN-MCMC produced the best results for the Interval width and Composite Metric, except for the Coverage for which BNN-VI had the highest coverage.As expected, GPR achieved comparable performance and it was the second best-performing method.
Figure 1 shows the target creep life, predicted creep life, and uncertainty estimates by the eight methods on one fold of the test dataset.One can note that the uncertainty estimates by QR and NGBoost are overestimated.On the other hand, GPR provides accurate uncertainty estimates, as well as, the deterministic NNs and BNNs have generally good performance for point predictions and uncertainty estimates.datasets.The results for predicted creep rupture life and uncertainty estimates are shown in Figure 2 only for the three best-performing models: GPR, BNN-VI, and BNN-MCMC.

BNN -MCMC
The logarithm values for the creep rupture life are shown on a logarithmic axis.The green shaded area represents the 95% confidence interval for the predictions by the models.
Next, we evaluated the models on the Titanium alloy dataset, which similar to the Nickel superalloys dataset is also much smaller than the SS316 dataset.The overall performance is consistent with the results in Tables 1 and 2, with the top performers being BNN-MCMC and GPR.The mean interval widths are wider for the top performers, indicating that the models are less confident about the uncertainty predictions.Figure 3 presents the predicted creep rupture life and uncertainty estimates for the three best-performing models: GPR, BNN-VI, and BNN-MCMC.

Physics-Informed Machine Learning
Experimental validation encompasses the NNs and three best-performing models from the previous section: GPR, BNN-VI, and BNN-MCMC.We introduced physics-informed features and a physics-informed loss function for the NN method.For the SS316 alloys dataset, the results are presented in Table 4.We can notice that the PIML framework improves the points estimates and uncertainty estimates for most regression models.Only for the PI-BNN-MCMC, the performance slightly decreased in comparison to the BNN-MCMC model without physics knowledge.

BNN -MCMC
Table 5 presents the results for the Nickel-based superalloys dataset, and Table 6 shows the results for the Titanium alloys dataset.The incorporation of physics knowledge led to significant improvements in the predictions for all models for these two datasets.The best-performing approach is PI-BB-MCMC for both datasets.As expected, PIML imparts greater benefits to tasks with smaller datasets, and we can see higher gains for these two datasets, compared to the physics-informed models with the larger SS316 alloys dataset.

Active Learning
For the AL case study, we used a Batch Mode AL with a batch size  of 10 for the SS316 alloy dataset and a batch size  of 8 for the smaller Nickel-based superalloy and Titanium alloy datasets.Figure 4 presents the plots for PCC and R 2 scores of the GPR, BNN-VI, and BNN-MCMC models.For the SS316 dataset with PI features, GPR achieves high PCC and R 2 scores with fewer data points.For Nickel and Titanium alloys, BNN MCMC achieves the best result, and it converges the fastest with a small number of data samples.Overall, BNN-MCMC performed the best in 2 of the 3 tests and GPR performed the best in 1 of the tests, whereas BNN-VI had the lowest performance overall.

Discussion
Our experimental results in Tables 1 to 3 indicate that BNN models and GPR outperform the traditional ML models, and therefore they are more suitable for UQ in material property prediction.Overall, BNN-MCMC achieved the best performance on most metrics.The results in Tables 4 to 6 demonstrate that the PIML paradigm can improve the models' performance for the smaller Nickel-based superalloy and Titanium alloy datasets, whereas it obtained smaller improvement for the larger SS316 alloys dataset.For the application of UQ in AL, BNN-MCMC and GPR performed comparatively well.
The experimental results demonstrate the capability of BNNs to provide accurate and reliable UQ.Bayesian NNs exhibit more consistent uncertainty that aligns better with the observed deviations, reducing the likelihood of overconfidence or underconfidence.Additionally, BNNs allow for the separation of epistemic uncertainty and aleatoric uncertainty.This property enhances the data efficiency of BNNs, enabling effective learning from smaller datasets.Accordingly, the priors in BNNs can be regarded as soft constraints that act similar to the regularization techniques in traditional NNs.Likewise, physics-informed BNNs leverage physical knowledge to constrain and guide the learning process, and it can also be used to generate additional features when dealing with insufficient data.AL actively selects the most informative samples to ensure the models learn faster with fewer samples.
One maeor limitation of BNNs is their high computational cost, since both BNN-MCMC and BNN-VI require sampling from the posterior distribution over the network parameters.Furthermore, MCMC requires sufficient number of iterations to obtain accurate samples from the posterior distribution, which can take significantly longer than GPR and require increased computational resources.Also, hyperparameter tuning of BNNs can be challenging and requires a deeper understanding of the internal working of the models.
It is also worth noting that GPR primarily focuses on modeling structural uncertainty reflecting the inherent variability within the model structure, as the function space is modeled as a distribution over functions, and the predictions are made by considering the possible functions that are consistent with the observed data.On the other hand, BNNs are aimed at quantifying parametric uncertainty that arises from the variability in the model parameters, rather than uncertainty in the functional form of the model.
A limitation of this work is that the collected data from creep rupture tests do not comprise data from repeated experimental measurements that capture the variability in the measured creep life values.Consequently, the used datasets do not provide ground truth values to allow for truthful evaluation and comparison of the used methods for uncertainty quantification.
In future work, we will focus on the development of physics-informed loss functions and physics-informed layers in BNNs, and implementation of AL approaches based on hybrid query strategies, such as Query by Committee.

Conclusion
This work provides a study of uncertainty quantification in multivariable regression for material property prediction with Bayesian Neural Networks.In our proposed approach, we developed a PIML framework for predicting creep rupture life in metal alloys by introducing physics-informed feature engineering to augment the set of input features to the regression models and by designing a physics-informed loss function that introduces physics constraints into the learning algorithm.The hypothesis we test is that the synergistic combination of historical experimental data from creep tests and prior knowledge and constraints from the governing physics laws into a PIML framework will yield improved predictions and uncertainty quantifications of creep deformation, that satisfy physics constraints and are consistent with the underlying physics laws.Our findings demonstrate the potential of BNNs to advance the field of materials science and engineering by enabling more accurate and reliable predictions with quantified uncertainties.The experimental validation indicates that the most promising approach for material property prediction is BNN-MCMC, which achieved performance that is either competitive or exceeds the performance of GPR as the state-of-the-art method for UQ is multivariable regression.The case study on applying uncertainty estimates in an active learning scenario confirms that BNN is a promising approach for overcoming the challenges in modeling material properties related to sparse and noisy data.

Figure 1 .
Figure 1.Experimental data points, predicted data points, and uncertainty estimates for Quantile regression (QR), Natural Gradient Boosting (NGBoost), Gaussian Process Regression (GPR), Deep Ensemble, MC Dropout, BNN-Variational Inference, and BNN-Markov Chain Monte Carlo (MCMC) for the SS316 alloys dataset.The logarithm values for the creep rupture life are shown on a logarithmic axis.The green shaded area represents the 95% confidence interval for the predictions by the models.The corresponding results for the Nickel-based superalloy dataset are shown in Table2.Since this dataset is of smaller size and has only 153 data points, there is a significant performance drop for most models.We can still note that BNN-MCMC, BNN-VI, and GPR have comparable predictions, and BNN-MCMC has a small advantage in point estimate predictions.The values of the uncertainty estimates indicate greater variability due to the challenges associated with smaller

Figure 3 .
Figure 3. Experimental data points, predicted data points, and uncertainty estimates for Gaussian Process Regression (GPR), BNN-Variational Inference, and BNN-Markov Chain Monte Carlo (MCMC) for the Ti-based alloys dataset.The logarithm values for the creep rupture life are shown on a logarithmic axis.The green shaded area represents the 95% confidence interval for the predictions by the models.
|  ()| The prediction of creep rupture life, related to the time duration that a material can sustain before undergoing rupture is essential for guiding design strategies, maintenance regimens, and safety protocols for systems and structures.Existing physics-based creep models are broadly classified into two maeor categories: time-temperature parametric (TTP) models and creep constitutive (CC) models.

Table 2 .
Experimental results for the Nickel-based superalloys dataset, including predictive accuracy metrics (PCC, R 2 ,