Predicting California bearing ratio of HARHA-treated expansive soils using Gaussian process regression

The California bearing ratio (CBR) is one of the basic subgrade strength characterization properties in road pavement design for evaluating the bearing capacity of pavement subgrade materials. In this research, a new model based on the Gaussian process regression (GPR) computing technique was trained and developed to predict CBR value of hydrated lime-activated rice husk ash (HARHA) treated soil. An experimental database containing 121 data points have been used. The dataset contains input parameters namely HARHA—a hybrid geometrical binder, liquid limit, plastic limit, plastic index, optimum moisture content, activity and maximum dry density while the output parameter for the model is CBR. The performance of the GPR model is assessed using statistical parameters, including the coefficient of determination (R2), mean absolute error (MAE), root mean square error (RMSE), Relative Root Mean Square Error (RRMSE), and performance indicator (ρ). The obtained results through GPR model yield higher accuracy as compare to recently establish artificial neural network (ANN) and gene expression programming (GEP) models in the literature. The analysis of the R2 together with MAE, RMSE, RRMSE, and ρ values for the CBR demonstrates that the GPR achieved a better prediction performance in training phase with (R2 = 0.9999, MAE = 0.0920, RMSE = 0.13907, RRMSE = 0.0078 and ρ = 0.00391) succeeded by the ANN model with (R2 = 0.9998, MAE = 0.0962, RMSE = 4.98, RRMSE = 0.20, and ρ = 0.100) and GEP model with (R2 = 0.9972, MAE = 0.5, RMSE = 4.94, RRMSE = 0.202, and ρ = 0.101). Furthermore, the sensitivity analysis result shows that HARHA was the key parameter affecting the CBR.

Strength of the relation The mechanical index of geomaterials must be accurately predicted for robust pavement design 1 .The subgrade soil's strength is commonly measured by its California Bearing Ratio (CBR).CBR is a static strength and bearing capacity index that can be measured in the laboratory or in situ 2,3 .The CBR is an important input parameter for predicting the stiffness modulus of the subgrade soil, which is an essential pavement design index when cyclic loading is considered 4,5 .The CBR value is used to indirectly estimate the thickness of subgrade materials in large infrastructure projects.Consequently, precise and timely estimation of this parameter is extremely important to the design process and construction schedule.
The CBR test is a simple strength test that compares the bearing capacity of a material to that of well-graded crushed stone (a high-quality crushed stone material should have a CBR of 100%).It is intended for, but not limited to, evaluating the cohesiveness of materials with particle sizes of less than 19 mm (0.75 in).In accordance with current American Association of State Highway and Transportation Officials 2003 requirements, the laboratory CBR test entails soil mass penetration utilizing a circular 50 mm plunger applied at a rate of 1.25 mm/ min 6 into a compacted soil specimen with the optimum moisture content.The CBR test is an indirect measure of soil strength based on the resistance to penetration by a standardized piston moving at a standardized rate over a specified distance.CBR values are frequently used for highway, airport, parking lot, and other pavement designs based on empirical local or agency-specific methods.Additionally, CBR has been empirically correlated with resilient modulus and a number of other engineering soil properties.
Several studies were conducted to assess the performance of various materials, including fly ash, coarse sand, river bed material, and stone dust, that could be used to improve soft subgrades in highway construction [7][8][9][10][11] .For example, fly ash use in soil stabilization decreased the liquid limit and plasticity index and increased CBR 12 .Similarly, interaction between soil and waste plastic strips which causes the resistance to penetration of the plunger resulting into higher CBR values 13 .
Developing machine learning (ML) models for CBR prediction may be a viable option in this context 14 , as obtaining representative CBR values for design purposes is difficult due to insufficient soil investigations and a limited budget in determining the CBR value.In contrast, the laboratory CBR test is time-consuming and laborious.Artificial intelligence models can simulate highly nonlinear relationships between numerous input and output parameters, resulting in more precise predictions than simple and multiple regression analysis [15][16][17] .Several artificial intelligence model techniques have been used in engineering [18][19][20][21][22][23][24] and many other disciplines [25][26][27][28] , including CBR value prediction using artificial neural network (ANN) 29 , and gene and multi expression programming 30 .As a result, this field is still being researched and investigated.
2][33][34][35][36][37][38][39][40][41] .A critical review of the existing literature, however, indicates that, despite the successful implementation of GPR in various domains, their application to predict CBR value has not been thoroughly investigated.The purpose of this paper is to develop a new model for predicting the CBR value of expansive soil treated with hydrated lime-activated rice husk ash using the GPR computing technique.The viability and acceptability of the CBR prediction using the GPR computing method are also addressed in this paper.The dataset for this study includes seven input parameters for predicting CBR value: hydrated lime-activated rice husk ash (HARHA), liquid limit (LL), plastic limit (PL), plasticity index (PI), optimum moisture content (OMC), clay activity (CA), and maximum dry density (MDD).To compare the accuracy of the current model with that of previously developed models, several performance indexes were used, including coefficient of determination (R 2 ), mean absolute error (MAE), root mean square error (RMSE), relative root mean square error (RRMSE), and performance indicator (ρ), as well as objective function (OF) to determine whether the model is overfitted or not.
The rest of the paper is structured as follows.Section "Materials and methods" presents information about the dataset, Pearson's correlation analysis, and a brief literature review on Gaussian process regression for estimating the CBR and the performance measure.Section "Results and discussion" presents the developed model's results and discussion, and Section "Limitations and future works" discusses the limitations and prospects for the future.Last Section presents the conclusions of this study.

Materials and methods
Dataset.In this study, the dataset was obtained from Onyelowe et al. 29 , which consist of 121 observations (see Appendix A, Table A1 in supplementary information file).Researchers have used a different percentage of the available data as the training set for different problems.For instance, Ahmad et al. 34 used 70% for training and remaining 30% was equally divided into testing and validation sets.In this study, training dataset contains 85 (70%) observations while testing and validation comprises of 18 (15%) observations each.The CBR is a function of hydrated lime-activated rice husk ash (HARHA), liquid limit (LL), plastic limit (PL), plasticity index (PI), optimum moisture content (OMC), clay activity (CA), and maximum dry density (MDD) 29 .HARHA, a hybrid geometrical binder, was made by mixing rice husk with 5% hydrated lime and leaving it for 24 h to activate.Hydrated lime activates alkali, and rice husk comes from rice mills.Rice husk is agro-industrial waste.Direct combustion produces rice husk ash (RHA) 42 .Therefore, these input parameters were utilized in this study to develop the desired model.The parameters' maximum (Max), minimum (Min), mean, standard deviation (SD), and coefficient of variation (COV) were chosen in such a way that they were consistent throughout training, testing, and validation data sets (Table 1).Figure 1 illustrates the cumulative percentage and frequency distributions for all input and output parameters utilized in the CBR modeling from the aforementioned database.The cumulative percentage distribution can be used to determine what proportion of the data falls below or equals a given value.For example, if the cumulative percentage at an LL (50.4-58.2%) is 60%, then 60% of the data points are less than or equal to 20.The frequency distribution explains how data is spread across several categories or intervals.It aids in the identification of the most common or frequent values, as well as any patterns or trends.For example, if the frequency of a specific category, such as OMC (17.8-18.4%), is higher than others, it suggests that the data is concentrated in that particular region.Furthermore, readers can refer to Onyelowe et al. 29 for additional information on carrying out the tests.

Pearson's correlation analysis.
To determine the relationships between each pair wise variable, the Pearson correlation coefficient (ξ) 43 was utilized.Table 2 detailed the relationship of all the variables based on the ξ.A Pearson correlation coefficient > 0.8 indicates a strong association between each pair wise variable, values range from 0.3 to 0.8 indicate a medium relationship, and |ξ| < 0.30 indicates a weak relationship 44 .The rank correlation coefficient (|ξ|) was used to determine the associations between each pair of variables based on the distribution of the data.The parameters were determined to have a generally acceptable degree of correlation.It is evident from Table 2 that the PI is strongly correlated with CBR (|ξ| = 0.99514), but the OMC is weakly correlated with CBR (|ξ| = 0.09768) and the same is reported by Onyelowe et al. 29 .Certain variables that have a considerable amount of deviation have the potential to have an effect on prediction models 45

Gaussian processes regression (GPR).
According to Rasmussen 46 , the assumption that the GPR model For each input x, there is a random variable f(a) that corresponds to the value of the stochastic function f at that www.nature.com/scientificreports/location.In this study, it is assumed that the observational error n is normal, independent, and identically distributed, with a mean of zero µ(a) = 0 , a variance of σ 2 , and f(a) drawn from the Gaussian process on a specified k.The following is, where
where e i and m i are the nth measured and predicted output of the i th sample, respectively.e i and m i represents the average values of the measured and predicted output, respectively.The total number of datasets is shown by n while the training and validation datasets are shown by the subscripts T and V respectively.If a model's R 2 values are higher than 0.8 and close to 1, it is considered as being effective 31 .The RMSE criterion measures the mean squared difference between predicted and actual output, whereas the MAE criterion measures the mean magnitude of the error.RRMSE is calculated by dividing RMSE by the measured data's mean value.To improve

Results and discussion
In order to increase the accuracy and capability of the trained model Furthermore, the parameters are divided into three parts based on similar statistical characteristics, such as the mean value and coefficient of variation (COV).Model overfitting has been controlled by the mentioned validation set.The Pearson VII universal kernel known as PUK kernel function was scrutinized after multiple iteration of trial-and-error method among different function.In GPR model, the hyperparameters were fixed according to the best possible results.Hyperparameters such as noise, omega and sigma values were iterated through trial-and-error method until the desired results were achieved.Noise value was fixed at 0.3 while omega and sigma were fixed at 0.4 each listed in the following table.Figure 2 represents the flow chart of the proposed methodology in this study.
To verify the effectiveness of learned models in the field of ML, models need to be assessed.Different evaluation methodologies are used with various types of models.The analysis of the built machine-learning model's predictive impact comes after the development of the machine-learning model for CBR prediction.This study verified the GPR model's CBR prediction by comparing the predicted and actual values.Figure 3 shows that there is a strong correlation between the training set's predicted value and the actual value.Although some of the data points in the test set's and validation set's predicted value have high errors compared to the actual CBR value e.g.sample 9 (see Fig. 3b) and samples 1, 2 and 3 (see Fig. 3c) respectively, overall, the predicted value is found accurate.The findings demonstrate how well the GPR model predicts the CBR.
Figure 4, a scatter diagram of the predicted and actual values of the training, test, and validation sets, illustrates the effect of fitting.A few points in the test set and validation set have large errors, such as in the test set, where the measure value of CBR was about 8.5% and the predicted value was as high as 10.6%; however, the small differences in individual data points have no impact on the GPR model.In addition, the CBR value is in the range of 8.2-44.5%,and predicted and actual values of the training, test, and validation sets fit well.The R 2 value of the training set is 0.9999, the MAE value is 0.0920, the RMSE value is 0.13907, the RRMSE value is 0.0078, the ρ value is 0.00391, the R 2 value of the test set 0.9997, the MAE value is 0.2099, the RMSE value is 0.51819, the RRMSE value is 0.0155, the ρ value is 0.00775, and the R 2 value of the validation set 0.9996, the MAE value is 0.0719, the RMSE value is 0.1070, the RRMSE value is 0.0025, the ρ value is 0.00125.Consequently, the R 2 , MAE, RMSE, RRMSE, and ρ values of the training, test, and validation sets have common characteristics-namely, their R 2 value is high, and their MAE, RSME, RRMSE values are low.It demonstrates that the GPR model accurately predicts the CBR value and that there is no overfitting.
The GPR model was compared to artificial neural network (ANN) and gene expression programming (GEP) models from the literature in this study.Table 3 displays the performance indexes.The summary of statistical performance in the training, testing, and validation phases shows that the MAE, RMSE, RRMSE, ρ, and OF values of the GPR model are significantly lower while the R 2 value is larger for the CBR value.For example, in the validation stage, the analysis of the R 2 together with MAE, RMSE, RRMSE, and ρ values for the CBR shows that the GPR model achieved better prediction results with R 2 = 0.9996, MAE = 0.0719, RMSE = 0.1070, RRMSE = 0.0025 and ρ = 0.00125 as compared to the ANN model with R 2 = 0.9994, MAE = 0.1649, RMSE = 1.19,RRMSE = 0.05, and ρ = 0.028) and GEP model with R 2 = 0.9932, MAE = 0.5, RMSE = 5.49, RRMSE = 0.167 and ρ = 0.084 proposed in literature.The results indicate that the proposed model to predict CBR value using GPR was more reliable and improved for practical applications.
Sensitivity analysis is used to analyze the individual effect of input factors on CBR value.In this present study, the cosine amplitude method was used to determine the sensitivity analysis of the problem 59,60 .This method has been utilized in numerous studies 61,62 .To construct data array (X), data pairs are used, as follows: where x i is a m length vector, a variable in the X array, which may be expressed as: The co-relation among strength of relation R ij , x i and x j dataset expressed as follows: where n is the number of values (in this case, 85), and x im and x om are the input and output variables, respectively.The strength of the relationship ( r ij ) varies from zero to one for each input parameter.The higher the value of r ij , the stronger the effect of that specific input variable on CBR value.The r ij scores for all input parameters are shown in Fig. 5. Figure 5 shows that HARHA ( r ij = 0.988) has the largest influence in predicting CBR value, whereas PI ( r ij = 0.847) has the least influence. (11)

Limitations and future works
It is a common fact that ML studies have always included several limitations and difficulties.One of the limitations of this study is related to the number of data samples used in the analysis, which are 121.The proposed model in this research is effective with the expected accuracy if the same input parameters are used in the future.In addition, if the same inputs are used but out of the range of our inputs, there is a possibility of an error in the analysis.In the future, more experimental data should be collected to improve the generalization capability of the proposed model.The prediction of CBR value using sophisticated ML algorithms such as deep learning is left as a topic for future study.
operates under is that nearby observations should exchange information.Any finite number of the random variables in a Gaussian process has a joint multivariate Gaussian distribution.Let a × b stand to represent the input and output domains, respectively, from which n pairs (a i , b i ) are randomly and uniformly distributed.For regression, let b ⊆ ℜ ; then, a Gaussian process on a is distinct by the mean function µ : a → ℜ and a covariance function k : a × a → ℜ .The main supposition of GPR is that y is given as b = f (a) + ζ , where ζ ∼ N 0, σ 2 .

Figure 1 .
Figure 1.Distribution histograms for inputs (in blue) and outputs (in green).

Figure 3 .
Figure 3.The accuracy of the GPR model in predicting CBR value in (a) training, (b) testing, and (c) validation sets.

Table 1 .
. Statistical parameters for data sets used for training, testing, and validation.

Table 2 .
58arson's correlation matrix.performance of the model, RMSE, RRMSE and MAE should be relatively close to zero.This value cannot be 0 in practice, but the smaller it is, the more accurate the model's performance.Performance indicator (ρ) is the function of RRMSE and the coefficient of correlation (R) value58.The closeness of OF to zero indicates that the model is not overfit.

Table 3 .
Comparison of statistical metrics for evaluating the performance of the GPR, ANN, and GEP models.Sensitivity analysis of input variables.