Empirical models for compressive and tensile strength of basalt fiber reinforced concrete

When molten magma solidifies, basalt fiber (BF) is produced as a byproduct. Due to its remaining pollutants that could affect the environment, it is regarded as a waste product. To determine the compressive strength (CS) and tensile strength (TS) of basalt fiber reinforced concrete (BFRC), this study will develop empirical models using gene expression programming (GEP), Artificial Neural Network (ANN) and Extreme Gradient Boosting (XG Boost). A thorough search of the literature was done to compile a variety of information on the CS and TS of BFRC. 153 CS findings and 127 TS outcomes were included in the review. The water-to-cement, BF, fiber length (FL), and coarse aggregates ratios were the influential characteristics found. The outcomes showed that GEP can accurately forecast the CS and TS of BFRC as compared to ANN and XG Boost. Efficiency of GEP was validated by comparing Regression (R2) value of all three models. It was shown that the CS and TS of BFRC increased initially up to a certain limit and then started decreasing as the BF % and FL increased. The ideal BF content for industrial-scale BF reinforcement of concrete was investigated in this study which could be an economical solution for production of BFRC on industrial scale.


Overview of gene expression programming
GEP models are more functional and yield correct results by using the optimized parameters.GEP is more advanced and expanded form of gene programming (GP), a type of machine learning that generates models which rely on genetic evaluation 31,32 .GEP is a new technique for the advancement of computer programs that relies on the character linear chromosomes constituting of genes structurally organized in a head and a tail.As a result of mutation, transposition, root transposition, gene transposition, gene recombination, and one-and two-point recombination, the chromosomes play the function of genomes that are subject to alteration.The targets to be chosen are expressed trees, which are encoded by the chromosomes [33][34][35] .The method can run with high efficiency, which is significantly better than current adaptive algorithms, thanks to the development of these different entities (genome and expression trees) with discrete purposes.Genetic programming's constant linear width makes GEP an efficient method 36 .The simplest criteria used in GEP also allow for the development of complicated and non-linear programs due to multi-genic behavior because of the genetic process occurring at the chromosomal level.Five sets made up the entire GEP 37 : the Function set, the Terminal set, the Fitness measure set, the Parameters set, and the Criteria set.In GEP, each specimen is set as a genome, which is a fixedsize linear string.Furthermore, during the reproduction stage, the genetic operators are used for modification of chromosomes.
www.nature.com/scientificreports/The process goes ahead by initially selecting data randomly, then the best combination of population is selected based on error criteria and outliers are separated.Further, the most suitable combination is produced by mutation, crossover and direct cross-over 38 .This process is also named as "learning".After running several cycles most suitable model is created by reaching maximum iterations.GEP models are less time consuming and more efficient than early used traditional experimental procedures for predicting strength of concrete composites.
The summary of the processes undergoing GEP modelling continue in the following manner.
1. Based on recorded data (population) number of chromosomes are produced randomly.
2. The chromosomes formed in the first step then generate mathematical equations.
3. Each chromosome is then used to check suitability with targeted function.This is an iterative process and if iterations do not stop then best of the first generation is selected using roulette wheel method.4. To create modified individuals from other chromosomes the genetic operators are applied which are the GEP algorithm.5. Now more chromosomes are created by several iterations for a certain number of generations and the model is formed with most efficient producing results.

Artificial neural networks (ANNs)
Artificial neural networks (ANNs) are a class of machine learning algorithms inspired by the structure and functioning of biological neural networks, such as the human brain 1 .ANNs consist of interconnected nodes, or artificial neurons, organized into layers: input layer, hidden layers (if any), and an output layer.These networks are used for a wide range of tasks, including pattern recognition, regression, classification, and more.Here's an overview of ANNs and their potential applications in civil engineering: Basic Structure of Artificial Neural Networks: • Inner Layer Receives the raw data or features for the task.
• Hidden Layers One or more layers of neurons that process and transform the input data.
• Output Layer Produces the final output, which could be a prediction, classification, or any other relevant result.
ANNs are trained using labeled data, where the correct outputs are known.
During training, the network adjusts its internal parameters (weights and biases) to minimize the error between predicted and actual outputs.Popular optimization algorithms like backpropagation are used for this purpose.ANNs can analyze sensor data from bridges, buildings, or other infrastructure to detect signs of damage or wear and tear.They can predict the remaining useful life of structural components, helping with maintenance planning 2 .ANNs can model soil behavior, helping predict settlement, slope stability, and bearing capacity.They can analyze geological data to identify potential hazards like landslides or earthquakes.ANNs can predict traffic flow, congestion patterns, and optimize traffic signal timings.They can be used for predictive maintenance of road infrastructure based on traffic data.ANNs can model complex hydrological systems to predict river flow, rainfall, and flood events.They can assist in flood risk assessment and early warning systems 3 .ANNs can help optimize construction schedules and resource allocation.They can predict construction project delays and cost overruns based on historical data.ANNs can analyze data from material testing to predict material properties and behavior under various conditions.ANNs can model the environmental impact of civil engineering projects and help with mitigation strategies.ANNs can assist in urban planning by predicting population growth, traffic patterns, and land-use changes.ANNs can be used to monitor the quality of construction materials, such as concrete or asphalt, based on inspection data.In all these applications 4 .ANNs excel at handling complex, nonlinear relationships in the data, which are often challenging to capture with traditional engineering models.However, it's essential to have enough high-quality data for training and to validate the ANN's performance for reliable results in civil engineering applications 5 .

Extreme gradient boost (XG Boost)
Extreme gradient boosting (XG Boost) is a powerful and popular machine learning algorithm used for both classification and regression tasks.It belongs to the gradient boosting family of algorithms and is known for its high performance, efficiency, and versatility.XG Boost has gained popularity in various fields, including finance, healthcare, and natural language processing 6 .While it may not be a common tool in traditional civil engineering, it can still be applied to certain civil engineering problems.XG Boost is an ensemble learning algorithm that combines the predictions of multiple decision trees to create a robust and accurate model.
It is called "Extreme" Gradient Boosting because it emphasizes the use of gradient boosting techniques, which iteratively optimize the model's performance.XG Boost is highly efficient and parallelizable, making it suitable for large datasets and distributed computing environments.Civil engineering structures, such as bridges and roads, require regular maintenance to ensure their safety and longevity 7 .XG Boost can be used to predict maintenance needs by analyzing data related to factors like structural wear and environmental conditions.In construction projects, XG Boost can help identify defects and quality issues by analyzing data from sensors, cameras, or other monitoring equipment.It can flag anomalies and potential problems in real-time, enabling timely interventions 8 .
XG Boost can be used to forecast traffic patterns, optimize traffic signal timings, and even predict congestion or accidents.Civil engineers can utilize these predictions to plan better transportation systems.XG Boost can be used to model and predict the environmental impact of construction projects, aiding in decision-making and regulatory compliance 9 .XG Boost can help analyze soil properties and geological data for construction site The study of past research led to the collection of data on the CS of BFRC.It should be noted that numerous tests were conducted to figure out the database's veracity  The study of past research led to the collection of data on the CS of BFRC. This oduced 153 datasets for CS (CS) and 127 datasets for TS (TS), which were used to create the corresponding empirical models.The training, validation, and testing sets of the database were randomly chosen for this investigation.The model was trained using the training data, and the validation data was used to confirm the model's generalizability.Throughout the testing process, many expressions were assessed on the collected data.
In Table 1, the descriptive statistics are displayed.It is recommended to employ the provided formulas for this set of data to make accurate forecasts of the CS and TS.
It should be noted that numerous tests were conducted to evaluate the database's consistency and validity.The datasets that considerably (up to 20%) diverged from the overall trend were regarded as negligible while developing or assessing the performance of the models.
The contribution of different input parameters for design of CS and TS of BFRC can be seen in Figs. 1 and 2. These input parameters played a part in the assessment of optimum CS and TS of BFRC.

Model development
Prior to developing the model, the first step is to decide which input parameters will have an impact on the BFRC's parameters.To assess the most influential parameters on the characteristics of the BFRC and establish a generalized connection, all the concerned factors in the studied data were carefully analyzed, and the efficiency of multiple preliminary runs was recorded.As a result, it is believed that the variables in Eq. 1 are functions of the CS of BFRC.The robustness and generalizability of the resulting model depend heavily on the fitting parameters, it should be noted.Based on recommendations from the literature and numerous test runs, the fitting parameters for the GEP method were selected.How long the module will run for is decided by the size of population (number of chromosomes).Depending on the intricacy of the prediction model, 50, 100, or 150 levels were used as the population size.The architecture of many models created by software is determined by the head size and gene count, where the former considers the intricacy of every term and the latter the number of sub-ETs in the model.Three head sizes-5, 8, or 10-and a set number of 3 and 5 genes were used in this experiment.A list of the precise parameters used in the GEP algorithm for the two models can be found in Tables 2 and 3.
Correlation coefficient is an often employed performance indicator (R).R, however, cannot be used as the primary sign of the model's prediction accuracy because it is insensitive to the division and multiplication of (1) CS and TS = f (w/c, CA, FA, BF, cement)   output numbers.Root means square error (RMSE) and the R 2 regression component are therefore also considered in this study.The model's performance is assessed using a performance index (β), which functions as a function of the RMSE, R 2 , and R.These error functions of mathematical expressions are provided in Eqs. ( 2)-( 7).
( Overfitting of models because of extensive data training is a concern in machine learning.The testing error may extensively while the training mistakes may continue to reduce.To avoid the model overfitting, the optimal model is chosen by optimizing an objective function (OBF), which is represented as under.The fitness function is defined as OBF here.
where the subscripts T and V define training and validation (or testing) data, respectively, and n depicts the total number of data points.The relative percentages of the training and validation sets of dataset entries are shown by OBF together with the R and RMSE effects.As a result, the minimization of OBF can be considered a precise indicator of the models' overall effectiveness.The ideal model is represented by a value close to zero.Keeping in mind that after simulating 18 different fitting parameter combinations, the model with the lowest OBF was selected.
The implementation of Artificial Intelligence (AI) techniques have demerited the problem of collinearity and the value of coefficient correlation is with in the prescribed limits which can be viewed from Tables 4 and 5.The problem of interdependency among input variables is a common problem which arises during modelling.Efficiency of the developed model reduces due to rise in the strength of model under consideration.To overcome this issue, Correlation Coefficient R is calculated which should have a value less than 0.8.It can be easily viewed Selection of best GEP model for TS of BFRC. a The operations employed included +, −, *, /, sqrt, x 3 .b The operations employed included +, −, *, /, sqrt, x 3 , x 2 .c The operations employed included +, −, *, /, sqrt, x 3 , x 2 , 3Rt.d The operations employed included +, −, *, /, sqrt, x 3 , exp, sin, cos, atan, ln. e The operations employed included +, −, *, /, sqrt, x 3 , x 2 , 3Rt.f The operations employed included +, −, *, /, sqrt, x 3 , x 2 , pow.g The operations employed included +, −, *, /, sqrt, x 3 , exp, sin, cos.h The operations employed included +, −, *, /, sqrt, x 3 , x 2 ,3Rt,4Rt, exp, ln.i The operations employed included +, −, *, /, sqrt.j The operations employed included +, −, *, /, sqrt, x 3 , exp, sin, cos, atan, ln.k The operations employed included +, −, *, /, sqrt, x 3 , x 2 .l The operations employed included +, −, *, /, sqrt, x 3 , x 2 ,3Rt,4Rt, exp, ln.m The weight of the "+, −, *" operations was four times that of others.n The weight of the "+, −, *" operations was seven times that of others.o The weight of the "*" operations was four times that of others.p The weight of the "+, −, *" operations was three times that of others.www.nature.com/scientificreports/from the Tables 4 and 5 that R value is less than 0.8 and hence there is no risk of multi-collinearity among input parameters during modelling.

Results and discussion
As previously noted in Tables 2 and 3, the empirical relationships for CS and TS are determined using the four fundamental mathematical operations + ,, x, and.A visual comparison between the projected model and the data for CS is shown in Fig. 3.As shown in Fig. 3, the constructed model clearly considers the effects of each CS and BFRC input parameter.The outcomes displayed in Fig. 3 demonstrate statistical significance; the training, testing, and validation data points nicely fit the trend line, demonstrating the accuracy of the data, accordingly.
The number of datasets has a significant impact on the suggested model's reliability.153 samples of CS were gathered for this study from various sources to help improve outcomes.The model for formulating CS is created by choosing 3, 5 and 10 for the number of genes and head size, respectively, as shown in Table 2.The 28-day CS of the BFRC up to 123 MPa is proposed to be predicted using the simplified equation, Eqs.(9-12).The highlighted value is considered as most suitable for the formulation of CS. where

Formula development of TS of BFRC
The model for formulating TS is made by selecting 3, 5 and 10 for the number of genes and head size, respectively, as shown in Table 3.The 28-day TS of the BFRC up to 7.99 MPa is proposed to be predicted using the simplified expression, Eqs.(13-17).The value that has been shown is thought to be ideal for the formation of TS.
where Figure 4 provides a visual representation of the comparison between the model predictions and the actual results for TS.The developed model clearly considers the effects of all TS of BFRC input parameters, as can be shown in Fig. 4. Regression lines' slopes for training, validation, and testing, which were 0.89, 0.77, and 0.91, respectively, the findings presented in the Fig. 4 prove a good connection.The quantity of datasets has a significant impact on the suggested model's reliability 36 .In this research 127 specimens for TS were gathered from the literature available to achieve better results.

SHAP analysis
SHAP (SHapley Additive exPlanations) is a popular framework for explaining the output of machine learning models.It provides a way to understand the contribution of each feature to the model's predictions 37 .SHAP values are based on cooperative game theory and they give a clear understanding of how much influence an individual parameter has on targeted output.SHAP values are based on the concept of Shapley values, which originate from cooperative game theory.In cooperative game theory, Shapley values allocate a fair share of the total payoff to each player in a game based on their marginal contributions.In the context of machine learning, each feature in a dataset can be considered as a "player" in a cooperative game.The "payoff " is the model's prediction 38 .
SHAP values help explain why a particular prediction was made by breaking down the prediction into contributions from each feature.
Positive SHAP values indicate a feature's positive contribution to the prediction, while negative values indicate a negative contribution.The sum of SHAP values for all features equals the difference between the model's prediction for a specific instance and the expected (average) prediction.SHAP values can be calculated using various methods, depending on the model and the specific use case.Common methods include SHAP Kernel Explainer, Tree SHAP for tree-based models, Deep SHAP for neural networks, and more.
For tree-based models like decision trees or random forests, Tree SHAP is often used to efficiently compute SHAP values.SHAP values can be visualized using various techniques, such as SHAP summary plots, SHAP dependence plots, and force plots.
Summary plots provide an overview of feature contributions across a dataset, while dependence plots show how a single feature's value impacts predictions.Force plots display the individual feature contributions for a single prediction.SHAP analysis is valuable for understanding black-box models, such as complex neural networks, and for building trust and accountability in machine learning systems.It's useful in various ( 9) www.nature.com/scientificreports/applications, including credit scoring, healthcare, image analysis, natural language processing, and more.You can perform SHAP analysis in Python using libraries like shap, which provides tools for calculating and visualizing SHAP values.In summary, SHAP analysis is a powerful technique for explaining the predictions of machine learning models.It helps users gain insights into the model's decision-making process and provides valuable interpretability and transparency, which are essential in many real-world applications.SHAP analysis for CS and TS can be clearly viewed in Figs. 5 and 6.In Fig. 5 FA is most influential parameter influencing compressive strength to BFRC.The other input parameters are listed in descending order as per their performance.For better understanding, one can easily observe the importance by higher positive value.Large positive value indicates more significance of these parameters.
As explained earlier, higher positive value of a parameter is an indicator of its dominancy while imparting strength to concrete.In Fig. 6 FA shows higher mean SHAP value.It shows that these values starting from higher range closer to 1 for FA ending on cement with lower value of 0304 show significance of parameters for TS of BFRC in descending order.It does not mean that parameters with lower value should be discarded because they are necessary for strong binding properties despite of the fact that they do not have much contribution in imparting strength.

Performance assessment of machine learning models
The criteria used in this study to assess the model efficiency is to compare any performance indicator used in research for all Machine Learning models used.For this purpose Regression value (R 2 ) is chosen in this study to validate the performance of GEP more clearly.In Fig. 7 bar 1 represents the R 2 for GEP and 2, 3 for ANN and XG Boost.This figure clearly explains the importance of GEP chosen for this study because of its accuracy and robustness which can be seen by larger value of R 2 for GEP.

Relative study of GEP and machine learning models
To the knowledge of the researcher's understanding, no empirical methods have been created for calculating the CS of BFRC that would account for the crucial factor considered in this investigation.As a result, using comparable datasets, linear and non-linear regression along with ANN and XG boost models have also been proposed to calculate the CS of BFRC.
The comparison of the CS and TS findings suggested by the three models is shown in Figs. 8 and 9.For all three datasets, regression models follow the statistical parameters, RMSE and the GEP model, the least.In comparison to ANN and XG Boost, the RMSE training for the CS outcomes predicted by GEP is about 33.7% lower.Testing of the two models, which differ by 50%, shows that the GEP model performs better than ANN and XG Boost models during the testing phase.However, these three models are applicable for forecasting the CS and TS of BFRC as the targeted outputs in Figs. 8 and 9 show that data output values lie close to each other.This means that no widespread outliers were produced while processing these models.
Equations 18 and 19, respectively, provide the empirical equations to forecast CS.Golbraikh et al. (2002) suggested that one of slopes of regression line (k or k') should be close to 1 if it passes through the origin.As can be seen, regression line of CS slope is 0.99 and that of TS is 0.89.This suggests more precision and correlation.It should also be near to 1 for the squared correlation coefficient (through the origin) between experimental and predicted values or for the coefficient between predicted and experimental values.This is according to several scholars.Table 6 shows that the model complies with the requirements for external verification, demonstrating that the GEP models are highly valid, demonstrate the ability to predict, and go beyond simple correlations between input and output characteristics.
The coefficient between predicted and experimental values, or the squared correlation coefficient (via the origin) between experimental and predicted values, should also be near to 1, according to various scholars.It can be observed that GEP model not only correlates the input and output parameters but also is found efficient in prediction, validation, and verification of the data.Soft computing techniques [39][40][41] , deep learning algorithms [42][43][44] and machine learning [45][46][47][48][49] can be utilized for further analysis.Moreover using artificial neural network 50,51 , support vector machines 52,53 ; random forest 54,55 , deep learning neural network 56,57 , neuro fuzzy 58,59 and extreme learning 60,61 , support vector machines [62][63][64] and hybrid machine learning model of genetic algorithm [65][66][67] can be utilized to predict the response using the existing experimental data.It will save the cost and human effort as well and open new directions for future research.Zhou et al. 68 and Wu et al. 69 conducted separate studies on  the moisture diffusion coefficient of concrete under various conditions and the three-dimensional simulation of seismic wave propagation, taking into account source-path-site effects.Xu et al. 70 and He et al. 71 investigated mine water inflow from roof sandstone aquifers using upscaling techniques and the use of N-doped graphene quantum dots to enhance the chloride binding of cement, respectively.Zhan et al. 72 and Zhou et al. 73 performed data-worth analysis for identifying and characterizing heterogeneous subsurface structures and developing highstrength geopolymer based on BH-1 lunar soil simulant, respectively.Tian et al. 74 and Ren et al. 75 studied the collapse resistance of steel frame structures in column-loss scenarios and developed a damage model for porous rock suitable for different stress paths, respectively.Cheng et al. 76 and Yu et al. 77 investigated the effects of methane and oxygen on heat and mass transfer in reactive porous media and the stress relaxation behavior of marble under cyclic weak disturbance and confining pressures, respectively.Xu et al. 78 , Ren et al. 79 , and Yao et al. 80 have recently examined the properties of source rocks and the genetic origins of natural gas.They have also investigated the damage caused by compaction and cracking, as well as the combined disturbance-induced damage to rocks.

Conclusions
This research depicted the application of gene expression programming (GEP) strategy to predict CS and TS of BFRC.The proposed model is empirical and relies on widely dispersed catalogue gathered from different experimental datasets studied in literature.The results obtained from the predicted model validate the experimental findings.The analysis of parameters depict that the suggested model agrees with the contribution of input parameters to suggest the accuracy in the trend of CS and TS of BFRC as can be seen in SHAP Analysis and in Figs. 8 and 9.
The assessment and comparing of fitness functions (β and OBF) and statistical parameters (RMSE and R) for all three sets (training, validation, and testing), revealed the precision of the suggested models in Table 6.
Additionally, the model clearly satisfies various criteria considered for external validation.When derived GEP and regression models are compared, it becomes clear that GEP models outperform ANN and XG Boost models in terms of generalization and prediction, making them ideal for implementing in designing of BFRC by observing R 2 value of 0.99 in Fig. 7.
It is recommended to perform dredged assessment first before using BF as reinforcement in concrete.It is better to carefully consider the optimum content of BF to achieve desired CS and TS of BFRC in Figs. 1 and 2.
Instead of dumping basalt fibers as trash, the empirical models can supply a precise and powerful foundation for enhancing their use in construction projects.This may help create high-performance concrete, which will be more useful in the construction sector.Further study on BFRC can be done for other parameters like Elasticity

2 o
−e)(ei −e o i ) , m o i = k ′ × e i R ′

Table 1 .
Descriptive statistics of input parameters.

Table 2 .
Summary of best GEP models for CS of BFRC.

Table 4 .
Correlation coefficient among input variables used in modelling of CS.

Table 5 .
Correlation coefficient among input variables used in modelling of TS.

Table 6 .
Formulating TS by GEP ANN XG Boost.Statistical parameters of GEP model for external verification.