Evaluation of machine learning models for predicting TiO2 photocatalytic degradation of air contaminants

The escalation of global urbanization and industrial expansion has resulted in an increase in the emission of harmful substances into the atmosphere. Evaluating the effectiveness of titanium dioxide (TiO2) in photocatalytic degradation through traditional methods is resource-intensive and complex due to the detailed photocatalyst structures and the wide range of contaminants. Therefore in this study, recent advancements in machine learning (ML) are used to offer data-driven approach using thirteen machine learning techniques namely XG Boost (XGB), decision tree (DT), lasso Regression (LR2), support vector regression (SVR), adaBoost (AB), voting Regressor (VR), CatBoost (CB), K-Nearest Neighbors (KNN), gradient boost (GB), random Forest (RF), artificial neural network (ANN), ridge regression (RR), linear regression (LR1) to address the problem of estimation of TiO2 photocatalytic degradation rate of air contaminants. The models are developed using literature data and different methodical tools are used to evaluate the developed ML models. XGB, DT and LR2 models have high R2 values of 0.93, 0.926 and 0.926 in training and 0.936, 0.924 and 0.924 in test phase. While ANN, RR and LR models have lowest R2 values of 0.70, 0.56 and 0.40 in training and 0.62, 0.63 and 0.31 in test phase respectively. XGB, DT and LR2 have low MAE and RMSE values of 0.450 min-1/cm2, 0.494 min-1/cm2 and 0.49 min-1/cm2 for RMSE and 0.263 min-1/cm2, 0.285 min-1/cm2 and 0.29 min-1/cm2 for MAE in test stage. XGB, DT, and LR2 have 93% percent errors within 20% error range in training phase. XGB has 92% and DT, and LR2 have 94% errors with 20% range in test phase. XGB, DT, LR2 models remained the highest performing models and XGB is the most robust and effective in predictions. Feature importances reveal the role of input parameters in prediction made by developed ML models. Dosage, humidity, UV light intensity remain important experimental factors. This study will impact positively in providing efficient models to estimate photocatalytic degradation rate of air contaminants using TiO2.

www.nature.com/scientificreports/urban smog, and the greenhouse effect.Airborne contaminants provide a significant risk to human health and the environment, necessitating the development of innovative strategies for mitigating their effects.Titanium dioxide (TiO 2 ) is a popular choice for pollutant degradation due to its ability in heterogeneous photocatalytic degradation [2][3][4] .It is very effective at breaking down pollutants with a variety of functional groups and structures due to its non-selective photocatalytic characteristics.Previous research has demonstrated that TiO 2 has shown encouraging results in the removal of air and water contaminants 5 .TiO 2 offers several advantages, including costeffectiveness, low energy consumption, exceptional efficiency, strong oxidizing capabilities, chemical stability, resistance to acids, ease of production, and insolubility in water.The results demonstrate the effectiveness of TiO 2 in eliminating pollutants, and the photocatalytic process can be carried out at standard ambient temperature and pressure conditions 1, [6][7][8] .Figure 1 illustrates the process of semiconductor photocatalytic degradation.
Moreover, TiO 2 finds application in a wide range of areas, including air purification, water treatment, water splitting, renewable energy production, and the conversion of carbon dioxide into hydrocarbons [10][11][12][13][14] .In recent decades, extensive efforts have been dedicated to the development of photocatalysts and the assessment of their effectiveness in municipal water treatment processes [15][16][17][18][19] .Nevertheless, quantifying the efficiency of photocatalysts across a spectrum of contaminants presents a difficult challenge.The photocatalytic degradation performance of contaminants is intricately linked to the characteristics of photocatalysts, encompassing factors such as crystalline structure, grain size and shape, specific surface area, pore structure, and more [20][21][22][23] .Additionally, a number of experimental factors, including as the dosage of the photocatalyst, the medium's pH, the concentration of contaminants, the wavelength and intensity of the light, and others, have an important impact on the photocatalytic activity [24][25][26] .The conventional experimental methods for assessment of photocatalytic activity are UV-Vis spectroscopy [27][28][29] , gas chromatography analysis 30,31 , chemical oxygen demand 29,32 , fluorescence spectroscopy 33 , electrochemical methods 34 .These methods of experimentation require longer reaction times to get measurable contaminant degradation, which causes delays in the collection of results.Moreover, these experimental techniques need complex sample preparation procedures and manual data processing, which requires skilled workers and significant time commitments.The utilization of traditional methods frequently provides limited mechanistic understanding of the photocatalytic degradation process, creating difficulties in the catalyst design and optimization of reaction condition.Experimental processes are important to get quality data for training and testing of ML models but performing repeated experiments for each project are not feasible.An effective way to get over these restrictions and improve estimation of photocatalytic activity is using machine learning models.ML models are capable of accurately and quickly predicting experimentation results and reaction kinetics outperforming more conventional approaches therefore artificial intelligence and ML modelling have been used extensively [35][36][37][38][39] .This facilitates a quicker evaluation of photocatalytic performance.ML algorithms can be highly effective in identifying the ideal reaction conditions to maximize photocatalytic activity and selectivity by utilizing datasets of experimental parameters and reaction results.More sophisticated algorithms are excellent at deciphering complex correlations between catalyst characteristics, reaction parameters, and photocatalytic efficiency, providing crucial mechanistic insights into the degradation process.
Data-driven machine learning offers a novel method for assessing photocatalyst performance that is quick, affordable, and flexible outperforming traditional testing.Artificial neural networks (ANNs) is an established machine learning technique used to forecast properties of many different types of materials, including metals, polymers, composites, and ceramics [40][41][42][43][44][45][46] .ANNs have been used to forecast the photocatalytic activity of novel catalysts [47][48][49][50][51] and to speed up the design and discovery of new catalysts [52][53][54] .Prior research has employed the gaussian process regression model to forecast the band gaps of anatase TiO 2 photocatalysts by considering their surface area and lattice properties 55 .It is important to note that these model predictions may not fully account for all the factors influencing photocatalytic degradation, which could leave gaps in our understanding of TiO 2 's overall performance.Furthermore, it is crucial to conduct a comprehensive comparison to determine which Figure 1.Mechanism of photocatalytic degradation 9 .
model performs optimally for a given dataset.One of the many crucial steps in developing a successful ML model is evaluating and contrasting many models in order to choose the optimal one for deployment 56 .Existing research indicates that there is no universally superior method, as no single approach consistently outperforms others 57,58 .Many studies tend to introduce new models and only compare them with similar ones, making it challenging to assess their relative and overall performance 59 .Although certain methods may exhibit superior average performance, there exists substantial variability when considering different problems and metrics 58 .Hence, the comparative analysis of diverse algorithms becomes indispensable in the quest to pinpoint the most appropriate model for a particular problem.Distinct models demonstrate excellence under varying circumstances and the utilization of a spectrum of models enhances the effectiveness of research endeavors.Through the comparison of multiple algorithms, researchers can mitigate potential biases and gain a more comprehensive understanding of the problem landscape, revealing patterns and correlations within the data.
In this study, thirteen machine learning techniques are employed to develop machine learning models for prediction of TiO 2 photocatalytic degradation of air contaminants.The algorithms encompassed linear approaches (such as linear, ridge, lasso, and support vector regression), decision trees, random forests, K-nearest neighbors, and boosting-based methods.The crucial issues of hyperparameter optimization and overfitting are addressed using grid search technique and K-fold cross validation.The dataset for ML models training and testing is acquired from literature and various methods including regression analysis, statistical metrices and visualization techniques are used to evaluate and compare performance of proposed models.Furthermore, feature importances are used to analyze the role of input parameters in predicting the output.This study will improve the prediction process for TiO 2 photocatalytic rate and knowledge and functionality of different constituents in photocatalytic degradation.

Overview of employed ML techniques
Thirteen machine learning techniques were assessed to identify the ideal model that can be used to predict TiO 2 photocatalytic degradation of air contaminants.The algorithms used in this study are discussed in the following sections.Detailed description of employed ML techniques is not scope of this study and can be found extensively in literature [60][61][62][63][64] .

XG boost (XGB)
XGBoost is an open-source machine-learning library introduced by Chen and Guestrin in 2016 65 .It stands out as a powerful, versatile, and portable tool.It excels in addressing regression, classification, and ranking problems within supervised learning.Data scientists and machine learning experts widely regard XGB as a dependable and efficient algorithm, particularly when dealing with extensive datasets rich in features.The core principle of XGB involves constructing decision trees sequentially.What sets it apart is its ability to assign greater importance to the factors that previous decision trees may have miscalculated.By combining these numerous classifiers and predictors, XGB produces robust and dependable models.Figure 2 represents the mechanism of XGB.XGB's sequential ensemble learning is illustrated by building decision trees iteratively and training each one to correct the residuals of the previous trees to produce a strong prediction model.www.nature.com/scientificreports/

Decision tree (DT)
A decision tree analyzes the data to identify patterns and establish predictive rules.Figure 3 illustrates a threelayer construction of decision tree.The primary objective of a regression tree is to divide the predictors into segments, enabling the prediction of target variables based on these partitions of input variables.Furthermore, the regression tree implicitly selects variables and highlight variables with greater significance in predicting target variables based on the previous nodes in the tree.One of its notable advantages is its versatility in handling both numeric and categorical data.As a result, this approach is considered relatively straightforward, although it requires careful consideration to prevent data overfitting.The disadvantage of regression trees is that their models are inherently unstable.Slight modifications to the dataset might result in completely different partitions, making the optimal model selection difficult 66,67 .Consequently, decision tree models are susceptible to overfitting.To mitigate this issue, more complex tree-based methods like random forests and boosted trees are often preferred for their greater reliability.

Lasso regression (LR2)
Lasso regression (LR2) is employed for feature selection and regularization to enhance model accuracy 68 .Similar to ridge regression (RR), LR2 minimizes the residual sum of squares by considering the regression coefficients as the sum of absolute values.As shown in Eq. ( 1), the objective is to minimize the following expression: Here, sample numbers in the dataset are indicated by i, while the number of input characteristics is shown by m.The shrinkage parameter is denoted by λ, and as λ grows, the coefficients (βj) tend to approach zero.Figure 4 Illustrates the shrinkage effect and sparsity induction in feature selection using the L1 constraint of (1)

Support vector regression (SVR)
Support vector machines (SVMs) are commonly used for support vector regression (SVR), which was introduced by Vapnik et al. 69 .To reduce generalization error, the SVR technique builds a model based on the training data and uses a hyperplane that maximizes the distance between labeled classes to categorize new data points.The sum of the distances from the hyperplane to the closest labeled data points is used to calculate the margin in SVR.SVR minimizes both observable training errors and generalized errors to achieve generalized performance.The SV method is illustrated in Fig. 5, where 'x' stands for the input vector, 'y' for the target value, and the model utilizes the linear regression function.

AdaBoost (AB)
Boosting is an ensemble technique that combines multiple weak learners to create a strong learner.It involves adjusting the sample weights of the training data for each model based on the results of the previous model.This means that the outcomes of previous learning iterations influence subsequent ones, leading to an increase in the importance of data over boosting rounds.AdaBoost, a specific boosting algorithm, focuses on increasing the weight of training samples that were poorly predicted by the previous model.Initially, AdaBoost creates the first weak learner and assigns output weight (C) and model weight (w) to each data point based on the learning results.
The model is then updated with the data weights.This process is repeated for a specified number of iterations (N).Data points that were incorrectly predicted by the previous models are given higher weight, while those accurately predicted have their weight reduced.As the iterations continue, challenging examples receive increasing attention.After N iterations, each of the N weak learners is assigned a model weight (w) to contribute to the final model 66,70 .The boosting procedure is shown in Fig. 6, wherein AdaBoost increases the weight of training data that the prior model underfits.The weighted sum of this combination is the result of the AdaBoost model.

Voting regression (VR)
A voting regressor falls under the category of ensemble meta-estimators, sequentially fits multiple base regressors to the entire dataset 71 .It generates the final prediction by averaging the outputs of these individual estimates.
The key advantage of using a voting regressor is its resilience to significant errors or mispredictions from any single model.Since it relies on the collective performance of multiple models, the negative impact of one model's poor performance is mitigated by the strong performance of others.By combining several models, the chances of a single model making an incorrect prediction are minimized.This approach enhances the robustness of the estimator and reduces the risk of overfitting.The voting regressor workflow illustrated in Fig. 7 shows that many estimators work in parallel to provide predictions, which are then aggregated using a weighted technique to get a final prediction that is less likely to be overfit and more reliable.

CatBoost (CB)
CatBoost (categorical boosting) is an open-source gradient-boosting method for decision trees that was first presented by Prokhorenkova et al. in 2018 72 .Its unique, efficient, and greedy gradient-boosting methodology make it distinctive.CatBoost is utilized in a variety of domains, including recommendation systems, ranking, prediction, and even personal assistants, in addition to regression and classification challenges.One of its notable advantages is that it may reduce the need for extensive hyperparameter adjustments, which minimizes the risk of overfitting, thus giving models more applicability.Due to its use of symmetric decision trees, this method can rapidly infer pre-trained weak learning models.It is particularly well-suited for handling noisy data with diverse attributes and complex relationships.Figure 8 is schematic of the CatBoost regressor illustrating the training of several predictors using different feature combinations from the training set, with an ensemble of the judgments made by each predictor serving as the final prediction.

K-nearest neighbors (KNN)
The K nearest neighbor (KNN) technique involves making predictions for new records by comparing them to the most similar records in a dataset.Figure 9 provides a visual representation of KNN algorithms.It can be applied to both regression and classification tasks in machine learning.KNN operates on the assumption that observations close in the attribute space (e.g., concrete mix properties) are also close to each other in terms of output values.Predictions for output values are made by using a predefined function of the response values of the nearest neighbors, with a focus on the closest neighbors in the data space.In standard KNN, the average function is commonly used.Some key properties of standard KNN include: 1. Assigning equal importance to all neighbors and using the average function to calculate the response value for unknown observations.2. Treating all normalized attributes as equally important by assigning the same weight to them.3. Using Euclidean distances to calculate distances between data points.An advantage of KNN is its resilience to noise in the training data, making it an effective algorithm for handling large datasets 66,67 .

Gradient boost (GB)
Gradient boost (GB) is a machine learning boosting strategy that uses decision trees as weak learners in order to minimize the total error of the model and produce a robust learner 61 .This strategy employs an iterative ensemble methodology in which a weak learner is introduced and trained by the algorithm to minimize the overall training error.Then, until the model's overall error hits the target level, another weak learner is added and trained similarly.The basic method of GB is to perform regression on a function that is obtained from the gradient vector of the loss function that was computed in the preceding iteration 73 .Figure 10 illustrates gradient boosting mechanism in which decision trees are iteratively trained on progressively larger residual errors until an improved ensemble model is produced.www.nature.com/scientificreports/

Random forest (RF)
Random forest involves constructing an ensemble of regression trees to mitigate the variability observed in individual trees.In this approach, decision trees come together to form a "forest" by applying the concept of "bootstrap aggregation" (bagging), which entails creating multiple similar datasets sampled from the same source dataset.Bagging involves combining trained base models on different subsets of the training data.Decision trees are known for their low bias but high variance, making them susceptible to overfitting.The key advantage of the random forest method lies in its ability to significantly reduce this instability.However, decision trees, when used in isolation, tend to overfit the training data.To address this issue and prevent overfitting, random forest models are typically constructed by aggregating multiple decision trees or through regularization techniques 66,74 .An illustration of the RF process is shown in Fig. 11.First, the size of the samples is established, and then each sample has a decision tree constructed for it.Based on the input parameters entered, each decision tree generates a prediction.The best forecast is determined by a voting procedure in which the result receiving the most votes is declared the best prediction.The majority vote ultimately determines the outcome of the final forecast.

Artificial neural network (ANN)
The artificial neural network (ANN) is an algorithm rooted in deep learning, aiming to emulate the human brain and its neural system 75 .It comprises neurons as processing elements that form interconnected networks, with each processing element having multiple inputs and generating a single output.When the input to a processing element surpasses a certain threshold, it transmits information to neighboring elements.The strength of these connections between processing elements is determined by assigning weights, which are adjusted during the training process using training data to minimize the difference between predicted and target values.The hidden layer, situated between the input and output layers, plays a crucial role by applying nonlinear transformations to the inputs.An ANN's operation is shown in Fig. 12.The hidden layers, which include algorithms that replicate neural processes in the human brain, receive data from the input nodes.The network learns from the input nodes, which leads to activation inside the hidden nodes.The output layer generates predictions because of the patterns and insights that these hidden layers have extracted from the data.

Ridge regression (RR)
Ridge regression (RR) shares similarities with LR1 but focuses on reducing the model variance observed in LR1 68 .
In RR analysis, the coefficients are shrunk to minimize the residual sum of squares, as illustrated in Eq. ( 2).
The notations are the same as Eq.(1) in Lasso regression.Here, m represents the number of input features, and i represents the sample numbers in the dataset.The shrinkage parameter is denoted as λ, and as λ increases, the coefficients (βj) tend to approach zero.Ridge regression illustrated graphically in Fig. 13, showing the restriction placed on coefficients (β) by L2 regularization inside a circle, which lowers variance and strengthens the model's resistance to multicollinearity.

Linear regression (LR1)
An established linear connection between dependent and independent variables is established by the supervised machine learning approach known as linear regression (LR1) 76 .This technique uses gradient descent to get the best coefficients and gives regression coefficients to explain this linear connection.LR1 models the target variable's prediction value based on the supplied input variables, also known as independent variables.In the   m-dimensional space, where m is the number of independent characteristics, it fits a hyperplane.The regression coefficients are chosen such that the cost function is minimized.After this minimization procedure, the best-fit line is obtained using gradient descent, where the ideal set of regression coefficients is selected.The process begins with random values for θi and iteratively updates them to reduce the cost function value.
where (Y) is target variable and Xj is independent variable and βj is regression coefficients.Figure 14 shows a scatter plot that has a linear regression line to illustrate the idea of fitting a linear model to the data points can be used to highlight the predicted connection between an independent and dependent variable.

Methodology Source and features of literature data
The database utilized for ML model development contains 200 sets of experimental data derived from previous research 77 .The dataset includes seven independent variables, which essentially cover the majority of significant experimental condition settings: dosage, ultraviolet light intensity, humidity, wavelength, experimental temperature, initial concentration of air contaminant and reactor volume.The output variable for the ML models was photocatalytic degradation rate of TiO 2 (k, min -1 /cm 2 ).The reaction rate absolute values are usually very small positive number therefore they are converted to base 10 logarithm -log(k).Table 1 shows the statistical analysis of dataset.The mean, minimum (min), and maximum (max) indicate the center and extreme data points.Standard deviation (SD) is used to measure the dispersion of the data.A smaller SD suggests that most data  www.nature.com/scientificreports/points are close to the mean, while a larger SD indicates greater dispersion across the range of values.Skewness and kurtosis are employed to assess the symmetry and shape of the data distribution, relative to a normal probability distribution.Both can be zero, positive or negative and maybe undefined 78 .The data curve is flatter than the normal distribution curve when kurtosis is negative and more peaked than when it is positive.Conversely, the data distribution is normal with a medium peak on zero kurtosis value [79][80][81] .A skewness value of exactly zero indicates asymmetry, whereas positive and negative values indicate that the probability distribution curve's tail is stretched to the right and left of the curve, respectively 82 .Generally, acceptable ranges for kurtosis and skewness and kurtosis are between − 10 to + 10 and − 3 to + 3 respectively 79,80,82,83 .

Data preprocessing
Data preprocessing transforms raw, unclean data from various sources into clean and analyzable data.Raw data often come in formats that complicate or outright prevent analysis and machine learning application.For machine learning projects, it's crucial to format the data correctly to enhance the performance of the chosen models.Each MLmodel has its specific requirements for data format.Organizing the dataset appropriately is key to effectively applying and comparing different machine learning and deep learning algorithms.This includes removing missing values, converting categorical data into numerical values, which is vital for many of these algorithms.Moreover, dataset standardization is a common prerequisite involving scaling numerical data to fit within a certain range, removing the mean and adjusting to unit variance to ensure consistency and improve algorithm efficiency.In this study, data preprocessing and visualization is performed using python on Jupyter notebook environment.The StandardScaler tool of scikit-learn is used for data standardization.The formula for standardization is given in Eq. 4.
where "z" is standardized value, and "x" were original value of feature."μ" and "σ" is the mean and standard deviation of the feature values.After removing anomalies from dataset, correlation and parametric studies were performed to assess the features involved in the data.

Correlation study
The correlation matrix of the input data is shown in Fig. 15.Lighter hues signify weaker correlations, whereas darker hues suggest greater positive correlation.Notably, there is a large negative correlation (R = − 0.47) between T and R and a high positive correlation (R = 0.57) between variables W and I. High correlation values (|r|> 0.5) among the input variables enhance the possibility of multicollinearity, which can lead to bias and affect model results.The variance inflation factor (VIF) is an essential diagnostic tool in statistical research that helps identify collinearity between independent variables 84 .VIF is a quantitative indicator of collinearity inside a regression model, and its values normally span from 0 to 10 and are occasionally limited to 0 to 5 85 .Multicollinearity may be evaluated using the tolerance value which is the reciprocal of the VIF.In general, tolerance values between 0.1 and 1 are seen as a sign of negligible multicollinearity 86 .Table 2 shows that there is no substantial multicollinearity among the independent variables in this dataset as shown by tolerance values more than 0.1 and VIF values less than 5 86 .

Parametric study
Figure 16 shows the relationship between the photocatalytic degradation rate and the dataset's seven input factors.Figure 16 www.nature.com/scientificreports/

ML models development
The database is divided using random test train split to apply ML modeling.The training and test data ratio was 7.5:2.5.The ratio of input variables to data instances has a significant impact on the accuracy and efficiency of the ML models 87,88 .This ratio must be higher than 5 to assess the reliability of data for developing strong correlations among variables 88,89 .The requirement is efficiently satisfied by the ratio of twenty-one for training data and seven for test data in this study.It is crucial to choose the appropriate hyperparameter for each algorithm to improve the performance of all the models.The hyperparameters utilized for each model are listed in Table 3.
The hyper parameter mentioned in Table 3 are adjusted using the grid search approach 90 along with tenfold cross-validation to improve generalizability and robustness of models.

Statistical metrics for ML model evaluation
The performance of ML models is also assessed by statistical metrices.These metrices determine the accuracy level of the models in different aspects.The correlation between experimental and predicted results are measured using the correlation coefficient (R).R above 0.8 is considered a significant and robust correlation between the experimental and model-predicted results 91 .However, R have shown insensitivity towards the division and multiplication of given outcomes 92 .Therefore, R 2 is used due to its unbiased estimate and enhanced performance.The R 2 values around one represents that maximum variance among the explanatory factors is captured 93 .Nash-Sutcliffe Efficiency (NSE) above 0.65 is required for a model to show good efficacy 83 .The significance of  root mean square error (RMSE) lies in its ability to handle larger error values relative to smaller ones 94 .Nevertheless, RMSE might not be enough to guarantee ideal model performance in some circumstances.As a result, the mean absolute error (MAE) is also calculated.MAE gives greater weight to lower error values, works incredibly well with continuous and smooth data 95 .In conclusion, greater NSE and correlation metrics (R 2 ), together with lower error statistical measures (MAE, RMSE), suggest better and enhanced model performance.When a model is overfitted to the training set of data, it can lead to serious issues with machine learning approaches.Consequently, testing error tends to increase while training error continues to decrease 96 .The idea of OF was created as a fitness function in machine learning models to mitigate the impact of overfitting.The RRMSE, R, and relative percentage statistical features that are present in training and testing datasets are considered by the OF parameter, which makes it significant.According to the literature, the most accurate model is the one with the lowest OF value 97,98 .Furthermore, recently a new engineering index, the a20-index, has been proposed [99][100][101] for the reliability assessment of the developed ML techniques: where m20 is the number of samples having a ratio of experimental value to projected value between 0.80 and 1.20, and M is the number of dataset samples.It should be noted that the unit value of the a20-index values is anticipated for a flawless prediction model.The suggested a20-index indicates the proportion of samples that meet expected values with a ± 20% variation from experimental values, which has a physical engineering significance.Table 4 provides details of statistical parameters employed.

Regression analysis
The study employed an optimal mix of hyperparameters to create machine learning models as listed in Table 3.A prediction model with a high R 2 value is typically regarded as superior.Figure 17 represents regression graph for ML models results with x-axis and y-axis representing experimental values and representing predicted outcomes respectively.The linear fitting line represents a good fit between experimental and predicted outcomes.The XGB

Statistical analysis of results
The performance of the developed ML models on statistical metrices is shown in Table 5.In the training phase, the XGB algorithm exhibits exceptional performance achieving the best values across multiple performance metrices, including RMSE (0.318 min-1 /cm 2 ), R 2 (0.932), MAE (0.211 min -1 /cm 2 ).It demonstrates impressive accuracy with an R 2 of 0.937 in the test phase.XGB technique maintains accuracy in the testing phase with RMSE (0.450 min-1 /cm 2 ), MAE (0.263 min -1 /cm 2 ).Similar, consistent performance is given by DT, LR2 and SVR throughout the training and test phase with lower RMSE and MAE values.On the other end of the spectrum, ridge regression, and linear regression are identified as the worst-performing algorithms.They exhibit lower R 2 values in both the training and testing phases, indicating their limited accuracy in predicting the target variable.XGB exhibits the most optimal values in terms of PI, VAF%, and OF, with subsequent performance demonstrated by DT, LR2, and SVR.Furthermore, a20 and a10 indices are used which have physical meaning and demonstrate the percentage of prediction with error less than 20% and 10% respectively.Figure 18 shows results for a10 and a20 metrices in train and test phase.XGB has the highest ratio of values within 10 and 20 percent error.Similar performance is shown by DT, LR2, SVR and CB.On other hand, LR2, RF, RR and AB have low values for a10 and a20 indices showing higher errors.XGB, DT, LR2 and SVR have 93% percent errors within 20% error range in training phase.XGB has 92% and DT, LR2 and SVR has 94% errors with 20% range in test phase.Only KNN, RF and LR have more than 70% predictions in the 20% error range while all other models have more than 80% predictions within 20%error range.

Regression error characteristic curve
In classification tasks, the receiver operating characteristic (ROC) curve is widely used as a useful tool for comparing and displaying classification results.Regression error characteristic (REC) curves are developed in the domain of regression for similar purpose as ROC [102][103][104][105] .The proportion of correctly predicted occurrences within a certain tolerance interval (y-axis) are shown versus the absolute deviation tolerance (x-axis) in REC curves.This curve successfully shows the cumulative distribution function of prediction errors either represented in absolute deviation or squared residual terms.Area under curve (AUC) may be used to calculate the area over the REC Curve (AOC), which is a skewed estimate of anticipated errors using the formula AOC = 1-AUC.An ideal regression model would have a curve that is parallel to the y-axis and an AOC that is as small as possible.The error is displayed using absolute deviation as shown in Fig. 19(a) and (b) for the training and testing datasets, respectively.AOC values for several machine learning models are shown in Table 6.The significantly smaller AOC values here represent better model performance.The XGB model performs exceptionally well during training (AOC = 0.038) and testing (AOC = 0.048).

Taylor Diagram
Taylor diagram was originally developed by Karl E. Taylor for a comparative analysis of the performance metrics of the models 106 .This diagram incorporates three primary metrics SD, R, and RMSE.R and SD are used to measure how well the models' predictions align with experimental data, whereas RMSE quantifies the discrepancies between predicted and actual values.An optimal model is indicated by R values close to 1 and an SD that mirrors the experimental data's SD.The Taylor diagram serves as an effective tool for assessing model accuracy by illustrating the model that best aligns with the actual data.By displaying multiple models or datasets on the Close to 1 Higher than 0.65 for very good model var(X) * 100 Higher values

Feature importance
The emerging field of explainable AI (XAI) brings transparency and help users understand decision making process of ML model.The effort addresses the issue of "AI black boxes," or systems where core operations are opaque, keeping engineers and practitioners in the dark about results.Researchers and engineers push for openness in ML system operation, opposing such opacity.XAI aims to reveal the reasoning behind an ML's decisions by removing layers of complexity.Employing feature importances in this study will reveal role of input parameters in final predictions.The absolute significance of each input parameter in the suggested models of GB, Decision tree, CatBoost, AdaBoost, RF, XGB is depicted in Fig. 22(a)-(f).The most important feature for most models for estimating the photocatalytic degradation rate is D for GB, DT, CB, and RF while for AB is I and XGB is W. The significant influence of D, W and I on the photocatalytic degradation rate is seen in all the top-performing models.Notably, the XGB model (Fig. 22(f) emphasizes the importance of W, D and R as a critical component of photocatalytic degradation rate.In RF, GB, AB, DT and CB, Dosage remains the most important feature with 35%, 38.5%, 29.6%, 35.9 and 26.6% percentage weightage While in XGB model W remain most important with 71% weightage.This will also have a physical importance as interpretable machine learning models provide light on the underlying mechanisms and demonstrate their potential as useful instruments for forecasting and identifying critical variables influencing the TiO 2 photocatalytic degradation rate.Pervious researchers have performed experimental techniques to assess the role of input perimeters and experimental settings on final output.Haghighatmamaghani et al. 108 researched the operating parameters of concentration, relative humidity level, and residence time on removal efficiency and maintained that humidity conditions play an important role.Zhang et al. 109 researched the humidity influence on photocatalytic degradation and tried to develop a relationship.Li et al. 32 used TiO 2 photocatalyst and concluded that the degradation of organic pollutants was www.nature.com/scientificreports/

Conclusions
In this study, thirteen common ML models are utilized to examine an extensive dataset and predict Photocatalytic degradation rate of TiO 2 .A variety of ML techniques are used in this including linear approaches (such as linear, ridge, lasso, and support vector regression), decision trees, random forests, K-nearest neighbors, and boosting-based methods.All models are trained, tested using the given input parameters and rigorously evaluated.Feature importances are utilized to reveal underlying physical phenomena and decision-making process.
The results obtained are the following.
• XGB model has the highest R • The influence of D, W, and I on photocatalytic degradation rate is consistent across all top-performing models.
The most crucial feature for estimating photocatalytic degradation rate varies among different models, with D being significant for GB, DT, CB, and RF, and I for AB, while W is important for XGB.• XGB, DT, LR2 are the most robust ML models and are prediction problem of TiO 2 photocatalytic degradation of air contaminants.In summary, this research study extensively evaluated thirteen ML techniques for estimation of TiO 2 photocatalytic degradation of air contaminants.This study successfully proposed the most robust XGB, DT and LR2 models for prediction.Moreover, it is essential to recognize the limitations of the study and recommendations for future work.The developed ML models are trained for database input parameters, necessitating additional training, testing, and hyperparameter tuning for new inputs.More experimental factors should be studied to enhance the diversity and applicability of the models and study their role in TiO 2 photocatalytic activity.New machine learning techniques should also be used for the analysis of the present database to evaluate the impact of input parameters and model's performance.The ML models that have been constructed can also be used for prediction in various engineering problems.

Figure 18 .
Figure 18.Results of model performance on indicators with physical meaning (a) a10 training (b) a10 testing (c) a20 training (d) a20 testing.

Figure 19 .
Figure 19.Regression error characteristic curve for employed models.(a) training phase (b) testing phase.

Table 1 .
Statistical analysis of dataset.
(a)-(d)shows that factors such as dosage, UV light intensity, humidity, and InitialC have a significant impact on the photocatalytic degradation rate.A progressive decrease in the rate of photocatalytic degradation rate is evident when D is present.In contrast, Fig.16(b) and (e) show that factors like humidity and reactor volume have a favorable effect on photocatalytic degradation rate.A completely linear relationship between the input and output variables is improbable as the measured photocatalytic degradation rate for each input variable is displaying a large range.Vol:.(1234567890)Scientific Reports | (2024) 14:13688 | https://doi.org/10.1038/s41598-024-64486-7

Table 2 .
VIF analysis of dataset.

Table 3 .
Hyper-parameters values of the proposed Machine Learning models.model demonstrates the highest R 2 values of 0.932 and 0.937 in the training and test phases, respectively.The DT, LR2, and SVR models also exhibit strong performance, achieving R 2 values of 0.927 in the training phase and 0.924, 0.924, and 0.923 in the test phase, respectively.Conversely, ANN, RR, and LR1 display lower R 2 values of 0.62, 0.63, and 0.31 in the training phase, and 0.70, 0.56, and 0.40 in the test phase, respectively.AB, VR, and CB models deliver average performance, with R 2 values of 0.86, 0.86, and 0.82 in the training phase, and 0.90, 0.93, and 0.93 in the test phase, respectively.

Table 4 .
Statistical parameters for performance evaluation.n = data points, Xi = Experimental data, Yi = predicted data, X i = average experiment values, Y i = average predicted values.
Training data, V = Testing dataScore analysisThe Score Analysis method makes it easy to evaluate and compare performance of employed models and visualize the top performing models.The models in this analysis are sorted according to performance metrics, with the best-performing model receiving the highest rank and the worst-performing model receiving the lowest.The total number of models (k) in this study is 13.The training and test parts of this ranking procedure are carried out independently.The total of the test and training phase scores determines the final score.Table7presents a comprehensive overview of the score analysis, while Fig.21presents a radar map that illustrates the outcomes.Upon analyzing the table and figure, XGB and DT were the two best models throughout training, with scores of 103 and 90, respectively.Their performance decreased during the test phase with XGB scoring 98 and DT scoring 88.

Table 5 .
Results of model performance on statistical parameters.

Table 6 .
Results of REC curve.

Table 7 .
Score analysis of model performances.

NSE RMSE MAE VAF% PI a10 a20 Phase total Total
2 value of 0.932 and 0.937 in training and test phase.The DT, LR2 and SVR model showed good performance with R 2 value of 0.927 in training phase and 0.924, 0.924 and 0.923 in test phase respectively.While ANN, RR and LR1 showed lowest R 2 value of 0.62, 0.63 and 0.31 in training phase and 0.70, 0.56 and 0.40 in test phase respectively.AB, VR and CB models gave average performance with R 2 values of 0.86, 0.86 and 0.82 in training and 0.90,0.93 and 0.93 in test phase.These models' performance significantly improved in the test phase.• After regression and statistical analysis, Regression error characteristics were used to assess and visualize model performances.XGB, DT model has the least AOC values of 0.037 and 0.041 in training and 0.04 and 0.051 in test phase.• XGB, DT, LR2 achieved the highest score in training 103, 90,95 respectively and 98, 88 and 84 respectively in test phase.ANN, RF and LR1 showed lowest scores of 38, 26 and 16 in training and 35, 57 and 26 in testing phase.