Modeling crude oil pyrolysis process using advanced white-box and black-box machine learning techniques

Accurate prediction of fuel deposition during crude oil pyrolysis is pivotal for sustaining the combustion front and ensuring the effectiveness of in-situ combustion enhanced oil recovery (ISC EOR). Employing 2071 experimental TGA datasets from 13 diverse crude oil samples extracted from the literature, this study sought to precisely model crude oil pyrolysis. A suite of robust machine learning techniques, encompassing three black-box approaches (Categorical Gradient Boosting—CatBoost, Gaussian Process Regression—GPR, Extreme Gradient Boosting—XGBoost), and a white-box approach (Genetic Programming—GP), was employed to estimate crude oil residue at varying temperature intervals during TGA runs. Notably, the XGBoost model emerged as the most accurate, boasting a mean absolute percentage error (MAPE) of 0.7796% and a determination coefficient (R2) of 0.9999. Subsequently, the GPR, CatBoost, and GP models demonstrated commendable performance. The GP model, while displaying slightly higher error in comparison to the black-box models, yielded acceptable results and proved suitable for swift estimation of crude oil residue during pyrolysis. Furthermore, a sensitivity analysis was conducted to reveal the varying influence of input parameters on residual crude oil during pyrolysis. Among the inputs, temperature and asphaltenes were identified as the most influential factors in the crude oil pyrolysis process. Higher temperatures and oil °API gravity were associated with a negative impact, leading to a decrease in fuel deposition. On the other hand, increased values of asphaltenes, resins, and heating rates showed a positive impact, resulting in an increase in fuel deposition. These findings underscore the importance of precise modeling for fuel deposition during crude oil pyrolysis, offering insights that can significantly benefit ISC EOR practices.


Data gathering and preparation
In this study, 2071 experimental TGA findings related to 13 distinct crude oils were gathered from the literature 12,13,24,29,30,[33][34][35] in order to precisely represent crude oil pyrolysis, which is a crucial reaction in the ISC EOR process.The database used in this work is more comprehensive than the one used in Mohammadi et al. 's study 31 (i.e.2015 TGA data for 11 distinct crude oils).Since the kind of crude oil affects how it is pyrolyzed, a variety of crude oils with the characteristics mentioned in Table 1 were chosen to serve as input data for our models.
For model training, factors identified in the literature as being important during the pyrolysis of crude oil 9,13,33,36,37 were taken into consideration.In this research, the models' input parameters included the temperature, heating rate, weight percentage of asphaltenes and resins, and oil °API gravity.Since these values are often accessible, there is a large enough database for training the models.The model's result was the residual mass of crude oil at various temperatures.Table 1 lists the characterizations for crudes and heating rates utilized in this study's simulation.Additionally, Table 2 lists the output parameter and statistical descriptions of every model input variable, and Fig. 1 visually depicts the distribution of all the arguments.
Asymmetrical distribution, in contrast to symmetrical distribution, deviates from a regular and balanced pattern typically illustrated by a bell curve.Skewness may quantify the asymmetry of the distribution in this situation.Skewness value is positive when the probability function's left side contains the majority of the data, and vice versa.Conversely, kurtosis identifies the distribution shape in relation to the normal distribution.For instance, if the kurtosis is positive, it means that the normal distribution has a greater peak than the usual distribution 38 .According to the data in Table 2 and Fig. 1, the distribution and variation range of the input variables are broad enough to provide a generic model for forecasting the pyrolysis of crude oil.It should be mentioned that oil °API gravity, heating rate and especially asphaltene have a number of outliers which, in turn, definitely influence the precision of models.However, the vast majority of observations, as it is seen, are located   www.nature.com/scientificreports/within the box borders, making the impact of an error term insignificant.Despite the presence of outliers in the data, a thorough examination confirms their validity, indicating that they statistically differ from the majority of the data.As evident from the modeling results, these outliers do not significantly skew errors during the modeling process.As it is further observed, residual crude oil and temperature data are provided in a continuous form, and the distance between observations is insignificant, while other parameters including oil °API gravity, heating rate, asphaltene, and resin are represented with considerable gaps.Moreover, the median of the temperature and the residual crude oil used to develop models in this research are 385.71and 40.22,respectively.The median of other parameters such as resin, asphaltene, heating rate, and oil °API gravity are 14, 9.66, 8, and 20.26, accordingly.
Figure 2 shows the correlation matrix of input data.As it is demonstrated, the temperature factor accounted for the greatest influence on mass estimate defining around 92% of its behavior.It should be stated that the correlation is negative, thus it means the bigger the temperature the less the mass value, and vice versa.Other parameters have a much smaller effect than the temperature on the target factor, but these parameters are essential for differentiation in crude oil characteristics and modeling of crude pyrolysis.
The dataset was partitioned into a training set comprising 80% of the total data and a test set with 20% randomly.Training involved using the training subset, while the test subset assessed the model's predictive performance.Here, K-fold cross-validation, specifically K-fold 6, was implemented to ensure each observation had an equal chance in training and validation.This involved randomly splitting the training data into 6 folds, fitting the model using 5 folds, and validating it with the remaining fold, a practice tailored to our dataset size.

Model development
In this study, four different machine learning approaches were used for the purpose of the calculation of residual crude oil during pyrolysis.Among these techniques, one utilized was of white-box nature, and the others were of black-box origin.The flowchart represented below in Fig. 3 depicts the general schematic of the research showing the main steps of each stage employed.

Gaussian process regression (GPR)
A common nonparametric modeling approach called GPR employs the Gaussian process before doing regression analysis 39 .It includes a prior Gaussian process solved using Bayesian inference as well as the regression residual.
A distribution over functions could be described by a Gaussian process, which is a group of random variables.Since a mean function m(x) and a covariance function k(x, x′) may fully explain an actual process f (x) , it might be expressed as 40 : (1) f (x) ∼ GP(m(x), k(x, x′)) The objective of GPR is to determine the mapping correlation between the input vector x and the observable y for a specific training dataset D = x i , y i n i=1 , where x i is the input vector of the ith sample and y i is the observation value of the ith sample 41 : where ζ is the additional disturbance that matches a Gaussian distribution with zero mean and variance σ 2 n .Calculating the covariance of noisy measurements y is as follows: K(x, x) + σ 2 n + I n , where y = [y 1 , y 2 , . . ., y n ] T , X = [x 1 , x 2 , . . ., x n ] T , K represents the covariance matrix, I n shows the n-dimensional identity matrix.Therefore, the joint distribution of the testing sample data x * under the prior may be computed as 41 : According to Eq. ( 3), the mean of f (x * ) and covariance of f (x * ) may be written as 41 : (2)

Extreme gradient boosting (XGBoost)
Boosting applies to a family of learning techniques that increase the fit of ultimate models by mixing base models with basic functions 42 .The composite of basic models with fairly low precision 43 creates a scalable solution that could identify deep interactions and is less susceptible to anomalies 44 .The gradient boosting approach, which consists of an effective linear model solver and a tree learning algorithm, is utilized to develop the model.Several objective functions, including regression, classification, and ranking, are supported by the boosting method.XGBoost, a free software package, delivers cutting-edge solutions to a variety of challenges, notably climate projections 45,46 .XGBoost with a scalable tree-boosting method performs more than 10 times quicker than current popular solutions on a single computer 44 .XGBoost includes several parameters, making it a complicated model.In addition, hyperparameters are required to limit the danger of over-fitting and forecast variability 47 .The number of iterations (n estimators) and the learning rate are the two key hyper-parameters that avoid overfitting in XGBoost.In this technique, n estimators relate to the complexity of the model; raising this parameter may result in a more robust model, but it might still overfit to a certain extent.The amount of iterations governs the degree of fit and so influences the optimal learning rate value, and conversely.Generalization effectiveness is often enhanced by minimizing the learning rate.Decreased learning rate may significantly enhance predictive accuracy 48 .The regularization term, proposed by Friedman 43 , assists users in avoiding overfitting and manages the model's complexity.Throughout the tuning procedure, model regularization factors such as lambda and alpha should be adjusted to the required regularization weight in order to improve the quality of the model.

Categorical gradient boosting (CatBoost)
CatBoost is a machine learning technique founded on gradient boosting decision tree (GBDT) that was developed by Yandex researchers in 2017 49,50 .Through ranking promotion, it enhances GBDT, assures that all datasets may be utilized for training and learning, and decreases the over-fitting of training 51 .Due to its strong effectiveness, CatBoost has been employed in various sectors, notably driving style identification 52 and diabetes diagnosis 53 .The traditional GBDT method substitutes the category feature with the average label value related to that category.In a decision tree, the mean label value is used as the segmentation criteria for nodes.This technique is referred to as greedy target-based statistics (greedy TBS) and is described as follows 49 : In general, though, features include more data than lab.While the mean label value is employed to forcefully represent characteristics, conditional transfer takes place.The claim is that the supplied collection of findings D = {X i , Y i , }, i = 1, . . ., n, σ = (σ 1 , . . ., σ n ) is a permutation, and x p A k may be replaced with 49 : here, P is the a priori, and a is its weight (a > 0).The addition of a priori reduces the noise produced by the lowfrequency category.

Genetic programming (GP)
GP is a frequently used evolutionary method in evolutionary-based computing 54,55 .GP may be used to locate global optimum solutions in a wrapped search space.It may additionally generate optimization algorithms motivated by Darwin's theory of evolution 56 .GP employs an evolutionary path including selection, crossover, mutation, and cloning procedures to seek syntactic expressions that offer more connection between a set of independent (input) and dependent (output) elements 57 .GP is capable of optimizing model structure on its own, and its results are symbolic in nature.Moreover, its depiction is adaptable.These significant qualities make GP an excellent method for symbolic regression.GP-evolved solutions, in contrast, provide robust interpretability in terms of how features are learned or retrieved from the signals and how they influence categorization 58 .

Model optimization and tuning
Optimal hyperparameter selection is crucial for algorithm performance.Tuning these parameters fine-tunes the model, significantly impacting accuracy and ensuring the algorithm is well-suited to the specific characteristics of the data, ultimately enhancing predictive capabilities 59 .In constructing each model and addressing overfitting, grid search was utilized to optimize the hyperparameters.The hyperparameters selected for each model differed,

Evaluation of models
Utilizing seven statistical indicators, the accuracy of the suggested models was evaluated.The following metrics have been employed in the research: MAPE, SD, RMSE, R 2 , MAE, MBE, and NSE.The selection of such indicators is based on the fact that they are commonly considered to be the most representative and effective ones in the fields of statistics and machine learning.These are the descriptions for the measures listed below 60 : Mean absolute percentage error (MAPE, %): Standard deviation (SD): Root mean square error (RMSE): where N shows the count of data, y exp refers to the experimental data, and y pred stands for predicted data by presented models.

Mean absolute error (MAE):
This estimate is a risk measure equivalent to the anticipated value of the absolute error loss or l1-norm loss.If ŷi is the anticipated value of the ith sample and y i is the matching real value, then the calculated MAE over n samples is given by: Mean bias error (MBE): This parameter quantifies the average mistake in a forecast and is computed as: abs( y exp − y pred y exp ) × 100 It is a normalized measurement that compares the residual variation (or "noise") to the variation of the observed data.
Here, y o represents the mean of observed data, while y m signifies the simulated data.Additionally, y t o denotes the data being released at time instant t.
In combination with the statistical method, graphical analysis was utilized to verify the accuracy of the models.The following is a brief summary of what these graphical analyses imply 61 : The plot of the error distribution is the percent relative error (E i ), which is generated using the given equations and plotted against the experimental findings or variable.This graph illustrates the error pattern and the distribution of approximated E i values along the axis of zero error.
The number of data units along the Y = X axis impacts the correctness of the model; the fewer points there are, the more effective the model.
A graph of cumulative frequency vs absolute relative error (E a ) displays the accuracy of the model in anticipating any percentage of data.E a is computed using the following equation:

Developed correlation
For the GP algorithm, the following correlation which can accurately predict the target parameter was developed.In order to optimize the model, a thorough grid search was done to find the optimum population size, tree depth, tree length, maximum generations, etc.As a result, the comprehensible equation consisting of 3 input parameters and 10 additional coefficients was established.

Statistical evaluation of models
According to the statistical analysis provided in Table 4, the XGBoost model has provided the highest accuracy and reliability in terms of all the indicators.Its R 2 and NSE are almost equal to 1, and the RMSE, SD, MAPE, MBE, and MAE are extremely small for the testing and training as well as the whole datasets.The GP approach has proved itself to be the worst among the four developed techniques, yet its precision is still quite high despite being less robust than others having 0.9820 of R 2 for all portions of data.The middle positions are held by GPR and CatBoost.The performance of the first and the latter is rather decent with RMSE, SD, MBE, and MAE extremely close to zero and the estimate of R 2 being more than 0.99.However, the GPR is better when comparing all the parameters except for MBE.All of these models outperform previously published models and correlations in terms of the precision of the forecasts.Summing up the statistical analysis, the following list from the best performance to the weakest can be established: XGBoost, GPR, CatBoost, and GP.

Graphical evaluation of models
In this regard, the graphical assessment of the models' results was performed first by displaying the cross-plot of algorithms outcomes vs real data points, as shown in Fig. 5. Based on these diagrams, the spread of data points forms a line with a unit slope, indicating that the predicted and objective data points in all models except for GP are in excellent conformity.Having a unit slope though, XGBoost, GPR, and CatBoost differ from each other.
As seen in the pictures, GPR and CatBoost have a number of insignificant outliers, whereas the XGBoost line is the smoothest among all.A residual graph is a diagram depicting residuals along the vertical axis and the independent variable along the horizontal axis.The residual number is the discrepancy between the reported and expected numbers.According to the visual materials represented in Fig. 6, the best performance should be attributed to XGBoost possessing the smallest y-axis range (from approximately − 1 to around 1) and the lowest amount of outliers.GP is the least accurate with the spread of residuals from 20 to − 20.GPR and CatBoost are somewhat similar both having the same range in which the majority of observations are located, yet the outliers of CatBoost make it less precise than GPR.
A histogram of error distribution is an allocation of probabilities regarding a point projection that specifies the likelihood of each inaccuracy.Based on Fig. 7, the distributions are highly centered having little deviations in all approaches employed.The majority of the observations are at the point of zero relative error for both training and testing.However, the XGBoost is again the leader in the assessment as its relative error spread is the lowest one being from roughly − 0.14 to 0.08 for both training and testing.
Figure 8 illustrates the relative deviation graph of the created models' results.The horizontal line of this graph represents experimental values, while the vertical axis represents the comparative deviation of model results from experimental values.This graph demonstrates that the comparative deviations of the suggested models are generally spread around the zero-deviation line, indicating that the models can predict the target data with tolerable error rates.As in all the cases, XGBoost effectiveness is the highest one comparing to other techniques utilized with the smallest relative error variation equal to around − 0.14 and 0.08.GP range is the greatest one being in the range from ~ − 0.5 to ~ 0.5 which indicated its lowest level of accuracy.
A cumulative frequency is the total values distributed across multiple absolute relative error intervals.As depicted in Fig. 9, XGBoost, GPR, and CatBoost are the most effective methods for predicting the correct value of the target parameter, as the relative error of 90% of the data does not exceed 10-15%.The best precision is in the case of XGBoost, as the relative error of around 99% of the data is roughly equal to 5-7%.GP performance is worse than black-box models, as shown by the graph.Although the lower accuracy of the developed correlation is obvious and even predictable from the beginning, the advantage of correlation is fast prediction without the need for artificial intelligence-related knowledge, which is usually required to use black-box models.
In comparing this study with previous research, it can be asserted that a dataset comprising 2071 experimental TGA findings for 13 distinct crude oil samples was harnessed in this study, ensuring a comprehensive foundation for the modeling approach.This dataset represents the most extensive compilation utilized for modeling crude oil pyrolysis to date.The application of advanced machine learning techniques led to the development of models with high accuracy.Specifically, the XGBoost model achieved an overall MAPE of 0.7796% and an R 2 of 0.9999, signifying a remarkable level of precision.This result compares favorably with prior investigations.Past studies in this domain have also sought to model crude oil pyrolysis and predict fuel deposition.Notably, Rasouli et al. 29 developed a multilayer perceptron model with a 3.5% error for the pyrolysis of 6 crudes, Norouzpour et al. 30 , employed a radial basis function neural network with a 5.8% error for the pyrolysis of 6 crudes, and Mohammadi et al. 32 utilized a generalized regression neural network with a 1.04% error for the pyrolysis of 11 crudes.While these models showcased respectable performance, our current study not only extends the dataset size but also harnesses a variety of machine learning techniques, enhancing accuracy and robustness in modeling.Furthermore, this study introduces a straightforward mathematical correlation that achieves remarkable accuracy with a mere 9.73% error.Formulating a coherent correlation between input and output datasets proves challenging in opaque methodologies.The application of black-box models demands sophisticated computer systems and specialized expertise, constraining widespread accessibility.Consequently, the development of user-friendly mathematical correlations using advanced white-box algorithms can streamline the prediction of fuel formation during crude oil pyrolysis, offering rapid and precise predictions without the necessity for specialized tools.

Trend analysis
Lastly, Figs. 10, 11 and 12 illustrate how the XGBoost model predicts residual crude oil during pyrolysis as a function of temperature for various crudes and heating rates.It should be mentioned that the chosen oil samples in each graph were connected to particular research to guarantee that the TG experimental settings were the same.Table 1 provides a summary of the heating rates and characteristics of these crude oils.The TG curves of several crudes (Oil 1, 2, 3, 5, and 6) with respect to temperature are shown in Fig. 10a,b.As shown in Fig. 10, the XGboost model successfully estimates the experimental trend for various heating rates and oils.Because crude oils have diverse constituents, so do their TG curves are likewise distinct.Heavy crudes often leave more residue because they contain more asphaltenes.In this instance, the suggested XGBoost model accurately recognizes the TG curve trend and forecasts the quantity of residue for each crude oil sample at various temperatures.
Figure 11 displays the TG curves for crude sample #5 at three diverse heating rates.As shown in Fig. 11, when the heating rate decreases, the crude oil's TG curve shifts to the left as a consequence of a longer exposure period to heat.The XGBoost model successfully predicts the experimental trend and tracks the influence of the heating rate.
Figure 12 displays the TG curves for crude samples #4 and #7 at the same heating rate, which is 10 °C/min.As shown in Fig. 12, the XGBoost model successfully predicts the experimental trend.

Sensitivity analysis
To assess the comparative significance of input variables on residual crude oil, the relevance factor (r) and the XGBoost model results are utilized.The accompanying method is utilized to calculate the r values for each input parameter 32,62 : where σ m is the average value of calculated residual crude oil and σ j is the jth value of assessed crude oil residue; and inp i,j and inp m,i are the jth and average value of the ith input parameter, correspondingly, where inp i,j are oil o API gravity, heating rate, resins, asphaltenes, and temperature.Figure 13 depicts the relative effect and relevance of input parameters on residual crude oil.As it is seen, the most impact in the XGBoost model is attributed to the temperature with approximately − 0.92 significance.All other parameters such as resin, asphaltene, heating rate, and oil o API gravity are not as influential as temperature having less than 0.16 of importance.Overall, among the mentioned inputs, temperature and asphaltenes owe the highest influence on the crude oil pyrolysis process.In addition, temperature and oil o API gravity had negative impacts on fuel deposition, while asphaltenes, resins,  and heating rates had a positive impact on fuel deposition during crude oil pyrolysis.This means that the higher the amount of asphaltene and resin of crude, the higher the amount of fuel (coke) formation.
The high negative impact of temperature on fuel deposition is attributed to the fundamental principles of pyrolysis.Elevated temperatures promote the thermal cracking and vaporization of hydrocarbons in crude oil, leading to a high reduction in the mass of residual crude oil.This behavior is consistent with the well-established pyrolysis process.Asphaltenes and resins are complex, high molecular weight components in crude oil.They tend to break down and contribute to coke formation during pyrolysis.Their positive impact on fuel deposition can be attributed to their transformation into solid carbonaceous residues, which enhance the overall fuel availability for sustaining the combustion front in ISC.Overall, an increase in asphaltene and resin content results in a reduction in mass loss during the pyrolysis of crude oil, consequently leading to increased fuel deposition.While heating rate is essential in governing the speed of temperature increase, its impact is relatively low in this model.With an escalation in heating rate, the TG curve for crude oil shifts to the right, signifying an increase in the mass of residual crude oil.The observed result is linked to the reduced exposure time of the crude oil to heat.Finally, Oil o API gravity, with its lower significance, implies that its effect on fuel deposition is less pronounced.Typically, heavier crude oils characterized by lower o API gravity tend to leave more residue, primarily due to a higher concentration of asphaltene.In summary, the technical reasons for these sensitivity analysis outcomes are rooted in the complex chemistry of crude oil pyrolysis.Understanding the behavior of these parameters can aid in optimizing ISC processes and improving the recovery of unburned oil.

Conclusions
Crude oil pyrolysis analysis through TGA runs offers insights into fuel deposition during ISC EOR.This study aimed to precisely model crude oil pyrolysis by leveraging 2071 experimental TGA datasets obtained from literature sources.A suite of robust machine learning techniques, encompassing three black-box approaches (CatBoost, GPR, and XGBoost), and a white-box approach (GP), was employed to estimate crude oil residue at varying temperature intervals during TGA runs.Among the developed models and mathematical correlation, the XGBoost model exhibited exceptional precision, achieving an overall MAPE of 0.7796% and an R 2 of 0.9999.Following the XGBoost model, GPR, CatBoost, and GP models provided the next best results, respectively.Notably, the GP model, despite displaying a slightly higher error compared to the black-box models, provided satisfactory results, making it a viable option for rapid estimation of crude oil residue during pyrolysis.Moreover, a sensitivity analysis was conducted to explore the relative impact and significance of inputs on residual crude oil during pyrolysis.Among these inputs, temperature and asphaltenes were identified as the most influential factors in the crude oil pyrolysis process.Higher temperatures and oil o API gravity were associated with a negative impact, leading to a decrease in fuel deposition.On the other hand, increased values of asphaltenes, resins, and heating rates showed a positive impact, resulting in an increase in fuel deposition.

Figure 1 .
Figure 1.The box plots of parameters employed in the research.

Figure 2 .
Figure 2. Correlation matrix of inputs and target data used in this study.

Figure 4
Figure 4 represents the schematic of the GP employed for estimating the residual crude oil during pyrolysis.

Figure 4 .
Figure 4. Schematic of the GP model developed in this study.

Figure 5 .
Figure 5. Cross plots for each of the testing and training datasets for target prediction.

Figure 6 .
Figure 6.Residual plots for training and testing datasets.

Figure 7 .
Figure 7. Histograms of error distribution for all developed algorithms.

Figure 11 .Figure 12 .
Figure 11.Experimental TGA data of crude oil #5 and XGBoost results for different heating rates.

Table 1 .
Parameters of different crude oils utilized in the research.

Table 2 .
The statistical specifications of the input and target parameters of models.
of the research process in this work.In the traditional GPR (CGPR), the entire training dataset D = x i , y i n i=1 is utilized to develop the nonparametric model and to compute the prediction findings for a specific test sample.The dimension of the covariance matrix K in CGPR is n × n.

Table 3 .
Optimal features for implemented models.

Table 4 .
The statistical errors of the proposed models in train, test, and total data sets.