Modeling interfacial tension of surfactant–hydrocarbon systems using robust tree-based machine learning algorithms

Interfacial tension (IFT) between surfactants and hydrocarbon is one of the important parameters in petroleum engineering to have a successful enhanced oil recovery (EOR) operation. Measuring IFT in the laboratory is time-consuming and costly. Since, the accurate estimation of IFT is of paramount significance, modeling with advanced intelligent techniques has been used as a proper alternative in recent years. In this study, the IFT values between surfactants and hydrocarbon were predicted using tree-based machine learning algorithms. Decision tree (DT), extra trees (ET), and gradient boosted regression trees (GBRT) were used to predict this parameter. For this purpose, 390 experimental data collected from previous studies were used to implement intelligent models. Temperature, normal alkane molecular weight, surfactant concentration, hydrophilic–lipophilic balance (HLB), and phase inversion temperature (PIT) were selected as inputs of models and independent variables. Also, the IFT between the surfactant solution and normal alkanes was selected as the output of the models and the dependent variable. Moreover, the implemented models were evaluated using statistical analyses and applied graphical methods. The results showed that DT, ET, and GBRT could predict the data with average absolute relative error values of 4.12%, 3.52%, and 2.71%, respectively. The R-squared of all implementation models is higher than 0.98, and for the best model, GBRT, it is 0.9939. Furthermore, sensitivity analysis using the Pearson approach was utilized to detect correlation coefficients of the input parameters. Based on this technique, the results of sensitivity analysis demonstrated that PIT, surfactant concentration, and HLB had the greatest effect on IFT, respectively. Finally, GBRT was statistically credited by the Leverage approach.


Scientific Reports
| (2023) 13:10836 | https://doi.org/10.1038/s41598-023-37933-0 www.nature.com/scientificreports/ • Collecting an extensive database of IFT of surfactant-hydrocarbon systems including important parameters such as HLB and PIT x , which have a significant impact on the better characterization of surfactants. • Development of accurate models with low error using robust tree-based machine learning algorithms.
• Performing sensitivity analysis to detect the relative effect of temperature, normal alkane molecular weight, surfactant concentration, HLB, and PIT x on the IFT of surfactant-hydrocarbon systems. • Implementation of leverage method to identify suspicious and outlier data related to IFT of surfactant-hydrocarbon systems reported in the literature.

Data gathering
In order to develop the models, 390 sample data were collected from previous studies [68][69][70][71][72][73][74][75][76] . The sample dataset contains temperature, normal alkane molecular weight, surfactant concentration, HLB, and PIT x . In this paper, five different surfactants were used, and their specifications were presented in Table 1. Also, n-hexane, n-heptane, n-octane, n-nonane, n-decane, n-undecane, n-dodecane, n-tetradecane, and n-heptadecane were used as normal alkanes. HLB and PIT x were used to represent the type of surfactants. The HLB value determines the hydrophilicity and lipophilic of a surfactant 77,78 . Researchers have developed various methods for calculating the amount of the HLB 79,80 . In this study, the method introduced by Davies (1957) 80 was used to calculate the HLB. Davis method calculates the value of the HLB based on the group number as follows: The hydrophilic group numbers and the lipophilic group numbers are obtained from tables provided by Davies (1957) 80 . The values of the HLB for the surfactants used in this study are presented in Table 1. Another parameter used to characterize the surfactant is the PIT. This parameter depends on the structure of the surfactant 76 . For better visualization, the structures of surfactants utilized in this work are depicted in Fig. 1. The amount of the PIT x of the five surfactants used in this study are presented in Table 1. Moreover, the statistical parameters of the databank used in this work are represented in Table 2. The collected data were divided into training and (2) HLB Davies = 7 + � (hydrophilic group numbers) − �(lipophilic group numbers)  (3) accepting discrete and continuous features. The CART is very similar to the C4.5. The difference between the two algorithms is that the CART does not calculate the rules and can also solve regression problems 84 . In this study, the optimized version of the CART algorithm was used. Dividing nodes in the training process of the network is one of the most critical parts of implementing a DT algorithm. In CART, it uses a Gini coefficient to divide the nodes 85 . The DT implemented in that study consists of four stages 85 . In the first stage, the DT grows using the division of nodes. After dividing the training data into two parts, with the same logic, it divides these subsets again and so on. The DT greedily searches for an optimal division. This algorithm repeats the data segmentation in each step and does not check whether it leads to less impurity in the next steps. Each node is assigned to a predicted target based on the target's distribution in the sample data in nodes. This process continues until it cannot find a partition that reduces the impurity. Also, when the tree reaches its maximum state, the DT's growth process will stop. It is necessary to optimize the maximum depth value for this algorithm. In the second stage, after the tree's maximum growth, the process of building the tree stops. At this stage, the DT may not accurately predict the target value based on the test data. The third step involves pruning and simplifying the tree, which makes it better to predict. In the fourth step, the best tree is selected from the pruned trees. www.nature.com/scientificreports/ The DT continuously divides the data into smaller sections during the same process to homogenize the data in the partition. The splitting rules may be set to optimize a criterion related to the target's predictive value, or the rules may be set to minimize local node impurity or dependent variables over the training set.
Identifying the number of training data points for the DT is a significant issue because it will cause over-fit if the sample data is low. Adding any level to the tree may lead to an increase in the number of samples required to learn the DT. The size of the tree should be controlled to prevent overfitting 86 . The main parameters for DT optimization include tree depth, minimum sample division, and minimum sample leaf. Ensemble methods can prevent over-fitting in the DT algorithm 87, 88 . Extra trees. The ET creates a stronger model by combining several decision trees. The ET is one of the ensemble methods. In the ET method, the node is divided completely randomly by selecting the cutting points. Each DT grows independently, and all learning samples are used to grow the trees. The predicted target values are then added up for the final prediction. Finally, it predicts the final answer using the mathematical mean of the predicted values obtained from each base model 89 . The ET algorithm builds an ensemble model based on the explicit randomization of cutting points and feature combinations using averaging. Also, using all learning data to build base models can minimize the bias of the final model 90 . The tree growth method's complexity in the ET algorithm, assuming the trees are balanced, is similar to other tree growth methods 89 . This algorithm has three parameters including N min , which denotes the minimum sample size to divide a node, K shows the number of randomly selected attributes in each node, and M illustrates the number of trees used as the base model. It is necessary to optimize these parameters to develop a more robust model based on additional trees. Each of these parameters has a different effect. The value of parameter N min affects the average noise output of the model. The larger the value of N min , the smaller the trees are made. As a result, the variance decreases, and the bias increases. The minimum size of sample data for node splitting should be optimized according to the model's amount of output noise. Obviously, in regression problems, high noise levels lead to overfitting. Geurts et al. 89 suggested that a higher value for parameter N min should be used to build a stronger model when the data has more noise. In other words, to optimize ET in high noise conditions, it is necessary to increase the value of N min . The number of selected attributes can also determine the strength of the attribute selection process. The maximum value that can be considered for K is the number of input features of the model. The low value of parameter K increases the randomness of the trees. It also makes the structure of the trees less dependent on the target value of the learning samples. Therefore, if we set the K to 1, the divisions are selected completely independent of the target variable. Also, if the value of this parameter is equal to the number of features in the learning data, the features are not explicitly selected randomly, and the randomization effect is applied only by selecting the cut points 89 . A schematic structure of the ET algorithm is illustrated in Fig. 3.

Gradient boosted regression trees (GBRT).
Boosting is another method of an ensemble that combines several weak learners to create a stronger model for target prediction 85 . This method is used to solve regression and classification problems. The weak learners are trained one after the other, each focusing on correcting the previous step 91 . In this study, the GBRT was used, in which the DT is defined as the basic learner.
In a modeling issue, one has a system consisting of a set of random "explanatory" variables or "input" x = {× 1; …; xn} and random "response" variables or "output" y. The goal is to create a function F * (x) that relate y to x.
the expected value of some specified loss function � y, F(x) is minimized. Here, h(x; a m ) is a simple function of x defined as a basic learner. The expansion coefficients {β m } m 0 and the parameters {a m } m 0 are jointly fit to the training data in a forward "stage-wise" manner.
Equation (5) can be solved in two steps for a given cost function by the Gradient Boosting method 92,93 . First, the function h(x; a) is fit by least squares: www.nature.com/scientificreports/ In the next step, according to h(x; a m ) , the optimal value of β Μ is determined. GBRT specializes in this strategy to the case where the weak learner h(x; a) is an L terminal node regression tree. At each iteration m, a regression tree is divided the the solution to Eq. (8) reduces to a simple "location" estimate based on the criterion .
It will be updated separately in each corresponding area.
The learning rate is controlled by the "shrinkage" parameter 0 < v ≤ 1 . The small values of this parameter (v ≤ 0.1) lead to a much better generalization error. Friedman (1999) 92 presented specific algorithms based on this template for several loss criteria, including least-absolute deviation, least squares, Huber, and for classification, K class multinomial negative log-likelihood. It should be noted that hyper-parameters should be considered to optimize the implemented model. These parameters such as the number of base estimators, subsample, loss function, maximum depth, the minimum number of leaf nodes, the maximum number of features, and the minimum number of sample split samples, define the structure of the network. A simple architecture of the GBRT algorithm is depicted in Fig. 4.

Results and discussion
Description of the models' development. In the present work, three different data-driven techniques, including DT, ET, and GBRT were developed to establish accurate models for the estimation of the IFT between ionic surfactants and normal alkanes. As mentioned, in order to create a more robust and faster model, the specific hyperparameters of each algorithm must be adjusted and optimized. As mentioned earlier, the trial-and-  www.nature.com/scientificreports/ error method was employed to optimize the implemented models. The value of the maximum depth parameter of the DT strongly affects the speed and accuracy of the model. The depth of the tree should be carefully adjusted so as not to cause over-fitting and under-fitting. As reported in Table 3, the best value for this parameter is 7. The proposed control parameters for implementing the DT algorithm based on the sample data used in this study are reported in Table 3. In this study, two ensemble algorithms based on the DT were used. Ensemble methods were used to increase the stability and accuracy of the DT model. Ensemble models can also prevent over-fitting and create a robust and stable model based on the base estimator 94    www.nature.com/scientificreports/ alpha must be considered and adjusted. The adjusted parameters for the models implemented in this study were reported in Table 3.
Statistical evaluation. In this study, statistical criteria were used to evaluate the accuracy and ability of the developed models in predicting IFT. For this purpose, statistical parameters including determination coefficient (R 2 ), average percent relative error (APRE, %), root mean square error (RMSE), standard deviation (SD), and average absolute percent relative error (AAPRE, %), were used. The formulas of these statistical parameters are listed as follows 95 : If the value of R 2 is high and the values of AAPRE, APRE, RMSE, and SD are low, the model predicts the IFT with higher accuracy. The maximum value of R 2 is one, and the lowest value of the AAPRE value is zero. Statistical parameters for evaluating the models implemented in this study at different development stages are presented in Table 4. According to the RMSE values presented in Table 4, the accuracy of the models implemented in this study is as follows: Graphical error analysis. In this section, graphical error analysis shows the models' validity and accuracy.
Therefore, four graphs, including bar-plot, cross-plot, relative error distribution, and cumulative frequency diagram were investigated. Figure 5 is a cross-plot of DT, ET, and GBRT models. This diagram plots the predicted values in the training and testing phases versus the experimental values. In this type of diagram, if the train and test points are close to the unit slope line (X = Y), it indicates that the model can predict with high accuracy. As Fig. 5 shows, some of the test points of the DT and ET models are above or below the X = Y line, which indicates GBRT > ET > DT; for both training and testing phases. www.nature.com/scientificreports/ the lower accuracy of these models. Figure 5 shows that the points of the GBRT model are scattered around the unit slope line. This model estimates the IFT close to the experimental values. It can be seen that the GBRT model estimates the IFT with higher accuracy than the DT and ET models. Furthermore, the relative error diagram is a practical tool to show the deviation of the value predicted by the model from the experimental value. Absolute error is the difference between the predicted and experimental values. The relative error is equal to the absolute error divided by the experimental value. The relative error is calculated as follows: In Fig. 6, the zero line indicates that the model predicts without error. Therefore, if the training and test points are close to the zero line, it indicates that a robust model has been developed. Figure 6 shows that from a value of 20 mN/m onwards, the relative error of all models implemented in this study is low. The points in the negative range of Fig. 6 with respect to the relative error Eq. (18) show that the model is overestimated. As explained in the model section, ensemble methods increase accuracy and improve overfitting 87,96 . According to the results presented in Figs. 6 and 7, the models implemented in this study, including ET and GBRT, have reduced variance and controlled overfitting, as seen in the test phase.
Next, Fig. 8 illustrates the cumulative frequency plot, which displays the proportion of predicted data that are less than or equal to a particular error value. This graph displays absolute relative errors (%) that are calculated using the next equation for different proportions of predicted data by models. www.nature.com/scientificreports/ The closer a model gets to the vertical axis, the more data it can predict with lower error and consequently considered a more precise model. As Fig. 8 shows, the GBRT model estimates 70% of the IFT values with just less than 3.2% error. This is while the ET and DT models have predicted this proportion of the data with an error of 3.6% and 4.2%, respectively. In addition, about 90% of the data, estimated by the GBRT model, has an error lower than 6.2%, while it is 8.2% and 10.8% for ET and DT models, respectively. As a result, the superiority of the GBRT model over the ET and DT models can be recognized again.
Based on the presented results in this section, the GBRT model is proposed to estimate the IFT between the surfactant solution and the normal alkane with high accuracy. A part of IFT values predicted by GBRT model in the test phase is reported in Table 5, and no significant difference is observed in the prediction of experimental data by this model. The results of Fig. 7 show that the AAPRE and RMSE of the GBRT model in the test phase were 3.63% and 1.628, respectively, which indicates the high reliability of this model in predicting the IFT of surfactant-hydrocarbon systems.
Sensitivity analysis. In this study, the sensitivity analysis was performed to determine the extent and type of relationship between the independent variables presented in Table 2 and the amount of the IFT (output). Different methods of sensitivity analysis are introduced for regression models 97,98 . In this section, the Pearson coefficient was used to calculate the relevancy factor 99 :  y , x k.i , and x k denote the target, the average of the target value, the input value, and the k th input value average. Figure 9 shows the absolute values of sensitivity analysis results of the proposed GBRT model. In this data set, PITx, the surfactant concentration, and HLB had the greatest effect on IFT, respectively. At concentrations lower than the CMC of surfactants, the IFT decreases with increasing surfactant concentration 100 . Therefore, the surfactant concentration was expected to have large effect on IFT along with the type of surfactants. The molecular weight of alkanes has the lowest value of the Pearson coefficient compared to other input parameters.   Figure 10 shows the predicted values of IFT by the GBRT model and the experimental data 68,69,72 . In this figure, the IFT of anionic (SDS) and cationic (C 10 TAB) surfactants with n-hexane as hydrocarbon phase was plotted. The temperature of the surfactant solution was considered constant (298.2 K), and IFT data was plotted as a function of surfactant concentration. Based on the results presented in Fig. 10, it can be seen that the GBRT model accurately predicts the IFT of the surfactants and n-hexane systems. The IFT between the surfactant solution and the hydrocarbon depends on the surfactant concentration [101][102][103] . At concentrations lower than the CMC, by increasing the surfactant concentration, the IFT value will reduce 38 . Surfactant molecules are adsorbed on the liquid-liquid interface and reduce IFT 104 . Therefore, increasing the adsorbed surfactant molecules at the interfacial boundary further reduces the IFT.
In the next step, Fig. 11 shows the IFT between C 10 TAB and C 12 TAB in n-octane and n-nonane hydrocarbon phases 69 along with the prediction of the GBRT model. For this analysis, the temperature was considered a constant value of 298.2 K. As can be seen, the heavier the hydrocarbon phase, the greater the IFT of surfactant-hydrocarbon. Again, the GBRT model renders great predictions for the IFT of the surfactants and hydrocarbons considered in this figure.
Another parameter affecting the IFT of the surfactant that was considered in this study is temperature. The IFT values were predicted for different concentrations of SDS corresponding to a temperature range. Using the GBRT model, a comparison of the predicted trend of IFT changes in the temperature range with experimental data 70 is shown in Fig. 12. The predicted trend is similar to the experimental data and also shows good agreement with the real trend of IFT variation. As mentioned earlier, the IFT decreases with increasing temperature. It also shows that the proposed GBRT model predicts the interfacial behavior of the surfactant well under defined conditions. Outlier detection using Leverage approach. The Leverage approach 105-107 is a reliable technique to discover outliers that may exist in a databank due to a variety of circumstances, including experimental errors. These points are located at an improper distance from the majority of data. Therefore, catching the inappropriate data noted above is critical for preventing model inaccuracy and unreliability. According to the Leverage method, the values of the standardized residuals (R) as well as a matrix named the Hat matrix, which is made up of the exploratory and predicted values obtained from the model, are needed to conduct this analysis. The leverage or Hat indexes (H) are determined using the following formula 108-110 : Here, X represents the matrix of explanatory variables, and T represents the transpose matrix operator. Moreover, the critical Leverage (H*) is calculated according to the following formula 111 : In this work, the databank includes four inputs and 390 data points, leading to H* = 0.0461. On the other hand, considering MSE as the mean square of error, e j as the error value of the jth data, and Hj as the jth Leverage value, R values can be determined as follows 90,112 :  Figure 13 displays the outcomes of the leverage approach utilizing the GBRT model's results. In this case, only 6 points are recognized as suspected data, which are located outside of the model application scope. Also, it was found just 8 points as outliers. This confirms that the experimental database of IFT between surfactants and hydrocarbon is highly reliable, and the GBRT model is statistically dependable and valid.

Conclusions
The aim of this study was to develop accurate and reliable models to estimate the IFT of ionic surfactants and normal alkanes. In this study, three intelligent computer-aided algorithms namely DT, ET, and GBRT models were implemented for this purpose. A databank containing 390 experimental IFT data points presented in the literature was used to develop these models. The models considered the following parameters as input: temperature, normal alkane molecular weight, surfactant concentration, HLB, and PIT. Based on this work, the following conclusions are drawn: terms of the independent variables. 5. The cumulative error distribution of the GBRT model was very satisfactory, with approximately 90% of the predicted data having a relative error of less than 6.2%. 6. According to the results of the sensitivity analysis, the effect of input parameters on the IFT is as follows: PIT > surfactant concentration > HLB > temperature > molecular weight of normal alkane. 7. The Leverage method demonstrated that the majority of the data points (almost 96.5%) are valid and both the IFT databank and the GBRT model seem to be highly trustworthy.
As a suggestion for future studies, it can be mentioned that the simultaneous involvement of the aqueous phase containing surfactant, the organic hydrocarbon phase and the solid phase including reservoir rock or its minerals can correlate the governing equations in the area of surface wetting.