A constrained machine learning surrogate model to predict the distribution of water-in-oil emulsions in electrostatic fields

Accurately describing the evolution of water droplet size distribution in crude oil is fundamental for evaluating the water separation efficiency in dehydration systems. Enhancing the separation of an aqueous phase dispersed in a dielectric oil phase, which has a significantly lower dielectric constant than the dispersed phase, can be achieved by increasing the water droplet size through the application of an electrostatic field in the pipeline. Mathematical models, while being accurate, are computationally expensive. Herein, we introduced a constrained machine learning (ML) surrogate model developed based on a population balance model. This model serves as a practical alternative, facilitating fast and accurate predictions. The constrained ML model, utilizing an extreme gradient boosting (XGBoost) algorithm tuned with a genetic algorithm (GA), incorporates the key parameters of the electrostatic dehydration process, including droplet diameter, voltage, crude oil properties, temperature, and residence time as input variables, with the output being the number of water droplets per unit volume. Furthermore, we modified the objective function of the XGBoost algorithm by incorporating two penalty terms to ensure the model’s predictions adhere to physical principles. The constrained model demonstrated accuracy on the test set, with a mean squared error of 0.005 and a coefficient of determination of 0.998. The efficiency of the model was validated through comparison with the experimental data and the results of the population balance mathematical model. The analysis shows that the initial droplet diameter and voltage have the highest influence on the model, which aligns with the observed behaviour in the real-world process.

However, these conventional electro-coalescer vessels tend to be large due to extended residence times needed for effective separation.
Recent advancements in electro-coalescence technology, such as inline electrostatic coalescers (IEC), have improved water separation efficiency 10,11 .These devices subject the water/oil mixture to an AC electric field, magnifying droplet sizes and enhancing coalescence rates in the pipeline to facilitate the water separation downstream reducing the reliance on demulsifying chemicals and promoting an environmentally friendly approach 12 .IECs are particularly crucial for increasing the efficiency of dehydration in heavy oil processing and offshore operations, as they effectively counteract the emulsion stabilizing effects of surface-active compounds in heavy crude oil, such as asphaltene and resin while providing a compact design, light weight, and superior performance particularly beneficial in limited-space offshore units.
Previous studies have primarily focused on the modelling of traditional electro-coalescer vessels, taking into account factors such as the strength of the electric field, flow rates, and residence times [13][14][15][16][17][18] .However, the phenomenon of droplet breakage, which can occur simultaneously and influence separation efficiency, has been largely disregarded.Furthermore, limited attention has been given to modelling inline electrostatic devices.Considering the growing demand to address flow conditioning challenges, particularly in constrained environments like offshore platforms, and the necessity to enhance the efficiency of heavy oil processing, it becomes essential to develop a thorough understanding of IECs.Therefore, in our previous study 11 , we developed a mathematical model using population balance equations (PBE) to consider both coalescence and breakage of emulsion droplets under the influence of a static electric field within an IEC.The model predictions closely matched experimental data, examining factors such as electric field intensity, inlet flow rate, and residence time to understand their impact on droplet size distribution and separation efficiency.
The prediction of electrostatic water separation efficiency, which is based on predicting the temporal size distribution of water-in-oil emulsions requires cumbersome calculations due to the complex interactions of multiphase dynamics, fluid mechanics, and electrostatic forces.Implementing the direct population balance model for these calculations poses challenges in terms of computational efficiency and increased computational cost, making it difficult for scenarios where quick predictions or resource-efficient solutions are required.In response, surrogate machine learning (ML) models can be used as practical alternatives.Machine learning algorithms are inherently data-driven and are capable of identifying meaningful patterns and connections within available data 19,20 .The ML surrogate model captures the essential features and patterns of the original model, enabling faster and more efficient predictions with acceptable accuracy 21 .Surrogate modelling is particularly useful when dealing with complex systems or simulations where direct modelling may be less practical due to computational constraints or resource limitations.There is limited existing research on application of machine learning methods to predict the size distribution of droplets in the crude oil dehydration process.Ranaee et al. conducted a study utilizing artificial intelligence techniques to assess the performance of a traditional crude oil demulsification system.They achieved this by combining global sensitivity analysis, machine learning methods, and rigorous model discrimination criteria 18 .However, to the best of our knowledge, there has been no research conducted on the utilization of machine learning surrogate modelling for inline electrostatic coalescer systems.
In this study, we employed an extreme gradient boosting (XGBoost) model, fine-tuned with a genetic algorithm (GA), to estimate the distribution of droplet sizes across a diverse range of input variables.We used an extensive dataset from our validated mathematical model 10,11 in a controlled environment.In the next step, two penalty terms were incorporated into the objective function of the XGBoost algorithm to prevent high modelling deviations and eliminate negative outputs.The XGBoost model incorporates critical parameters of the electrostatic dehydration process, including droplet diameter, voltage, crude oil properties, temperature, and residence time as input factors, and predicts the number of water droplets per unit volume as its output.We employed permutation and Shapley additive explanations (SHAP) methods to evaluate the influence of the input parameters on the modelling output.The efficiency of the constrained XGBoost model was assessed by comparing its predictions to a standard XGBoost model as well as experimental data 12,22 and the outcomes of the phenomenology-based mathematical model 11 .

Methodology Mathematical model
Kooti et al. utilized a population balance modelling approach to simulate the dynamic and complex processes of coalescence and breakage of water-in-oil emulsions in inline electrostatic coalescers 11 .IEC is a pipe-based device equipped with a series of insulated active and grounded electrodes, exposed to an AC electric field 12 (Fig. 1).As illustrated in Fig. 2, this system enhances the separation of water from crude oil by destabilizing water-in-oil emulsions and promotes a shift in the size distribution towards larger water droplets.The electric field causes dispersed water droplets to become polarized and collide in a moderately turbulent flow regime.The polarized Figure 1.A schematic representation of an inline electrostatic coalescer device 10,12 .
drops are provided with a strong-range attraction force that enables them to break the inter-facial film between them and eventually merge to create a larger droplet 22 .
The Population Balance Equations (PBE) were employed to provide a macroscopic level understanding of how the size distribution of particles changes over time in a liquid-liquid system, assuming that the spatial distribution of droplets was random and homogeneous.The PBE was discretized first in the internal coordinate, representing droplet size, using the method of classes.Subsequently, discretization was applied to the external coordinate, representing time.The closure of the PBE was achieved by developing coalescence and breakage kernels to accurately capture the system's behavior.The results demonstrate the ability of the model to accurately simulate droplet coalescence and breakage in emulsified oil while predicting droplet size distribution and water removal efficiency.For a more in-depth mathematical background, refer to Kooti et al. in the cited literature 11 .

Machine learning surrogate model
Machine learning has exhibited significant success in domains where identifying non-linear relationships is often challenging 18,[23][24][25][26][27] , while mathematical modelling is based on the underlying physical and chemical processes of the phenomenon 28 , particularly in areas like computational fluid dynamics, where deriving causal connections is very important 29 .To utilize the advantages of both approaches, we developed a surrogate ML model to predict the behaviour of an electrostatic coalescer, leveraging a dataset obtained using our previously published mathematical model 11 .The surrogate model is based on the XGBoost algorithm 30 coupled with a genetic algorithm for hyperparameters tuning.To ensure the model's robustness and reliability, we ran the algorithm 500 times and selected the model having the lowest mean squared error (MSE) on the test set as our final GA-XGBoost model.This approach minimizes the impact of random factors and potential inconsistencies.

Extreme gradient boosting
XGBoost is an advanced supervised algorithm for both classification and regression tasks.Its core principle involves the aggregation of multiple weak predictors, predominantly decision trees, to construct a robust predictive model.XGBoost addresses the common challenge of overfitting associated with tree-based algorithms by sequentially integrating numerous tree models 30,31 .The model expression can be written as follows 30,32,33 : where f k represents the k-th tree model, y i stands for the predicted value for the sample x i , and the loss objective function for the learning process is defined as: where l represents the differentiable convex loss function that measures the difference between the prediction y i and the target y i .�(f t ) is the regularization term and can be described as follows: where T denotes the number of branches within the decision tree algorithm, and ω represents the vector of branch parameters, following the second-order expansion of Eq. ( 2), the revised objective function can be written as: www.nature.com/scientificreports/where g i and h i represent the initial and subsequent derivatives of the loss function, denoted as l, at the value of y (t−1) .To prevent overfitting during training, the algorithm does not simultaneously train all regression trees; instead, it sequentially incorporates decision trees.Consequently, when incorporating t trees, the prior t − 1 trees have already undergone training, making l(y i , ŷ(t−1) i ) essentially a fixed factor.Eventually, this simplifies the objective function to: We incorporated two penalty terms in the objective function of the XGBoost algorithm to obtain more meaningful predictions.The first penalty function, Penalty 1 , targeted the residuals between the observed ( y obs. ) and the machine learning predicted ( y pred. ) values of population of droplets per unit volume.This function was defined as: where the threshold t was set at 0.001 , with residuals exceeding this threshold being raised to the power of 3 , the value of t was determined through trial and error in subsequent steps.
Additionally, Penalty 2 was applied to discourage negative predicted values.This function was defined as: where the adjusting parameter c is obtained through an iterative process.The overall mean squared error was computed as follows: where N is the total number of observations.

Genetic algorithm
Genetic Algorithm (GA) is an optimization algorithm that mimics the process of natural evolution, where the survival of fitter creatures and their genes were simulated 34 .This algorithm excels in exploring and exploiting the search space through the iterative application of genetic operators to enhance a population of potential solutions 35 .In this study, GA was utilized to tune the hyperparameters of the XGBoost model, such as the number of estimators, maximum depth, learning rate, subsample ratio, and column subsampling ratio.The GA operates through three main genetic operators: selection, crossover, and mutation 36 .The selection process uses a roulette wheel strategy, where the likelihood of an individual being selected for reproduction is proportional to its fitness.This method ensures that better-performing hyperparameter sets have a higher chance of propagating to subsequent generations 37 .Crossover, specifically a two-point crossover, is then applied to selected individuals.This operator combines parts of two parent solutions to produce new offspring, promoting the mixture of good traits and the discovery of better-performing hyperparameter combinations.The mutation process, described by the following equation, introduces random changes to offspring 38 : where x (t+1) i represents the state of the i − th individual in the population in iteration t + 1 .The function rand() generates a random number between 0 and 1, and the mutation rate is a predefined threshold that determines the likelihood of a mutation.This process enables the algorithm to explore new areas in the hyperparameter space, potentially leading to better solutions.
The GA parameters significantly impact the optimization process.Key parameters include population size and maximum iterations, which define the extent and depth of the search; a larger population and more iterations expand the search but require more computational resources.Mutation probability and elite ratio maintain a balance between discovering new solutions and retaining the best ones.Crossover probability and the proportion of parents affect the population's diversity, with higher crossover probability enhancing diversity and a greater parents' portion ensuring the persistence of best-performing solutions 39 .These GA parameters were chosen iteratively to achieve a balance between comprehensive exploration and computational efficiency.

Model development and evaluation
We employed the data obtained from the mathematical model presented by Kooti et al. for the model development 11 .The dataset consists of 13600 data points.Table 1 summarizes the statistical analysis of the dataset, including properties of the fluid and the electrostatic coalescence system.The compiled dataset encompasses a wide range of characteristics, including the diameter of droplets ( d i ), voltage (V), crude oil density ( ρ ), viscosity ( µ ), residence time (t), temperature (T), and the number of water droplets with a specific diameter per unit volume ( f i ), which represents the size distribution of the dispersed phase.
The entire dataset was divided into a training set and a testing set at the split ratio of 4:1.This ratio ensures that a significant amount of data is used for training, while still retaining a robust and representative test set to evaluate model performance.Therefore, following the completion of the hyperparameter tuning phase, the training set was used for model training, while the testing set was used to assess the model's predictive performance.This step ensures an unbiased evaluation of the model, verifying the model's ability to generalize to unseen data, and is essential for reducing the risk of over-fitting 20,40 .
The performance of the constrained and standard XGBoost models in predicting droplet population per volume was assessed through standard metrics, including mean squared error (MSE), root mean squared error (RMSE), and the coefficient of determination ( R 2 ).We performed residuals analysis, i.e. the difference between predicted values of mathematical and ML model, and comparative analysis to further evaluate the effectiveness of the modelling approach.The relative importance of input parameters was determined by calculating Shapley additive explanations (SHAP) values 41 and permutation feature importance 42 .

Performance analysis
We first ran multiple simulations to determine the optimal bounds of hyperparameters of the XGBoost model and to fine-tune the user-defined parameters in the genetic algorithm.Following this, we chose the hyperparameter boundaries as detailed in Table 2 for building the models.The maximum number of iterations was set to 30, with a population size of 5 individuals.The mutation probability was 0.12, and elitism was applied to the top 2% of the population.Crossover was performed with a probability of 0.9 using a two-point method, and 5% of the population was selected as parents.The selection process employed the 'roulette' method, and mutations were applied randomly.There was no specified limit for the maximum number of iterations without improvement.The parameter c in the equation (7) was determined iteratively to be 100.
The train and test set performance values for the two models, standard GA-XGBoost and Constrained GA-XGBoost, were compared in Table 3.For the standard GA-XGBoost, the MSE, the RMSE, and the R 2 were 0.106, 0.325, and 0.995.In contrast, the constrained GA-XGBoost model demonstrated superior performance with a MSE of 0.005, a RMSE of 0.069, and a R 2 value of 0.998, respectively.These performance metrics indicate that the constrained GA-XGBoost model exhibits better predictive accuracy on the test set across all metrics compared to the standard GA-XGBoost model.In addition to the performance metrics, the regression plots in Figs. 3 and  4 clearly illustrate a stronger correlation between the predicted and observed droplet population per unit volume Table 1.Summarized statistics for electrostatic dehydration system and fluid characteristics. in constrained GA-XGBoost model, showing robust predictive capabilities on previously unseen data, confirming its capability to obtain the fundamental dataset pattern.The empirical cumulative distribution function (eCDF) of absolute residuals was investigated for evaluating the performance of the ML model.Residuals quantify the difference between observed simulation outputs and the corresponding predictions generated by the model.The eCDF plot depicts the fraction of data points where the absolute residuals fall below a specific threshold on the x-axis.As illustrated in Fig. 5, plotted on logarithmic scales, a steep incline in the curve at lower residuals indicates a significant concentration of data points with prediction errors close to zero.Additionally, the 80th and 90th percentiles of the absolute residuals stand at 0.001 and 0.003 µm −3 , respectively, indicating the distribution of errors within the dataset.Percentiles represent spe- cific points in a dataset below which a certain percentage of the observations fall and serve as valuable metrics for understanding the spread and magnitude of errors.Given the better performance of the constrained GA-XGBoost model, it was utilized for further analysis.As shown in Fig. 6, 98.05% of the residual values, out of a total of 13,600 data points, fall within the range of − 0.25 to 0.25 µm −3 .This indicates that predictions made by the constrained GA-XGBoost model lie very close to the values of the mathematical model.Furthermore, the residual analysis in Figs. 6 and 7 revealed that random errors were present across the entire spectrum of F i and no systematic bias was detected.This suggests that the model predictions are not systematically over/under estimated within the entire range.This observation further confirms the reliability of the model.

Feature importance
We utilized the Shapley Additive Explanations (SHAP) and Permutation techniques to assess the relative importance of input parameters in predicting water droplet size distribution.In the permutation method, each feature was subjected to 30 random permutations.The analysis showed that both methods identified similar rankings in terms of importance (Fig. 8).The diameter of droplets emerged as the most influential factor in determining droplet population per unit volume, representing the size distribution of droplets.The other parameters were ranked in descending order of importance as follows: voltage, crude oil density, residence time, crude oil viscosity, and the temperature of the mixture.Temperature was identified as the least influential feature for two main reasons.First, the narrow range and limited variability of the temperature in the dataset (Table 1) resulted in a smaller impact on the target variable.Second, the unique nature of an inline electrostatic coalescer, in contrast to traditional electrostatic vessels, may be less influenced by temperature due to its significantly shorter residence time.This implies that temperature plays a lesser role in influencing outcomes compared to the other features.

Comparative analysis
The comparison between the predicted outputs in the mathematical 11 and constrained GA-XGBoost model is illustrated in Fig. 9.These plots show results of four different voltages over a wide range of droplet diameters (1-1000 µm ) while other parameters are kept constant.The selection of varying parameters in this analysis is based on the previously mentioned examination of parameter importance.Consequently, the base parameters for comparison are the two most influential parameters of the dehydration system: droplet diameter and voltage.
The comparison analysis confirms the accuracy of the developed ML model in predicting the droplet size distribution within an IEC system.The plots follow a normal distribution, showing a characteristic bell-shaped pattern with a peak at the centre and a gradual decline in droplet counts towards the outer edges.This pattern implies a tendency for specific droplet sizes to become more prevalent, while less frequent sizes occur towards the extremes of the distribution.Because the coalescence of droplets is the dominant mechanism compared to the breakage of droplets, the process leads to the merging of smaller water droplets into larger ones, which is evident from the reduction in the peaks of the droplet population per unit volume and the reduction of the overall number of droplets.
Figure 10 illustrates the size distribution of droplets in two scenarios: (i) without the utilization of an electric field (IEC switched off) and (ii) with an electric field (IEC switched on), comparing the predictions of the surrogate model to the experimental data 12 .The distribution is represented by plotting the cumulative volume fraction of droplets against their diameter at the electrostatic coalescer outlet.Both scenarios exhibit smooth curves without sudden jumps, indicating a uniform size distribution of droplets.This means that droplets are distributed relatively evenly without significant clustering or localized variations.In the scenario without the presence of an electric field, the analysis of the cumulative volume fraction reveals a distinctive trend.Initially, as the droplet size increases from 10 µm , the cumulative volume fraction sharply increases until reaching approximately 50 µm ,    where it approaches a plateau at the value equal to one.This indicates that all of the droplets in this size range have been taken into account, and there are no further larger droplets present in the system.
Another important parameter to analyze is the droplet diameter equivalent to the volume fraction of 0.5, which represents the median droplet size of the system.Analysing this value is essential as it provides a direct comparison of droplet sizes between different scenarios.The median droplet size for the scenario without IEC is approximately 30 µm .This indicates that 50% of the dispersed phase volume consists of droplets smaller than or equal to 30 µm in diameter.When comparing the two scenarios, a noticeable change in the droplet diameter range is evident.In the electric field scenario, the cumulative volume fraction curve begins to rise at around 30 µm and reaches one at the droplet diameter of 1000 µm , with the median droplet size value equal to approxi- mately 300 µm .This shift indicates a considerable change in the droplet size distribution, emphasizing the domi- nance of larger droplets in the scenario with the electric field, compared to the scenario without it.Therefore, we conclude that the presence of the electric field has a substantial impact on the distribution of droplet sizes and consequently enhances the water removal process.It is evident that the cumulative droplet size distribution obtained by the constrained GA-XGBoost model is consistent with the experimental results.

Discussion
Coupling XGBoost with Genetic Algorithms for hyperparameter optimization resulted in a robust surrogate model for predicting the behaviour of dispersed droplets in IEC.The GA navigated a high-dimensional search space efficiently, optimizing the hyperparameters.The analysis of feature importance through SHAP and permutation techniques ranked the influence of each input feature on model predictions.The identification of the initial diameter of droplets as the most influential factor aligns with empirical evidence 12 , confirming the model's ability to capture meaningful relationships.Moreover, the absence of systematic bias, illustrated in Figs. 6 and  7, is important for the applicability of the model in practical scenarios.
While prior studies have primarily focused on conventional electro-coalescence vessels and neglected droplet breakage, our study addresses these gaps and provides more realistic predictions of droplet behaviour in an inline electrostatic field.The significance of this study is the pioneering application of machine learning to the domain of inline electrostatic coalescence.This novel approach addresses the critical need in the crude oil production industry to increase the efficiency of the dehydration process, particularly in constrained environments like offshore platforms, and also for processing heavy oil mainly because it contains natural emulsifiers that cause the water-in-oil emulsions to become more stable which consequently results in low efficiency of electrostatic coalescence 43 .
Our model efficiently predicts the temporal size distribution of water-in-oil emulsions and consequently provides a cost-effective and reliable means to enhance separation efficiency by optimizing the process design and operational conditions.Moreover, this model has positive environmental outcomes, specifically by diminishing the reliance on de-emulsifying chemicals in the crude oil dehydration process.Currently, the industry heavily depends on these chemicals to destabilize the interface between oil and water and facilitate water separation.However, improving the water removal efficiency using our model results in a reduced need for de-emulsifying chemicals, contributing to a more sustainable and eco-friendly approach by lowering the environmental impact associated with the production, usage, and disposal of these chemicals.Therefore, the developed model aligns with the global momentum toward greener practices in the crude oil production sector, emphasizing its positive role in fostering environmental responsibility.
However, it is essential to acknowledge the performance of ML models heavily relies on the quality and representatives of the training data.It is essential to highlight characteristics of the electrostatic device that were not explored in our study, such as different electrode configurations, due to the unavailability of relevant experimental data in the literature, and consequently in the mathematical model.Future investigations can focus on expanding the dataset to include a broader array of diverse industrial settings and also investigating the combination of machine learning with different mathematical models.On the other hand, based on the results of the feature importance analysis, it will also be valuable to explore the impact of reducing input features by excluding the least influential ones.This could potentially simplify the model while maintaining its predictive accuracy.Future research could delve into a comparative analysis, assessing the changes in model accuracy when using a reduced set of input features compared to our current model that incorporates six inputs to offer insights into an optimal configuration of input features.

Summary and conclusion
In this study, we developed a constrained machine learning surrogate model for predicting the size distribution of water-in-oil emulsions in inline electrostatic coalescers (IECs) as a practical alternative to a mathematical model based on population balance equations to facilitate fast and accurate predictions.This model is valuable in addressing challenges encountered in crude oil dehydration processing, including equipment corrosion and catalyst deactivation.The compact and lightweight design of IEC not only addresses spatial constraints for offshore operations but also demonstrates high efficiency in processing heavy oil with stable emulsions, reducing reliance on demulsifying chemicals and contributing to a more environmentally friendly approach.
We employed an XGBoost algorithm with hyperparameter optimization using a genetic algorithm.We incorporated two penalty terms into the objective function of the algorithm to enhance the physical interpretability and accuracy of our model compared to the standard GA-XGBoost model.These terms discouraged high modeling deviations and negative predictions.The results of the test set revealed the precision of the constrained GA-XGBoost model with MSE of 0.005 and an R 2 value of 0.998.The comparative analysis further confirmed the accuracy of the constrained ML model when compared to the experimental data and the outcomes of the mathematical model.Residual analyses showed the model's reliability, detecting no systematic bias and revealing a majority of residuals tightly clustered around zero.Furthermore, we used SHAP and Permutation methods to assess the importance of the six input features, and results showed the initial droplet diameter and the electric field voltage were the most influential parameters.In conclusion, the surrogate machine learning model provides a practical alternative for accurately describing the evolution of droplet size distribution in inline electrostatic coalescers, serving as a valuable tool to enhance the performance of the dehydration process and advance its practical applications in the petroleum industry.

Figure 2 .
Figure 2. Temporal changes in the number density distribution of water droplets ( F i ) with different diameters per unit volume at the outlet of IEC 11 .

Figure 3 .
Figure 3. Regression plots comparing the standard GA-XGBoost model predictions to mathematical model data for the droplet population per unit volume.

Figure 4 .
Figure 4. Regression plots comparing the constrained XGBoost model predictions to mathematical model data for the droplet population per unit volume.

Figure 5 .
Figure 5. Empirical cumulative distribution function (eCDF) of absolute residuals of the constrained XGBoost model.

Figure 6 .
Figure 6.Residual plots of the droplet population per unit volume ( F i ) predicted by the constrained XGBoost model for training and testing data.

Figure 7 .
Figure 7. Frequency of residuals in predicting the droplet population per unit volume ( F i ) for training and testing data.

Figure 9 .
Figure 9.Comparison of droplet population per unit volume predicted by the mathematical and the constrained GA-XGBoost models for varying droplet diameters under four voltage conditions: (a) 1 kv/cm, (b) 2 kv/cm, (c) 3 kv/cm, (d) 4 kv/cm.

Figure 10 .
Figure 10.Comparing experimental values of droplet cumulative volume fraction to the constrained GA-XGBoost predicted values.

Table 3 .
Comparison of training and testing performance metrics for GA-XGBoost models.