Utilizing machine learning for flow zone indicators prediction and hydraulic flow unit classification

Reservoir characterization, essential for understanding subsurface heterogeneity, often faces challenges due to scale-dependent variations. This study addresses this issue by utilizing hydraulic flow unit (HFU) zonation to group rocks with similar petrophysical and flow characteristics. Flow Zone Indicator (FZI), a crucial measure derived from pore throat size, permeability, and porosity, serves as a key parameter, but its determination is time-consuming and expensive. The objective is to employ supervised and unsupervised machine learning to predict FZI and classify the reservoir into distinct HFUs. Unsupervised learning using K-means clustering and supervised algorithms including Random Forest (RF), Extreme Gradient Boosting (XGB), Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were employed. FZI values from RCAL data formed the basis for model training and testing, then the developed models were used to predict FZI in unsampled locations. A methodical approach involves 3 k-fold cross-validation and hyper-parameter tuning, utilizing the random search cross-validation technique over 50 iterations was applied to optimize each model. The four applied algorithms indicate high performance with coefficients determination (R2) of 0.89 and 0.91 in training and testing datasets, respectively. RF showed the heist performance with training and testing R2 values of 0.957 and 0.908, respectively. Elbow analysis guided the successful clustering of 212 data points into 10 HFUs using k-means clustering and Gaussian mixture techniques. The high-quality reservoir zone was successfully unlocked using the unsupervised technique. It has been discovered that the areas between 2370–2380 feet and 2463–2466 feet are predicted to be high-quality reservoir potential areas, with average FZI values of 500 and 800, consecutively. The application of machine learning in reservoir characterization is deemed highly valuable, offering rapid, cost-effective, and precise results, revolutionizing decision-making in field development compared to conventional methods.


S gv
Surface area per unit grain in μm x j i The ith pattern belonging to the  Normalized value with a range of values 0-1 x Variable on the dataset while max and min refers to the maximum and minimum value of the variable Reservoir characterization is a fundamental part of petroleum engineering that involves gathering and evaluating data to understand the properties of a subsurface reservoir.This process is necessary for making informed decisions involving the production and recovery of hydrocarbons from the reservoir 1 .The information gathered during reservoir characterization is critical for accurate hydrocarbon reserve estimation, optimization of production techniques, risk reduction, and improved recovery, which is vital to financial analysis and decision-making in the oil and gas industry 2 .In addition, reservoir characterization provides valuable information on the reservoir's properties and behavior, which contributes to the development of an optimum field development plan, including the determination of the number and placement of wells, production rates, and field infrastructure design.
Reservoir characterization is a challenging task due to the uncertainty imposed by reservoir heterogeneity, which refers to the variability of reservoir properties across various geological scales.To address this uncertainty, the hydraulic flow unit (HFU) zonation is used to cluster rocks with identical petrophysical and flow characteristics into the same unit 3 .This allows for the prediction of unknown reservoir properties and eliminates unnecessary coring expenses.HFUs are based on geological and physical flow principles and provide a more accurate representation of reservoir heterogeneity compared to traditional lithological or depositional faciesbased approaches.The Hydraulic Flow Unit method is related to the Flow Zone Indicator (FZI), a commonly used measure in reservoir characterization.The FZI provides a quantitative method for analyzing the relationship between microscopic characteristics like pore throat size and distribution and macroscopic ones like permeability and porosity.This consequently suggests that rock properties derived from depositional and diagenetic processes play a significant role in determining the surface area, shape factor, and tortuosity of carbonates, and thus the FZI value 4 .
Conventional methods for reservoir characterization primarily focus on directly measuring or estimating permeability and porosity, which are crucial for understanding reservoir potential.The primary tools for this purpose are core measurements and well logs.Core measurements involve physically extracting a sample from the reservoir and analyzing it to determine properties like permeability and porosity.Well logs, on the other hand, are continuous recordings of various physical parameters along the wellbore, providing indirect estimates of these reservoir properties.
While core measurements offer high accuracy, they are often expensive, time-consuming, and only provide data for a limited section of the reservoir.Well logs, including tools like bulk density, neutron porosity, sonic, and nuclear magnetic resonance logs, are more extensive but can sometimes yield less satisfactory results.This is due to uncertainties in the empirical parameters used for interpretation and the adaptability issues of response equations to different reservoir conditions.These limitations of conventional methods highlight the need for more efficient and comprehensive approaches in reservoir characterization (Rock Typing) 5,6 .Therefore, there is a need to identify advanced methods capable of overcoming the limitations inherent in traditional reservoir characterization techniques.
AI and Machine Learning (ML) offer solutions to these challenges by efficiently processing vast quantities of data, surpassing the limitations of human analysis in both speed and complexity.These advanced technologies can interpret intricate datasets from logs more effectively, identifying patterns and correlations that might be missed by traditional methods.Furthermore, AI-driven methods are not confined to the data from cored intervals, enabling a more comprehensive analysis of the reservoir.This holistic approach can integrate diverse data sources, including seismic, geological, and production data, offering a more nuanced understanding of reservoir characteristics.Studies have utilized a range of supervised machine learning algorithms, including Random Forest (RF) 7 , Support Vector Machines (SVM) 8 , Artificial Neural Networks (ANN) 9 , adaptive network fuzzy inference system (ANFIS) 10 , and Extreme Gradient Boosting (XGB) 6 , to accurately predict permeability values.Additionally, unsupervised machine learning algorithms such as K-Means have been studied to classify the reservoir based on the hydraulic flow units (HFUs) 11,12 .
The main objective of this study is to create a supervised machine-learning model that directly estimates the flow zone indicator (FZI) at unsampled locations using well-logging data during the initial exploration phase.This approach is highly valuable as it allows for the direct determination of FZI at specific depths of interest, leveraging the power of supervised machine learning.Additionally, an unsupervised machine-learning model will be developed to cluster hydraulic flow unit numbers in the target zone.This clustering approach is also valuable as it enables the assessment of distinct petrophysical properties associated with flow units, which greatly influences reservoir characterization.
To accomplish the study's objective, the implementation of popular machine learning algorithms like K-Means for the unsupervised machine learning model, and Random Forest, Extreme Gradient Boosting, Support Vector Machines, and Artificial Neural Network for the supervised machine learning model is planned.Additionally, the credibility of the results will be ensured by evaluating the physics-based approach in conjunction with the datadriven approach of supervised machine learning.This combination of approaches will enhance the classification of rock reservoir types, resulting in improved accuracy and efficiency.Therefore, this research aims to introduce a dependable and data-driven approach for predicting flow zone indicators in unsampled locations, utilizing advanced machine learning techniques.This innovative methodology is poised to significantly contribute to the advancement of rock reservoir type classification within the petroleum industry, marking a shift towards more sophisticated, analytics-based strategies.

The flow zone indicator (FZI)
In reservoir characterization, predicting permeability is crucial for understanding hydrocarbon production.The Hydraulic Flow Unit (HFU) approach was first introduced by 3 which is based on the modification of the Kozeny-Carman equation: where k is permeability in m 2 , φ e is effective porosity, K T is the pore-level effective zoning factor and S v gr is the specific surface area per unit grain volume.The K T parameter is a function of pore size and shape, grain size and shape, pore and grain distribution, tortuosity, cementation, and pore system (intergranular, intracrystalline, vuggy, or fractured) 13 .
The HFU approach uses the normalized porosity index or the void ratio ( φ z ) and reservoir quality index (RQI) to predict permeability.The method involves plotting φ z against RQI on a log-log scale and fitting a unit slope trend line.The Flow Zone Indicator (FZI), which characterizes the geological and petrophysical attributes of a given HFU, is determined by the intercept value of the trend line at ( φ z ) = 1.The previous parameters are calculated using the following equations.
where k is permeability in mD, φ e is effective porosity in fractions, F s is the shape factor, τ is the tortuosity, S gv is the surface area per unit grain in μm.The permeability can be recalculated based on the flow unit of a sample, considering the FZI and effective porosity, using the following equation.
When the samples for a given HFU are closely aligned with the trend line, the FZI value is equal to or close to the FZI arithmetic average of these samples, and the predicted permeability is identical to the measured one.However, if the samples are scattered around the trend line, the FZI value differs greatly from the FZI arithmetic average, and the predicted permeability is far from the measured one, with a significant error 14 .Fine-grained rocks, poorly sorted sands, rocks with authigenic pore filling, pore filling, and pore bridging clays are more likely to have a large surface area and a high degree of tortuosity, as stated by 3 .The shape factors and tortuosity of coarse-grained, well-sorted sands are much lower.Integrating FZI with other well logs and core data enables the classification of HFUs, leading to more accurate reservoir characterization and better reservoir management. (1) (2)  15 .The bagging is based on the concept of building multiple decision trees independently from one another using a subset of the input predictor parameters and a bootstrap sample of the training data 16 .It randomly selects the training dataset Tb (b = 1, …, B) from the whole training set T with replacement (bootstrapping sampling) and randomly selects M features or input variables from P input variables or (M < P) 17,18 .By following these steps, the proxy model's bias, excess variance, and overfitting will be reduced to acceptable levels.Like decision trees, random forests are effective at resolving non-linear patterns within data while also being scalable and resistant to outliers in imbalanced datasets [19][20][21][22][23] .
For each tree within the Classification and Regression Tree (CART) framework, the ideal division is calculated using a random selection of both Tb and P features.The collective set of these trees can be expressed as an ensemble.
In the regression approach used by the Random Forest algorithm, the final prediction is derived through an averaging process rather than majority voting.The prediction Y for a given input X is calculated as the average of the predictions from all the individual trees in the ensemble: This equation suggests that the collective prediction is the meaning of the outcomes from each of the B individual Classification and Regression Trees (CART) that constitute the forest.By averaging, Random Forest harnesses the diversity of the ensemble, effectively reducing the overall prediction error.This method capitalizes on the ensemble's ability to minimize the average squared error across the predictions, often resulting in a more accurate prediction than any single tree's output 18,23 .

Extreme gradient boosting (XGB)
The gradient boosting approach is a robust ensemble training algorithm designed for both non-linear classification and regression applications by upgrading a weak learning model into a strong learner [24][25][26] .The primary objective of the gradient boosting approach is to identify a new sub-model with a lower error rate than the previous model.Hence, this method relies on the use of multiple models (bagging) which are trained to minimize errors from the previous method 17,27 .
One of the most well-known gradient-boosting enhancements is Extreme Gradient Boosting (XGB), which employs Gradient Boosting Decision Trees (GDBT) 28 .This method avoids overfitting because it considers more regularization terms than standard gradient tree boosting.Furthermore, it enhances model robustness by employing sampling techniques across both rows and columns, effectively diminishing the model's variance 29 .A key factor in XGBoost's effectiveness is its ability to scale efficiently across various configurations.The ensemble model of XGBoost is formulated in an additive manner.
where f symbolizes a specific tree within the space F , which encompasses the entire set of regression trees.Here x i , ignifies the i-th eigenvector, and K is the total count of trees in the model.The expression of cost function presented as follows: where The sum of the loss function l y i , y i , measuring the difference between the observed y i and the predicted y i values and denotes the regular punishment.The regularization term itself is further defined as a combina- tion of two components.γ T , where γ is the coefficient penalizing the complexity of the model by the number of leaf nodes T , enforcing the 1 norm, while 1 2 ω 2 , with as the coefficient for the 2 norm and ω as the leaf weight.

Support vector machines (SVM)
Support vector machines are based on the inductive concept of structural risk minimization (SRM), which allows for reasonable generalizations to be made from a limited set of training examples [30][31][32][33] .This method utilizes a margin-based loss function to control the input space dimensions and a kernel function to project the prediction model onto a higher-dimensional space.
A support vector regressor (SVR) is a member of the Support Vector Machine, which has extremely potent and flexible performance, is not confined to linear models, and is resistant to outliers.This method utilizes the Vol.:(0123456789) www.nature.com/scientificreports/kernel trick to translate the original data into a higher-dimensional space without explicitly declaring the higher dimension 34,35 .This method's compatibility with linear models (using linear kernels) or non-linear models (using polynomial or radial kernels) makes it extremely versatile 17 .The effectiveness of the SVR relies heavily on the model selection and kernel function settings (C, Gamma, and Epsilon) 36 .
The introduction of Vapnik's epsilon-insensitive loss function has enabled Support Vector Regression (SVR) to effectively address nonlinear regression estimation challenges.This approach involves the approximation of given datasets using this specialized loss function.
With a linear function where the dot product in X is denoted by , .SVR aims to find a function f (x) that approximates output values within a deviation of ∈ from the actual training data.The choice of ε is crucial, as smaller values lead to tighter models that penalize a larger portion of the training data, while larger values result in looser models with less penalization.The ideal regression function is identified by addressing an optimization problem, which is designed to calculate the values of ω and b: where ξ i and ξ * i are the slack variables, and the model parameters ω and b .This approach balances minimizing training error and penalizing model complexity, thus controlling the generalization error.The regularization constant C in the optimization formulation helps to trade off between these two aspects.The epsilon-insensitive loss function further adds to this balance by penalizing errors only when they exceed ε .This methodology allows SVR to achieve better generalization performance compared to some other models like neural networks 37 .

Artificial neural network (multi-layer perceptron)
Artificial Neural Network (ANN) or multi-layer perceptron is one of the most effective machine learning approaches.Its mathematical design is inspired by biological neural networks.This technique consists of 3 main layers, the input layer is aimed for receiving input information from X variable.This data will be received and learned by the hidden layer.This information will be generated by the output layer as a consequence of testing [38][39][40][41] .
This study will concentrate on feed-forward back-propagation neural networks, one of the numerous forms of neural networks.In this method, the input information flows in a forward manner from the input layer to the hidden layer and ends up in the output layer.The errors that arise during this procedure will be calculated and backpropagated by resetting the network's weight and bias.It is an iterative procedure until the finest inaccuracy is discovered 34,41,42 .
Hagan and colleagues 41 stated that single cycle of the process is described by the following equation.
where g k represents the current gradiend, Z k denotes the current set of weights and biases, and α k is the learn- ing rate.To adjust the connection weights for a specific neuron i during a particular iteration p , the following equation outlines the process 43 .
This equation updates the weight of the i-th neuron for the next iteration (p + 1) by adding a weight correc- tion factor �w i (p) to the current weight w i (p).
The weight correction factor �w i (p) is calculated based on the equation.
For the j-th neuron in a hidden layer γ i , alternate expression for the weight correction factor �w i (p) is defined as follows 43,44 .
where δ k (p) denotes the error gradient at neuron k in the output layer during the iteration p .This equation is commonly referred to as the delta rule.

K-means clustering
In this study, the K-Means algorithm is exclusively used as an unsupervised machine learning technique.It is selected for its simplicity and widespread application in clustering tasks.The algorithm minimizes a performance criterion called P, which is calculated as the sum of squared error distances between data points and their corresponding cluster centers 45 .The algorithm begins with a random initial partition, and patterns are then reassigned to clusters based on their similarity to the cluster centers until a convergence criterion is satisfied, such as no further reassignments or a significant reduction in squared error after a certain number of iterations 46 .The squared error for a clustering L of a pattern set H containing K clusters is as follows.
where x j i is the ith pattern belonging to the jth cluster and c j is the centroid of the jth cluster.In this study, along with the K-means algorithm, the Gaussian Mixture Model will also be implemented to reinforce the confidence in the outcomes derived from the K-means clustering.The Gaussian Mixture Model (GMM) offers a probabilistic approach to clustering, presenting the advantage of accommodating clusters of different sizes and orientations due to its use of covariance matrices.This capability enables the GMM to identify and adapt to elliptical or anisotropic clusters, unlike simpler algorithms like k-means which assume isotropic clusters.Additionally, GMM provides a soft-clustering approach, assigning probabilities of membership to each point for all clusters, rather than forcing a hard assignment.This results in a more nuanced understanding of the data's structure, particularly useful when the relationship between variables is complex and not easily separable into distinct groupings 47 .Hence, incorporating both K-means and Gaussian Mixture Model (GMM) methods in a single study leverages the strengths of both clustering techniques.

Methodology Data acquisition
This study analyzes open-source data consisting of thousands of well logs from the Halibut Oil Field, which are supplemented with routine core analysis studies.A total of 212 data sets are chosen for analysis, based on the specific formation depth and the availability of porosity and permeability data at that depth.These data sets encompass 17 different types of well-log information, including Corrected Gamma Ray (CGR), Bulk Density Correction (DRHO), Delta-T Compressional (DT5), Gamma Ray (GR), High-Resolution Enhanced Thermal Neutron (HNPO), Laterolog Deep Resistivity (LLD), High-Resolution Laterolog Resistivity (LLHR), Laterolog Shallow Resistivity (LLS), Mud Resistivity (MRES), Micro Spherically Focused Resistivity (MSFC), Thermal Neutron Porosity (NPHI), Enhanced Thermal Neutron Porosity (NPOR), Potassium Concentration (POTA), Bulk Density (RHOB), Spontaneous Potential (SP), Thorium Concentration (THOR), and Uranium Concentration (URAN), alongside porosity and permeability data.The focus of this study is the FZI parameter, which is directly influenced by permeability.
It is acknowledged that the FZI exhibits a non-normal distribution, as evident from Fig. 1.Consequently, predicting the FZI directly could potentially lead to misleading results due to its extremely non-normal distribution.To address this issue, an approach is taken to transform the FZI values using a logarithmic scale, aiming to approximate a normal distribution, as illustrated in Fig. 1.To provide an initial understanding of the data, Table 1 presents the data statistics, while Fig. 1 showcases the distribution of each parameter considered in the study and a pair chart for the input versus the output parameter.The cross plot between the input and the output parameters in Fig. 1b shows a linear (in orange) and nonlinear (in black) relationship between the output and input parameters.
Eighteen parameters (including the LOGFZI) were chosen at the initial phase of this study, as shown in Table 1.It is necessary to reduce the number of parameters to optimize the model's dimensionality and improve its processing time 48 .However, initially applying all 18 input factors will allow for a more comprehensive understanding of how these parameters affect the precision with which the machine learning model predicts the flow zone indicator.When the connection between input factors and model accuracy is better understood, it's possible to reduce the number of parameters and thereby boost model efficiency.The correlation coefficient analysis of each input parameter to the output parameter of LOGFZI is presented in Fig. 2. The heat map was generated using seaborn.heatmappython library.Figure 2a presents Pearson's correlation coefficients, that highlight the linear relationship between the parameters with each other, while Fig. 2b presents the Spearman's correlation coefficients, that was used to exclude the nonlinearity and outliers' effect.The correlation coefficients for most parameters remained consistent, except in a few instances where the correlation either increased or decreased when Spearman's coefficient was calculated compared to Pearson's coefficients.This variation can be attributed to the presence of outliers or nonlinear relationships.For instance, the correlation for DTS slightly increased from -0.1 to -0.3, indicating a more negative relation with LogFZI.Similarly, the LLD coefficient increased from − 0.7 to − 0.8 due to the nonlinear relation between LogFZI and LLD.Conversely, the correlation for RHOB decreased from − 0.7 to − 0.4.
Data normalization needs to be done to improve integrity and reduce data redundancy especially for the algorithm that basically relies on the distance technique (KNN and SVR).This is normally done because the input and output data used in the study have very large unit and range differences.The normalization technique employed in this study is the MinMaxScaler.A significant benefit of this scaler is its capability to preserve the original shape of the dataset's distribution.This preservation is critical as it ensures the integral information within the data remains unaffected during scaling.Unlike several other scaling methods, MinMaxScaler does not alter the core characteristics of the original data, thus maintaining the crucial details and patterns necessary for accurate analysis.The normalization formula applied is: www.nature.com/scientificreports/where x norm is a normalized value with a range of values 0-1, x is the variable on the dataset while max and min refers to the maximum and minimum value of the variable 42,49 . (19)

Supervised machine learning design
In the supervised machine learning section of this study, 212 datasets will be split into two groups: 65% for training purposes and 35% for testing.As an effort to prevent overfitting and leakage on the testing data both holdout and k-fold cross-validation method are adopted in this study to serves a dual purpose: ensuring an unbiased evaluation of the model and a thorough assessment of its generalizability.The holdout method provides a clean dataset for final model evaluation, free from any influence of the training process.Meanwhile, k-fold crossvalidation is applied to the training data to reduce the potential variance in model performance that could result from a single train-test split, particularly important in datasets of limited size.This nested approach is a robust strategy for hyperparameter tuning, enabling the model to demonstrate consistent performance across multiple subsets of the data, thus reinforcing its ability to generalize beyond the training sample.In this scenario, the model will continue to be trained until all folds have been used for testing once.The average score of the testing fold will recognize as validation score (Fig. 3) 50 .
The algorithm's hyperparameters will also be tuned to determine the optimal model for each method 18 .The set of each individual's input hyper-parameters is displayed in Tables 2, 3, 4 and 5.

Unsupervised machine learning design
In the unsupervised machine learning section, the distribution of the log FZI data will be examined through a histogram and a normal probability plot to make initial judgments regarding data clustering.A statistical method incorporating the normal probability plot will be employed, where a straight line in the plot signifies a normal distribution.If multiple straight lines with varying slopes are present, it indicates the existence of different datasets that share the same normal distribution, implying the presence of distinct clusters.
To determine the optimal number of clusters in the K-Means algorithm, the elbow criterion is utilized.The elbow criterion suggests selecting the number of clusters where the addition of another cluster does not significantly contribute new information 51 .In this study, the elbow method incorporates the Root Mean Square Error (RMSE) and R-squared as measures to evaluate the clustering of flow units 52 .These metrics quantify the deviation between observed and estimated values, providing insights into the optimum cluster numbers for reservoir characterization of hydraulic flow units.Several previous studies have utilized the elbow method in conjunction with the RMSE and R-squared metrics to determine the optimal number of clusters for hydraulic flow units in reservoir characterization efforts 6,53,54 .
It is crucial to follow an organized process to obtain precise and trustworthy outcomes.Figure 4 displays the precise study methodology in detail.

Pre-processing machine learning model
In this work, a predictive model is constructed using three-fold cross-validation and hyperparameter optimization.Randomized search cross-validation is used as a solution to Grid Search Cross-Validation (Exhausted Cross Validation) to reduce computation time when performing hyperparameters 55 .Using this approach, 50 iterations of hyper-parameters are paired with 3 folds of cross-validation to generate 150 training models, which are then assessed using coefficient determination metric evaluation (R 2 ).The initial investigation will compare models that have undergone scaling to those that have not.Evaluation metrics such as R-squared (R 2 ), Mean Squared Error (MSE), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE) for each algorithm are compiled in Tables 6, 7, 8, and 9 and illustrated in Figs. 5, 6, 7,  and 8.
The outcomes presented in Figs. 5 and 7 indicate that applying scaling techniques improves the performance of both the SVM and ANN models, with notable enhancement observed in the ANN.The SVM is recognized for its robustness, employing a margin-based loss function that effectively manages the dimensionality of the input space.However, SVM may underperform with skewed datasets as finding the optimal separating hyperplane becomes challenging with imbalanced data 56 .A similar challenge is observed with neural network algorithms, which, at their core, rely on linear regression principles.Extreme skewness in the data can substantially impact the performance of neural networks.Meanwhile the stability of the scores for both the Random Forest and XGBoost models, even after dataset standardization, can be attributed to their foundational decision tree structure.These models utilize bootstrapping sampling methods and an aggregation technique known as bagging to produce the final score.This approach equips the models with resilience against imbalanced or skewed datasets, ensuring consistent performance irrespective of data standardization [19][20][21][22] .

Data processing and features reduction
To enhance machine learning model accuracy, various data processing techniques are used.Parameter reduction is achieved by analyzing the impact of excluding each variable using the feature importance method.The Feature Importance Analysis is performed using the random forest model as the benchmark to identify the most important parameters in the dataset.The random forest model is selected for its high accuracy, as indicated by the high R-squared values observed during the pre-processing stage of the data in both the training and testing  sets.Figure 9 presents the results of the analysis, showing the relative importance of each input parameter in the dataset.Figure 9 displays the relative importance of each parameter in the output model, as determined by a feature importance analysis using a random forest model.This analysis calculates the decrease in the Mean Squared Error (MSE) of the prediction, where a higher importance score indicates a greater role of the parameter in reducing the MSE [19][20][21][22] .LLHR emerges as a notable parameter with an importance score of 27%.It is important to understand that this score does not imply that excluding LLHR would directly result in a 27% change in the model's performance.Instead, it signifies LLHR's relative contribution to enhancing the model's predictive accuracy by reducing the MSE.The parameter selection process in this study was guided by the aim to include input parameters that collectively have a substantial impact on the model's effectiveness.The cumulative relative importance from LLHS to HNPO is 51%, indicating their combined significance in the model.Therefore, the final set of selected input parameters, comprising LLHR, LLS, MSFC, LLD, CGR, NPHI, THOR, NPOR, and HNPO, was chosen based on their collective ability to decrease the MSE and improve the model's overall predictive performance, rather than solely on their individual importance scores.
The feature importance analysis results are in line with the existing literature, as these parameters demonstrate a strong relationship with the calculation of FZI using a physics-based approach.The LLHR (Laterolog High-Resistivity), LLS (Laterolog Shallow), and LLD (Laterolog Deep) logs are crucial resistivity logs utilized in formation evaluation 57 explored the relationship between resistivity and permeability using known water saturation and the apparent formation factor.The results of the study demonstrated a strong relationship between resistivity and permeability.The MSFC log provides quantitative resistivity data at a micro-scale and can be converted into visual images, allowing for detailed core permeability description through visual examination.Bourke 58 observed a strong visual correlation between micro resistivity and permeability images, indicating their potential for capturing porosity-permeability variations.Micro resistivity data offer high-resolution permeability transformation, surpassing traditional logs, and have been used in various studies for permeability assessment and characterization.These findings highlight the significance of the MSFC log in permeability prediction.
Yao and Holditch 59 established a correlation between core permeability and open-hole well-log data, highlighting the significance of the relationship between gamma-ray and permeability estimation, which ultimately contributes to the estimation of FZI.Thus, CGR is an important parameter in this model.NPHI, NPOR, and HNPO are different versions of thermal neutron porosity logs widely used for characterizing reservoir porosity.These logs have been extensively studied in combination with other parameters to determine lithology and estimate clay volume that reflects its importance for FZI prediction.The THOR (Thorium Concentration) log measures the thorium concentration in parts per million (ppm) using energy emissions from radioactive minerals, which are detected by the spectral gamma ray log.According to 60 , high concentrations of thorium are indicative of dark, organic-rich shale, as well as calcareous, brittle, and fractured shale.Hence, Thorium concentrations directly influence permeability and porosity and the rock type.
In addition to parameter reduction, data transformation using the Yeo-Johnson method was applied to the dataset.This transformation technique is employed to address the issue of non-normality in the data distribution.By employing this transformation, the data distribution becomes more symmetrical, thus meeting the assumptions of certain statistical models and improving the accuracy of subsequent analyses.

Post-data-processing machine learning model
In this step, the machine learning model proceeds to apply the same characteristic (hyper-parameter combination) as the previous model.The following Figs.10, 11, 12 and 13 represent the results of the machine learning algorithm that was applied following data processing.These cross-plot figures showed the capabilities of the different machine learning to predict the flow zone index, where most of the data are aligned with the 45-degree line.Additionally, Figs. 14 and 15 and Tables 10 and 11 summarize the comparison models' performance post scaling and transformation process.
Model evaluations demonstrate steady efficacy throughout the training, validation, and testing stages.The Random Forest model stands out with the highest accuracy in training and testing, at 0.9566 and 0.9081, respectively.Table 11 and Fig. 14 collectively suggest that the models retain high accuracy post-data processing.In Table 12 it can be seen the comparison of the final model to the initial model which did not undergo data processing, the final model that incorporated scaling and transformation exhibited enhancements.This is particularly noticeable in the case of the ANN model, which, as previously discussed, showed significant improvement.Due to the highest model performances resulting from the final model, it is recommended to use the post-processed models for future research, as they offer a well-tuned blend of dimensionality reduction and predictive capability.

Initial observation
Figure 16 displays the histogram plot of FZI value, showing a non-normal distribution.Despite attempts to transform the heavily non-normal FZI data to log FZI (Fig. 17), the resulting distribution remains non-normal due to FZI being influenced by the direction of fluid flow (permeability) and requiring further averaging or upscaling methods.Consequently, determining the number of hydraulic flow units (HFUs) solely from this plot is challenging.The histogram represents overlapping individual normal distributions, necessitating the isolation and identification of these individual distributions to accurately estimate the number of HFUs 54 .Therefore, while the histogram provides insights into the variation of HFU distribution across the formation, it offers a qualitative analysis rather than a precise count of HFUs.
The normal probability plot is used as a statistical technique to assess the normality of a dataset.The presence of multiple straight-line segments in the plot indicates the presence of different hydraulic flow units (HFUs), each with its distinct normal distribution.Figure 18 displays nine distinct straight lines, suggesting the existence of nine HFUs in the formation.However, it's important to note that this approach relies on statistical analysis and visual interpretation, which can be subjective.Caution should be exercised when interpreting the results.Despite its limitations, this method is a valuable tool in data analytics and provides insights into the properties and behavior of a reservoir.

The optimum cluster number
In the initial stage of K-Means clustering, the elbow method is utilized to determine the optimal number of flow units (clusters).In this study, the elbow method uses RMSE and R-squared evaluations to determine the optimal number of flow units 61 .The results of the elbow method plot are displayed in Fig. 19.
Both the RMSE and R-squared approach may provide a different interpretation of the optimal HFU value.Considering the previously assessed heterogeneity, for the RMSE method, the optimum HFU value is taken as the number that has a minimum difference of 10% from the previous HFU value.Thus, the optimum HFU number for RMSE is 10, as, at 11, the value drops below 10%.In contrast, the R-squared method shows very small differences between R-squared values for each HFU number.Therefore, the interpretation of the R-squared method relies on visually observing the plot itself.By examining the plot, it is evident that an HFU of 10 exhibits the most horizontal straight line among all the previous ones.Consequently, the HFU value of 10 is considered the optimum value based on both the RMSE and R-squared approaches.

The K-means clustering
After determining the optimum cluster number, the K-Means clustering algorithm is utilized.The selected optimum HFU value is 10, and to ensure consistent and reproducible results, the random state parameter is set to 42 during the initialization of the K-Means clustering model.This parameter controls the random initialization of cluster centroids 62 .By using a specific random seed, the same initial centroids are used each time the code   delineated similar clusters within the dataset.Although the labeling of the clusters differs between the two methods, the composition of the data points within corresponding clusters is largely analogous.This consistency between the K-means and GMM clustering outcomes suggests that both methods are capturing the inherent groupings within the dataset effectively.The parallelism in results reinforces the reliability of the clustering, affirming that the dataset possesses a structure that is robust to the clustering technique applied.The congruence of these clustering methods provides a validated foundation for further analysis.
The performance of the evaluation and clustering method was assessed by calculating the permeability using the FZI values for each flow unit cluster by using Eq. ( 5).The calculated permeability was then compared to the actual permeability.Figure 22 displays the comparison between predicted and actual permeability values.The results indicate a high R-squared value of 0.93, demonstrating the effectiveness of the clustering method.This outcome validates the evaluation of reservoir heterogeneity, the determination of optimum HFU numbers, and the utilization of FZI for clustering.Table 13 provides the average permeability and porosity values for each flow unit cluster.It is important to note that when addressing heterogeneity, the choice of averaging method (arithmetic, harmonic, geometric) for permeability depends on the distribution of permeabilities within the rock  during deposition 63 .By examining the FZI values alongside their respective average permeabilities, it is possible to predict the permeability quality of a specific location, thus enabling an assessment of its potential for fluid flow.

Models validation and applications in unsampled formations
Figure 23 shows the results of applying four models to the additional unseen location.The trend reveals that the ANN model performs the poorest, followed by the SVM, while the XGB and RF models exhibited the highest performance.This result is consistent with the values presented in Tables 10 and 11, which showed that the RF model is the optimal choice for predicting the FZI.This decision is based on the highest R-squared and lowest error score values obtained from both the training and testing datasets, and with their proximity indicating good generalization.
Several literatures further support this decision by acknowledging the acceptability of different models after data processing, as each model possesses strengths based on the nature of the data.The superior performance of the Random Forest model over others, particularly SVM and ANN, in this study is likely due to the characteristics  of the dataset.Moreover, the dataset size used in this study is relatively small, comprising only 159 training and 53 testing data points.This condition is disadvantageous for algorithms like SVM, which depend on the spatial dimensions of the data, and for ANN models, which are based on fundamental linear regression calculations 56,64 .However, this limitation does not significantly impact the Random Forest algorithm, which employs bootstrapping sampling and a technique called bagging for final score computation.This method allows the algorithm to randomly select the training dataset from the whole training set with replacement and randomly selects M features or input variables from input variables.Such as methodology makes the model robust against imbalanced or skewed datasets, ensuring stable performance regardless of data standardization 19-23 .It's crucial to understand that the results of the machine learning models in this study are specific to the dataset employed and should not be generalized.The effectiveness of each algorithm heavily depends on the characteristics of the data used, meaning Random Forest may not always outperform other algorithms in different scenarios.
Figure 24 illustrates the application of the ML models in an unseen dataset, and the forecasting of FZI value using a random forest model in an unsampled location.Figure 24a reveals a noticeable resemblance between the predicted and observed trends in the FZI data.This finding holds significant value for reservoir modeling scenarios.Upon careful examination of Fig. 24b, two distinct depth ranges emerge as potential reservoir development zones.The first zone, represented by the red box, has an approximate thickness of 10 ft and an average FZI value of 500.Based on the clustering analysis presented in Table 13, it is likely associated with HFU number 8, which displays a harmonic average permeability value of 2806 millidarcy (mD).The second zone, indicated by the blue box, spans approximately 15 ft, and exhibits an average FZI value of around 800. Referring to Table 13, HFU number 9 is linked to an FZI value of around 800, suggesting the presence of a zone characterized by remarkably high permeability, measuring approximately 6410 mD.These significant findings strongly indicate the existence of favorable reservoir zones within the delineated areas.By combining the clustering analysis of HFUs and employing machine learning models to predict FZI based on well-log data, it becomes possible to estimate potential reservoir characterization zones.However, for an optimized approach to hydrocarbon recovery, it is imperative to consider additional petrophysical properties such as water and hydrocarbon saturation.Furthermore, accurate calculations of the initial hydrocarbon in place within these predicted potential zones should be incorporated into the analysis.

Conclusion
This study utilized state-of-the-art machine learning methodologies to augment the efficacy of reservoir characterization.The supervised learning algorithms, including Random Forest (RF), Extreme Gradient Boosting (XGB), Support Vector Machines (SVM), and Artificial Neural Network (ANN), were used to predict Formation Zone Indicator (FZI) values in unsampled locations, while unsupervised learning technique of K-Means and Gaussian mixture clustering algorithm was employed to classify Hydrocarbon Flow Units (HFUs) in the reservoir.The findings of this study are summarized as follows: • The four implemented algorithms demonstrate robust performance in estimating the flow zone indicator of the reservoir, yielding high coefficients of determination (R 2 ) of 0.89 and 0.95 in the training and testing datasets, respectively.• The RF model emerged as the optimal choice for FZI prediction in unsampled locations, with R 2 values of 0.957 for training and 0.908 for testing.The study's findings hold significant implications for reservoir characterization practices in the petroleum industry.The successful integration of machine learning, particularly Random Forest, into conventional methods allows for rapid and cost-effective reservoir assessments.This approach not only enhances decision-making speed but also identifies specific zones with high-quality reservoir potential.The study showcases the robustness of machine learning in petroleum engineering applications, marking a shift towards more efficient and accurate reservoir characterization.To further advance the field, future research should explore additional machine learning models and incorporate a broader set of features for a comprehensive analysis in addition to validating the results on different datasets.The prediction of FZI value using from unsampled data based on forest model. https://doi.org/10.1038/s41598-024-54893-1

Figure 1 .
Figure 1.Histogram of 17 well-log parameters, illustrating the diverse distribution types for each parameter.The LOGFZI distribution demonstrates a closer resemblance to a log-normal distribution compared to the original FZI distribution.

Figure 2 .
Figure 2. Heatmap of correlation coefficients between each parameter, illustrating the strength of correlation for all parameters; (A) Pearson's Coefficients, (B) Spearman's coefficients.The LOGFZI exhibits noticeable strong correlations (mostly negative) with several parameters, (Heat map was generated using seaborn.heatmap python library).

Figure 4 .
Figure 4. Workflow of the study for supervised and unsupervised machine learning models.

Figure 5 .
Figure 5. Coefficient determination (R 2 ) summary for the different ML methods using unscaled datasets.

Figure 6 .
Figure 6.Error summary for the different ML methods using unscaled datasets.

Figure 7 .
Figure 7. Coefficient determination (R 2 ) summary for the different ML methods using scaled datasets.

Figure 8 .
Figure 8. Error summary for the different ML methods using scaled datasets.

Figure 9 .
Figure 9. Feature importance analysis quantifying the significance of each parameter in the model construction.

Figure 16 .
Figure 16.The distribution of FZI value.

Figure 17 .
Figure 17.The histogram for log FZI data.

Figure 18 .
Figure 18.The normal probability plot for the log FZI data.

Figure 19 .•
Figure 19.The Elbow method plot illustrates the RMSE and R-squared values.

Figure 23 .
Figure 23.Different algorithm performance for additional unseen data.

Figure 24 .
Figure 24.(a) The comparison of random forest and other algorithm performance in unseen sample data, (b) The prediction of FZI value using from unsampled data based on forest model.

Table 1 .
Statistical summary of 17 well-log parameters, including FZI and LOGFZI.

Table 12 .
Relative differences of initial and final model score (R 2 ).

Table 13 .
Average porosity and permeability for each HFU.