Hierarchical automated machine learning (AutoML) for advanced unconventional reservoir characterization

Recent advances in machine learning (ML) have transformed the landscape of energy exploration, including hydrocarbon, CO2 storage, and hydrogen. However, building competent ML models for reservoir characterization necessitates specific in-depth knowledge in order to fine-tune the models and achieve the best predictions, limiting the accessibility of machine learning in geosciences. To mitigate this issue, we implemented the recently emerged automated machine learning (AutoML) approach to perform an algorithm search for conducting an unconventional reservoir characterization with a more optimized and accessible workflow than traditional ML approaches. In this study, over 1000 wells from Alberta’s Athabasca Oil Sands were analyzed to predict various key reservoir properties such as lithofacies, porosity, volume of shale, and bitumen mass percentage. Our proposed workflow consists of two stages of AutoML predictions, including (1) the first stage focuses on predicting the volume of shale and porosity by using conventional well log data, and (2) the second stage combines the predicted outputs with well log data to predict the lithofacies and bitumen percentage. The findings show that out of the ten different models tested for predicting the porosity (78% in accuracy), the volume of shale (80.5%), bitumen percentage (67.3%), and lithofacies classification (98%), distributed random forest, and gradient boosting machine emerged as the best models. When compared to the manually fine-tuned conventional machine learning algorithms, the AutoML-based algorithms provide a notable improvement on reservoir property predictions, with higher weighted average f1-scores of up to 15–20% in the classification problem and 5–10% in the adjusted-R2 score for the regression problems in the blind test dataset, and it is achieved only after ~ 400 s of training and testing processes. In addition, from the feature ranking extraction technique, there is a good agreement with domain experts regarding the most significant input parameters in each prediction. Therefore, it is evidence that the AutoML workflow has proven powerful in performing advanced petrophysical analysis and reservoir characterization with minimal time and human intervention, allowing more accessibility to domain experts while maintaining the model’s explainability. Integration of AutoML and subject matter experts could advance artificial intelligence technology implementation in optimizing data-driven energy geosciences.


Athabasca oil sands
The study area is located in the Athabasca oil sands in Alberta, Canada, which is considered one of the world's largest bitumen deposits 31 .The majority of these bitumen resources were discovered in four major deposits: Athabasca, Cold Lake, Wabasca, and Peace River 32 (Fig. 1).With estimated resources of around 1 trillion barrels of bitumen, the Athabasca is considered the world's largest bitumen deposit 33,34 .These deposits are part of the Western Canada sedimentary basin, which is bounded on the west by the Rocky Mountains and on the east by the Canadian shield and is divided into two sections: the Williston intracratonic basin in the southwest and the Alberta foreland basin (Fig. 1).The basin was formed during the Paleozoic rifting period, which was followed by the development of a passive margin due to thermal subsidence 35 .Devonian-aged mixed succession of carbonate, evaporites, and shales deposited along the passive margin are the oldest preserved sediments in the Athabasca oil sand deposits.As a result, several studies suggest that these Devonian shales may be a source rock for the Athabasca Oil Sand 36 .This was followed by a period of siliciclastic deposition from the Late Paleozoic to the Late Jurassic, which could have resulted in the formation of Jurassic source rock 35 .The development of the Rocky Mountains fold and thrust belt, which controlled the deposition of the foreland basin megasequence, resulted in a significant shift in sediment provenance during the Late Jurassic.
This megasequence was dominated by siliciclastic deposition during the Early Cretaceous, and it includes the Lower Cretaceous Mannville group reservoirs, the primary reservoir interval in the Athabasca oil sands 35 .The McMurray formation, which unconformably overlies the Devonian carbonate, is the first Mannville group sedimentary unit found in Alberta, followed by the Wabiskaw member of the Clearwater Formation, which sits unconformably on the McMurray Formation 33 (Fig. 1).The primary reservoirs in the Athabasca oil are the McMurray-Wabiskaw clastic deposits, which are then capped by the Clearwater Formation shales as the ultimate regional seal 37 .In general, the McMurray-Wabiskaw interval is composed primarily of a deepening-upward complex system of sediments controlled by a sub-Cretaceous unconformity configuration 37 .These deposits are primarily composed of four facies associations: fluvial, tidal flat, tidal bar complex, and tidal bar cap 38 .The McMurray and Wabiskaw reservoirs have a thickness of up to 40 m and a porosity of up to 30% 39 .The majority of Athabasca oil sand is hosted in the Lower Cretaceous McMurray-Wabiskaw interval, from which the majority of bitumen resources can be recovered using thermal in-situ and surface mining methods 40 .

Well log data
This study utilized a publicly available dataset of 2173 wells provided by the Alberta geological survey as part of a regional study conducted in 1985.The primary goal of acquiring this dataset was to map the Lower Cretaceous McMurray Formation and the overlying Wabiskaw member of the Clearwater Formation in Alberta, Canada's Athabasca Oil Sand area.The following data are available for petrophysical and other measurements: lithology log (LITH), bitumen mass percentage (W_Tar), water saturation (Sw), shale volume (VSH), porosity (PHI), and water resistivity (Rw).A suite of well logs with variable coverage, such as Gamma ray (GR), Resistivity (ILD), Caliper (CALI), Density (RHOB), Neutron (NPHI), and Porosity derived from density (DPHI), is also available (Fig. 2).There are four distinct lithologies identified using 750 wells and core data analysis (Sand, Shaly Sand, Shale, and Coal; Fig. 2).According to the attached report from the Alberta Geological Survey in 1994, the interpreted lithology log was then populated using various petrophysical equations, primarily volume of shale and porosity calculated using density and neutron logs.

Exploratory data analysis
In this study, we followed a standard exploratory petrophysical data analysis workflow to preprocess the data and unravel any statistical patterns/trends (Fig. 3).Python programming language and built-in libraries (e.g., pandas, scikit-learn) were used to process and analyze the available data.Because this study involves a large number of well data, data cleaning was performed by sorting, rescaling, grouping, and reformatting to ensure the data is uniform and ready for machine learning analysis (Fig. 3).In addition, data preparation required analyzing the outlier values/trends observed in the well log values using log normalization across different wells, removing outliers, and scaling for consistency.To avoid miscalculation and error during machine learning training and prediction, all missing values were removed from the dataset.The exploratory data analysis was carried out using various visualization techniques such as cross-plots and histograms.This step is critical for identifying patterns and analyzing anomalous values using descriptive statistics.It is also useful for determining the significance    www.nature.com/scientificreports/ of certain features in order to aid in the prediction of logs based on the identified relationship which can be recognized in Fig. 4.

Machine learning
Supervised machine learning Several supervised machine learning models were evaluated and compared as a baseline model with the AutoML model in this study.Both logistic regression and gradient boosting machine classifier were used for the discrete task (facies prediction); while linear regression and gradient boosting machine regressor were used to predict continuous data.For example, gradient boosting machine is utilized for VSH and W_Tar, while random forest regressor is utilized for PHI.The total dataset was divided into 80% training and 20% blind test for all learning techniques.The training dataset was further divided into 80% for training and 20% for validation.The data for training and validation was completely separated from the test set in order to get independent results.Then the following set of logs were used as training features: GR, DPHI, NPHI and ILD to predict the lithology log.The same input logs were also used to predict the VSH and W_Tar.
Breiman 41 first introduced the random forest (RF) algorithm as an ensemble supervised machine learning algorithm that relies on decision trees.In each tree, RF combines bagging and different bootstrapping processes, adding an extra layer of randomness to the model.Furthermore, while the RF algorithm is inspired by the decision tree algorithm, it introduces randomness in separating each node and selecting the best predictors in that node 42 .Overall, when compared to Decision Tree, RF reduces overfitting and its performance is robust to outliers in the dataset 42,43 .Gradient boosting machine (GBM) is a concept that was developed to iteratively improve the performance of weak learners and create an efficient learner 44,45 .In general, GBM consists of three (2) a weak learner, which is typically a decision tree, to make predictions; and (3) an additive model to add weak learners to minimize loss function.The main advantage of GBM is its ability to work with large and complex datasets, as well as its robustness to bias and outliers in the dataset.However, GBM, like RF, can be costly to train and tune.Furthermore, GBM is known to suffer from model overfitting to training datasets, so regularization methods (L1 and L2), as implemented in the extreme gradient boosting algorithm (XGB), are required to mitigate this issue.

Automated machine learning (AutoML) implementation
Recent advances in artificial intelligence technologies enable the development and implementation of automated machine learning (AutoML), which automates the architectural design, selection, and parameterization of machine learning models 26,46 .In this study, we chose the open-source, distributed machine learning platform built to scale to large datasets, H 2 O tool for AutoML because of its scalability, user-friendliness, versatility, and extensive libraries to explore the models 47 .In this case, AutoML employs a combination of random grid search and stacked ensembles, as diverse models improve the accuracy of the ensemble method.To make the tool accessible to non-experts, in this study, only a few parameters are required to train the model within the H 2 O tool.These parameters serve as constrains for the AutoML process, so as soon as any of them is met, the AutoML process will stop: ▪ Max_runtime_secs: This constrain is to specify the amount of time the AutoML process will run to train various models (ex.Generalized linear model (GLM), Gradient boosting (GBM) and distributed random forest (RDF)).Followed by finetuning associated hyperparameters and evaluating best models based on certain metrics (ex.Root mean square).This is solely based on predefined parameters until the runtime is achieved.
▪ Max_models: this is to specify the number of models to be included in the AutoML process.This is an exception to Stacked ensemble models that basically tries to combine the different models and get best results.
▪ Seed: This option specifies the random number generator (RNG) seed for algorithms that are dependent on randomization.
In this work, the following conditioned were applied while running H 2 O AutoML learning modelling including the training and validation process: max_models = 10, max_runtime_sec = 400, seed = 1234.In addition, we excluded the stacked ensemble model generated by the H 2 O model to allow a fair comparison with other conventional ML models.

Evaluation metrics
The models were evaluated by using various evaluation metrics such as adjusted coefficient determination (adjusted R 2 ; Eq. 1), root mean squared error (RMSE; Eq. 2) and mean absolute error (MAE; Eq. 3) for regression tasks.For regression tasks, the adjusted R 2 is insensitive to insignificant independent variable which better capture the model performance 48 .
For classification evaluation comparison, the confusion matrix, precision, recall and f1-score were also accounted for based on the ratio between true positive (TP), false positive (FP), true negative (TN), and false negative (FN).The precision is calculated based on the ratio between TP/TP + FP while the recall described the percentage between TP/TP + FN.The classification accuracy (TP + TN/TP + FN + TN + FP) and the f1-score (2*(precision * recall)/(precision + recall) are the most widely used metrics to evaluate the performance of machine learning algorithm for classification problem 23 .

Petrophysical properties prediction
For simplicity purposes, all the algorithms involved in this study were implemented using default parameters which include only running the algorithm without specifying any related parameter.This is primarily to avoid the fine-tuning of hyperparameters associated with specific algorithms.
As a result, the first experiment used a linear regression-based algorithm to predict three different continuous logs: volume of shale (VSH), porosity (PHI), and mass percent of bitumen (W Tar).The first model was trained to predict the volume of shale (VSH), and it scored 71.15% adj_R 2 , 1.45% RMSE, and 8.32% MAE in the training phase.During the validation phase, the model received the following scores: 70.43% adj_R 2 , 1.46% RMSE, and 8.29% MAE (Table 1).The same model was then used to predict VSH on a completely separate dataset as a blind test of model performance.The model received the following scores: 71.93% adj_R 2 , 1.52% RMSE, and 8.73% MAE.This demonstrates very similar performance during training and generalization during the blind test (Table 1).In the porosity (PHI) prediction, the model predicted PHI with 70.29% adj_R 2 , 0.53% RMSE, and 3.13% MAE (Table 1).In the validation phase, the model achieved the following results: 69.68% adj_R 2 , 0.53% Vol.:(0123456789)  1).The other continuous log to be predicted is the mass percentage of bitumen (W_Tar), which has sparse sampling within the available data set.As a result, predicting such a feature is expected to be more challenging due to insufficient overall data to train the model and evaluate the model's performance.Using a similar Linear Regression algorithm to train the model, the following scores were reported during the training phase: adj_R 2 is 12.96%, RMSE is 1.22%, and MAE is 3.43% (Table 1).When applied to the validation dataset, the model produced similar results: 13.55% adj_R 2 , 1.22% RMSE, and 3.43% MAE.The test results, on the other hand, revealed a dramatic drop in performance as follows: adj_R 2 is 1.1%, RMSE is 1.22%, and MAE is 3.04% (Table 1).This result can be explained by lack of enough sampling for training and high bias within the available dataset.As a result, the model is unable to provide a reasonable prediction during the training, validation, and blind test phases.
A similar approach has been used with various supervised machine learning techniques, but with more sophisticated and resource-intensive algorithms such as gradient boosting machine (GBM) and random forest (RF).Using the same training and validation datasets, these algorithms were employed to predict the three different parameters.Learning algorithms such as GBM and RF can be customized using a variety of hyperparameters, but for the sake of simplicity and avoiding hyperparameters, no pre-set parameters were used in this study.Instead, these learning models were applied only using the default set of parameters.The first feature (log) to be trained for, as in the previous workflow, is the volume of shale (VSH).The gradient boosting machine model performed better in this case than the Linear Regression (up to 5% improvement), scoring 76.2% adj_R 2 , 1.4% RMSE, and 8.09% MAE (Table 1).In this case, the random forest algorithm yielded higher scores for the other parameter, porosity (PHI), as follows: 77.76% adj_R 2 , 0.45% RMSE, and 2.60% MAE.The gradient boosting machine algorithm performed best in the volume of bitumen (W Tar) prediction, scoring 67.85% adj_R 2 , 0.69% RMSE, and 0.53% MAE despite the limited available data.This result shows a significant improvement from the simple linear regression model.It is therefore evident that the more advanced conventional machine learning models outperform the simple Linear Regression in all petrophysical properties prediction tasks evaluated in this study (Table 1).However, there are some discrepancies between the actual logs and predicted logs from RF and GBM as shown in Fig. 5.For example, the learning models underpredict porosity values especially in the high porosity interval and overpredict the values across the relatively tighter intervals.Meanwhile, another parallel training model is constructed using the H 2 O tool to apply AutoML to the prediction of these three continuous logs.A similar approach is used by running the model with only simple default parameters (max_models = 10, max_runtime_sec = 400, seed = 1234) and using the same training, validation, and testing dataset for absolute performance comparison.The first feature to be explored is VSH, which is similar to the workflow used with supervised machine learning.In such a case, the AutoML approach tests a variety of supervised learning algorithms (ex.GBM, XGB, DRF) with various parameters.The primary model is then chosen using the best mean per class error metric.Similar to the conventional ML model for VSH, the GBM algorithm performed the best in this case and obtained the following results: 78.77% adj_R 2 , 1.33% RMSE and 7.90% MAE (Table 1).These metrics show an overall up 3% improvement when compared with conventional supervised machine learning with similar default parameters and show visually closer prediction when compared with the actual dataset (Fig. 5).The exact same approach using H 2 O tool was also applied to train the model to predict porosity.In this modelling, the AutoML process has identified distributed random forest (DRF) with (total number of trees = 50) as best fit given the run constraints.This allows a direct comparison with the conventional RF model for porosity (PHI) prediction.The DRF modelling achieved the following results in the blind test dataset: 80.45% adj_R 2 , 0.42% RMSE and 2.60% MAE (Table 1).This shows a similar magnitude of improvement (up to 3% improvement in R 2 ) than the conventional RF model.Comparison with the actual test dataset reveals that the AutoML approach provides a much closer prediction than the conventional method (Fig. 5).The last continuous log to be modelled by AutoML is the W_Tar, in which the previous linear regression model exhibited poor correlation.The AutoML process has picked the GBM algorithm to be the fittest as per mean per class error score to predict the W_Tar similar to the conventional approach.The GBM model developed by the AutoML process has scored 67.34% adj_R 2 , 0.71% RMSE and 0.28% MAE despite the very limited training data available to train the model (Table 1) which shows a comparable performance with the conventional GBM model (Fig. 5).www.nature.com/scientificreports/parameters.The AutoML-based model has shown a significant improvement when compared with the conventional GBM and achieved a weighted F1-score of 98% (Table 2).In addition, the AutoML approach provided a more consistent prediction across all the lithofacies, with high precision values ranging from 0.95 to 0.99 and the recall values ranging from 0.97 to 0.99 (Table 2 and Fig. 6).Furthermore, prediction results from the blind test wells and confusion matrix demonstrates that all various lithologies were properly assessed and correctly classified (Figs. 6 and 7).

Discussion
Predicting various petrophysical properties such as the porosity and volume of shale as well as the categorical features such as lithofacies using AutoML has yielded promising potential as demonstrated in the study.The study shows that AutoML approach has outperformed conventional regression and advanced machine learning algorithms, such as RF and GBM, in the predictions of different petrophysical parameters (Figs. 5 and 7).Across all the predictions, the proposed AutoML has shown a significant improvement in lithofacies prediction (up to 15%) which is a very challenging task to predict, in particular when dealing with heterogeneous reservoirs 18,49 .
In addition, the AutoML model can achieve such a high performance within a short period of time (less than 400 s) and minimal human intervention.A study by Palacios Salinas et al. 50further supports the advantage of AutoML in geosciences, specifically for remote sensing analysis.Furthermore, such an approach would allow to democratize advanced machine learning analysis in general and make it more accessible to non-machine learning experts which is geoscientist or petrophysicists in the case of subsurface well log interpretation.Several major drawbacks of AutoML have been actively discussed in the literature, including high-cost training, overfitting, and low interpretability 26,27 .The high-cost training issue is mostly associated with the iterative training process, but with the current technology and advanced libraries, most AutoML could be trained in low-specification PC or personal laptop, as is the case for our study.The overfitting issue is commonly related to limited and unrepresentative dataset.In this study, we utilized close to five million data points collected from 2000 wells (Fig. 4) and the selection of validation and blind test dataset were curated carefully in order to have representative test sets.To address this, we extracted feature importance ranking from the best performed model to show how the model made the decision and prediction.This is a key information when building any learning model to better classify the relevant input logs and hence identifying relationship.Furthermore, it also provides a good insight into where some logs might actually be redundant and hence can be eliminated in the modelling workflow.For the VSH prediction, the gamma ray log was by far the most important log scoring around 74% which is not surprising since the volume of shale is typically driven by gamma ray calculations in conventional petrophysical analysis (Fig. 8a).The DPHI, ILD and NPHI logs scored 13%, 8% and 5%, respectively as the contributing factor in the calculation of VSH (Fig. 8a).This further supports that the AutoML model uses similar parameters that expert petrophysicists use to calculate VSH 51 .Similarly, both gamma ray and density logs play a significant role in predicting porosity with 48% and 34%, respectively (Fig. 8b).While density is commonly used to calculate total porosity from well logs, gamma ray is typically thought to have insignificant influence on the porosity calculation.In addition, neutron log has the lowest importance (18%) in the porosity prediction which is counterintuitive with the conventional petrophysical analysis (Fig. 8b).However, this phenomenon can be explained by the lithofacies types in this Athabasca oil sands field where the majority of lithofacies is sand, shale, and shaly sand in which the porosity can be significantly influenced by the gamma ray logs as illustrated in Fig. 4. Finally, according to the important feature report, both density and neutron porosity (DPHI & NPHI) logs play a major role in training the model to predict the W_Tar (Fig. 8c).For lithofacies prediction, the VSH emerges as the most influential parameters in the prediction of lithofacies.This is followed by the gamma ray and density logs.With the types of lithofacies analyzed in this study, it is understandable why the model place VSH as the most dominant feature than the GR in predicting lithofacies (Fig. 9).This information would be helpful for future studies that focus on well log interpretation in reservoir characterization.

Conclusion
This study highlights the untapped potential of AutoML to accurately predict wireline logs and thus reservoir properties with a more robust and efficient workflow, and low carbon emission by eliminating time-consuming, manual analysis.Our findings show that the proposed AutoML method could predict different logs with high consistency and high levels of accuracy while using a legitimately simple workflow to implement.Overall, the AutoML processes are distinguished by the extreme simplicity they provide to novice users with limited experience in the fields of machine learning and data science.Another advantage is that it saves time and effort when experimenting with different algorithms and tuning their associated hyperparameters.The proposed model and library used in this study have the advantages from traditional machine learning because of their ability to can a large number of wells and different types of data and scalability for real-world deployments.
Furthermore, AutoML has provided useful insights into what specific algorithm could potentially be offered to solve a specific issue.The Gradient boosting algorithm, for example, is considered powerful in classification modeling, such as the facies/lithology prediction performed in this study.Furthermore, the feature importance percentage reporting embedded in the AutoML process is a useful tool for identifying relationships between various features (logs) and help to explain how the model base its decision to perform prediction.This will also result in better utilization of available data and improved data acquisition in future projects.Finally, this experiment shows that AutoML has a promising potential for improving formation evaluation using simple workflows.This can be validated by implementing AutoML workflow on more complex case studies in the future.

Figure 1 .
Figure 1.Location of four major oil sand deposits in Alberta, Canada 32 .

Figure 2 .
Figure 2. Examples of the available well data and lithofacies interpretation in the datasets.

Figure 4 .
Figure 4. Cross plots between different parameters available in the dataset.It is evident that different lithologies show different log responses and variable results in the laboratory measurements.

Figure 5 .
Figure 5. Plots showing the comparison between different ML algorithms and AutoML to the actual logs.

Figure 7 .
Figure 7.Comparison of lithofacies prediction using different machine learning algorithms in two different wells.

Figure 8 .
Figure 8. Histogram showing the feature importance ranking on the prediction of (a) volume of shale, (b) porosity and (c) bitumen mass percentage with AutoML.

Table 1 .
Summary of the performance of various supervised machine learning algorithms in regression tasks (VSH, PHI and W_Tar) for the blind test dataset.The best performing model is highlighted as bold values.Significant values are in [bold].