Machine learning to predict effective reaction rates in 3D porous media from pore structural features

Large discrepancies between well-mixed reaction rates and effective reactions rates estimated under fluid flow conditions have been a major issue for predicting reactive transport in porous media systems. In this study, we introduce a framework that accurately predicts effective reaction rates directly from pore structural features by combining 3D pore-scale numerical simulations with machine learning (ML). We first perform pore-scale reactive transport simulations with fluid–solid reactions in hundreds of porous media and calculate effective reaction rates from pore-scale concentration fields. We then train a Random Forests model with 11 pore structural features and effective reaction rates to quantify the importance of structural features in determining effective reaction rates. Based on the importance information, we train artificial neural networks with varying number of features and demonstrate that effective reaction rates can be accurately predicted with only three pore structural features, which are specific surface, pore sphericity, and coordination number. Finally, global sensitivity analyses using the ML model elucidates how the three structural features affect effective reaction rates. The proposed framework enables accurate predictions of effective reaction rates directly from a few measurable pore structural features, and the framework is readily applicable to a wide range of applications involving porous media flows.


Scientific Reports
| (2022) 12:5486 | https://doi.org/10.1038/s41598-022-09495-0 www.nature.com/scientificreports/ pore-scale direct numerical simulation methods are computationally demanding, which limits most studies to focus on a few porous media samples. Hence, results are often specific to the studied geometries, while pore geometrical complexity is enormously diverse 56,57 . Machine learning (ML) methods are considered as a powerful alternative to time-consuming numerical simulations while maintaining the accuracy of pore-scale direct numerical simulations 55,[58][59][60][61][62] . Recently, deep learning frameworks based on neural networks have been successfully applied to make rapid predictions of the physical properties of porous media, including permeability, porosity, and specific surface area 55,59,61,[63][64][65] . However, ML models have rarely been applied to reactive transport problems 66,67 , and have not yet been used to predict reaction rates in porous media. Furthermore, these previous ML-based investigations of porous media are mostly based on a single ML algorithm. Combinations of different ML algorithms, such as Random Forests (RF) and neural networks, have been used for predictive diagnostics with great success in medical and bioinformatics studies, where each individual algorithm is able to contribute its own strengths toward solving a specific problem 68,69 . Yet, combinations of ML models have rarely been used in reactive transport studies.
We propose an ML-based framework that characterizes the quantitative link between pore structural features and effective surface reaction rates. The pore structural features extracted from hundreds of porous media and effective reaction rates estimated from pore-scale simulations are used as the input data to train ML models. We first train an RF learning model to estimate the importance of pore structural features in determining the reaction rates. Based on this importance information, we then train an artificial neural network (ANN) model and accurately predict effective reaction rates with only three pore structural features. We demonstrate the developed framework in different flow (Pe) and reaction (Da) regimes. Finally, using the ANN model, the effects of the key pore structural features on effective reaction rates are comprehensively evaluated through global sensitivity analyses, and the physical relevance of the results is discussed.

Methods
In this section, we present the pore-scale numerical method for building the training data, and the ML methods for estimating pore structural features' importance and predicting the effective reaction rates.
Building training data with pore-scale reactive transport modeling. We perform a large number of pore-scale reactive transport simulations in an ensemble of 3D porous media structures obtained from an open-source database 64 . Rabbani et al. 64 generated a large set of porous structures through the texture transformation and porosity manipulation of 60 measured tomographic images. Each sample has a size of 256 3 voxels, and dimensionless units are used to estimate the physical features of porous media. For each porous media, Rabbani et al. 64 quantified various pore structural features including 11 of the most common single-value pore structural features. They are specific surface, pore sphericity, coordination number, throat radius, pore radius, tortuosity, pore-throat ratio, grain radius, pore density, grain sphericity, and throat length.
We apply a previously verified 3D pore-scale reactive transport model [70][71][72] to calculate effective surface reaction rates in the ensemble of porous media. In this model, fluid flow, solute transport, and chemical reactions are solved. We consider incompressible fluid flow at low Reynolds numbers and solve the continuity equation and the Stokes equation 73 , where v is the fluid velocity vector, p is the pressure, and µ is the dynamic viscosity. The governing equations of fluid flow are solved by the discrete Boltzmann equation based on the D3Q19 scheme 29 . A no-slip boundary condition is applied via the bounce-back rule at the fluid-solid interface. Details on the applied lattice Boltzmann method can be found in Mostaghimi et al. 34 .
The advection-diffusion equation is then solved to consider the transport of reactive solutes through pore spaces 74 , where C is the local solute concentration, t is the time and D is the molecular diffusion coefficient. To study different flow regimes, we consider three different Péclet (Pe) numbers that cover typical flow conditions in porous media: Pe = 0.1, 1, 10. Pe is defined as U av L D where U av is the average velocity and L is the characteristic length that is calculated via π/s A where s A is the specific surface area 34 .
For fluid-solid reactions in the porous media, we consider a bimolecular heterogeneous reaction which can be expressed as a generic equation, where A is chemical species in aqueous solutions, B is the reactant in solid phase and C is the product that can be either in aqueous phase (dissolution) or solid phase (adsorption). This equation has been reported to be adequate for describing chemical reactions in various systems and applications [75][76][77] . We implement the bimolecular heterogeneous reaction by applying irreversible first-order reaction kinetics as the boundary condition at fluid-solid interfaces via,  78 . We consider reactive transport of H + and Na + for calcite dissolution and salt ion adsorption scenarios, respectively. We use Damköhler numbers Da = k r U av to describe the reaction rate relative to the mass transfer rate by advection 79 , and use Da II = Pe Da = k r L D to compare the reaction rate to the mass transfer rate by molecular diffusion 51,80 . The diffusion coefficient of H + , D H + , is 1.0 × 10 −9 [m 2 s −1 ] and D Na + is 1.3 × 10 −9 [m 2 s −1 ] 81 , and the corresponding Da II values are 0.03 and 0.27. Note that Da is independent of initial concentration because the reaction follows the first-order reaction. Thus, the studied system is determined by Pe and Da (or Da II ).
From the steady-state concentration fields, we directly estimate the local reaction rate at each interfacial grid ( Fig. 1) and average over all interfacial grids to obtain effective reaction rates. The effective surface reaction rate can be defined as 22 , where N is the total number of voxels at fluid-solid interfaces and C i is the steady-state concentration of i-th voxel at fluid-solid interfaces. We define the normalized effective reaction rate as, where C in is the injection concentration. R norm quantifies the discrepancy between the effective reaction rate and the well-mixed reaction rate for C = C in .
Machine learning methods. The pore structures of porous media can be characterized by multiple features, such as specific surface, tortuosity, and pore radius. We use the 11 single-value pore structural features and effective reaction rates calculated from pore-scale simulations to train a two-step ML framework (Fig. 2), aiming to identify the key pore structural features that control effective reaction rates. We combine RF and ANN models to quantify the importance of pore structural features and to predict effective reaction rates directly from a few key pore structural features. In comparison with neural networks, RF is less computationally expensive and can effectively estimate the importance scores of input features, i.e., RF is a useful algorithm for feature importance ranking 82,83 . In spite of the higher computational cost for training, the ANN offers better model accuracy and performance, if the model is well optimized. Thus, we use RF to estimate feature importance, then use that importance information to optimize ANN training. Feature selection for ML models can help reduce redundant data, minimize overfitting, and improve model accuracy by removing unnecessary data. Further, ML algorithms with fewer features can be trained faster 84,85 .
RF learning algorithm creates multiple decision trees on input data and then selects the mean predictions of each decision tree as the best solution 82,83 . We use the data of 11 pore structural features and corresponding effective reaction rates as input to train a bagged ensemble of decision trees to estimate the importance value  (1), and depth of tree (9). The hyperparameters are optimized to maximize R 2 values and to ensure enough splits and tree depth. ANN model applies a learning algorithm for nonlinear statistical data modeling by mimicking the way nerve cells work in the human brain, and the model is particularly efficient in implicitly estimating complex nonlinear relationships between input features and target predictions 86 . We choose a single layer feed-forward neural network consisting of an input layer, hidden layer, and output layer [87][88][89] . The input is the pore structural features, and effective reaction rates from pore-scale simulations are the target predictions. The hidden layer consists of 10 neurons, where multiple functions are applied for data transformation. The neurons learn about the data and then send it to the output layer. Bayesian Regularization is used as the training algorithm in ANN, which is efficient for training small-size datasets with noises. We first train the ANN with the 11 pore structural features, and then reduce the number of input features based on the importance values obtained from the RF model.

Results and discussion
In this section, we first present results with an RF learning model that ranks the importance of each pore structural feature in predicting effective reaction rates. Then, we combine the RF importance ranking result with the ANN model to identify the most critical pore structural features for predicting effective reaction rates. We start with the Pe = 0.1 and Da II = 0.03 case and expand to different Pe and Da regimes. Finally, we conduct global sensitivity analyses with the validated ANN model and discuss the results. Importance ranking of pore structural features. For ML predictions, there is generally a trade-off between data preparation cost and model accuracy. ML models often show better performance with more datasets or instances for predictions, though it depends on the particular dataset 62,90 . To test this, we train the RF learning model with pore-scale simulation results of 100, 200, 300, 400, and 500 instances (i.e., the number of porous media samples) to determine how many simulation results are needed to achieve adequate accuracy of the learning model. We use the coefficient of determination, R 2 , to measure the accuracy of the ML models 91 . The black square data points in Fig. 3a show the R 2 values of RF models as a function of the number of instances used in training. The R 2 increases as the number of instances increases, but the increase is relatively minimal beyond 300. The R 2 of the model with 300 training instances is 0.938, which is comparable to that with 500 instances (R 2 = 0.946). Hence, we train the RF model with 300 instances and estimate the importance of 11 pore structural features.
The RF successfully ranked the importance of each feature, and the results are validated with the fivefold crossvalidations. The importance rankings are shown in Fig. 3b, and the importance of each feature is estimated from the increase in the prediction errors after permuting the feature in the datasets 82,83 . The error bars are obtained from fivefold cross-validation, which is commonly used to test the performance of the ML model and detect overfitting 92 . The short error bars indicate that there is small uncertainty in the predicted importance values and the average values of the importance are reliable 82,83 .
Because the importance ranking can be affected by the variability of feature values 90 , we estimate the coefficient of variation of each feature, which quantifies the variability around its mean value 93 . The inset of Fig. 3a shows the coefficient of variation of each pore structural feature. It is worth noting that pore sphericity and tortuosity have lower variability than most of the other features, while they are estimated as the second and sixth most important among the 11 features. There is no noticeable correlation between the feature importance and coefficient of variation, and this indicates that the variability of feature values in the input data is large enough to evaluate the importance of pore structural features.
Predicting effective reaction rates from key pore structural features. Based on the importance information, we now train the ANN model with a varying number of features to identify the key features for predicting effective reaction rates. Figure 4a shows the R 2 values for ANN predictions using 11 pore structural features. When the number of instances used for training is larger than 300, highly accurate and stable pre- www.nature.com/scientificreports/ dictions are achieved, which suggests 300 instances (R 2 = 0.972) are also sufficient for establishing an accurate ANN model. The importance estimation of pore structural features by RF provides the basis for identifying key features [94][95][96] . To identify the most critical pore structural features for ANN predictions, we train the ANN using 300 instances with 11 features, and then reduce the number of input features one by one, removing the features with the lowest importance value. The importance ranking from the most important feature to the least important feature are the following: specific surface, pore sphericity, coordination number, throat radius, pore radius, tortuosity, pore-throat ratio, grain radius, pore density, grain sphericity, throat length. The inset in Fig. 4a shows no significant variations in R 2 values when the ANN model is trained with three or more input features. When using one or two features for training, the R 2 values are much lower, indicating the necessity of using the three most important pore structural features for achieving accurate predictions 64,[97][98][99] . This implies that the other pore structural features are highly related to these three features or have limited contributions to effective reaction rates, which can be confirmed from the pair-wise correlations between the 11 features (see the Supplementary Document). For example, throat radius shows a high correlation with pore sphericity, and tortuosity shows a high correlation with coordination number. Specific surface, pore sphericity, and coordination number are identified as the three most important pore structural features. Specific surface area has the highest importance, which is defined as the ratio of total surface area to bulk volume 64 . This result is intuitive because the specific surface is directly linked to the reactive surface  www.nature.com/scientificreports/ area 100,101 . Pore sphericity is a shape factor describing the smoothness of reactive surface 97,98 , and it will affect the efficiency of mass transfer from the fluid to solid surfaces. Coordination number measures the average number of throats connected to a pore and describes the average connectivity of the pore space, which will govern the overall accessibility of reactive surface 99 . We use these three features from 300 instances as inputs to train the ANN model that predicts effective reactions rates. We use 70% of the data for training, 15% for validating, and 15% for testing the model. The testing data is independent of the training and validation data, which is applied to measure the performance of the trained ANN model. The prediction performance of the ANN model is shown in Fig. 4b. The validation R 2 is 0.980, and the testing R 2 is 0.972, indicating the ANN model provides good performance in predicting effective reaction rates with only three pore structural features. The marginal discrepancy between the R 2 values from validation and testing shows the stable performance of the ML model 90,102 . The effects of Pe and Da. We extend the developed framework to different flow (Pe) and reaction (Da) regimes. We first present the results under different Pe numbers. Figure 5a shows the results of importance estimation at Pe = 0.1, 1, and 10 with Da II = 0.03. Specific surface, pore sphericity, and coordination number remain the three most important features, though the most important feature becomes pore sphericity at Pe = 1 and 10. At higher Pe, advection is stronger, meaning pore structural features that are sensitive to flow will become more important. Indeed, the importance of pore sphericity and coordination number increases as Pe increases. The pore sphericity measures the shape of fluid-pore interfaces and determines the smoothness of the flow lines in pore space, thereby also the accessibility of the reactive surface area. Therefore, the influence of the pore shape factor on reaction increases as Pe increases. The results show that the three most important features remain www.nature.com/scientificreports/ the same across Pe numbers, indicating that the three features are the key predictors for effective reaction rates in typical porous media flow conditions. Figure 5b shows the test R 2 of the ANN predictions at different Pe numbers. The test R 2 values for ANN predictions at Pe = 0.1, 1 and 10, are calculated as 0.972, 0.974 and 0.980, respectively. The R 2 values of ANN predictions are high across Pe numbers, showing stable and high-quality predictions across these different flow conditions. We now extend the framework to a different Da number (Da II = 0.27) with a higher intrinsic reaction constant ( 1.1 × 10 −6 [ms −1 ] ), which is relevant for Capacitive Deionization (CDI), an emerging desalination method 78 . The transport of Na + and its adsorption at the solid surface is considered in the model. As shown in Fig. 5c, the three most important features are the same at both Da II = 0.03 and 0.27 with Pe = 0.1. The increase in reaction constant leads to a large increase in the estimated importance of the specific surface (the first green bar), which indicates that the specific surface plays a much more important role at higher reaction rates. At low flow rates (Pe = 0.1) but high reaction constant, the reaction rate is much larger than the advective mass transport rate, making the surface area play a more dominant role in determining the amount of fluid-solid reaction. Figure 5d shows the ANN predictions with two Da numbers. A high R 2 value of 0.976 is again achieved at Da II = 0.27, only with the three pore structural features. This result confirms that the ML framework and findings are also valid for different Da regimes.
Global sensitivity analysis with machine learning. We perform global sensitivity analyses using the trained ANN model to elucidate the combined effects of the key pore structural features on effective reaction rates under various flow and reaction conditions. Each row in Fig. 6 shows the effects of pore structural features on normalized effective reaction rates ( R norm ) at fixed Pe and Da. For each column, the coordination number, pore sphericity, and specific surface are fixed respectively with their average values in the datasets. This enables us to plot the combined effects of two features on the effective reaction rates. Effective reaction rates expectedly www.nature.com/scientificreports/ increase with specific surface and pore sphericity, as these two features determine the area and accessibility of pore reactive surface. A larger coordination number also leads to a larger effective reaction rate because a large coordination number implies an enhanced mass transfer between pores 64 . The overall effects of the three features on effective reaction rates are similar for the explored Pe and Da cases. However, the magnitude of normalized effective reaction rate is sensitive to Pe and Da. At higher Pe but low Da, the system is reaction-limited 21,46 . In this regime, the mass transfer rate is faster than the reaction rate, such that the concentration and reaction rates in the pore space become spatially uniform, leading to the increase in R norm (compare Fig. 6a,b and d-f). We confirmed that the Pe = 10 case leads to a further increase in R norm , meaning less discrepancy between the well-mixed reaction rate and effective reaction rate. Figure 6g-i show the ANN predictions with Da = 2.7 at Pe = 0.1, where a large variability in effective reaction rates is estimated. This reaction regime is transport-limited due to weak advection (Pe < 1) and high reaction rate (Da > 1) 33,46 , where the reaction rate is higher than the mass transfer rate by advection. In such regimes, uneven distributions of concentrations and large concentration gradients emerge in the pore space, leading to discrepancies between the well-mixed reaction rate and effective reaction rate. The global sensitivity analyses elucidate the effects of the key pore structural features on effective reaction rates, and the effective reaction rates could be used as input parameters to Darcy-scale reactive transport modeling.
If solid phase alteration is considered, uniform or transitional (between uniform and wormholing) dissolution patterns are most likely to be observed due to the low Da numbers 35,103 . However, it is known that the dissolution patterns depend not only on Pe and Da but also on pore structural heterogeneity 38,46,104 . The ML framework proposed in this study could be extended to predict dissolution regimes from Pe, Da, and pore structural features. However, reactive flow simulations with solid alterations, which is computational very expensive, should be performed to obtain accurate dissolution patterns.

Conclusions
Numerous environmental applications rely on reactive transport in porous media, but the accurate estimation of reaction rates has been a major challenge, limiting the predictive capability of reactive transport models. This study established a quantitative link between pore structure features and effective surface reaction rates by combining pore-scale simulations with ML algorithms. For the first time, we identified the three key pore structural features that determine effective surface reaction rates. The three features remained as the top critical features for the explored values of Pe and Da, which cover typical flow and reaction regimes in porous media 105,106 . The identified three features indeed capture the key factors that control reactive transport with heterogeneous reactions: specific surface quantifies surface area effect, pore sphericity quantifies pore shape effect, and coordination number quantifies flow/connectivity effect. We also applied this ML-based framework to perform global sensitivity analyses of the input features in determining effective reaction rates. The established ML model served as a surrogate model and enabled us to exhaustively and efficiently evaluate the effects of various system parameters (e.g., pore structures, flow rates, reaction rate constants) on effective reaction rates, which was otherwise not feasible due to the computational limitations. Extending the applicability of the proposed framework to wider ranges of Pe and Da will be an important next step. Also, with a larger dataset with wider Pe and Da values, one may be able to develop a more generic ML model that includes Pe and Da as input parameters.
The presented ML framework can be readily extended to a wide range of geological and environmental applications that involve complex coupled processes. For example, the lifetime of bentonite barriers in geologic repositories of spent nuclear fuel could be efficiently estimated by using clay structural features, temperature, chemical reaction constants, water saturation, and swelling rates as inputs to the ML model training. Further, the framework can be used to not only establish a quantitative link between input variables and target output variables but also to identify optimal values of pore structural features that can maximize the performance of porous materials. For example, by linking the pore structural features to effective corrosion rates, the framework can identify novel corrosion-resistant porous materials with optimized pore structural features that minimize corrosion. The framework can also be naturally extended to optimize other material properties such as mechanical strength and filtration efficiency. In particular, the model could identify optimal membrane properties for maximizing filtration efficiency by considering fiber diameter, hierarchical surface structure, pore size distribution, and surface area as input parameters and the effective water filtration rate as target prediction.
The proposed ML framework also provides an attractive approach for obtaining upscaled model parameters that are physically parameterized with subscale properties. In subsurface applications, the continuum model is often incapable of properly capturing pore-scale effects on Darcy-scale properties such as permeability, dispersion coefficients, and effective reaction rates. The proposed framework will enable us to establish ML-based quantitative correlations between pore-scale information and upscaled parameters. In a future study, the effects of porous media sample size should be further investigated. Studying the sample size effect will require substantial computational resources, but it is an important step for achieving upscaling. In summary, the proposed framework can not only elucidate the key parameters that control various physicochemical processes in porous media systems but also can be extended to improve model predictability and to identify optimal properties of porous materials.

Data availability
The training data and trained ML models are all made available open access at https:// drive. google. com/ drive/u/ 2/ folde rs/ 17nTP OjOVs livzZ G8u0_-l4gib CV3vz HX.