Predicting permeability via statistical learning on higher-order microstructural information

Quantitative structure–property relationships are crucial for the understanding and prediction of the physical properties of complex materials. For fluid flow in porous materials, characterizing the geometry of the pore microstructure facilitates prediction of permeability, a key property that has been extensively studied in material science, geophysics and chemical engineering. In this work, we study the predictability of different structural descriptors via both linear regressions and neural networks. A large data set of 30,000 virtual, porous microstructures of different types, including both granular and continuous solid phases, is created for this end. We compute permeabilities of these structures using the lattice Boltzmann method, and characterize the pore space geometry using one-point correlation functions (porosity, specific surface), two-point surface-surface, surface-void, and void-void correlation functions, as well as the geodesic tortuosity as an implicit descriptor. Then, we study the prediction of the permeability using different combinations of these descriptors. We obtain significant improvements of performance when compared to a Kozeny-Carman regression with only lowest-order descriptors (porosity and specific surface). We find that combining all three two-point correlation functions and tortuosity provides the best prediction of permeability, with the void-void correlation function being the most informative individual descriptor. Moreover, the combination of porosity, specific surface, and geodesic tortuosity provides very good predictive performance. This shows that higher-order correlation functions are extremely useful for forming a general model for predicting physical properties of complex materials. Additionally, our results suggest that artificial neural networks are superior to the more conventional regression methods for establishing quantitative structure–property relationships. We make the data and code used publicly available to facilitate further development of permeability prediction methods.

The porosity φ (volume fraction of the pore phase) and specific surface s (pore-solid interface area per unit volume) are perhaps the most basic geometrical characteristics. These two characteristics are the most frequently used in empirical expressions for the permeability. The Kozeny-Carman equation 17,18 is the most notable example, usually written as where k is the permeability and c is the Kozeny-Carman constant. However, the remarkably simple form comes with great limitations. The Kozeny-Carman constant was found not to be a universal quantity. It does not only vary for different systems, but can also depend on the porosity 19 . Additionally, it does not distinguish portions of pore space that carries significant flow from portions that do not 1 .
To tackle this difficulty, countless modified versions of the original Kozeny-Carman equation have been proposed. However, these models are usually ad hoc and only applicable to a specific class of structures 20 . More importantly, although in many cases tortuosity is incorporated in the Kozeny-Carman constant [21][22][23][24] , it usually only depends on the porosity alone in simplified models, thus the final expression of the permeability is essentially nothing more than a function of porosity and specific surface, i.e., f (φ)/s 2 . However, it is well-known that the microstructure is highly degenerate given only porosity and specific surface 25,26 , which leads to a wide range of permeabilities as we show later. Thus, any function of the form f (φ)/s 2 cannot be a general predictor and suffers from the intrinsic variances in the set of infinite degenerate microstructures.
Indeed, accurate prediction of the effective physical properties of the porous media requires a complete quantitative characterization of the microstructure in d-dimensional Euclidean space R d via a variety of n-point correlation functions 1 . However, while such complete structural information about the medium is generally not available, reduced information in the form of lower-order correlation functions is often very beneficial. Twopoint void-void and three-point void-void-void correlation functions have been used to produce both bounds and estimates for the effective electrical conductivity, diffusion coefficient and permeability [27][28][29][30][31][32][33][34][35][36] . In addition to the void-void correlation function, two-point surface-surface and surface-void correlation functions (where the surface is the interfacial surface between two phases) can also be defined and provide improved reconstructions of two-phase media from imaging data 37,38 , as well as sharper bounds on permeability compared to only using the void-void correlation function 1 .
On the other hand, it has been shown that permeability can be simply connected to the electrical formation factor of the porous material 39,40 . This has been proved rigorously by Avellaneda and Torquato 41 . However, from a prediction point of view, the formation factor itself needs to be measured experimentally or solved numerically, thus is not that helpful for establishing an explicit link to the microstructure. Although the formation factor is related to the hydraulic tortuosity 42 , the later also requires heavy computations.
As a complement to rigorous approaches to estimate effective properties from the microstructure, data-driven methodologies to establish structure-property relationships are increasingly being used [43][44][45][46][47][48][49] . The rapid increase in computational resources facilitates the computation of effective properties for very large data sets (hundreds or thousands) of different microstructures. Moreover, as noted above, affordable high-resolution 3D digitized images of actual microstructures provide valuable data sets. As a consequence, it becomes manageable to generate large numbers of realistic virtual microstructures, and using those to perform exploratory computational screening of structure-property relationships. For example, van der Linden et al. 43 use a data set of 536 virtual granular materials, compute 27 geometrical descriptors and use log-linear regression and other statistical learning methods as well as different variable selection schemes to understand the usefulness of the different descriptors for predicting permeability in these systems. Stenzel et al. 44 study effective conductivity prediction in 43 virtual realizations of a stochastic spatial network model structure, using porosity and different tortuosity and constrictivity measures. This study was extended to 8,119 microstructures 50 , which is likely the largest study published before, and the same data set was used again later to predict effective conductivity and permeability 45 . Barman et al. 46 studied effective diffusivity prediction in 36 virtual porous polymer films using tortuosity and constrictivity. In a different direction, there are several attempts to use 2D and 3D convolutional neural networks (CNNs) to extract information directly from the binary image data describing the structure [47][48][49][51][52][53] in order to predict effective properties. However, these models are usually difficult to interpret and hard to rescale.
In this work, we are primarily interested in the predictive power of the information content contained in different microstructural descriptors. Specifically, we investigate the two-point surface-surface, surface-void, and void-void correlation functions, and also porosity, specific surface, and geodesic tortuosity using different regression methods. Unlike the hydraulic tortuosity mentioned above, the geodesic tortuosity is a purely geometric quantity that can be computed efficiently, and has been shown to be superior for diffusivity prediction 46 . We compare different regression methods, including conventional linear regression with linear and quadratic terms, as well as deep artificial neural networks (deep learning). While conventional linear regression has an advantage in so far as the transparency of the prediction mechanism, deep learning has the potential to extract nearly the full information content of the descriptors, providing insight into the utility of the different descriptors for establishing the structure-property relationship. We find that the information content contained in these two-point correlation functions and geodesic tortuosity are indeed helpful to overcome the difficulty of applying a unique Kozeny-Carman-type equation to a variety of distinct microstructures, by yielding much better prediction performance. Moreover, our results suggest that artificial neural networks are superior to the more conventional regression methods for establishing quantitative structure-property relationships.
Consistent with the purpose of the paper, we have generated a large data set of virtual, porous, isotropic, and stationary microstructures of three different types, based on (i) thresholded Gaussian random fields, (ii) thresholded spinodal decomposition simulations of phase separation, and (iii) non-overlapping ellipsoid systems. www.nature.com/scientificreports/ Varying porosity and length scale and other parameters, we generate 10,000 structures of each of the three types, yielding a data set of 30,000 virtual microstructures in total. This is likely to be the largest data set of virtual microstructures ever created for studying permeability prediction, and covers both granular (ellipsoids) and continuous solid phases to provide a broad variability in the pore space geometry. Fluid flow is simulated using the lattice Boltzmann method. The large number of simulated microstructures makes it feasible to use not only scalar descriptors but also high-dimensional descriptors such as the two-point correlation functions, while still avoiding the well-known ' curse of dimensionality' in regression caused by having too many dimensions but too little data. To facilitate further investigation and development of permeability prediction methods, we make the microstructural descriptors, the computed permeabilities, the trained models, and the code used herein publicly available 54 . The paper is organized as follows. First, we introduce necessary definitions for the geometric descriptors used throughout the paper. Second, we describe how the virtual microstructures are generated, and the flow simulations and computations of permeability are described. Third, computation of the different microstructural descriptors is covered. Fourth, the prediction models for permeability are investigated. Finally, we make concluding remarks and discussions.

Background and definitions
Geodesic tortuosity. We compute geodesic tortuosity in the flow direction according to Barman et al. 46 in the following manner. As a first step, a pointwise geodesic tortuosity is computed as τ (x) = d(x)/d . Here, d is the length of the microstructure in the flow direction, and d(x) is the length of the shortest path from any inlet pore to any outlet pore through x . The shortest path is calculated as the sum of two geodesic distance transforms computed in the pore space of the binary voxel array: one using the set of edge voxels constituting the inlet pores as seeding points, and the other using the set of edge voxels constituting the outlet pores as seeding points. Let P be the set of voxels for which both geodesic distances are finite, i.e., the set of pore voxels connected to both inlet and outlet. Then, the geodesic tortuosity τ can be computed as In Barman et al. 46 , it was found that accounting for both inlet and outlet in this manner is superior (in terms of diffusivity prediction) to just accounting for the inlet as is commonly done 44,55 . Tortuosity calculations were implemented in Matlab (Mathworks, Natick, MA, US). correlation functions. Let I(x) be the indicator function for the void phase (pore space) V 1 , i.e., The two-point void-void correlation function is then generally defined by For statistically homogeneous materials, S 2 is only dependent on the vector difference r = x 2 − x 1 . Further, if the material is also statistically isotropic, S 2 is only dependent on the radial distance r = |r| . Introducing a notation that is consistent with the other correlation functions defined below, the two-point void-void correlation function is now defined as where the average is taken over all x and over all r with magnitude r. We proceed to the correlation functions involving the interfacial surface. Let M(x) be the interface indicator function defined by 1 Still assuming ergodicity and statistical isotropy, the surface-void correlation function can be written as and the surface-surface correlation function can be written as Importantly, the information content of one-point correlation functions (porosity φ and specific surface s) is automatically encoded into these two-point correlation functions. When r goes to infinity, F vv (r) , F sv (r) and F ss (r) converge to φ 2 , sφ and s 2 respectively. Interestingly, the slope of F sv (r) at the origin is proportional to the integrated mean curvature of the system 56 , which has recently been shown to be a useful predictor of both permeabilities 57 and diffusion coefficients 58 .
Accurate and robust computation of F vv , F sv , and F ss from discretized structures is a non-trivial task (in particular for the later two). The calculations recently became accessible due to the algorithms devised by Ma and Torquato 56 . There, the calculations involving the interfacial surface are performed using a scalar field which when thresholded yields the corresponding two-phase medium. The details of the algorithms and the software can be found in Ref 56 .

Microstructure data preparation
Microstructure generation. To achieve a large, representative data set, three different types of microstructures that are commonly studied in the materials literature are generated, including (i) thresholded Gaussian random fields, (ii) thresholded spinodal decomposition simulations of phase separation, and (iii) non-overlapping (hard) ellipsoid systems. We simulate 10,000 realizations for each type, with porosities φ selected uniformly in 0.3 ≤ φ ≤ 0.7 and varying characteristic length scales. In the end, all structures are converted to N 3 binary voxel arrays with N = 192 voxels. In Fig. 1, one example of each type of structure is shown. We verified that the choice of the system volume size is both computationally manageable and representative. The correlation functions are evaluated for integer radii value bins r from 1 to 96 voxels. In Fig. 2, some examples of correlation functions are shown. Note that these correlation functions are considerably distinct from each other, as seen by their different magnitudes and functional shapes. On the other hand, they have already converged to the large-r limits within the sample size. The details of how these samples are generated are presented in the following subsections.
Gaussian random fields. Gaussian random fields are generated according to Lang and Potthoff 59 . Assuming that we wish to simulate a Gaussian random field G(x) , x ∈ R 3 , with mean zero and covariance function x, y , it utilizes the fact that the covariance function can be written where γ p is the spectral density of the Gaussian random field and �·, ·� is the inner product. We wish to generate structures with length scale parameter L and resolution  Spinodal decomposition. The lattice Boltzmann method 61,62 , a numerical framework for solving partial differential equations based on kinetic theory, is used to simulate phase separation kinetics (spinodal decomposition) using the Navier-Stokes and Cahn-Hilliard equations. Very briefly, the time evolution of a spatially dependent concentration C(x, t) , 0 ≤ C ≤ 1 , is described by As an initial condition, the values of C are uniformly distributed in 0 ≤ C ≤ 1 , independently in all grid points. The phase separation is the coarsening of regions with C ≈ 0 and C ≈ 1 (ideally equal to 0 and 1). Here, u is a fluid velocity governed by the Navier-Stokes equations, M is a mobility, i.e., a diffusion coefficient, and µ is the chemical potential. The simulation is performed using a dimensionless time step unity and both the density ratio and viscosity ratio between the phases are unity. The interface width, i.e., the characteristic length scale of the transition between the phases is 5 voxels. The simulations are performed in the resolution 96 3 voxels with periodic boundary conditions, and are run until an appropriate degree of coarsening is obtained. The number of iterations K is chosen randomly between 5 and 20,000 such that K 1/3 is approximately uniformly distributed; this is because according to the Lifschitz-Slyozov law, the typical length scale in the structure will be proportional to the cubic root of the simulation time. After terminating the simulation, the solutions are upscaled to 192 3 voxels and thresholded to obtain the desired porosity. The spinodal decomposition simulations are run using in-house software 61,62 with efficient scaling to many cores using the MPI interface. The average execution time is approximately 13 min (32 cores) for each structure, and up to 60 min for the longest computation.
Non-overlapping ellipsoids. Random configurations of non-overlapping, hard ellipsoids are generated using a hard particle Markov Chain Monte Carlo (MCMC) algorithm. The Perram-Wertheim criterion 63 for two ellipsoids of arbitrary orientation is used for overlap detection. First, particles are assigned uniformly distributed locations and orientations (the latter encoded using a quaternion representation). Second, the configurations are relaxed by sequentially performing random translations of all particles and then random rotations of all particles until no two particles overlap. Proposed translations and rotations are only accepted if they lead to a lower or equal degree of overlap for the considered particle. These "local" stochastic optimization steps eventually lead to a "global" optimization resulting in no overlap. Third, the configurations are equilibrated by performing a large number of random translations and rotations, ensuring a distribution in location and orientation that is as uniform as possible. Now, if the desired porosity φ is larger than 0.50, non-overlapping configurations can be generated easily at constant porosity as described above. Otherwise, as a final step, the configuration is further compressed in small steps, �φ = 10 −5 , until the target porosity φ target is reached (in some cases, the configuration becomes jammed before reaching φ target ). The proposed translations are normal distributed with standard deviation σ t in each direction. The proposed rotations are normal distributed with standard deviation σ r in a random direction. In every step, σ t and σ r are chosen in an adaptive fashion to aim for an acceptance probability of 0.25. The number of ellipsoids M is distributed in 8 ≤ M ≤ 512 such that M 1/3 is approximately uniformly distributed, yielding an approximately uniform distribution of length scales. Further, the ellipsoids have semiaxes (1, 1, η) where η is uniform in 0.25 ≤ η ≤ 1 (oblate) with probability 0.5 and otherwise uniform in 1 ≤ η ≤ 4 (prolate). The random microstructures are generated using in-house developed software implemented in Julia (http://www.julia lang.org) 64 and available in a Github repository (https ://githu b.com/rodin g/white fish_gener ation , version 0.2). The average execution time is approximately 1 min (single core) for each structure, and up to 30 min for the longest computation. The obtained configurations are further smoothed with a Gaussian filter www.nature.com/scientificreports/ with σ = 3 voxels and thresholded again to regain the original porosity; the reason for this is that computation of some of the correlation functions requires the binary structures to be described as a thresholded version of smooth scalar fields.
Flow simulations. The lattice Boltzmann method 61,62 , a numerical framework for solving partial differential equations based on kinetic theory, is used to simulate fluid flow through the structures. The Navier-Stokes equations for pressure-driven flow are solved for the steady state using no-slip, bounce-back boundary conditions on the solid/liquid interface and periodic boundary conditions orthogonal to the flow direction. We use the two relaxation time collision model with the free parameter eo = 3 16 , which guarantees that the computed permeability is independent of the relaxation time (and thus the viscosity) 65 . The relaxation time τ = − 1 e is kept at 1.25. The flow is driven by constant pressure difference boundary conditions across the structure in the primary flow direction 66 , and a linear gradient is used as initial condition. The computational grid coincides with the voxels of the binary structure, i.e., it has 192 3 grid points. After convergence to steady state flow, the permeability k is obtained from Darcy's law, Here, ū is the average velocity, p is the applied pressure difference, µ is the dynamic viscosity, and d is the length of the microstructure in the flow direction. The permeability is independent of the fluid and the pressure difference and a property solely of the microstructure provided that the Reynolds number is sufficiently small ( Re < 0.01 ), which also ensures that the velocity is proportional to the pressure difference. The computed permeabilities have units voxels 2 , where the voxels have unit length.
Convergence of the computation is assessed in the following fashion. The energy of the fluid flow field, i.e. the integral of the squared velocity magnitude in the pore space, is computed for each iteration. In each iteration, the coefficient of variation (the standard deviation divided by the mean) of this energy is computed for the latest 500 iterations. The computation is terminated once this coefficient of variation reaches below 10 −4 . The convergence criterion is the same throughout, although the number of iterations required for convergence differs between different types of microstructures. The mean number of iterations needed are approximately 2,840 for Gaussian random fields, 2,330 for spinodal decompositions, and 2,270 for non-overlapping ellipsoids. However, the number of iterations is also highly dependent on e.g. porosity and varies approximately in the range 1,000 to 10,000 for all types of microstructures. The average execution time is 97 s (utilizing 32 cores, with efficient scaling using the MPI interface). Figure 3 illustrates the result of a flow simulation in one of the Gaussian random field structures. We choose 4,500 miscrostructures, 1,500 for each type, and plot their scaled permeabilities ks 2 versus porosities φ in Fig. 4. It is noteworthy that although a clear overall trend can be seen, the scaled permeability is never a function of φ alone. In fact, we observe that for the same porosity, the largest scaled permeability is approximately Consequentially, more detailed information is needed to pinpoint the true permeability on this "band". Another interesting observation is that the scaled permeabilities of spinodal decomposition patterns are almost always lower than those of the other two types. It has been shown that spinodal decomposition gives rise to hyperuniform structures in the scaling region 67 . This observation is consistent with the fact that hyperuniform structures cannot tolerate large "holes" and the pore space is more evenly distributed compared to nonhyperuniform structures, thus their permeabilities are generally lower 40 .

Microstructural descriptors
We study the performance of the different microstructural descriptors introduced above and the combinations of them for predicting the permeability k (dimension length 2 ). The descriptors used are porosity φ (dimensionless), specific surface s (dimension 1/length), tortuosity τ (dimensionless), the correlation functions F ss (dimension 1/ length 2 ), F sv (dimension 1/length), and F vv (dimensionless). Additionally, we investigate a particular combination of the correlation functions: inspired by a rigorous upper bound for permeability in isotropic media 56 , i.e., we define a function which is also used for prediction (F is dimensionless and converges to zero when r goes to infinity). Each correlation function is represented by a 96-dimensional vector. In Table 1, the models, denoted 1 through 7, and the sizes of the corresponding input features are listed. Additionally, we consider a rescaling of the problem, predicting the rescaled, dimensionless permeability ks 2 instead of k directly. For model 1, we remove s from the descriptors since it is already absorbed in the permeability; for models 2, 3, 4, 6, and 7, the correlation functions   www.nature.com/scientificreports/ are rescaled to dimensionless versions where applicable. In Table 2, the modified models, denoted 1 ′ through 7 ′ , and the dimensions of the corresponding input vectors (that changes only for model 1 ′ ) are listed. In practice, we use the logarithm of permeabilities, i.e., log 10 k and log 10 ks 2 , instead of their original values. The reason for using logarithms of the permeabilities is that they span several orders of magnitude. By taking logarithms, the predictions are simplified, and guaranteed to be positive. We also use the logarithms of porosity, specific surface, and tortuosity since we know that models 1 and 1 ′ are naturally multiplicative in these Kozeny-Carman-like equations.

predictive models
We assess the predictive performance of the different descriptors/inputs using several regression methods, namely, linear regression with linear terms only or combined with quadratic terms, and deep artificial neural networks. The inputs are as described above, with no normalization (such as subtracting feature-wise means; our investigation suggested no improvement from normalization in this setting). For each microstructure class, the data are split randomly into training data (70 %; 7,000 per class), validation data (15 %; 1,500 per class), and test data (15 %; 1,500 per class). In total, the training, validation, and test data sets hence consist of 21,000, 4,500, and 4,500 samples, respectively. The split is kept fixed across all inputs and all regression methods. Training data is used for the actual estimation of a functional relationship mapping input to output. Validation data is used for hyperparameter selection, i.e., finding optimal values for e.g. learning rates for ANNs (in the case of linear regressions, the validation data is not used because we do not have any hyperparameters to optimize). Test data is used for final assessment of the predictive performance. To quantify error/loss in prediction, we use several different measures. Let k be the 'true' permeability, i.e., the value obtained from the lattice Boltzmann simulations, and let k be the predicted value. We use mean squared error (MSE) in the logarithmic scale, root mean squared error (RMSE) which is just RMSE = MSE 1/2 , and mean absolute percentage error (MAPE) in the linear scale, i.e., Using MSE is the most practical and most common choice for model fitting. However, for final assessment of performance, the linear scale and MAPE is a more straightforward and intuitive choice.
Linear regression with linear terms. First, we consider using linear regression with only linear terms (i.e., only the input descriptors to the power of unity are used). For models 1 and 1 ′ , this becomes multiplicative regression in a Kozeny-Carman-like form, i.e., and It is well established that a > 0 , b < 0 , and c < 0 in this setting (and due to the dimensions, b = −2 would be preferable).
On the other hand, the rationale behind the linear regression model of correlation functions is inspired by the rigorous bounds involving correlation functions, such as Eq. (16). Since the integral in the bounds can be seen as the inner product between the correlation functions and another predetermined function, it is natural to assume that a functional regression on the correlation functions may yield a reasonable estimation of permeabilities. (20) log 10 k = c 0 + a log 10 φ + b log 10 s + c log 10 τ (21) log 10 ks 2 = c 0 + a log 10 φ + c log 10 τ . The rest of the models are formulated in an equivalent fashion. For the correlation function-based models, α , β , and γ are just vectors of coefficients but can also be thought of as discretized forms of continuous coefficient functions α(r) , β(r) , and γ (r) . We use least squares fitting, finding the coefficients that minimize the training set MSE. We also include the reference Kozeny-Carman model in this category as a benchmark. Fitting is performed in Matlab (Mathworks, Natick, MA, US). Specifically, the fitted Kozeny-Carman model writes as For models 1 and 1 ′ , the estimated relationships become and Interestingly, the estimated coefficients for models 1 and 1 ′ are quite similar, and they are also comparable to the Kozeny-Carman model. Specifically, even without being forced to have the dimensionally correct exponent − 2 for s, as in the case of model 1 ′ , the estimated exponent from the regression (− 1.88) in model 1 is very close.
To get an idea of the dependence of the coefficient functions on r, we plot the estimated α(r)/r in Fig. 5 as an example. We scale α(r) by r in order to make it have the appropriate weight consistent with the one that multiplies the correlation functions in the integrand of Eq. (16). It is clear that instead of weighting on different parts of F vv (r) equally, the regression emphasizes on the small-to-intermediate-r behavior of the correlation function as expected.
We also made an attempt to further smooth the coefficient functions according to a proposed functional regression scheme published earlier 68 . Briefly, the idea is to introduce a penalty term in the least squares objective function that is proportional to the integral of the squared second derivative of the coefficient function, such that the functional form becomes smoother. To exemplify, pick a model with a coefficient function α(r) , and we minimize (22)    www.nature.com/scientificreports/ for a range of penalty parameters , and pick such that the validation MSE is minimized. Interestingly, the smoothing procedure provides only a negligible improvement in validation and test errors, and a negligible increase in smoothness of the coefficient functions. Thus, we stick to the models without penalties. The errors for training, validation, and test sets for the aforementioned models are shown in Table 3. Although we perform no hyperparameter optimization for these regression models (such as variable selection), we include the validation set errors for consistency. There are several noteworthy observations about our results. First, the Kozeny-Carman-like model (1 and 1 ′ ) with geodesic tortuosity almost reduce the relative error by half compared to the original Kozeny-Carman model, while the combination of all three correlation functions (6 and 6 ′ ) also give comparable performance. Interestingly, when we further combine the correlation functions with geodesic tortuosity (7 and 7 ′ ), the error shrinks to approximately only one-third of the error generated by the Kozeny-Carman model. Among single correlation functions, the best performance is given by F vv , while the worst is given by F ss . This result is expected, since F vv alone can yield a bound on the permeability, while F ss alone does not even contain the most important information, i.e., the porosity. The reason we keep model 2 and 2 ′ is indeed just for self-consistency. Interestingly, the compound correlation function F performs relatively poorly. This is probably due to the fact that it washes out the information content contained in individual correlation functions. However, its error can be interpreted as a lower bound on how good the bound in Eq. (16) would work on our data set. To better visualize our findings, the predicted values vs the true (simulated) values and histograms of relative errors for a few selected models are shown in Fig. 6. It is obvious that by adding geodesic tortuosity and correlation functions the prediction error can be greatly reduced. It can also be seen that although the Kozeny-Carman-like model (Fig. 6b) and the correlation function based one (Fig. 6c) has similar MAPE, the error distribution of the correlation function based one is more symmetric.
Linear regression with quadratic terms. Second, we generalize the previous models that utilized linear terms only to incorporate both linear and quadratic terms. For the correlation function models, we use both the correlation functions themselves and their squares. For example, we use both F ss and F 2 ss as input data in model 2. It is worth to point out that we use only pure quadratic terms, such as F 2 ss (r i ) , but not mixed quadratic terms, such as F ss (r i )F ss (r j ) for i = j . We include models 1 and 1 ′ in this investigation as well, mainly for completeness, and add terms of the type log 10 φ 2 . Fitting is performed in Matlab (Mathworks, Natick, MA, US). The errors for training, validation, and test sets are again shown in Table 4. In Fig. 7, the predicted values vs the true (simulated) values and histograms of relative errors for a few selected models are shown. Importantly, we note that adding the quadratic terms leads to an improvement for every model. This suggests that the relation between the microstructural descriptors and the permeability can be quite complex such that a simple linear model may not be able to fully capture it. However, the relative rank of performances roughly remain the same, showing the validity of our previous arguments. The estimated coefficient functions are quite www.nature.com/scientificreports/ noisy and their physical meaning is not obvious. We also make an attempt to use a full quadratic model, incorporating also mixed terms such as F ss (r i )F ss (r j ) , or even mixed between correlation functions, such as F sv (r i )F vv (r j ) . The numbers of variables in the models then become very large, leading to ill-conditioned estimation problems. We investigated whether the Lasso variable selection technique 69 , which forces a variable number of coefficients in a linear model to become zero by penalizing the sum of absolute values of the coefficients, could act as an efficient means of reducing the model dimensionality. However, it turns out that no amount of Lasso regularization can decrease the validation MSE in this case. There are two likely reasons for this: Lasso is primarily intended for  www.nature.com/scientificreports/ high-dimensional variable spaces where a large fraction of the variables contain little information and mostly noise, and can be disregarded easily. This is likely not the case for the correlation functions. Also, because the values are taken from continuous functions, they are strongly correlated, which is known to compromise the underlying rationale of Lasso.
Neural networks. The complexity of the linear regression models could be further increased, for example by incorporating pure cubic terms. Although we expect to see further improvements, the linear regression model can quickly become ill-conditioned and intractable on this track. For this reason, we proceed to consider deep neural networks, which can potentially fully capture the complex structure-property relationships. Thus we can exploit the complete information content contained in the descriptors. We use four fully-connected hidden layers with 128 nodes each and rectified linear unit (ReLU) activations. Given that the input dimension is n, and the output dimension is unity, the number of weights in the network is (n + 1) × 128 + 3 × 129 × 128 + 128 + 1 , i.e., there are between 50,049 and 86,785 weights to be optimized. The network is shown in Fig. 8. Random initial weights are selected using the Glorot/Xavier uniform initializer. The networks are trained using the Adam optimizer 70 with learning rate 10 −4 , batch size 128, and mean squared error loss. All models are trained 100 times for 10,000 epochs using different random weight initializations, and the models with the globally minimal validation loss (MSE) are selected (hence utilizing early stopping, but performed over multiple realizations/initializations). The reason for this procedure is to minimize the impact of the random weight initializations. The models are implemented in TensorFlow 2.1.0 (http://www.tenso rflow .org) 71 . An example of training and validation loss curves is shown in Fig. 9. Again, the errors for training, validation, and test sets are shown in Table 5. In Fig. 10, the predicted values vs the true (simulated) values are shown. Indeed, the neural networks based regressions perform noticeably better than the linear counterpart. Again, we see that the combination of correlation functions and geodesic tortuosity gives the best performance, achieving an impressive MAPE that is less than 4% . We also notice that all correlation function based models (except for F ss and F for aforementioned reasons) perform very well, and all better than the Kozeny-Carman-like model.
To gain some understanding of the neural network and how the prediction is performed, we perform for the case of model 4 an analysis of the network's sensitivity with respect to perturbations in the input. Specifically, for the test set, we add random Gaussian noise to F vv (r) for one r value at a time. The perturbation in the output is quantified by the standard deviation of the difference between the original prediction and the perturbed prediction. In Fig. 11, we show the results for σ = 0.02 (it turns out that for a broad range of perturbations, 0.0001 ≤ σ ≤ 0.1 , the result changes only by a constant scaling). We see a rough resemblance to Fig. 5 in the sense that large magnitudes are mostly found for small r, indicating that to some extent the models are using the same information in the correlation function. www.nature.com/scientificreports/

conclusions and discussion
We have studied data-driven structure-property relationships between fluid permeabilities and a variety of microstructural descriptors in a large data set of 30,000 virtual, porous microstructures of different types. The data set includes both granular and continuous solid phases, and is the largest one ever generated for the study of transport properties to our knowledge. To characterize the pore space geometry, we computed one-point correlation functions (porosity, specific surface), two-point surface-surface, surface-void, and void-void correlation functions, and geodesic tortuosity. Different combinations of these descriptors were used as input for different statistical learning methods, including linear regression with linear and quadratic terms, as well as deep neural networks. We find that the performance improves as the regression models become more complex, suggesting the complex relationship between the structural descriptors and the physical properties. Sufficiently large neural networks are able to fully capture the information content of the descriptors and reveal their utilities. With higherorder descriptors, we obtain significant improvements of performance when compared to a Kozeny-Carman regression with only lowest-order descriptors (porosity and specific surface). We found that combining all three two-point correlation functions and tortuosity provides the best prediction of permeability. The void-void correlation function was found to be the most informative individual descriptor. Also, the combination of porosity, specific surface, and geodesic tortuosity provides comparable predictive performance, in spite of its simplicity. Indeed, this shows that the greater information content contained in higher-order correlation functions are  . An example of a training (blue) and validation (red) loss curves for model 6 ′ . This is the best run for this model, i.e., the one that yielded the minimum validation MSE, which is indicated by the vertical line (black). www.nature.com/scientificreports/ extremely useful for predicting physical properties of complex materials. Moreover, our work demonstrates that advanced machine learning methods can be very useful in establishing structure-property relationships. An interesting observation is that in general the rescaling of permeabilities seem to improve the performance of the simple linear model, as seen from Table 3 that all models except the Kozeny-Carman-like one outperform their unscaled counterparts. However, as we add quadratic terms and the predictive model becomes more complex, the advantage of rescaling does not hold any more. Finally, for the highly nonlinear neural networks, the relative performance is completely inverted. This may suggest that the relation between the permeability and the specific surface cannot be captured by a simple rescaling, and doing so may reduce the information content  www.nature.com/scientificreports/ contained in the original data. These observed effects of rescaling may lead to some guidelines on developing physics-aware machine learning models for other physical properties as well.
As a final remark, we emphasize that by incorporating microstructures with different length scales in our data set we make our models very robust and can be applied to real-world data. Since the permeability has the dimension L 2 , we can easily obtain the permeability of a sample with a different length scale but the same microstructure. However, there is no universally applicable characteristic microstructural length to scale the permeability that enables a comparison of permeabilities for different microstructures. For example, for sphere packings the natural choice can be the radii of particles, but for continuous structures we need to resort to other quantities 40 . Thus, by training our models on samples with varying length scales, we circumvent this problem by only requiring a rescaling to the right order of magnitude. Finally, we make the data and code used publicly available to facilitate further development of permeability prediction methods 54 .