A reduced order model (ROM) is devised to provide an acceptable accuracy while utilizing a much lower computational cost compared to the full order model (FOM)1. In recent years, a non-intrusive or data-driven ROM approach has grasped attention because (1) it has a straightforward implementation (i.e., does not require any modifications of FOM), (2) it easily lends itself to different kinds of physical problems, and (3) it allows for more stable and much faster prediction than intrusive ROM for nonlinear problems2,3,4,5,6,7. Traditionally, proper orthogonal decomposition (POD) is used as a data compression tool (i.e., linear subspace approach), which is the optimal way to construct the linear reduced manifolds. However, POD-based solutions on a linear subspace are often restrictive for highly nonlinear problems where reduced spaces lie in nonlinear manifolds. More recently, nonlinear compression using autoencoder-based deep learning (DL) architectures or nonlinear manifold approach5,6,8,9 has been suggested to reconstruct these nonlinear manifolds, resulting in generic and more refined predictive capabilities than linear subspace approaches for nonlinear problems. Recent extensive comparisons, however, show a performance deficit for DL-ROM approaches in some cases6.

Kadeethum et al.6 illustrate that there are two essential issues for DL-ROM. First, the nonlinear approach outperforms its linear counterpart in specific settings (e.g., boundary conditions and domain geometry), but the opposite can occur in other settings. This is because POD provides the optimal data compression in a linear subspace for the problems with fast-decaying Kolmogorov’s n-width that measures the degree of accuracy by n-dimensional linear subspaces10,11,12,13. Therefore, the DL-ROM approach could not exceed the level of POD accuracy for problems that naturally lie within linear manifolds. However, for problems with slowly decaying Kolmogorov’s width, the nonlinear manifold approach outperforms the linear subspace one. Even though the authors hypothesize that a visual comparison between principal component analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) could indicate which method will perform better before employing any specific compression strategy, there is no unified model that could be used across problem settings without an extensive case-based hyperparameter search. Second, although the nonlinear approach excels in very complex settings, it relies on convolutional operators, hindering its application for unstructured meshes and limiting DL-ROM approaches to less practical problems. Hence, these limitations in DL-ROM methods need to be resolved and tested with varying degrees of complex problems.

Convection in porous media is an important process in various applications in natural and engineered environments (e.g., biomedical engineering, multiphase flow in the subsurface, seawater intrusion, geothermal energy, and storage of nuclear and radioactive waste)14,15,16,17. As the media temperature and composition (fluid concentration) are altered, the dynamics of fluid density and viscosity variations could drive the flow field through flow instabilities18,19. The gravity-driven flow problem is usually characterized by Rayleigh number (\(\textrm{Ra}\)) in which if the \(\textrm{Ra}\) is low, the flow field is laminar, while if the \(\textrm{Ra}\) is high, the flow turns into a turbulent regime. In cases where the driving force is strong enough (very high \(\textrm{Ra}\)), the flow might also exhibit fingering behavior20.

Numerical simulation of gravity-driven flow in porous media has been a subject of extensive research. Notable examples of full order model (FOM) include: (1) TOUGH software suite, which includes multi-dimensional numerical models for simulating the coupled thermo-hydro-mechanical-chemical (THMC) processes in porous and fractured media21,22, (2) SIERRA Mechanics, which has simulation capabilities for coupling thermal, fluid, aerodynamics, solid mechanics and structural dynamics23, (3) PyLith, a finite-element code for modeling dynamic and quasi-static simulations of coupled multiphysics processes24, (4) OpenGeoSys project, which is developed mainly based on the finite element method using object-oriented programming THMC processes in porous media25, (5) IC-FERST, a reservoir simulator based on control-volume finite element methods and dynamic unstructured mesh optimization26, (6) DYNAFLOW™, a nonlinear transient finite element analysis platform27, (7) DARSim, multiscale multiphysics finite volume based simulator28, (8) the CSMP, an object-oriented application program interface, for the simulation of complex geological processes, e.g. THMC, and their interactions29, and (9) PorePy, an open-source modeling platform for multiphysics processes in fractured porous media30. In this study, we utilize the FOM developed in the previous works, a locally conservative mixed finite element framework for coupled hydro-mechanical-chemical processes in heterogeneous porous media31,32 in which interior penalty enriched Galerkin and mixed finite element are employed. This FOM, however, is computationally expensive for two reasons. The first one is the problem of interest is highly nonlinear; hence, it takes more nonlinear iterations to converge. The second reason is to satisfy the Courant–Friedrichs–Lewy (CFL) condition, the FOM needs to march through many intermediate time-steps to reach the time-steps of interest33,34,35.

Kadeethum et al.6 propose a data-driven reduced order model (ROM) that can reduce computation cost while maintaining an acceptable accuracy for natural convection in porous media problems. The model is applicable to parameterized problems1,13,36,37,38,39,40,41, depending on a set of parameters (\({\varvec{\mu }}\)) which could correspond to physical properties, geometric characteristics, or boundary conditions. This model sequentially follows (1) the offline and (2) online stages1,42. The offline stage begins with initializing a set of input parameters, which we call a training set. Then the FOM is solved corresponding to each member in the training set (in the following, we will refer to the corresponding solutions as snapshots). Either linear, relying on POD4,43 or nonlinear compression, depending on deep convolutional autoencoder (DL-AE or DL-ROM)5,6,8, is then used to compress FOM snapshots to produce basis functions that span reduced spaces of very low dimensionality, but still guarantee accurate reproduction of the snapshots44,45. The ROM can then be solved during the online stage for any new value of \({\varvec{\mu }}\) by seeking an approximated solution in the reduced space.

In this work, we propose a unified data-driven ROM using a combination of Barlow Twins (BT) self-supervised learning and an autoencoder (BT–AE) that bridges the performance gap between linear and nonlinear manifold approaches. In particular, we use BT self-supervised learning to maximize the information content of the embedding with the latent space through a joint embedding architecture46. With four different example cases that span the degree of complexity to cover both linear and nonlinear problems, a comparison of the proposed BT–AE framework with both linear (POD) and nonlinear (DL-AE) ROM approaches is conducted to demonstrate the performance of the unified data-driven ROM framework that (1) excels in all test cases (whether the solution can be captured in a linear or nonlinear manifold) and (2) operates on either structured or unstructured meshes. Importantly, this model is fully data-driven; it could be trained by data produced by FOM, on-site measurement, experimental data, or a combination of them. This characteristic can provide flexibility across the spectrum in more complex problems. Since it is not limited by the Courant–Friedrichs–Lewy condition for conventional FOMs, it could deliver quantities of interest at any given time contrary to the FOM6.


Data generation

We present a summary of all geometries and boundary conditions we use in Fig. 1. In short, Examples 1, 2, and 3 represent cases where \({\varvec{\mu }}\) is a scalar quantity, namely \(\textrm{Ra}\), while Example 4 illustrates a case where \({\varvec{\mu }}\) is a four-dimensional vector, composed of \(\mathrm {Ra_1}\), \(\mathrm {Ra_2}\), \(\mathrm {Ra_3}\), and \(\mathrm {Ra_4}\). The information of each example is presented in Table 1. We note that \(\textrm{M}_{\textrm{validation}}\) and \(\textrm{M}_{\textrm{test}}\) represent the number of the validation and testing sets with varying Rayleigh number (\(\textrm{Ra}\)), respectively (Table 1). Due to time dependence, the total number of training, validation, and test samples is the product of \(\textrm{M}\) and \(N^t\) with varying \(N^t\) ranges. Specifically the validation samples, \(\textrm{M}_{\textrm{validation}} N^t\), is determined by \(\textrm{M}_{\textrm{validation}} N^t = 0.1\textrm{M}N^t\) by randomly sampling 10% of the sum of training/validation sets (\(\textrm{M}N^t\)).

Figure 1
figure 1

Domain and boundary conditions for (a) Example 1 (heating from the left boundary), (b) Example 2 (Elder problem), (c) Example 3 (unit cell of micromodel), and (d) Example 4 (modified Hydrocoin with four subdomains). The red line indicates the region of the boundary where the temperature is elevated.

Table 1 Summary of main information for each example.

It is noted that the final total training samples are \(0.9\textrm{M} N^t\) because we allocate 10% of the training samples for the validation set. The total of testing data is \(\mathrm {M_{test}} N^t\). We want to emphasize that our \(N^t\) is not constant, but it is a function of \({\varvec{\mu }}\). To elaborate, the higher \(\textrm{Ra}\) value will result in the higher \(N^t\) to satisfy CFL condition.

The summary of each model, including the subspace dimension and compression method, is presented in Table 2. The detailed description of POD, AE, and DC–AE models is provided in Kadeethum et al.6, and our newly developed BT–AE models are described in “Methodology” section. In short, for POD models, we use proper orthogonal decomposition as a compression tool. The AE models use an autoencoder as a compression method. We employ a deep convolutional autoencoder to compress our training snapshots (\(\textrm{M} N^t\)) for DC–AE models. The BT–AE models utilize a combination of an autoencoder and Barlow Twins self-supervised learning in their compression procedure. For the POD models, linear compression, subspace dimension refers to the number of reduced basis or \(\textrm{N}\) as well as the number of intermediate reduced basis or \(\mathrm {N_{int}}\). We assume \(\textrm{N} = \mathrm {N_{int}}\) for all models for simplicity. The subspace dimension is the number of latent space (\(\textrm{Q}\)) for the nonlinear compression, AE, DC–AE, and BT–AE models.

Table 2 Summary of naming for each model.

Details of POD, AE, DC–AE models are provided in Kadeethum et al.6.

Comparisons of BT–AE with POD, AE, and DC–AE models in simple domains

We first compare the BT–AE model accuracy (for different numbers of \(\textrm{Q}\)) with the models developed by Kadeethum et al.6,43 (i.e., POD, AE, and DC–AE models) in relatively simple model domains. Example 1 illustrates a case where a linear manifold is optimal, while Example 2 presents a case where a nonlinear manifold is optimal. The results of POD, AE, and DC–AE models presented in Kadeethum et al.6 demonstrated that the POD-based and DL-ROM approaches are more suitable for the linear and nonlinear manifold problems, respectively, and they are used in this manuscript to evaluate the performance of BT–AE models.

Example 1: Heating from the left boundary

The geometry and boundary conditions are shown in Fig. 1a, and we adopt this example from Zhang et al. and Kadeethum et al.6,47. This example represents a case where its fluid flow is driven by buoyancy as the fluid is heated on the left side of the domain. The fluid then flows upwards and rotates to the right side of the domain. We set \({\varvec{\mu }} = (\textrm{Ra})\), and its admissible range of variation is [40.0, 80.0], see Table 1. For the training set, we use \(\textrm{M} = 40\), which lead to, in total, \(\textrm{M} N^t = 16802\) training data points.

We present the test case results of the BT–AE model (BT–AE 16 Q) as supplimental information (SI-Animation-Example 1). The difference between solutions produced by the FOM and ROM (DIFF) is calculated by

$$\begin{aligned} {\text {DIFF}}_\varphi (t^k, {\varvec{\mu }}_{\textrm{test}}^{(i)})= \left| \varphi _h(\cdot ; t^k, {\varvec{\mu }}_{\textrm{test}}^{(i)}) - \widehat{\varphi }_h(\cdot ; t^k, {\varvec{\mu }}_{\textrm{test}}^{(i)})\right| \end{aligned}$$

where \(\varphi _h\) is a finite-dimensional approximation of the set of primary variables corresponding to velocity, pressure, and temperature fields. \(\widehat{\varphi }_h\) is an approximation of \(\varphi _h\) produced by the ROM. Thus, \(\varphi _h(\cdot ; t^k, {\varvec{\mu }}_{\textrm{test}}^{(i)})\) and \(\widehat{\varphi }_h(\cdot ; t^k, {\varvec{\mu }}_{\textrm{test}}^{(i)})\) represent \(\varphi _h\) and \(\widehat{\varphi }_h\) at all space coordinates (i.e., evaluations at each DOF) at time \(t^k\) with input parameter \({\varvec{\mu }}_{\textrm{test}}^{(i)}\), respectively. Note that we only present the results of the temperature field. Hence, \(\varphi _h\) and \(\widehat{\varphi }_h\) represent \(T_h\) and \(\widehat{T}_h\), respectively. From SI-Animation-Example 1, we observe that BT–AE 16 Q provides a reasonable approximation of the temperature field.

Figure 2
figure 2

Example 1—results: (a) mean squared error (MSE) of each model (please refer to Table 2), and blue texts represent a mean value of the box plots—here we show that BT–AE 16 gives performance similar to POD-based approaches, but AE and DC–AE models do not, (b) data compression loss for validation set (Eq. 18), (c) mapping using ANN loss for validation set (Eq. 19), (d) latent space plot of DC–AE 16 Q model, and (e) latent space plot of BT–AE 16 Q model. Latent space plots are constructed using t-Distributed Stochastic Neighbor Embedding (t-SNE). Different colors represent each value of \(\textrm{Ra}\) value. We calculate the t-SNE plots using Scikit-Learn package using its default setting and perplexity of 15.

The results of Example 1 is presented in Fig. 2. In Fig. 2a, The performance of the different models (Table 2) is evaluated with the mean square error (\(\textrm{MSE}_\varphi (:, {\varvec{\mu }}_{\textrm{test}}^{(i)})\)) of the test cases defined as follows

$$\begin{aligned} {\textrm{MSE}_\varphi (:, {\varvec{\mu }}_{\textrm{test}}^{(i)}) :=\frac{1}{N^t} { \sum _{k=0}^{N^t} \left| \varphi _h(\cdot ; t^k, {\varvec{\mu }}_{\textrm{test}}^{(i)}) - \widehat{\varphi }_h(\cdot ; t^k, {\varvec{\mu }}_{\textrm{test}}^{(i)})\right| _{\varphi }^2}} \end{aligned}$$

where \(\textrm{MSE}_\varphi (:, {\varvec{\mu }}_{\textrm{test}}^{(i)})\) represents the MSE values of all t for each \({\varvec{\mu }}_{\textrm{test}}^{(i)}\). The \(\textrm{MSE}\) results show that BT–AE models perform better than AE and DC–AE models. Besides, BT–AE 16 Q delivers similar \(\textrm{MSE}\) results to those of the POD models. In contrast to the findings presented in Kadeethum et al.6 where the linear compression (POD) outperforms nonlinear compression (AE and DC–AE), BT–AE models in this study could perform similar to the POD models. To be accurate, BT–AE models still underperform, but errors are comparable.

We then investigate how the performance of BT–AE models compares to DC–AE. First, we examine the data compression loss of the validation set (see Eq. 18) which is presented in Fig. 2b. From this figure, the data compression losses of BT–AE models are slightly better than those of the DC–AE models. Subsequently, we illustrate the mapping using ANN loss of the validation set, see Eq. (19), in Fig. 2c. From Fig. 2c, we observe that the mapping losses of the BT–AE models are six orders of magnitude less than those of the DC–AE models. This behavior shows that the BT–AE’s latent spaces are easier to be mapped (i.e., ANN loss of the validation set for the BT–AE mapping is much lower than that of the DC–AE.). This speculation is explained by Fig. 2d,e, using a t-Distributed Stochastic Neighbor Embedding (t-SNE) plot. From Fig. 2d, one could see that all latent variables of DC–AE 16 Q blend (i.e., you cannot differentiate among cases with different \(\textrm{Ra}\) values.). The latent variables of the BT–AE 16 Q model, on the other hand, shown in Fig. 2e, behave in a much better structure (i.e., we can differentiate among cases with different \(\textrm{Ra}\) values.).

Example 2: Elder problem

The Elder problem48 is a significantly more complicated and ill-posed problem48,49. High \(\textrm{Ra}\) numbers considered in this case may cause the flow instability to be fingering behavior. The domain and boundary conditions are presented in Fig. 1b6,47,50. In short, the model domain is heated from the half of the bottom boundary (Fig. 1b), and the flow is driven upwards by buoyancy force. We set \({\varvec{\mu }} = (\textrm{Ra})\), and its admissible range as [350.0, 400.0] (Table 1). Compared to Example 1, this higher range of \(\textrm{Ra}\) values affects the minimum and maximum \(N^t\) as its range increases to [790, 1010].

The results of Example 2 are presented in Fig. 3. From Fig. 3a, we observe that all the models using nonlinear compression (AE, DC–AE, and BT–AE) perform better than the linear compression (POD). Furthermore, the BT–AE model accuracy is comparable to that of the DC–AE models. However, the BT–AE model results seem to be insensitive to the number of \(\textrm{Q}\), while the DC–AE model results are affected by the number of \(\textrm{Q}\) (i.e., the DC–AE 16 Q and DC–AE 256 Q are more accurate than the DC–AE 4 Q). We also present the results of the test cases for the BT–AE 16 Q model in the supplemental animation (SI-Animation-Example 2). From these results, we observe that the BT–AE 16 Q model delivers a reasonable approximation of the solution \(T_h\) (i.e., \(\widehat{T}_h\)).

Figure 3
figure 3

Example 2—results: (a) mean squared error (MSE) of each model (please refer to Table 2), and blue texts represent a mean value of the box plots—here we show that BT–AE models provide performance similar to DC–AE models, but POD-based approaches and AE models do not, (b) data compression loss for validation set (Eq. 18), (c) mapping using ANN loss for validation set (Eq. 19), (d) latent space plot of DC–AE 16 Q model, and (e) latent space plot of BT–AE 16 Q model. Latent space plots are constructed using t-Distributed Stochastic Neighbor Embedding (t-SNE). Different colors represent each value of \(\textrm{Ra}\) value. We calculate the t-SNE plots using Scikit-Learn package using its default setting and perplexity of 15.

We present the data compression loss of the validation set (Eq. 18) in Fig. 3b. In contrast to the ones shown in Fig. 2b, the DC–AE models have a slightly lower loss than that of the BT–AE models. We then investigate the ANN mapping loss (see Eq. 19) of the validation set in Fig. 3c. Similar to those presented in Fig. 2c, the BT–AE models have much lower mapping losses compared to those of the DC–AE models. Among the BT–AE models, BT–AE 256 Q has the highest value of ANN mapping loss, which is expected since it has the highest output dimension (i.e., we are mapping t and \({\varvec{\mu }}\) to \({\varvec{z}}^{\textrm{Q}}\)). Again, we observe a much better structure of the BT–AE 16 Q latent space than the one from DC–AE 16 Q (see Fig. 3d,e). To elaborate, the latent variables of the DC–AE 16 Q are overlapped to hinder us from differentiating among each case (different \(\textrm{Ra}\) values). The latent variables of the BT–AE, on the contrary, are structured in a way that one can clearly observe different parts that represent different \(\textrm{Ra}\) values as shown in Fig. 3e.

Model performance of BT–AE models on complex geometries

From Examples 1 and 2, we have observed that the BT–AE models could provide good results while operating on unstructured data. In this section, more challenging geometries which require an unstructured mesh for the FOM are evaluated with BT–AE models only since other methods are not suitable for unstructured mesh problems.

Example 3: Unit cell of micromodel

Example 3 uses a unit cell of micromodel where a central part of honeycomb shape and four corners are impermeable for flow. Still, the heat can conduct through these five subdomains as presented in Fig. 1c. Over the past decade, the micromodel has been used to study multiple coupled processes, including flow, reactive transport, bioreaction, and flow instability19,20,51,52,53,54. The flow is initiated from an influx at the bottom of the domain. This geometry is more complex than those utilized in Examples 1 and 2 (see Fig. 1a,b). The higher temperature at the bottom surface (shown in red) alters a fluid density at the bottom, and subsequently, a buoyancy force drives the flow upwards from the bottom (shown in red) to the top of the domain. Five subdomains contain very low flow conductivity, but they can conduct heat. Again, we set \({\varvec{\mu }} = (\textrm{Ra})\) and its range as [350.0, 400.0] (Table 1). The range of \(\textrm{Ra}\) can also cause flow instability. We use \(\textrm{M} = 40\), \(\textrm{M}_{\textrm{validation}}N^t = 10\)% of \(\textrm{M}N^{t}\), and \(\textrm{M}_{\textrm{test}} = 10\). We have in total \(\textrm{M} N^t = 44354\) training data points.

The summary of the Example 3 results is shown in Fig. 4. For all test cases the MSE values over time in Fig. 4a–c are in the range of \(\approx 1 \times 10^{-5}\). The MSE values tend to decrease over time until the temperature field becomes a steady state. Besides, BT–AE models with different \(\textrm{Q}\) values provide approximately similar results (in line with our findings from Examples 1 and 2). The behavior infers that utilizing only a small number of latent spaces; the model can achieve the same level of accuracy as the one with a large number of latent spaces. This behavior is very beneficial because the mapping between parameter space and latent space becomes more manageable. We also present the results of the test cases for the BT–AE 16 Q model in the supplemental animation (SI-Animation-Example 3). Overall, BT–AE 16 Q delivers a reasonable approximation of the \(T_h\) (i.e., DIFF results are low, and the relative error lies within 2%).

Figure 4
figure 4

Example 3 results: the moving average (a window size of 50) of mean squared error (MSE) of (a) BT–AE 4 Q, (b) BT–AE 16 Q, (c) BT–AE 256 Q (please refer to Table 2). (d) Data compression loss for validation set (Eq. 18), (e) Barlow Twins loss for validation set (Eq. 14), (f) mapping using ANN loss for validation set (Eq. 19), and (g) latent space plot of BT–AE 16 Q model. Latent space plots are constructed using t-Distributed Stochastic Neighbor Embedding (t-SNE). Different colors represent each value of \(\textrm{Ra}\) value. We calculate the t-SNE plots using Scikit-Learn package using its default setting and perplexity of 15.

The data compression loss (Eq. 18) is in the range of \(\approx 1 \times 10^{-5}\) to \(1 \times 10^{-6}\) (Fig. 4d) which is similar to that of Example 1 (Fig. 2b), but slightly lower than that of the Example 2 (Fig. 3b). The data compression loss seems to be invariant to \(\textrm{Q}\) values. We also present the Barlow Twins loss (Eq. 14) in Fig. 4e. We observe that the Barlow Twins loss increases with increasing the \(\textrm{Q}\) value as in Zbontar et al.46. This can be explained that as the \(\textrm{Q}\) value grows larger, the cross-correlation matrix \(\textbf{C}^{T}\left( t, {\varvec{\mu }}\right)\) becomes bigger, resulting in more members in Eqs. (15) and (16). As stated by Zbontar et al.46, the absolute value of Eqs. (15) and (16) is not as important as their trend. To elaborate this, in Fig. 4e, all models (different \(\textrm{Q}\) values) reach their saturated points around 40 epochs, meaning that the minimization of Eqs. (15) and (16) is completed.

The mapping of the latent space using ANN loss (Eq. (19)) is presented in Fig. 4f. Similar to Examples 1 and 2 (Figs. 2c, 3c), the mapping loss is in range of \(\approx 1 \times 10^{-5}\) to \(1 \times 10^{-7}\). The higher \(\textrm{Q}\) values, the mapping loss grows larger because there are more outputs to map. We present the latent space structure in Fig. 4g (only for BT–AE 16 Q). Following the results shown in Figs. 2e and 3e, the latent structure of the BT–AE model has a good structure since we can differentiate among different \(\textrm{Ra}\) values. This behavior stems from the fact that the BT losses maximize the information content of the embedding with the latent space through a joint embedding architecture.

Example 4: Modified hydrocoin with four subdomains

Example 4 uses the hydrocoin problem55,56 with the domain geometry shown in Fig. 1d. In this example, the domain is subdivided into four subdomains with different \(\textrm{Ra}\) values (i.e., \({\varvec{\mu }} = \left( \mathrm {Ra_1}, \mathrm {Ra_2}, \mathrm {Ra_3}, \mathrm {Ra_4}\right)\)). The range of \(\textrm{Ra}\) values is [350.0, 400.0]. Similar to the previous examples, this \(\textrm{Ra}\) range causes fingering behavior as shown in the supplemental animation (SI-Animation-Example4). We use \(\textrm{M} = 81\), \(\textrm{M}_{\textrm{validation}}N^t = 10\)% of \(\textrm{M}N^{t}\), and \(\textrm{M}_{\textrm{test}} = 10\). We have in total \(\textrm{M} N^t = 90{,}175\) training data points. We note that as we use \(\textrm{M} = 81 = 3^4\) equally spaced samples, for each parameter \(\textrm{Ra}_i\), \(i = 1,2,3,4\), we only have three values. As an example, for \(\mathrm {Ra_1}\) we only sample \(\mathrm {Ra_1} = \left( 350, 400, 450\right)\) for the training set. The same goes for \(\mathrm {Ra_2}, \mathrm {Ra_3},\) and \(\mathrm {Ra_4}\). As a result, training with relatively sparse samples of each parameter \(\textrm{Ra}_i\) makes it very challenging to obtain an accurate data-driven framework in general1,4.

Even though this setting is very challenging, we still observe that the BT–AE 16 Q delivers a reasonable approximation of the \(T_h\) as seen in the supplemental animation (SI-Animation-Example4). The summary of the Example 4 results is shown in Fig. 5. We present the MSE values as a function of time in Fig. 5a–c. We can observe that the MSE values for all test cases are in the range of \(\approx 1 \times 10^{-1}\) to \(1 \times 10^{-5}\), which are significantly higher than those of Examples 1, 2, and 3. Moreover, the MSE values generally increase as we approach steady-state solutions, unlike the behaviors shown in Example 3. Again, BT–AE models with different \(\textrm{Q}\) provide approximately similar results (in line with our finding from Examples 1, 2, and 3).

Figure 5
figure 5

Example 4 results: the moving average (a window size of 50) of mean squared error (MSE) of (a) BT–AE 4 Q, (b) BT–AE 16 Q, (c) BT–AE 256 Q (please refer to Table 2). (d) Data compression loss for validation set (Eq. 18), (e) Barlow Twins loss for validation set (Eq. 14), (f) mapping using ANN loss for validation set (Eq. 19), and latent space plot of BT–AE 16 Q model for (g) \(Ra_1\) and (h) \(Ra_4\). Latent space plots are constructed using t-Distributed Stochastic Neighbor Embedding (t-SNE). Different colors represent each value of \(\textrm{Ra}\) value. We calculate the t-SNE plots using Scikit-Learn package using its default setting and perplexity of 15.

The data compression loss (Eq. 18) is in the range of \(\approx 1 \times 10^{-2}\) to \(1 \times 10^{-4}\) (Fig. 5d), which is significantly higher than that of Examples 1, 2, and 3. This behavior illustrates that this example is the most challenging case for the BT–AE models. The data compression loss is the lowest for \(\textrm{Q} = 256\) and the highest for \(\textrm{Q} = 4\), but the difference is not critical. As shown in the Barlow Twins loss (Eq. 14) in Fig. 5e, the higher values of \(\textrm{Q}\) the larger Barlow Twins loss is (as we discussed in the previous example.).

The mapping of the latent space using ANN loss (Eq. 19) is presented in Fig. 5f. The mapping loss is in the range of \(\approx 1 \times 10^{-4}\) to \(1 \times 10^{-5}\), which is significantly higher than those of Examples 1, 2, and 3 (see Figs. 2c, 3c, 4f). This behavior also contributes to the higher MSE values of the BT–AE models. We present the latent space structure (only for BT–AE 16 Q) in Fig. 5g,h for \(\mathrm {Ra_1}\) and \(\mathrm {Ra_4}\), respectively. Since Example 4 has different \(\textrm{Ra}\) values in four subdomains, the differentiation of the latent space of individual \(\textrm{Ra}\) does not provide good solutions as each latent space of each subdomain might also be interconnected.


Recent developments in ML-based data-driven reduced order modeling (DL-ROM or DC–AE in this study)5,6 have shown promising results in capturing parametrized solutions of systems of nonlinear equations. These models, however, rely on convolutional operators, which hinders the applicability of these models to complex geometries where an unstructured mesh is required for FOMs, as in Examples 3 and 4. Though we could utilize an autoencoder without convolutional layers, the model could not achieve the same level of accuracy as DL-ROM6. Kadeethum et al.6 also illustrate that in a specific setting (simple geometry and boundary conditions), a linear compression approach using POD can outperform the DL-ROM model (Example 1). We have demonstrated that the autoencoder model through Barlow Twins self-supervised learning (BT–AE) could achieve the same accuracy as DL-ROM (Example 2 where POD models perform much worse than DL-ROM) by regularizing the latent space or nonlinear manifolds. Besides, it also yields optimal results in the case where the linear compression model outperforms the DL-ROM (Example 1). It means that the BT–AE model excels in all the test cases (Examples 1 and 2) while it still can operate on an unstructured mesh. This behavior has a significant advantage in scientific computing since most realistic problems require unstructured mesh representations. Besides, the BT–AE’s performance is insensitive to the number of latent spaces, suggesting that with only a small number of latent spaces, the model can achieve the same level of accuracy as the one with a large number of latent spaces. This behavior is very beneficial because the mapping between parameter space and latent space becomes more manageable.

The computational time used to develop our ROM can be broken down into three primary parts: (1) generation of training data through FOM (the second step in Fig. 6), (2) training BT–AE (the third step in Fig. 6), and (3) mapping of t and \({\varvec{\mu }}\) to reduced subspace (the fourth step in Fig. 6). Each FOM model (corresponding to each set of \({\varvec{\mu }}\) or \(\textrm{Ra}\) in this work) takes, on average, about two hours on AMD Ryzen Threadripper 3970X (4 threads). We note that our FOM utilizes the adaptive time-stepping; hence, each \({\varvec{\mu }}^{(i)}\) may require a substantially different computational time. To elaborate, cases that have higher \(\textrm{Ra}\) usually have a smaller time-step (\(N^t\) becomes larger), and subsequently, they require more time to complete.

The wall time used to train BT–AE is approximately 0.4 hours using a single Quadro RTX 6000. It is noted that this computational cost is much cheaper than that of the DC–AE model, taking around four to six hours6. This is because DC–AE relies on convolutional layers, dropout, and batch normalization, which require much higher computational resources. The BT–AE, on the other hand, utilizes only a plain autoencoder. The BT–AE model is also cheaper than the POD model. However, we note that this may not be a fair comparison as we perform POD and BT–AE using different machines (i.e., our POD framework only works on CPU, but our BT–AE is trained using GPU). Please refer to Kadeethum et al.6 for detailed wall time comparisons among POD and DC–AE models. The mapping of t and \({\varvec{\mu }}\) to reduced subspace through artificial neural networks (ANN) takes around half an hour to one hour using a single Quadro RTX 6000. As mentioned in “Methodology” section, we do not terminate the training of both BT–AE and mapping of t and \({\varvec{\mu }}\) to reduced subspace through ANN early, but rather use the model with the best validation loss through the final epochs. For example, we train for 50 epochs, but the model that offers the best validation loss might be the model at 20 epochs. However, the training time we report here is for 50 epochs. Thus, our training time provided here is considered conservative.

Even though the ROM training time is not trivial, it could provide a fast prediction during the online phase. Using AMD Ryzen Threadripper 3970X, the ROM takes approximately several milliseconds for a query of a pair of \(t^{k}\) and \({\varvec{\mu }}^{(i)}\). We also note that, as discussed previously, our ROM is needed to be trained on GPU for the problems at hand. Still, it could utilize CPU during an online time since we do not have to deal with back-propagation or optimization during the prediction time. On the contrary, one FOM simulation (for each \({\varvec{\mu }}^{(i)}\) for all t \(\in\) \(0=: t^{0}<t^{1}<\cdots <t^{N} := \tau\)) takes about two hours. So, assuming that we query all t similar to those of the FOM, ROM takes only a matter of several seconds. In practice, however, we might not need to evaluate all timestamps in \(0=: t^{0}<t^{1}<\cdots <t^{N}:= \tau\) because the quantities of interest at the specific time may be more important. Since ROM is not bound by the CFL condition and can predict the quantities of interest at any specific time without intermediate computation, we could simply perform one query—\(t^{N}\) and \({\varvec{\mu }}^{(i)}\), resulting in saving computational time significantly. Our ROM could provide a speed-up of \(7 \times 10^{6}\) at any specific time step for Example 2, and a speed-up of \(7 \times 10^{3}\) to \(7 \times 10^{6}\) for all examples considered in this work.

Our model is developed upon the data-driven paradigm, which is applicable to any FOM. Besides, it could be trained using data produced by FOM, on-site measurements, experimental data, or a combination among them. This characteristic provides flexibility, which intrusive approaches could not provide. The data-driven model, though, is usually hungry for training samples. We have illustrated that as the dimensionality of our parameter space grows, the model requires more training samples, or it will suffer by losing its accuracy significantly as in Example 4 compared to accurate prediction in Example 3. We speculate that an adaptive sampling technique57,58,59, incorporating physical information60,61, or including multimodal unsupervised training62 might provide a resolution to this issue in the future work. Another gap in data-driven machine learning ROM is that a posteriori error is exceptionally challenging to quantify. An error estimator developed by Xiao63 for linear manifolds could be adapted and extended to the nonlinear manifold paradigm. Additionally, epistemic uncertainty could also be quantified by adopting the ensemble technique proposed by Jacquier et al.64.


A graphical summary of our procedure is presented in Fig. 6: the computations are divided into an offline phase for the ROM construction, which we will show through four consecutive main steps and (single-step) online stage for the ROM evaluation.

Figure 6
figure 6

The summary of procedures taken to establish the proposed BT–AE.

The first step of the offline stage represents an initialization of a training set (\({\varvec{\mu }}\)), validation set (\({\varvec{\mu }}_{\textrm{validation}}\)), and test set (\({\varvec{\mu }}_{\textrm{test}}\)) of parameters used to train, validate, and test the framework, of cardinality \(\textrm{M}\), \(\textrm{M}_{\textrm{validation}}\), \(\textrm{M}_{\textrm{test}}\). For the rest of sections we will discuss only \({\varvec{\mu }}\). The same analogy goes for \({\varvec{\mu }}_{\textrm{validation}}\) and \({\varvec{\mu }}_{\textrm{test}}\). Let \(\mathbb {P} \subset \mathbb {R}^P\), \(P \in \mathbb {N}\), be a compact set representing the range of variation of the parameters \({\varvec{\mu }} \in \mathbb {P}\). For the sake of notation we denote by \(\mu _p\), \(p = 1, \ldots , P\), the p-th component of \({\varvec{\mu }}\). To explore the parametric dependence of the phenomena, we define a discrete training set of \(\textrm{M}\) parameter instances. Each parameter instance in the training set will be indicated with the notation \({\varvec{\mu }}^{(i)}\), for \(i = 1, \ldots , \textrm{M}\). Thus, the p-th component of the i-th parameter instance in the training set is denoted by \(\mu _p^{(i)}\) in the following. The choice of the value of \(\textrm{M}\), as well as the sampling procedure from the range \(\mathbb {P}\), is typically user- and problem-dependent. In this work, we use an equispaced distribution for the training set as done in6,43.

In the second step, we query the FOM, based on the finite element solver proposed and made publicly available in Kadeethum et al.6,32, for each parameter \({\varvec{\mu }}\) in the training set. In short, we are interested in gravity driven flow in porous media, and here we briefly describe all the equations used in this study: (1) mass balance and (2) heat advection–diffusion equations. Let \(\Omega \subset \mathbb {R}^d\) (\(d \in \{1,2,3\}\)) denote the computational domain and \(\partial \Omega\) denote the boundary. \({\varvec{X}}^{*}\) are spatial coordinates in \(\Omega\) (e.g., \({\varvec{X}}^{*}=[x^{*}, y^{*}]\) when \(d=2\), which we will focus on throughout this study). The time domain is denoted by \(\mathbb {T} = \left( 0,\tau \right]\) with \(\tau >0\) (i.e., \(\tau\) is the final time). Primary variables used in this paper are \({\varvec{u}}^* (\cdot , t^*) : \Omega \times \mathbb {T} \rightarrow \mathbb {R}^d\), which is a vector-valued Darcy velocity (m/s), \(p^* (\cdot , t^*) : \Omega \times \mathbb {T} \rightarrow \mathbb {R}^d\), which is a scalar-valued fluid pressure (Pa), and \(T^* (\cdot , t^*) : \Omega \times \mathbb {T} \rightarrow \mathbb {R}^d\), which is a scalar-valued fluid temperature (C). Time is denoted as \(t^*\) (s).

Following Joseph65, the Boussinesq approximation to the mass balance equations results in the density difference only appearing in the buoyancy term. The mass balance equation reads

$$\begin{aligned} {\varvec{u}}^{*}+{\varvec{\kappa }}\left( \nabla p^{*}+{\textbf{y}}\left( \rho -\rho _{0}\right) g\right) =0, \end{aligned}$$


$$\begin{aligned} \nabla \cdot {\varvec{u}}^{*}=0 \end{aligned}$$

where \({\varvec{\kappa }}={\varvec{k}}/\mu _f\) is the porous medium conductivity, \({\varvec{k}}\) is the matrix permeability tensor, \(\mu _f\) is the fluid viscosity, \(\textbf{y}\) is a unit vector in the direction of gravitational force, g is the constant acceleration due to gravity, \(\rho\) and \(\rho _0\) are the fluid density at current and initial states, respectively. We assume that \(\rho\) is a linear function of \(T^{*}\)47,66

$$\begin{aligned} \rho =\rho _{0}\left( 1-\alpha \left( T^{*}-T_{0}^{*}\right) \right) , \end{aligned}$$

where \(\alpha\) is the thermal expansion coefficient, and \(T_{0}^{*}\) is the reference fluid temperature. We note that Eq. (5) is the simplest approximation, and one may easily adapt the proposed method when employing a more complex relationship provided in67. The heat advection–diffusion equation defined as

$$\begin{aligned} \gamma \frac{\partial T^{*}}{\partial t^{*}}+{\varvec{u}}^{*} \cdot \nabla T^{*}-K \nabla ^{2} T^{*} - {f_c}^* = 0. \end{aligned}$$

Here, \(\gamma\) is the ratio between the porous medium heat capacity and the fluid heat capacity, K is the effective thermal conductivity, and \({f_c}^*\) is a sink/source. We follow Nield and Bejan18 and define dimensionless variables as follows

$$\begin{aligned} {\varvec{X}}:=\frac{1}{H} {\varvec{X}}^{*}, \quad t:=\frac{\kappa }{\mu \gamma H^{2}} t^{*}, \quad p:=\frac{\kappa }{K} p^{*}, \quad {\varvec{u}}:=\frac{H}{K} {\varvec{u}}^{*}, \quad T:=\frac{T^{*}-T_{0}^{*}}{\Delta T^{*}}, \quad f_c:=\frac{t^*}{\Delta T^{*}}{f_c}^*, \end{aligned}$$

where H is the dimensional layer depth, and \(\Delta T^{*}\) is the temperature difference between two boundary layers. From these dimensionless variables, we could rewrite our Eqs. (3) and (4) as

$$\begin{aligned} \begin{aligned} {\varvec{u}}+\nabla p-\textbf{y} {\text {Ra}} T=0,&\, \text{ in } \, \Omega \times \mathbb {T}, \\ \nabla \cdot {\varvec{u}}=0,&\, \text{ in } \, \Omega \times \mathbb {T}, \\ p=p_{D}&\, \text{ on } \, \partial \Omega _{p} \times \mathbb {T}, \\ {\varvec{u}} \cdot \textbf{n}=q_{D}&\, \text{ on } \, \partial \Omega _{q} \times \mathbb {T}, \\ p=p_{0}&\, \text{ in } \, \Omega \text{ at } t = 0, \end{aligned} \end{aligned}$$

where \(\partial \Omega _{p}\) and \(\partial \Omega _{q}\) are the pressure and flux boundaries (i.e., Dirichlet and Neumann boundary conditions), respectively. Here, \(\textrm{Ra}\) is the Rayleigh number

$$\begin{aligned} \textrm{Ra}:=\frac{g \alpha \kappa \Delta T^{*} H}{K}. \end{aligned}$$

We then write Eq. (6) in dimensionless form as follows

$$\begin{aligned} \begin{aligned} \frac{\partial T}{\partial t}+{\varvec{u}} \cdot \nabla T-\nabla ^{2} T - f_c=0,&\, \text {in} \, \Omega \times (0,\mathbb {T}], \\ T = T_D&\, \text {on} \, \partial \Omega _{{T}} \times (0,\mathbb {T}], \\ (-{\varvec{u}} T+\nabla T) \cdot \textbf{n}={T_{\textrm{in}}} {\varvec{u}}\cdot \textbf{n}&\, \text {on} \, \partial \Omega _{\textrm{in}} \times (0,\mathbb {T}], \\ \nabla T \cdot \textbf{n} = 0&\, \text {on} \, \partial \Omega _{\textrm{out}} \times (0,\mathbb {T}], \\ T={T_{0}}&\, \text {in} \, \Omega \text{ at } t = 0, \end{aligned} \end{aligned}$$

where \(\partial \Omega _{{T}}\) is temperature boundary (Dirichlet boundary condition), \(\partial \Omega _{\textrm{in}}\) and \(\partial \Omega _{\textrm{out}}\) denote inflow and outflow boundaries, respectively, which are defined as

$$\begin{aligned} \partial \Omega _{\textrm{in}} := \{ {\varvec{X}} \in \partial \Omega : {\varvec{u}} \cdot \textbf{n} < 0\} \quad \text { and } \quad \partial \Omega _{\textrm{out}} := \{ {\varvec{X}} \in \partial \Omega : {\varvec{u}} \cdot \textbf{n} \ge 0\}. \end{aligned}$$

The detail of discretization could be found in Kadeethum et al.6,32, and the FOM source codes are provided in Kadeethum et al.32. After the second step, we have \(\textrm{M}\) snapshots of FOM results associated with the different parametric configurations in \({\varvec{\mu }}\). Since the problem formulation is time-dependent, the output of the FOM solver for each parameter instance \({\varvec{\mu }}^{(i)}\) collects the time series representing the time evolution of the primary variables for each time-step t. Thus, each snapshot contains approximations of the primary variables (\({\varvec{u}}_{h}\), \(p_h\), and \(T_h\)) at each time-step of the partition of the time domain \(\mathbb {T}\). Therefore, based on the training set cardinality \(\textrm{M}\) and the number \(N^t\) of time-steps, we have a total of \(N^t \textrm{M}\) training data to be employed in the subsequent steps. We note that as our finite element solver utilizes an adaptive time-stepping6,32, each snapshot may have a different number of time-steps \(N^t\), i.e. \(N^t = N^t({\varvec{\mu }})\).

The third step aims to compress the information provided by the training snapshots provided by the second step. Kadeethum et al.6 provide detailed derivations and comparisons between linear and nonlinear compression. Especially the convolutional layers, in their classical form, could not deal with an unstructured data structure (unstructured mesh), which is very common in scientific computing or, more specifically, finite element analysis. Hence, our goal is to develop a nonlinear compression that (1) consistently outperforms (or at least delivers similar accuracy) the linear compression and (2) is compatible with an unstructured data structure.

To achieve this goal, we propose a nonlinear compression utilizing feedforward layers in combination with self-supervised learning (SSL) of Barlow Twins (BT) (Fig. 6). The BT for redundancy reduction is proposed by Zbontar et al.46. It operates on a joint embedding of noisy images by producing two distorted images from an original one through a series of random cropping, resizing, horizontal flipping, color jittering, converting to grayscale, Gaussian blurring, and solarization. Since we do not operate on structured data (image) but unstructured data produced by finite element solver, we only employ random noise and Gaussian blur operations to produce our noisy data set, see Fig. 6.

Let \({z}_1^{{\varvec{u}}}, \cdots , {z}_{\textrm{Q}}^{{\varvec{u}}}\), \({z}_1^p, \cdots , {z}_{\textrm{Q}}^p\), and \({z}_1^T, \cdots , {z}_{\textrm{Q}}^T\) be the nonlinear manifolds of the \({\varvec{u}}_{h}\), \(p_h\), and \(T_h\), respectively. For the sake of compactness, we will only discuss primary variable \(T_h\). The same procedure holds for \({\varvec{u}}_{h}\) and \(p_h\). Our goal is to achieve \(\textrm{Q} \ll \textrm{M} N^t\) where \(\textrm{M} N^t\) is the total training data, which implies that our nonlinear manifolds could represent our training data using much lower dimension. We employ a vanilla AE (using only feedforward layers) that is regularized by Barlow Twins SSL to obtain \({\varvec{z}}^T = \left[ {z}_1^T, \cdots , {z}_{\textrm{Q}}^T \right]\). We do not use any batch normalization or dropout. The summary of the training process is presented in Algorithm 1. We will provide the detailed implementation in

figure a

In short, during the training phase, our BT–AE model is composed of one encoder, one decoder, and one projector. The training entails two sub-tasks; the first is BT (encoder and projector), which takes place in the outer loop. The second sub-task is responsible for the training of AE (encoder and decoder), which takes place inside the inner loop. The main reasons for this procedure are two folds. The first reason is Zbontar et al.46 states that the BT works better with large batch sizes. The AE, however, generally requires a small batch size68,69. Our previous numerical experiments based on DC–AE6 also align with this statement. Consequently, we set our batch size of the outer loop as \(\textbf{B}_{\textrm{outer}} = 512\), and the batch size of the inner loop as \(\textbf{B}_{\textrm{inner}} = 32\)

Prior to the training, we distort our training set (i.e., creating \(T_{h,A}\left( t, {\varvec{\mu }}\right)\) and \(T_{h,B}\left( t, {\varvec{\mu }}\right)\) from \(T\left( t, {\varvec{\mu }}\right)\)) through a series of two operations. First, add random noise is added as follows

$$\begin{aligned} \widetilde{T_{h,A}}\left( t, {\varvec{\mu }}\right) , \widetilde{T_{h,B}}\left( t, {\varvec{\mu }}\right) = T\left( t, {\varvec{\mu }}\right) + \epsilon {\text {SD}}\left( T\left( t, {\varvec{\mu }}\right) \right) \mathscr {G}\left( 0,1\right) \end{aligned}$$

where \(\widetilde{T_{h,A}}\left( t, {\varvec{\mu }}\right) , \widetilde{T_{h,B}}\left( t, {\varvec{\mu }}\right)\) are intermediate distorted input data. The constant \(\epsilon\), which is set to 0.1, determines the noise level as it is multiplied with the standard deviation of the input field. \(\mathscr {G}\left( 0,1\right)\) is a random value which is sampled from the standard normal distribution with mean and standard deviation of zero and one, respectively.

Subsequently, we pass \(\widetilde{T_{h,A}}\left( t, {\varvec{\mu }}\right) , \widetilde{T_{h,B}}\left( t, {\varvec{\mu }}\right)\) through Gaussian blur operation, which reads

$$\begin{aligned} {T_{h,A}}\left( t, {\varvec{\mu }}\right) , {T_{h,B}}\left( t, {\varvec{\mu }}\right) =\frac{1}{\sqrt{2 \pi {\text {SD}}\left( \widetilde{T_{h,A}}\left( t, {\varvec{\mu }}\right) , \widetilde{T_{h,B}}\left( t, {\varvec{\mu }}\right) \right) ^{2}}} \textrm{exp}\left( {-\frac{{\widetilde{T_{h,A}}\left( t, {\varvec{\mu }}\right) , \widetilde{T_{h,B}}\left( t, {\varvec{\mu }}\right) }^{2}}{2 {\text {SD}}\left( \widetilde{T_{h,A}}\left( t, {\varvec{\mu }}\right) , \widetilde{T_{h,B}}\left( t, {\varvec{\mu }}\right) \right) ^{2}}}\right) \end{aligned}$$

to obtain \(T_{h,A}\left( t, {\varvec{\mu }}\right)\) and \(T_{h,B}\left( t, {\varvec{\mu }}\right)\).

We use a number of the epoch of 50, see Algorithm 1. The outer loop works as follows: training BT begins with passing \(T_{h,A}\left( t, {\varvec{\mu }}\right)\) and \(T_{h,B}\left( t, {\varvec{\mu }}\right)\) to the encoder (it is noted we have only one encoder) resulting in \({\varvec{z}}_A^T\left( t, {\varvec{\mu }}\right)\) and \({\varvec{z}}_B^T\left( t, {\varvec{\mu }}\right)\). We then use \({\varvec{z}}_A^T\left( t, {\varvec{\mu }}\right)\) and \({\varvec{z}}_B^T\left( t, {\varvec{\mu }}\right)\) as input to the projector resulting in cross-correlation matrix \(\textbf{C}^{T}\left( t, {\varvec{\mu }}\right)\). \(\textbf{C}^{T}\left( t, {\varvec{\mu }}\right)\) is a square matrix with the dimensionality of the projector’s output, and its values range between -1, perfect anti-correlation, and 1, perfect correlation.

The Barlow Twins loss \(\mathscr {L}_{\textrm{BT}}^{T}\) (BT loss) is then calculated using

$$\begin{aligned} \mathscr {L}_{\textrm{BT}}^{T} := \mathscr {L}_{\textrm{I}}^{T} + \mathscr {L}_{\textrm{RR}}^{T} \end{aligned}$$


$$\begin{aligned} \mathscr {L}_{\textrm{I}}^{T} := \sum _{i}\left( 1-\textbf{C}_{i i}^T\left( t, {\varvec{\mu }}\right) \right) ^{2}, \end{aligned}$$


$$\begin{aligned} \mathscr {L}_{\textrm{RR}}^{T} := \lambda \sum _{i} \sum _{j \ne i} {\textbf{C}_{i j}^T\left( t, {\varvec{\mu }}\right) }^{2}, \end{aligned}$$

where \(\textbf{C}_{i i}^T\left( t, {\varvec{\mu }}\right)\) denotes the i-th diagonal entry of \(\textbf{C}^{T}\left( t, {\varvec{\mu }}\right)\), \(\lambda\) is a positive constant, which is set to \(5 \times 10^{-3}\) as recommended by Zbontar et al.46, and \(\textbf{C}_{i j}^T\) are off-diagonal entries of \(\textbf{C}^{T}\). In short, we train our BT part by trying to force \(\mathscr {L}_{\textrm{I}}^{T}\) to 1, but \(\mathscr {L}_{\textrm{RR}}^{T}\) to 0 resulting in teaching the encoder and projector learn how to get rid off noise from the distorted data, \(T_{h,A}\left( t, {\varvec{\mu }}\right)\) and \(T_{h,B}\left( t, {\varvec{\mu }}\right)\), and construct a representation that conserves as much \(T\left( t, {\varvec{\mu }}\right)\) information as possible.

Here, we follow the training procedures used by Kadeethum et al.6,70. We use the ADAM algorithm71 to adjust learnable parameters of encoder(\(\textrm{W}\) and \(\textrm{b}\)) and projector(\(\textrm{W}\) and \(\textrm{b}\)). The learning rate (\(\eta\)) is calculated as72

$$\begin{aligned} \eta _{c}=\eta _{\min }+\frac{1}{2}\left( \eta _{\max }-\eta _{\min }\right) \left( 1+\cos \left( \frac{\mathrm {step_c}}{\mathrm {step_f}} \pi \right) \right) \end{aligned}$$

where \(\eta _{c}\) is a learning rate at step \(\mathrm {step_c}\), \(\eta _{\min }\) is the minimum learning rate, which is set as \(1 \times 10^{-16}\), \(\eta _{\max }\) is the maximum or initial learning rate, which is selected as \(1 \times 10^{-4}\), \(\mathrm {step_c}\) is the current step, and \(\mathrm {step_f}\) is the final step. We note that each step refers to each time we perform back-propagation, including updating both encoder and projector’s parameters.

The inner loop is as follows: training AE starts with obtaining \({\varvec{z}}^T\left( t, {\varvec{\mu }}\right)\) by passing \(T_h\left( t, {\varvec{\mu }}\right)\) to the encoder. We then use \({\varvec{z}}^T\left( t, {\varvec{\mu }}\right)\) to reconstruct \(\widehat{T}_h\left( t, {\varvec{\mu }}\right)\) through the decoder. Subsequently, we calculate our data compression loss or AE loss (\(\mathscr {L}_{\textrm{AE}}^{T}\)) using

$${\mathcal{L}}_{{{\text{AE}}}}^{T} : = {\text{MSE}}^{T} = \frac{1}{{{\text{M}}N^{t} }}\sum\limits_{{i = 1}}^{{\text{M}}} {\sum\limits_{{k = 0}}^{{N^{t} }} {\left| {\hat{T}_{h} \left( {t^{k} , {\varvec{\mu }}^{{(i)}} } \right) - T_{h} \left( {t^{k} , {\varvec{\mu }}^{{(i)}} } \right)} \right|^{2} } } .$$

Similar to the training of BT, we use ADAM to adjust learnable parameters of encoder(\(\textrm{W}\) and \(\textrm{b}\)) and decoder(\(\textrm{W}\) and \(\textrm{b}\)) according the gradient of Eq. (18). The \(\eta _{c}\) is adjusted by Eq. (17). In contrast to the training of BT, we use \(\eta _{\min } = 1 \times 10^{-16}\), and \(\eta _{\max } = 1 \times 10^{-5}\).

Following the training of the BT–AE, we now establish the manifold \({\varvec{z}}^T\left( t, {\varvec{\mu }}\right) , \quad \forall t \in \mathbb {T} \, \text {and} \, \forall {\varvec{\mu }} \in \mathbb {P}\) during the fourth step shown in Fig. 6. The data available for this task are the pairs \((t, {\varvec{\mu }})\) and \({\varvec{z}}^T\left( t, {\varvec{\mu }}\right)\) in the training set. We achieve this through the training of artificial neural networks (ANN). Following Kadeethum et al.6,43, our ANN has five hidden layers, and each layer has seven neurons. We use tanh as our activation function. Here, we use a mean squared error (\({\textrm{MSE}^{{\varvec{z}}^T}}\)) as the metric of our network loss function, defined as follows

$$\begin{aligned} {\mathscr {L}_{\textrm{ANN}}^{T}} := {\textrm{MSE}^{{\varvec{z}}^T}}=\frac{1}{\textrm{M} N^t} \sum _{i=1}^{\textrm{M}}\sum _{k=0}^{N^t}\left| \widehat{{\varvec{z}}}^T\left( t^k, {\varvec{\mu }}^{(i)}\right) -{{\varvec{z}}^T}\left( t^k, {\varvec{\mu }}^{(i)}\right) \right| ^{2}. \end{aligned}$$

To minimize Eq. (19), we use the ADAM algorithm to adjust each neuron \(\textrm{W}\) and \(\textrm{b}\), a batch size of 32, a learning rate of 0.001, a number of epoch of 10,000, and we normalize both our input (\(t, {\varvec{\mu }}\)) and output (\({\varvec{z}}^T\)) to [0, 1]. To prevent our networks from overfitting behavior, we follow early stopping and generalized cross-validation criteria4,73,74. Note that instead of literally stopping our training cycle, we only save the set of trained weight and bias to be used in the online phase when the current validation loss is lower than the lowest validation from all the previous training cycle.

During the online phase (the fifth step shown in Fig. 6), we utilize the trained ANN and the trained decoder to approximate \(\widehat{{T}}_{h}\left( \cdot ; t, {\varvec{\mu }}\right)\) for each inquiry (i.e., a pair of \((t, {\varvec{\mu }})\) ) through

$$\begin{aligned} \widehat{{\varvec{z}}}^T\left( \cdot ; t, {\varvec{\mu }}\right) = {\text {ANN}} \left( t, {\varvec{\mu }} \right) , \end{aligned}$$

and, subsequently,

$$\begin{aligned} \widehat{{T}}_{h}\left( \cdot ; t, {\varvec{\mu }}\right) = {\text {decoder}} \left( \widehat{{\varvec{z}}}^{T}\left( \cdot ; t, {\varvec{\mu }}\right) \right) . \end{aligned}$$

We note that, for the prediction phase, our ROM could be evaluated using any timestamps, including one that does not exist in the training phase (i.e., any t that lies within \([t^{0}, \tau\)]) because our ROM treats the time domain \(\mathbb {T}\) continuously. Besides, in contrast with the FOM, the ROM is not bound by the CFL condition and can predict the quantities of interest at any specific time without intermediate computation. Hence, our proposed framework can reduce the computational time significantly.