Background & Summary

Numerical simulations in weather forecasting, wind and hydroelectric energy, aerospace vehicle design, automotive design, turbomachinery, nuclear plant design, and many other applications all rely on closure models to accelerate simulations while modelling the complex physical phenomenon of turbulence. While higher resolution techniques such as large-eddy simulation (LES) and direct numerical simulation (DNS) are becoming more widespread, the computational demands compared to current capabilities make these techniques unaffordable for many industrial simulations. For this reason, Reynolds-averaged Navier-Stokes (RANS) simulations are expected to remain the dominant tool for predicting flows of practical relevance to engineering and industrial problems over the next few decades1. However, flows with strong adverse pressure gradients2, separation3, streamline curvature4, and reacting chemistry are often poorly predicted by RANS approaches. Developing methods to improve the accuracy of RANS simulations will help bridge this critical capability gap between RANS and LES5.

Several recent investigations have demonstrated the potential of applying machine learning to the development of corrective turbulence closure models for RANS. Ling et al.6 constructed a tensor basis neural network (TBNN), which predicts the anisotropy tensor using five invariant scalars derived from the mean strain and rotation rate tensors. The TBNN turbulence closure model developed by Ling et al.6 is effectively a fifth-order eddy viscosity model, with locally varying coefficients predicted via deep learning. The ability to express such a locally-tuned, high-order relationship between the strain rate and anisotropy tensors is a powerful method to improve the accuracy of RANS simulations. Wu et al.7 developed a random-forests-based model, which directly predicts the Reynolds stress anisotropy. Kaandorp8 and Kaandorp and Dwight9 proposed a tensor basis random forest (TBRF) model, which is the random forests analogue to the TBNN proposed by Ling et al.6. While the different models by Ling et al.6, Wu et al.7, Kaandorp and Dwight9, Zhu and Dinh10, Zhang et al.11, Fang et al.12, and Song et al.13 all show promise, the results cannot be directly compared—each investigation used a different set of input features and labels, with different numerical settings chosen for feature generation. For this reason, Duraisamy14 recently highlighted the need for a benchmark dataset for machine-learnt closure models.

The approach used by Ling et al.6, Wu et al.7, Kaandorp and Dwight9, Zhang et al.11 and others is referred to as corrective or open loop augmented closure modelling. In this open loop framework, the machine learning model is used to generate a one-time mapping between the fields from a converged RANS solution, and fields from DNS. A contrasting approach is the closed loop framework15,16, where the training process involves conducting RANS simulations in an iterative manner, to repeatedly update the feature set until the model predictions and turbulence closure coefficients converge. The present work aims to present a dataset useful primarily for corrective augmented closure modelling, where the machine learning model is queried once to predict the Reynolds stress. However, this dataset can also be used as an initial set of fields for closed loop closure modelling, and the provided OpenFOAM case files are convenient for performing iterative simulations in a closed loop framework.

To generate a set of input features, the current requirement is for every investigator to generate a set of RANS simulations that match the DNS/LES reference cases. This requirement has several drawbacks. As the number of included datasets grows, the effort required grows. The development of the ImageNet dataset spurred rapid growth of the computer vision field, which would not have been possible otherwise. From an effort point of view, the availability of a curated dataset dramatically increases the time spent developing the models themselves, rather than setting up many RANS simulations to gather input features.

Another major drawback of the current approach arises from the issue of reproducibility in the field of computational fluid dynamics (CFD). Often, CFD studies are difficult to reproduce, due to a large number of input conditions17. Each investigation will use different meshes, numerical schemes, turbulence models, and other selections which affect the solution. The field of machine learning has also been plagued with reproducibility challenges, even with the widespread use of benchmark datasets18. While machine-learnt turbulence models are a promising approach, the development of these models could be significantly impeded by mixing two fields where reproducibility is a challenge. A well-documented, widely available dataset solves at least one aspect of the reproducibility issue, in that all models can at least be trained in the same environment, using the same input features and labels.

Motivated by the lack of a sufficient dataset, the present work aims to develop a set of RANS simulations of highly resolved reference cases in order to generate a curated dataset19. In this work, the numerical methods for the RANS simulations are presented, along with the selection and calculation of the input features for machine learning models. In doing so, the present work aims to present a large computational dataset, curated and logically structured for immediate use in developing next-generation turbulence closure models for RANS using data-driven machine learning. Table 1 summarizes the inputs and outputs of the present work.

Table 1 Inputs and outputs of the present study.

Methods

Selection of reference cases

An important aspect of dataset selection for data-driven turbulence modelling is sweeping of a parameter space. A deep insight followed by a deeper understanding of the fluid phenomena can be obtained by providing information on how the geometry and/or the Reynolds number changes the flow behaviour. In contrast, single-point measurements are only valuable in approximating a universal mapping between inputs and outputs. The majority of the datasets used here involve sweeping through some parameter space. Table 2 summarizes the cases used in the dataset.

Table 2 Cases in the dataset. ReL is the Reynolds number based on the characteristic length and velocity scales shown in Figs. 1 to 7.

Computational method

The flow is assumed to be incompressible, viscous, steady, and turbulent for all cases. Under these conditions, the fluid properties are specified by the kinematic molecular viscosity v. Table 3 summarizes the viscosity used for each case.

Table 3 Kinematic molecular viscosity used for each case.

The open-source library OpenFOAM v200620 was used to generate the dataset. The ability to replicate CFD is greatly improved by supplying the mesh and settings files17. The dataset includes the OpenFOAM case files, including the meshes used for all the cases, and the full details of the settings used. Supplying the OpenFOAM files also reduces the effort required for a posteriori testing. This practice is following Xiao et al.21, who included the OpenFOAM files with their dataset. While this section highlights the basic numerical settings used, the reader is referred to the dataset for the complete OpenFOAM settings.

Numerical schemes

A standardized set of numerical schemes was used for all cases. The numerical schemes represent commonly used RANS schemes, which represent a good trade-off between stability and accuracy. For discretizing the convective terms in the momentum equations, a second-order upwind scheme was used. For discretizing convective terms in the turbulence transport equations, a first-order upwind scheme was used. For the diffusive terms, a second-order central difference scheme was used. Since all the flow cases are steady, the transient terms were set to zero.

The simpleFoam solver was used to solve the equations iteratively. The semi-implicit method for pressure-linked equations-consistent (SIMPLEC) algorithm was used to accelerate convergence. For some cases, additional non-orthogonality correcting loops were applied to the pressure equation. The generalized geometric algebraic multigrid (GAMG) solver was used for the pressure equation, and the preconditioned bi-conjugate gradient (PBiCGStab) solver was used for all other equations.

Iterative residual convergence below 10−6 was generally achieved, with most simulations converging below 10−8. The residual plots for each simulation are provided along with the dataset. The exceptions to this tight residual convergence criteria are the Uy Uz, and p fields for the square duct cases. The linear eddy viscosity model is unable to accurately predict the secondary vortices resulting from non-zero Uy and Uz components in the square duct case, and therefore minimal convergence is seen in these residuals as the in-plane velocity fields remain close to the initial condition of zero. The pressure field for the square duct case does not converge below 10−6 due to the presence of a forcing term which maintains the bulk velocity, resulting in uniform streamwise zero pressure equal to the initial condition of zero.

Turbulence modelling

The two most common families of turbulence closure models, k-ε and k-ω, include many sub-models. Previous investigations on machine-learnt models for predicting the anisotropy tensor have augmented the standard k-ε model6, the Launder-Sharma low Reynolds number k-ε model8, and the k-ω model7,9. Four representative turbulence models were selected for the dataset: namely, the standard k-ε22, k-ε-ϕt-f 23, k-ω2, and the k-ω shear stress transport (SST)24 turbulence closure models. In this work, ϕt is used to denote the anisotropy measure \(\bar{{v{\prime} }^{2}}/k\) to align with the variable naming in OpenFOAM. Here, \(\bar{{v{\prime} }^{2}}\) denotes the wall-normal Reynolds stress. The default coefficients were used for all turbulence models20.

The k-ε-ϕt-f model is a more sophisticated model than the k-ε and k-ω models, through the inclusion of an additional transport equation for the anisotropy measure \({\phi }_{t}\equiv \bar{{v{\prime} }^{2}}/k\), and an elliptic equation for f. f is a scalar which predicts TKE redistribution from the streamwise to the wall-normal Reynolds stress. This model is an improved version of the original \(\bar{{v{\prime} }^{2}}\)-f model proposed by Durbin25, and the improved “code-friendly” version developed by Lien and Kalitzin26. The additional quantities enable the creation of new input features not available in the previous two-equation investigations. Both additional scalars satisfy all desired invariance properties, including Galilean invariance.

For all turbulence models, the mesh was sufficient for a low Reynolds number wall treatment. Low Reynolds number wall boundary conditions are provided for k, ε, and ω in OpenFOAM27. A fixed-value k = 0 boundary condition was applied at no-slip walls. At no-slip walls, the following low Reynolds number fixed value boundary condition was applied for ε:

$$\varepsilon ={\varepsilon }_{vis}=2wk\frac{\nu }{{y}^{2}},$$
(1)

where w are the cell corner weights20. For ω the following fixed value boundary condition was applied at no-slip walls.

$$\omega =\frac{6{\rm{\nu }}}{{\beta }_{1}\,{y}^{2}},$$
(2)

where β1 = 0.075.

Domain and boundary conditions

The domain and boundary conditions for all cases were selected to match the DNS or LES reference simulations. There are two main types of boundary conditions used in the dataset: fixed-free, and streamwise cyclic. While the periodic hills and square duct cases utilize a streamwise cyclic boundary condition, the bump, converging-diverging channel, and curved-backward facing step cases employ a fully-developed inlet velocity profile, and a zero-gradient outlet. The simulations here involve four different turbulence models, each with different fields. The units used for each variable are given in Table 4.

Table 4 Units for each variable requiring boundary conditions.

Mesh

OpenFOAM’s utilities were used to generate the meshes. The mesh generation method varied from case to case, as some cases have changing geometries. Table 5 summarizes the meshes used. All meshes met the low Reynolds number wall treatment criterion of \({y}^{+}\approx 1\) or below. Here, \({y}^{+}\equiv {u}_{\tau }{y}_{w}/\nu \) is the normalized wall-normal distance, where yw is the wall-normal distance, and uτ is the wall friction velocity. In all cases, the mesh was either hexahedral or hexahedral-dominant. A high-quality mesh is important for generating input features for machine learning, in that some terms are sensitive to the mesh quality. For example, the basis tensor \({\widehat{{\mathscr{T}}}}_{10}\) in a general representation of the Reynolds stress tensor proposed by Pope4 is fifth order in terms of the velocity gradient tensor. In developing the feature set here, we found that to keep these terms stable, the number of tetrahedral cells in the domain must be minimized. However, many industrial meshes contain tetrahedral cells, and are of poorer quality than the structured meshes generated here. While CFD results are normally sensitive to the mesh used, machine learning models are especially sensitive to the mesh quality. Poorer meshes result in increased noise and more outliers in the input feature set.

Table 5 Meshes used for discretizing the domain.

Periodic hills

Flow over periodic hills with cyclic boundary conditions is a common benchmark problem for turbulence modelling. The periodic hills case features separation, an important phenomenon for RANS models to accurately capture due to the prominence of strongly separated flows in many industrial settings. To provide a parameterized dataset for data-driven turbulence modelling, Xiao et al.21 performed DNS of flow over a series of periodic hills. This dataset consists of five cases, characterized by the steepness ratio α. The values of α selected are α = 0.5, 0.8, 1.0, 1.2, and 1.5, which results in a range of separated flows. The geometry for the five periodic hills cases is shown in Fig. 1. The Reynolds number based on bulk velocity and crest height for all cases is fixed at Re = 5,600.

Fig. 1
figure 1

The geometry for the five periodic hills cases. Further detail is given in Xiao et al.21. The Reynolds number for this case is calculated based on the hill height H and mean bulk velocity Ub. These parameters are fixed for all cases, so ReH remains fixed at 5,600.

The periodic hills case is a two-dimensional (2D) flow, with the domain geometry characterized in terms of the hill height H, as shown in Fig. 1. The domain height is fixed at 3.04H, and the domain width changes from 7.07H to 10.9H, as the parameter α changes. The boundary conditions for the periodic hills case are streamwise cyclic for all flow variables. Both the top and bottom boundaries are treated as no-slip walls. To maintain a constant bulk velocity in the flow, a mean pressure gradient source term is added to the momentum equation. Therefore, the pressure field for cases with cyclic boundary conditions should be interpreted as the deviation from the mean pressure field.

The mesh for the steepest periodic hills case (α = 0.5) is shown in Fig. 2. The RANS meshes for all periodic hills cases were provided by Xiao et al.21. The periodic hills mesh is a structured mesh, with cells concentrated near the boundary layer. While the geometry changes by varying the hill steepness and domain length, the number of cells for all cases is the same.

Fig. 2
figure 2

Structured hexahedral mesh used to discretize the α = 0.5 periodic hills case.

Square duct

The DNS dataset for flow in a square duct by Pinelli et al.28 has been widely used in data-driven turbulence modelling. This dataset consists of 16 cases, all with the same fixed geometry shown in Fig. 3. The Reynolds number based on the duct half-width varies between 1,100 and 3,500. The flow in a square duct is a challenging test case for eddy viscosity models. Linear eddy viscosity models are unable to predict the secondary corner vortices which form in the duct. These structures are Prandtl’s secondary motion of the second kind28. The dataset contains the mean velocities and Reynolds stresses in a cross-section of the duct. The inclusion of this dataset allows the machine-learnt model to incorporate the Reynolds number dependence of these challenging secondary motions, from the transitional to the fully turbulent regimes. Additionally, it is the only three-dimensional (3D) flow in the dataset, for which the Reynolds shear stresses \(\overline{u{\prime} w{\prime} }\) and \(\overline{v{\prime} w{\prime} }\) are nonzero.

Fig. 3
figure 3

The mesh and geometry for the square duct cases. The cases vary by changing the Reynolds number from 1,100 to 3,500, which is calculated based on the duct half-width H and mean bulk velocity Ub.

The geometry for the square duct is shown in Fig. 3. The dimensions for this 3D case are given in terms of the duct half-width H. The duct is a 2H × 2H × 5H box. Wall boundary conditions were applied for the top, bottom and sides of the duct. The streamwise cyclic boundary conditions for the square duct case are summarized in Table 6. A mean pressure gradient source term was added to the momentum equation, to maintain a constant bulk velocity.

Table 6 Boundary conditions for the periodic hills and square duct cases.

The mesh for the square duct case is shown in Fig. 3. This mesh is also structured. Cells are concentrated near the boundary layer. The mesh for all square duct cases is identical. The \({y}^{+}\le 1\) criterion was verified for the highest Reynolds number flow case. The mesh is 3D, with the dataset for machine learning being generated using a cross-section of the mesh.

Parametric bumps

The LES dataset for flow over a family of bumps by Matai and Durbin29 has been recently made available for data-driven closures. The bump is a circular arc, with convex fillets on either end. The dataset is characterized by the bump height h, which is the highest point of the circular arc as shown in Fig. 4. The Reynolds number based on momentum thickness and inlet free stream velocity U is fixed at Reθ = 2,500, while the Reynolds number based on bump height and U varies from Reh≈13,250 to 27,850. At h = 20 mm, the flow remains attached along the bump, while increasing the height further results in slight separation at h = 26 mm. For the highest bump corresponding to h = 42 mm, a small separated region forms behind the bump. While the periodic hills dataset features massively separated flows, the bump cases incorporate a smaller degree of separation. Matai and Durbin found that the mild separation causes a high turbulent kinetic energy (TKE) zone to depart from the bump ahead of the separated region, which is not the case for massively separated flows. Matai and Durbin attributed this region to the adverse pressure gradient generating a mean shear profile. Another important effect captured in the parametric bump case is strong disequilibrium. The parametric bump dataset is highly valuable for training machine-learnt closure models due to the high Reynolds number, parametrically sweeping geometry, physics unique to mildly separated flows, and strong disequilibrium.

Fig. 4
figure 4

The geometry for the five parametric bump cases. The bump length C is fixed at 305 mm, and the bump height varies as h = 20, 26, 31, 38, and 42 mm. Further detail is given in Matai and Durbin29. The Reynolds number based on maximum inlet velocity and step height varies from Reh = 13,260 to Reh = 27,850, with the momentum thickness Reynolds number fixed at Reθ = 2,500.

For the parametric bumps, the DNS and LES simulations utilized a fully-developed inlet flow generated by a “feeder” simulation. To generate the RANS inlet condition, a similar approach to the DNS and LES was taken: a flat version of the domain was simulated with fixed-free boundary conditions to allow the flow to fully develop before entering the domain of interest. Equations for isotropic turbulence are commonly used to estimate the RANS boundary conditions for fixed turbulence inlets. For the feeder simulations, the following equations were used to estimate turbulence quantities at the inlet:

$$k=\frac{3}{2}{(UI)}^{2},$$
(3)
$$\varepsilon ={C}_{\mu }^{3/4}\frac{{k}^{3/2}}{{L}_{t}},$$
(4)
$$\omega =\frac{\varepsilon }{0.09k},$$
(5)

where I is the turbulent intensity, Lt is the turbulence length scale and Cμ is a turbulence closure coefficient.

The parametric bumps case is unique in this dataset in that the top boundary is zero-gradient, compared to the walls used in the other cases. The inlet free-stream velocity U for the LES reference simulation was 16.77 m/s. To recreate these conditions, the inlet boundary conditions for the flat cases were adjusted to produce U = 16.77 m/s. It should be noted that this is an approximation of the LES inlet condition, because the four different turbulence models all produce different U. For the dataset, the mean velocity used for all turbulence models was the same (Table 7), so that the boundary conditions are comparable between turbulence models. The boundary conditions for generating a fully-developed inlet profile for the bump case are summarized in Table 7. After generating a fully-developed profile, the U, k, ε, and ω fields were used as fixed-value inlet conditions for the bump cases. The boundary conditions for the bump cases are summarized in Table 8. The domain size for the parametric bump set is fixed at 2C × 0.5C.

Table 7 Boundary conditions for the flat developing flow case, used to generate an inlet profile for the bump cases.
Table 8 Boundary conditions for the bump cases.

The parametric bump mesh is shown in Fig. 5, and the converging-diverging channel mesh is shown in Fig. 6. Both cases use a structured mesh over an obstruction in the flow. Cells are concentrated in the wake region, and the boundary layer. For the parametric bump, the changing geometry was created by adjusting the bump profile in the structured mesh generator, which resulted in the same number of cells for all cases. The mesh shown in Fig. 5 is for the highest bump. For the converging-diverging channel, the mesh density for both Reynolds numbers is identical, with the Re = 20,580 having an extended domain, and therefore more cells.

Fig. 5
figure 5

Structured hexahedral mesh used to discretize the h = 42 mm parametric bump case.

Fig. 6
figure 6

The mesh and geometry for the two converging-diverging channel cases, corresponding to Reynolds numbers of ReH = 12,600 and 20,580. The Reynolds number for these two cases is based on the channel half height H and the maximum inlet velocity Umax. The ReH = 12,600 converging-diverging channel case uses a smaller domain than the ReH = 20,580 case, but with an identical mesh.

Converging-diverging channel

Two datasets are available for flow over an identical converging-diverging geometry at ReH = 12,600 and ReH = 20,580, shown in Fig. 6. The Reynolds number for this case is based on the maximum inlet velocity and the channel half-height H. The lower Reynolds number dataset comes from the DNS by Laval and Marquielle30, and Marquillie et al.31. The higher Reynolds number dataset was generated by Schiavo et al.32 using LES. The bump height is approximately 2H/3. A fully developed internal channel flow enters the domain and impinges on the abrupt upstream side of the bump. The flow accelerates as the channel converges, then decelerates over the gradual downstream side of the bump. At ReH = 12,600, a thin separation bubble forms along the downstream slope. Along the flat upper wall, the flow remains attached but on the cusp of separation. At ReH = 20,580, the separation bubble grows. The cases contain valuable information about the Reynolds number effect on separation, reattachment, and development of a turbulent boundary layer under an adverse pressure gradient. The long domain downstream of the bump for ReH = 20,580 effectively provides an additional set of LES information for developing plane channel flow.

A similar procedure for the bump case was completed to generate inlet conditions for the converging-diverging channel. However, for the converging-diverging channel, the top boundary is a wall. The boundary conditions for the converging-diverging channel case were adjusted to produce a maximum velocity of Umax = 1.0 m/s, to match the reference simulations. The boundary conditions for the flat, developing flow case is shown in Table 9, and the boundary conditions for the cases in the data set are shown in Table 10. The domain size for the ReH = 12,600 converging-diverging channel is 12.6H × 2H, while for ReH = 20,580 the domain is enlarged to 25.3H × 2H by extending the outlet length.

Table 9 Boundary conditions for the flat developing flow case, used to generate an inlet profile for the converging-diverging channel cases.
Table 10 Boundary conditions for the converging-diverging channel cases.

Curved backward-facing step

The curved backward-facing step case simulated by Bentaleb et al.33 using LES was also included in the dataset. The geometry for this case is shown in Fig. 7. While this is the only case that does not feature parametric variation, it contains an additional set of data on separation and reattachment. While other cases in the dataset feature separation after an acceleration of the flow, the curved backward-facing step case features separation of a fully developed turbulent boundary layer. This phenomenon is difficult for RANS models to predict, and therefore the LES results were included in the dataset. While the original work by Bentaleb et al.33 defined a Reynolds number based on the maximum inlet velocity, we found that the large channel height meant that the mean velocity for all turbulence models was within 10% of the maximum velocity, so to approximate the reference case, defining the Reynolds number for the dataset based on the mean inlet velocity was sufficient. The top and bottom boundaries are walls. An identical procedure to the converging-diverging channel case was used to develop the inlet boundary condition, with Umax = 1.0 m/s. The curved backward-facing step domain is 22.7H × 9.48H. The boundary conditions for the flat, developing flow case is shown in Table 11, and the boundary conditions for the cases in the data set are shown in Table 12.

Fig. 7
figure 7

The geometry and mesh for the curved backward-facing step case. The Reynolds number ReH = 13,700 is based on the mean inlet velocity \(\bar{u}\) and the step height H.

Table 11 Boundary conditions for the flat developing flow case, used to generate an inlet profile for the curved backward facing step cases.
Table 12 Boundary conditions for curved backward facing step case.

The only unstructured mesh in the dataset is the curved backward-facing step, shown in Fig. 7. While it was feasible to generate a structured mesh for this case, an unstructured mesh was generated to include some more typical industrial cells into the dataset. Specifically, near the backward-facing step, the mesh transitions out of the inflation layer using some triangular cells.

Data Records

The dataset19 is hosted on Kaggle, a common platform for machine learning. A total of 29 simulations (Table 2) per turbulence model were completed to match the reference data. The DNS or LES reference data were interpolated onto the RANS grid, using linear interpolation. Any points which required extrapolation of the reference data were dropped, and the interpolated reference data were checked for realizability using the criteria from Banerjee et al.34. After interpolation and data quality checks, 895,640 points of RANS data paired with corresponding DNS or LES data are available for each turbulence model. Each data point represents a cell of a RANS simulation in the dataset, with the corresponding DNS/LES quantity interpolated onto the RANS cell. At each data point, the full set of base, derived, and labels fields described in the following sections are provided. A separate file is provided for each field, for each RANS simulation. An explanation of the file naming convention is provided with the dataset.

To maximize the usefulness of the dataset, a comprehensive set of input features and labels was generated. The dataset is organized into two types of data: base variables, and derived quantities provided for convenience. The base variables contain the bare minimum fields that need to be provided to construct the rest of the fields, which are the RANS fields and grid points. The available base fields in the dataset are summarized in Table 13, and the derived fields are summarized in Tables 14 and 15.

Table 13 Base fields available in the dataset.
Table 14 Derived feature fields available in the dataset.
Table 15 Derived label fields available in the dataset.

The more useful portion of this dataset is the set of pre-constructed machine learning input features. The selection of input features is a critical area of ongoing research in machine-learnt turbulence models. The typical practice in machine learning Reynolds stress modelling is to derive a set of invariants from a tensor basis, combined with other invariant scalars. This was the approach used in6,7,8,9,21 and others. While the input feature set varies, an effort has been made to provide sufficient fields in the dataset to conveniently reproduce past feature sets, and develop new ones. For example, all of the input features and labels used by Ling et al.6 are directly provided: the five invariants of the mean strain and rotation rate tensor, the ten basis tensors described in Pope4, and the anisotropy tensor labels.

Labels

This dataset is suited for models that predict the Reynolds stress tensor, an equivalent problem to predicting the anisotropy tensor. The provided label set includes the individual Reynolds stress components (the base labels), and other fields that are sometimes more convenient to use. The Reynolds stress tensor, TKE, and anisotropy tensor are provided as ready-to-use labels.

Invariants of tensor bases

The invariants are derived from a set of basis tensors, which form a basis for the space spanned by a set of feature tensors. First, the feature tensors need to be selected. The selection of the feature tensors determines what flow variable gradients are incorporated into the model. Previous investigations have selected the set of feature tensors as \(\left\{\widehat{S},\widehat{R}\right\}\)6, \(\left\{\widehat{S},\widehat{R},\nabla k\right\}\)8,9, and \(\left\{\widehat{S},\widehat{R},\nabla k,\nabla p\right\}\)7. If the feature tensors were directly employed as input features, the model would not be invariant because these inputs change with the coordinate system. Therefore, the procedure presented by Spencer and Rivlin35 is commonly employed to generate a tensor basis for the feature set. After constructing the tensor basis, the invariants of the tensor basis are taken—in other words, the traces of the basis tensors are used as input features. This procedure guarantees that the model has the same invariance properties as the trace of the basis tensors.

The dataset includes several quantities which are convenient in generating tensor bases. Along with the velocity gradient tensor U, the strain rate and rotation rate tensors S, R are provided. While the strain and rotation rate tensors are provided without normalization, a set of pre-normalized strain and rotation rate tensors \(\widehat{S},\widehat{R}\) are provided, with the normalizations shown in Table 14. A similar set of features for the kinematic pressure and TKE gradients are provided. The gradients themselves, a vector quantity, and the associated antisymmetric tensors for both the un-normalized and normalized forms are provided.

The provided dataset is sufficient to form the most comprehensive tensor bases used to date, which is the 47 tensor basis used by Wu et al.7. However, it is the traces of these 47 tensors which are of interest. These 47 invariant traces are included in the dataset to be directly used as input features to a machine learning model. Also included is the set of 5 invariants (λi), which arise from using the strain and rotation rate as the feature tensors, as in Ling et al.6.

Other input scalars

After gathering the set of tensor basis invariants, an additional set of scalars is added. Care must be taken that these scalars are invariant to not corrupt the invariance of the constructed tensor basis invariants. While many scalars have been proposed, many of them are not Galilean invariant, which is a property desired in machine-learnt turbulence models. Therefore, four Galilean invariant scalars used by Kaandorp and Dwight9 are included as ready-to-use features in the dataset. While this set of input scalars is not comprehensive, the dataset includes sufficient fields to conveniently generate other scalar quantities.

Technical Validation

The RANS results are sensitive to the mesh used. While the mesh must be compatible with the selected wall treatment, it must also be sufficiently fine to reduce discretization errors. To demonstrate that the selected meshes do not affect the result, a mesh independence study was completed for each of the five flow cases. The most demanding case was selected for each flow type: the steepest periodic hills case, the highest Reynolds number square duct, the highest bump, the highest Reynolds number converging-diverging channel, and the curved backward-facing step. Mesh independence was demonstrated using the k-ε turbulence model. The mesh study was conducted by examining the change in the velocity fields between varying mesh sizes.

Supplementary Figs. 1 and 2 show the results of the mesh convergence study for the periodic hills case. The meshes provided by Xiao et al.21 were refined two times, each by a factor of 2 in the x and y directions. A small group of cells could not be refined while maintaining reasonable quality, which is why the meshes shown in Supplementary Figs. 1 and 2 do not exactly contain N, 4N,and 16N cells. The results for the periodic hills case demonstrate good mesh convergence for the grid with the smallest number of cells used in the study. There is almost no change for the U velocity for grids whose number of cells is greater than N = 14,751. The V profiles near the inlet boundary shown changes between the mesh sizes. For this case, the mesh convergence is non-monotonic, but the differences of the V profiles between the various meshes used are small. Therefore, the N = 14,751 mesh is sufficiently converged.

One of the main considerations for the square duct mesh is sufficient resolution in the yz plane to extract machine learning features. The reference data by Pinelli et al.28 are provided as a set of statistics in the yz plane. Even though Supplementary Fig. 3 shows that the solution is mesh-converged at N = 87,552, the resolution in the yz plane is too coarse. The N = 87,552 mesh results in 2,304 dataset points per case, while the N = 691,200 mesh results in 9,216 points per case. Therefore, the N = 691,200 mesh is selected for generating the dataset, because the solution is mesh independent, and there are sufficient cells in the yz plane to generate features for machine learning.

The parametric bump is the highest Reynolds number flow in the dataset (ReH ≈ 27,850) and, as a consequence, it requires a dense mesh. Solution convergence at the coarsest mesh with N = 72,100 cells was demonstrated by increasing the number of cells in the structured mesh generator by a factor of two, and then four, and comparing the velocity profiles for the corresponding N, 4N, and 16N cases. Supplementary Figs. 4 and 5 show the comparisons made. For the U velocity profile, there are small differences in the wake of the bump, and in the far-field above the bump. The V velocity field reflects these small far-field differences above the bump. However, the differences are comparatively small, and the mesh demonstrates good convergence to generate the dataset.

Mesh convergence for the converging-diverging channel case was demonstrated similarly to the bump case. The number of cells in the structured mesh generator was increased by a factor of two, and then four. Supplementary Figs. 6 and 7 show that there are almost no differences between the solutions as the mesh is refined, even by a factor of 16. Therefore, the mesh for the converging-diverging channel case is sufficiently converged at N = 183,750.

Demonstrating mesh convergence for the curved backward-facing step case was completed similarly to the periodic hills case, by refining the mesh twice in each direction. Some cells could not be refined while maintaining reasonable mesh quality, which is the reason that the meshes in Supplementary Figs. 8 and 9 do not exactly have N, 4N, and 16N cells. The solution has excellent mesh convergence at N = 37,082, in both the U and V velocity fields.

Usage Notes

The dataset structure consists of a folder for each turbulence model, with an additional folder for the DNS/LES labels19. The RANS features for each case are provided using a consistent naming scheme. This structure allows the data to be accessed and processed in a coherent manner for immediate use in open-source machine learning frameworks such as TensorFlow and PyTorch. An example notebook of how to use the data to develop a simple machine learning model for the Reynolds stress anisotropy tensor is provided on the dataset page. Another example notebook is provided that demonstrates the field formats, using the square duct case as an example. The dataset will be updated as more DNS/LES reference datasets become available, or if there is demand to include additional RANS turbulence models. The curated dataset is most suitable for direct use in corrective (open loop) RANS turbulence modelling using machine learning. While the dataset presented here is not targeted for iterative (closed loop) machine learning-based RANS turbulence modelling (Schmelzer et al.15, Taghizadeh et al.16), it nevertheless can be used to provide the initial set of fields as well as to facilitate the implementation of the iterative approach for a particular RANS closure model (at least for the four turbulence closure models included in the dataset).

The dataset includes ready-to-use quantities, OpenFOAM files, and residual plots for all simulations. The ready-to-use input features are provided in the folders named by each turbulence model. The ready-to-use labels are provided in the labels folder. The openfoam folder provides the base quantities in OpenFOAM format, which is convenient for testing the corrective model. The residuals folder contains residual plots for all simulations.

There are approximately 1,000 fields per turbulence model, provided as numpy arrays. The first index for all fields in the dataset is the data point index, equivalent to the cell index. The remaining indices in the array depends on the nature of the field. For example, all tensors are given with shape (N, 3, 3), where N is the data point index. The ten basis tensors used in a general representation of the anisotropy tensor proposed by Pope4 are given as an array with shape (N, 10, 3, 3). Relatively few pre-processing steps have been performed on the dataset—no normalization or outlier elimination has been performed. The only deletions arise from a small subset (less than 50 points) of non-realizable LES label values, and any points requiring extrapolation of the reference data. Therefore, it is recommended that after a specific input feature set is formed using the provided fields, the input features should be standardized as is typical in machine learning. The RANS results also contain some outliers that may need to be dropped. For example, Kaandorp8 dropped datapoints outside of μ ± 5σ, where μ is the mean, and σ is the standard deviation.