A curated dataset for data-driven turbulence modelling

McConkey, Ryley; Yee, Eugene; Lien, Fue-Sang

doi:10.1038/s41597-021-01034-2

Download PDF

Data Descriptor
Open access
Published: 30 September 2021

A curated dataset for data-driven turbulence modelling

Scientific Data volume 8, Article number: 255 (2021) Cite this article

10k Accesses
29 Citations
7 Altmetric
Metrics details

Subjects

Abstract

The recent surge in machine learning augmented turbulence modelling is a promising approach for addressing the limitations of Reynolds-averaged Navier-Stokes (RANS) models. This work presents the development of the first open-source dataset, curated and structured for immediate use in machine learning augmented corrective turbulence closure modelling. The dataset features a variety of RANS simulations with matching direct numerical simulation (DNS) and large-eddy simulation (LES) data. Four turbulence models are selected to form the initial dataset: k-ε, k-ε-ϕ_t-f, k-ω, and k-ω SST. The dataset consists of 29 cases per turbulence model, for several parametrically sweeping reference DNS/LES cases: periodic hills, square duct, parametric bumps, converging-diverging channel, and a curved backward-facing step. At each of the 895,640 points, various RANS features with DNS/LES labels are available. The feature set includes quantities used in current state-of-the-art models, and additional fields which enable the generation of new feature sets. The dataset reduces effort required to train, test, and benchmark new corrective RANS models. The dataset is available at https://doi.org/10.34740/kaggle/dsv/2637500.

Measurement(s)	velocity fields • pressure fields • turbulence fields • related gradients
Technology Type(s)	numerical simulation
Factor Type(s)	turbulence model • flow geometry

Machine-accessible metadata file describing the reported data: https://doi.org/10.6084/m9.figshare.15124857

Direct numerical simulation of an unsteady wall-bounded turbulent flow configuration for the assessment of large-eddy simulation models

Article Open access 11 July 2023

Adaptive physics-informed neural operator for coarse-grained non-equilibrium flows

Article Open access 19 September 2023

Comparison and verification of turbulence Reynolds-averaged Navier–Stokes closures to model spatially varied flows

Article Open access 04 November 2020

Background & Summary

Numerical simulations in weather forecasting, wind and hydroelectric energy, aerospace vehicle design, automotive design, turbomachinery, nuclear plant design, and many other applications all rely on closure models to accelerate simulations while modelling the complex physical phenomenon of turbulence. While higher resolution techniques such as large-eddy simulation (LES) and direct numerical simulation (DNS) are becoming more widespread, the computational demands compared to current capabilities make these techniques unaffordable for many industrial simulations. For this reason, Reynolds-averaged Navier-Stokes (RANS) simulations are expected to remain the dominant tool for predicting flows of practical relevance to engineering and industrial problems over the next few decades¹. However, flows with strong adverse pressure gradients², separation³, streamline curvature⁴, and reacting chemistry are often poorly predicted by RANS approaches. Developing methods to improve the accuracy of RANS simulations will help bridge this critical capability gap between RANS and LES⁵.

Several recent investigations have demonstrated the potential of applying machine learning to the development of corrective turbulence closure models for RANS. Ling et al.⁶ constructed a tensor basis neural network (TBNN), which predicts the anisotropy tensor using five invariant scalars derived from the mean strain and rotation rate tensors. The TBNN turbulence closure model developed by Ling et al.⁶ is effectively a fifth-order eddy viscosity model, with locally varying coefficients predicted via deep learning. The ability to express such a locally-tuned, high-order relationship between the strain rate and anisotropy tensors is a powerful method to improve the accuracy of RANS simulations. Wu et al.⁷ developed a random-forests-based model, which directly predicts the Reynolds stress anisotropy. Kaandorp⁸ and Kaandorp and Dwight⁹ proposed a tensor basis random forest (TBRF) model, which is the random forests analogue to the TBNN proposed by Ling et al.⁶. While the different models by Ling et al.⁶, Wu et al.⁷, Kaandorp and Dwight⁹, Zhu and Dinh¹⁰, Zhang et al.¹¹, Fang et al.¹², and Song et al.¹³ all show promise, the results cannot be directly compared—each investigation used a different set of input features and labels, with different numerical settings chosen for feature generation. For this reason, Duraisamy¹⁴ recently highlighted the need for a benchmark dataset for machine-learnt closure models.

The approach used by Ling et al.⁶, Wu et al.⁷, Kaandorp and Dwight⁹, Zhang et al.¹¹ and others is referred to as corrective or open loop augmented closure modelling. In this open loop framework, the machine learning model is used to generate a one-time mapping between the fields from a converged RANS solution, and fields from DNS. A contrasting approach is the closed loop framework^15,16, where the training process involves conducting RANS simulations in an iterative manner, to repeatedly update the feature set until the model predictions and turbulence closure coefficients converge. The present work aims to present a dataset useful primarily for corrective augmented closure modelling, where the machine learning model is queried once to predict the Reynolds stress. However, this dataset can also be used as an initial set of fields for closed loop closure modelling, and the provided OpenFOAM case files are convenient for performing iterative simulations in a closed loop framework.

To generate a set of input features, the current requirement is for every investigator to generate a set of RANS simulations that match the DNS/LES reference cases. This requirement has several drawbacks. As the number of included datasets grows, the effort required grows. The development of the ImageNet dataset spurred rapid growth of the computer vision field, which would not have been possible otherwise. From an effort point of view, the availability of a curated dataset dramatically increases the time spent developing the models themselves, rather than setting up many RANS simulations to gather input features.

Another major drawback of the current approach arises from the issue of reproducibility in the field of computational fluid dynamics (CFD). Often, CFD studies are difficult to reproduce, due to a large number of input conditions¹⁷. Each investigation will use different meshes, numerical schemes, turbulence models, and other selections which affect the solution. The field of machine learning has also been plagued with reproducibility challenges, even with the widespread use of benchmark datasets¹⁸. While machine-learnt turbulence models are a promising approach, the development of these models could be significantly impeded by mixing two fields where reproducibility is a challenge. A well-documented, widely available dataset solves at least one aspect of the reproducibility issue, in that all models can at least be trained in the same environment, using the same input features and labels.

Motivated by the lack of a sufficient dataset, the present work aims to develop a set of RANS simulations of highly resolved reference cases in order to generate a curated dataset¹⁹. In this work, the numerical methods for the RANS simulations are presented, along with the selection and calculation of the input features for machine learning models. In doing so, the present work aims to present a large computational dataset, curated and logically structured for immediate use in developing next-generation turbulence closure models for RANS using data-driven machine learning. Table 1 summarizes the inputs and outputs of the present work.

Table 1 Inputs and outputs of the present study.

Full size table

Methods

Selection of reference cases

An important aspect of dataset selection for data-driven turbulence modelling is sweeping of a parameter space. A deep insight followed by a deeper understanding of the fluid phenomena can be obtained by providing information on how the geometry and/or the Reynolds number changes the flow behaviour. In contrast, single-point measurements are only valuable in approximating a universal mapping between inputs and outputs. The majority of the datasets used here involve sweeping through some parameter space. Table 2 summarizes the cases used in the dataset.

Table 2 Cases in the dataset. Re_L is the Reynolds number based on the characteristic length and velocity scales shown in Figs. 1 to 7.

Full size table

Computational method

The flow is assumed to be incompressible, viscous, steady, and turbulent for all cases. Under these conditions, the fluid properties are specified by the kinematic molecular viscosity v. Table 3 summarizes the viscosity used for each case.

Table 3 Kinematic molecular viscosity used for each case.

Full size table

The open-source library OpenFOAM v2006²⁰ was used to generate the dataset. The ability to replicate CFD is greatly improved by supplying the mesh and settings files¹⁷. The dataset includes the OpenFOAM case files, including the meshes used for all the cases, and the full details of the settings used. Supplying the OpenFOAM files also reduces the effort required for a posteriori testing. This practice is following Xiao et al.²¹, who included the OpenFOAM files with their dataset. While this section highlights the basic numerical settings used, the reader is referred to the dataset for the complete OpenFOAM settings.

Numerical schemes

A standardized set of numerical schemes was used for all cases. The numerical schemes represent commonly used RANS schemes, which represent a good trade-off between stability and accuracy. For discretizing the convective terms in the momentum equations, a second-order upwind scheme was used. For discretizing convective terms in the turbulence transport equations, a first-order upwind scheme was used. For the diffusive terms, a second-order central difference scheme was used. Since all the flow cases are steady, the transient terms were set to zero.

The simpleFoam solver was used to solve the equations iteratively. The semi-implicit method for pressure-linked equations-consistent (SIMPLEC) algorithm was used to accelerate convergence. For some cases, additional non-orthogonality correcting loops were applied to the pressure equation. The generalized geometric algebraic multigrid (GAMG) solver was used for the pressure equation, and the preconditioned bi-conjugate gradient (PBiCGStab) solver was used for all other equations.

Iterative residual convergence below 10⁻⁶ was generally achieved, with most simulations converging below 10⁻⁸. The residual plots for each simulation are provided along with the dataset. The exceptions to this tight residual convergence criteria are the U_y U_z, and p fields for the square duct cases. The linear eddy viscosity model is unable to accurately predict the secondary vortices resulting from non-zero U_y and U_z components in the square duct case, and therefore minimal convergence is seen in these residuals as the in-plane velocity fields remain close to the initial condition of zero. The pressure field for the square duct case does not converge below 10⁻⁶ due to the presence of a forcing term which maintains the bulk velocity, resulting in uniform streamwise zero pressure equal to the initial condition of zero.

Turbulence modelling

The two most common families of turbulence closure models, k-ε and k-ω, include many sub-models. Previous investigations on machine-learnt models for predicting the anisotropy tensor have augmented the standard k-ε model⁶, the Launder-Sharma low Reynolds number k-ε model⁸, and the k-ω model^7,9. Four representative turbulence models were selected for the dataset: namely, the standard k-ε²², k-ε-ϕ_t-f ²³, k-ω², and the k-ω shear stress transport (SST)²⁴ turbulence closure models. In this work, ϕ_t is used to denote the anisotropy measure $\bar{{v{\prime} }^{2}}/k$ to align with the variable naming in OpenFOAM. Here, $\bar{{v{\prime} }^{2}}$ denotes the wall-normal Reynolds stress. The default coefficients were used for all turbulence models²⁰.

The k-ε-ϕ_t-f model is a more sophisticated model than the k-ε and k-ω models, through the inclusion of an additional transport equation for the anisotropy measure ${\phi }_{t}\equiv \bar{{v{\prime} }^{2}}/k$, and an elliptic equation for f. f is a scalar which predicts TKE redistribution from the streamwise to the wall-normal Reynolds stress. This model is an improved version of the original $\bar{{v{\prime} }^{2}}$-f model proposed by Durbin²⁵, and the improved “code-friendly” version developed by Lien and Kalitzin²⁶. The additional quantities enable the creation of new input features not available in the previous two-equation investigations. Both additional scalars satisfy all desired invariance properties, including Galilean invariance.

For all turbulence models, the mesh was sufficient for a low Reynolds number wall treatment. Low Reynolds number wall boundary conditions are provided for k, ε, and ω in OpenFOAM²⁷. A fixed-value k = 0 boundary condition was applied at no-slip walls. At no-slip walls, the following low Reynolds number fixed value boundary condition was applied for ε:

$$\varepsilon ={\varepsilon }_{vis}=2wk\frac{\nu }{{y}^{2}},$$

(1)

where w are the cell corner weights²⁰. For ω the following fixed value boundary condition was applied at no-slip walls.

$$\omega =\frac{6{\rm{\nu }}}{{\beta }_{1}\,{y}^{2}},$$

(2)

where β₁ = 0.075.

Domain and boundary conditions

The domain and boundary conditions for all cases were selected to match the DNS or LES reference simulations. There are two main types of boundary conditions used in the dataset: fixed-free, and streamwise cyclic. While the periodic hills and square duct cases utilize a streamwise cyclic boundary condition, the bump, converging-diverging channel, and curved-backward facing step cases employ a fully-developed inlet velocity profile, and a zero-gradient outlet. The simulations here involve four different turbulence models, each with different fields. The units used for each variable are given in Table 4.

Table 4 Units for each variable requiring boundary conditions.

Full size table

Mesh

OpenFOAM’s utilities were used to generate the meshes. The mesh generation method varied from case to case, as some cases have changing geometries. Table 5 summarizes the meshes used. All meshes met the low Reynolds number wall treatment criterion of ${y}^{+}\approx 1$ or below. Here, ${y}^{+}\equiv {u}_{\tau }{y}_{w}/\nu $ is the normalized wall-normal distance, where y_w is the wall-normal distance, and u_τ is the wall friction velocity. In all cases, the mesh was either hexahedral or hexahedral-dominant. A high-quality mesh is important for generating input features for machine learning, in that some terms are sensitive to the mesh quality. For example, the basis tensor ${\widehat{{\mathscr{T}}}}_{10}$ in a general representation of the Reynolds stress tensor proposed by Pope⁴ is fifth order in terms of the velocity gradient tensor. In developing the feature set here, we found that to keep these terms stable, the number of tetrahedral cells in the domain must be minimized. However, many industrial meshes contain tetrahedral cells, and are of poorer quality than the structured meshes generated here. While CFD results are normally sensitive to the mesh used, machine learning models are especially sensitive to the mesh quality. Poorer meshes result in increased noise and more outliers in the input feature set.

Table 5 Meshes used for discretizing the domain.

Full size table

Periodic hills

Flow over periodic hills with cyclic boundary conditions is a common benchmark problem for turbulence modelling. The periodic hills case features separation, an important phenomenon for RANS models to accurately capture due to the prominence of strongly separated flows in many industrial settings. To provide a parameterized dataset for data-driven turbulence modelling, Xiao et al.²¹ performed DNS of flow over a series of periodic hills. This dataset consists of five cases, characterized by the steepness ratio α. The values of α selected are α = 0.5, 0.8, 1.0, 1.2, and 1.5, which results in a range of separated flows. The geometry for the five periodic hills cases is shown in Fig. 1. The Reynolds number based on bulk velocity and crest height for all cases is fixed at Re = 5,600.

The periodic hills case is a two-dimensional (2D) flow, with the domain geometry characterized in terms of the hill height H, as shown in Fig. 1. The domain height is fixed at 3.04H, and the domain width changes from 7.07H to 10.9H, as the parameter α changes. The boundary conditions for the periodic hills case are streamwise cyclic for all flow variables. Both the top and bottom boundaries are treated as no-slip walls. To maintain a constant bulk velocity in the flow, a mean pressure gradient source term is added to the momentum equation. Therefore, the pressure field for cases with cyclic boundary conditions should be interpreted as the deviation from the mean pressure field.

The mesh for the steepest periodic hills case (α = 0.5) is shown in Fig. 2. The RANS meshes for all periodic hills cases were provided by Xiao et al.²¹. The periodic hills mesh is a structured mesh, with cells concentrated near the boundary layer. While the geometry changes by varying the hill steepness and domain length, the number of cells for all cases is the same.

Square duct

The DNS dataset for flow in a square duct by Pinelli et al.²⁸ has been widely used in data-driven turbulence modelling. This dataset consists of 16 cases, all with the same fixed geometry shown in Fig. 3. The Reynolds number based on the duct half-width varies between 1,100 and 3,500. The flow in a square duct is a challenging test case for eddy viscosity models. Linear eddy viscosity models are unable to predict the secondary corner vortices which form in the duct. These structures are Prandtl’s secondary motion of the second kind²⁸. The dataset contains the mean velocities and Reynolds stresses in a cross-section of the duct. The inclusion of this dataset allows the machine-learnt model to incorporate the Reynolds number dependence of these challenging secondary motions, from the transitional to the fully turbulent regimes. Additionally, it is the only three-dimensional (3D) flow in the dataset, for which the Reynolds shear stresses $\overline{u{\prime} w{\prime} }$ and $\overline{v{\prime} w{\prime} }$ are nonzero.

The geometry for the square duct is shown in Fig. 3. The dimensions for this 3D case are given in terms of the duct half-width H. The duct is a 2H × 2H × 5H box. Wall boundary conditions were applied for the top, bottom and sides of the duct. The streamwise cyclic boundary conditions for the square duct case are summarized in Table 6. A mean pressure gradient source term was added to the momentum equation, to maintain a constant bulk velocity.

Table 6 Boundary conditions for the periodic hills and square duct cases.

Full size table

The mesh for the square duct case is shown in Fig. 3. This mesh is also structured. Cells are concentrated near the boundary layer. The mesh for all square duct cases is identical. The ${y}^{+}\le 1$ criterion was verified for the highest Reynolds number flow case. The mesh is 3D, with the dataset for machine learning being generated using a cross-section of the mesh.

Parametric bumps

The LES dataset for flow over a family of bumps by Matai and Durbin²⁹ has been recently made available for data-driven closures. The bump is a circular arc, with convex fillets on either end. The dataset is characterized by the bump height h, which is the highest point of the circular arc as shown in Fig. 4. The Reynolds number based on momentum thickness and inlet free stream velocity U_∞ is fixed at Re_θ = 2,500, while the Reynolds number based on bump height and U_∞ varies from Re_h≈13,250 to 27,850. At h = 20 mm, the flow remains attached along the bump, while increasing the height further results in slight separation at h = 26 mm. For the highest bump corresponding to h = 42 mm, a small separated region forms behind the bump. While the periodic hills dataset features massively separated flows, the bump cases incorporate a smaller degree of separation. Matai and Durbin found that the mild separation causes a high turbulent kinetic energy (TKE) zone to depart from the bump ahead of the separated region, which is not the case for massively separated flows. Matai and Durbin attributed this region to the adverse pressure gradient generating a mean shear profile. Another important effect captured in the parametric bump case is strong disequilibrium. The parametric bump dataset is highly valuable for training machine-learnt closure models due to the high Reynolds number, parametrically sweeping geometry, physics unique to mildly separated flows, and strong disequilibrium.

For the parametric bumps, the DNS and LES simulations utilized a fully-developed inlet flow generated by a “feeder” simulation. To generate the RANS inlet condition, a similar approach to the DNS and LES was taken: a flat version of the domain was simulated with fixed-free boundary conditions to allow the flow to fully develop before entering the domain of interest. Equations for isotropic turbulence are commonly used to estimate the RANS boundary conditions for fixed turbulence inlets. For the feeder simulations, the following equations were used to estimate turbulence quantities at the inlet:

$$k=\frac{3}{2}{(UI)}^{2},$$

(3)

$$\varepsilon ={C}_{\mu }^{3/4}\frac{{k}^{3/2}}{{L}_{t}},$$

(4)

$$\omega =\frac{\varepsilon }{0.09k},$$

(5)

where I is the turbulent intensity, L_t is the turbulence length scale and C_μ is a turbulence closure coefficient.

The parametric bumps case is unique in this dataset in that the top boundary is zero-gradient, compared to the walls used in the other cases. The inlet free-stream velocity U_∞ for the LES reference simulation was 16.77 m/s. To recreate these conditions, the inlet boundary conditions for the flat cases were adjusted to produce U_∞ = 16.77 m/s. It should be noted that this is an approximation of the LES inlet condition, because the four different turbulence models all produce different U_∞. For the dataset, the mean velocity used for all turbulence models was the same (Table 7), so that the boundary conditions are comparable between turbulence models. The boundary conditions for generating a fully-developed inlet profile for the bump case are summarized in Table 7. After generating a fully-developed profile, the U, k, ε, and ω fields were used as fixed-value inlet conditions for the bump cases. The boundary conditions for the bump cases are summarized in Table 8. The domain size for the parametric bump set is fixed at 2C × 0.5C.

Table 7 Boundary conditions for the flat developing flow case, used to generate an inlet profile for the bump cases.

Full size table

Table 8 Boundary conditions for the bump cases.

Full size table

The parametric bump mesh is shown in Fig. 5, and the converging-diverging channel mesh is shown in Fig. 6. Both cases use a structured mesh over an obstruction in the flow. Cells are concentrated in the wake region, and the boundary layer. For the parametric bump, the changing geometry was created by adjusting the bump profile in the structured mesh generator, which resulted in the same number of cells for all cases. The mesh shown in Fig. 5 is for the highest bump. For the converging-diverging channel, the mesh density for both Reynolds numbers is identical, with the Re = 20,580 having an extended domain, and therefore more cells.

Converging-diverging channel

Two datasets are available for flow over an identical converging-diverging geometry at Re_H = 12,600 and Re_H = 20,580, shown in Fig. 6. The Reynolds number for this case is based on the maximum inlet velocity and the channel half-height H. The lower Reynolds number dataset comes from the DNS by Laval and Marquielle³⁰, and Marquillie et al.³¹. The higher Reynolds number dataset was generated by Schiavo et al.³² using LES. The bump height is approximately 2H/3. A fully developed internal channel flow enters the domain and impinges on the abrupt upstream side of the bump. The flow accelerates as the channel converges, then decelerates over the gradual downstream side of the bump. At Re_H = 12,600, a thin separation bubble forms along the downstream slope. Along the flat upper wall, the flow remains attached but on the cusp of separation. At Re_H = 20,580, the separation bubble grows. The cases contain valuable information about the Reynolds number effect on separation, reattachment, and development of a turbulent boundary layer under an adverse pressure gradient. The long domain downstream of the bump for Re_H = 20,580 effectively provides an additional set of LES information for developing plane channel flow.

A similar procedure for the bump case was completed to generate inlet conditions for the converging-diverging channel. However, for the converging-diverging channel, the top boundary is a wall. The boundary conditions for the converging-diverging channel case were adjusted to produce a maximum velocity of U_max = 1.0 m/s, to match the reference simulations. The boundary conditions for the flat, developing flow case is shown in Table 9, and the boundary conditions for the cases in the data set are shown in Table 10. The domain size for the Re_H = 12,600 converging-diverging channel is 12.6H × 2H, while for Re_H = 20,580 the domain is enlarged to 25.3H × 2H by extending the outlet length.

Table 9 Boundary conditions for the flat developing flow case, used to generate an inlet profile for the converging-diverging channel cases.

Full size table

Table 10 Boundary conditions for the converging-diverging channel cases.

Full size table

Curved backward-facing step

The curved backward-facing step case simulated by Bentaleb et al.³³ using LES was also included in the dataset. The geometry for this case is shown in Fig. 7. While this is the only case that does not feature parametric variation, it contains an additional set of data on separation and reattachment. While other cases in the dataset feature separation after an acceleration of the flow, the curved backward-facing step case features separation of a fully developed turbulent boundary layer. This phenomenon is difficult for RANS models to predict, and therefore the LES results were included in the dataset. While the original work by Bentaleb et al.³³ defined a Reynolds number based on the maximum inlet velocity, we found that the large channel height meant that the mean velocity for all turbulence models was within 10% of the maximum velocity, so to approximate the reference case, defining the Reynolds number for the dataset based on the mean inlet velocity was sufficient. The top and bottom boundaries are walls. An identical procedure to the converging-diverging channel case was used to develop the inlet boundary condition, with U_max = 1.0 m/s. The curved backward-facing step domain is 22.7H × 9.48H. The boundary conditions for the flat, developing flow case is shown in Table 11, and the boundary conditions for the cases in the data set are shown in Table 12.

Table 11 Boundary conditions for the flat developing flow case, used to generate an inlet profile for the curved backward facing step cases.

Full size table

Table 12 Boundary conditions for curved backward facing step case.

Full size table

The only unstructured mesh in the dataset is the curved backward-facing step, shown in Fig. 7. While it was feasible to generate a structured mesh for this case, an unstructured mesh was generated to include some more typical industrial cells into the dataset. Specifically, near the backward-facing step, the mesh transitions out of the inflation layer using some triangular cells.

Data Records

The dataset¹⁹ is hosted on Kaggle, a common platform for machine learning. A total of 29 simulations (Table 2) per turbulence model were completed to match the reference data. The DNS or LES reference data were interpolated onto the RANS grid, using linear interpolation. Any points which required extrapolation of the reference data were dropped, and the interpolated reference data were checked for realizability using the criteria from Banerjee et al.³⁴. After interpolation and data quality checks, 895,640 points of RANS data paired with corresponding DNS or LES data are available for each turbulence model. Each data point represents a cell of a RANS simulation in the dataset, with the corresponding DNS/LES quantity interpolated onto the RANS cell. At each data point, the full set of base, derived, and labels fields described in the following sections are provided. A separate file is provided for each field, for each RANS simulation. An explanation of the file naming convention is provided with the dataset.

To maximize the usefulness of the dataset, a comprehensive set of input features and labels was generated. The dataset is organized into two types of data: base variables, and derived quantities provided for convenience. The base variables contain the bare minimum fields that need to be provided to construct the rest of the fields, which are the RANS fields and grid points. The available base fields in the dataset are summarized in Table 13, and the derived fields are summarized in Tables 14 and 15.

Table 13 Base fields available in the dataset.

Full size table

Table 14 Derived feature fields available in the dataset.

Full size table

Table 15 Derived label fields available in the dataset.

Full size table

The more useful portion of this dataset is the set of pre-constructed machine learning input features. The selection of input features is a critical area of ongoing research in machine-learnt turbulence models. The typical practice in machine learning Reynolds stress modelling is to derive a set of invariants from a tensor basis, combined with other invariant scalars. This was the approach used in^6,7,8,9,21 and others. While the input feature set varies, an effort has been made to provide sufficient fields in the dataset to conveniently reproduce past feature sets, and develop new ones. For example, all of the input features and labels used by Ling et al.⁶ are directly provided: the five invariants of the mean strain and rotation rate tensor, the ten basis tensors described in Pope⁴, and the anisotropy tensor labels.

Labels

This dataset is suited for models that predict the Reynolds stress tensor, an equivalent problem to predicting the anisotropy tensor. The provided label set includes the individual Reynolds stress components (the base labels), and other fields that are sometimes more convenient to use. The Reynolds stress tensor, TKE, and anisotropy tensor are provided as ready-to-use labels.

Invariants of tensor bases

The invariants are derived from a set of basis tensors, which form a basis for the space spanned by a set of feature tensors. First, the feature tensors need to be selected. The selection of the feature tensors determines what flow variable gradients are incorporated into the model. Previous investigations have selected the set of feature tensors as $\left\{\widehat{S},\widehat{R}\right\}$⁶, $\left\{\widehat{S},\widehat{R},\nabla k\right\}$^8,9, and $\left\{\widehat{S},\widehat{R},\nabla k,\nabla p\right\}$⁷. If the feature tensors were directly employed as input features, the model would not be invariant because these inputs change with the coordinate system. Therefore, the procedure presented by Spencer and Rivlin³⁵ is commonly employed to generate a tensor basis for the feature set. After constructing the tensor basis, the invariants of the tensor basis are taken—in other words, the traces of the basis tensors are used as input features. This procedure guarantees that the model has the same invariance properties as the trace of the basis tensors.

The dataset includes several quantities which are convenient in generating tensor bases. Along with the velocity gradient tensor ∇U, the strain rate and rotation rate tensors S, R are provided. While the strain and rotation rate tensors are provided without normalization, a set of pre-normalized strain and rotation rate tensors $\widehat{S},\widehat{R}$ are provided, with the normalizations shown in Table 14. A similar set of features for the kinematic pressure and TKE gradients are provided. The gradients themselves, a vector quantity, and the associated antisymmetric tensors for both the un-normalized and normalized forms are provided.

The provided dataset is sufficient to form the most comprehensive tensor bases used to date, which is the 47 tensor basis used by Wu et al.⁷. However, it is the traces of these 47 tensors which are of interest. These 47 invariant traces are included in the dataset to be directly used as input features to a machine learning model. Also included is the set of 5 invariants (λ_i), which arise from using the strain and rotation rate as the feature tensors, as in Ling et al.⁶.

Other input scalars

After gathering the set of tensor basis invariants, an additional set of scalars is added. Care must be taken that these scalars are invariant to not corrupt the invariance of the constructed tensor basis invariants. While many scalars have been proposed, many of them are not Galilean invariant, which is a property desired in machine-learnt turbulence models. Therefore, four Galilean invariant scalars used by Kaandorp and Dwight⁹ are included as ready-to-use features in the dataset. While this set of input scalars is not comprehensive, the dataset includes sufficient fields to conveniently generate other scalar quantities.

Technical Validation

The RANS results are sensitive to the mesh used. While the mesh must be compatible with the selected wall treatment, it must also be sufficiently fine to reduce discretization errors. To demonstrate that the selected meshes do not affect the result, a mesh independence study was completed for each of the five flow cases. The most demanding case was selected for each flow type: the steepest periodic hills case, the highest Reynolds number square duct, the highest bump, the highest Reynolds number converging-diverging channel, and the curved backward-facing step. Mesh independence was demonstrated using the k-ε turbulence model. The mesh study was conducted by examining the change in the velocity fields between varying mesh sizes.

Supplementary Figs. 1 and 2 show the results of the mesh convergence study for the periodic hills case. The meshes provided by Xiao et al.²¹ were refined two times, each by a factor of 2 in the x and y directions. A small group of cells could not be refined while maintaining reasonable quality, which is why the meshes shown in Supplementary Figs. 1 and 2 do not exactly contain N, 4N,and 16N cells. The results for the periodic hills case demonstrate good mesh convergence for the grid with the smallest number of cells used in the study. There is almost no change for the U velocity for grids whose number of cells is greater than N = 14,751. The V profiles near the inlet boundary shown changes between the mesh sizes. For this case, the mesh convergence is non-monotonic, but the differences of the V profiles between the various meshes used are small. Therefore, the N = 14,751 mesh is sufficiently converged.

One of the main considerations for the square duct mesh is sufficient resolution in the y–z plane to extract machine learning features. The reference data by Pinelli et al.²⁸ are provided as a set of statistics in the y–z plane. Even though Supplementary Fig. 3 shows that the solution is mesh-converged at N = 87,552, the resolution in the y–z plane is too coarse. The N = 87,552 mesh results in 2,304 dataset points per case, while the N = 691,200 mesh results in 9,216 points per case. Therefore, the N = 691,200 mesh is selected for generating the dataset, because the solution is mesh independent, and there are sufficient cells in the y–z plane to generate features for machine learning.

The parametric bump is the highest Reynolds number flow in the dataset (Re_H ≈ 27,850) and, as a consequence, it requires a dense mesh. Solution convergence at the coarsest mesh with N = 72,100 cells was demonstrated by increasing the number of cells in the structured mesh generator by a factor of two, and then four, and comparing the velocity profiles for the corresponding N, 4N, and 16N cases. Supplementary Figs. 4 and 5 show the comparisons made. For the U velocity profile, there are small differences in the wake of the bump, and in the far-field above the bump. The V velocity field reflects these small far-field differences above the bump. However, the differences are comparatively small, and the mesh demonstrates good convergence to generate the dataset.

Mesh convergence for the converging-diverging channel case was demonstrated similarly to the bump case. The number of cells in the structured mesh generator was increased by a factor of two, and then four. Supplementary Figs. 6 and 7 show that there are almost no differences between the solutions as the mesh is refined, even by a factor of 16. Therefore, the mesh for the converging-diverging channel case is sufficiently converged at N = 183,750.

Demonstrating mesh convergence for the curved backward-facing step case was completed similarly to the periodic hills case, by refining the mesh twice in each direction. Some cells could not be refined while maintaining reasonable mesh quality, which is the reason that the meshes in Supplementary Figs. 8 and 9 do not exactly have N, 4N, and 16N cells. The solution has excellent mesh convergence at N = 37,082, in both the U and V velocity fields.

Usage Notes

The dataset structure consists of a folder for each turbulence model, with an additional folder for the DNS/LES labels¹⁹. The RANS features for each case are provided using a consistent naming scheme. This structure allows the data to be accessed and processed in a coherent manner for immediate use in open-source machine learning frameworks such as TensorFlow and PyTorch. An example notebook of how to use the data to develop a simple machine learning model for the Reynolds stress anisotropy tensor is provided on the dataset page. Another example notebook is provided that demonstrates the field formats, using the square duct case as an example. The dataset will be updated as more DNS/LES reference datasets become available, or if there is demand to include additional RANS turbulence models. The curated dataset is most suitable for direct use in corrective (open loop) RANS turbulence modelling using machine learning. While the dataset presented here is not targeted for iterative (closed loop) machine learning-based RANS turbulence modelling (Schmelzer et al.¹⁵, Taghizadeh et al.¹⁶), it nevertheless can be used to provide the initial set of fields as well as to facilitate the implementation of the iterative approach for a particular RANS closure model (at least for the four turbulence closure models included in the dataset).

The dataset includes ready-to-use quantities, OpenFOAM files, and residual plots for all simulations. The ready-to-use input features are provided in the folders named by each turbulence model. The ready-to-use labels are provided in the labels folder. The openfoam folder provides the base quantities in OpenFOAM format, which is convenient for testing the corrective model. The residuals folder contains residual plots for all simulations.

There are approximately 1,000 fields per turbulence model, provided as numpy arrays. The first index for all fields in the dataset is the data point index, equivalent to the cell index. The remaining indices in the array depends on the nature of the field. For example, all tensors are given with shape (N, 3, 3), where N is the data point index. The ten basis tensors used in a general representation of the anisotropy tensor proposed by Pope⁴ are given as an array with shape (N, 10, 3, 3). Relatively few pre-processing steps have been performed on the dataset—no normalization or outlier elimination has been performed. The only deletions arise from a small subset (less than 50 points) of non-realizable LES label values, and any points requiring extrapolation of the reference data. Therefore, it is recommended that after a specific input feature set is formed using the provided fields, the input features should be standardized as is typical in machine learning. The RANS results also contain some outliers that may need to be dropped. For example, Kaandorp⁸ dropped datapoints outside of μ ± 5σ, where μ is the mean, and σ is the standard deviation.

Code availability

Both the code used for generating this dataset and input files for the OpenFOAM simulations are available on the Kaggle page for this dataset¹⁹. The software used was OpenFOAM v2006, with all scripts written in Python 3.

References

Slotnick, J., Khodadoust, A., Alonso, J. & Darmofal, D. CFD vision 2030 study: A path to revolutionary computational aerosciences. Tech. Rep. March (2014).
Wilcox, D. C. Turbulence Modelling for CFD (DCW Industries, Inc., 1994), 2 edn.
Catalano, P. & Amato, M. An evaluation of RANS turbulence modelling for aerodynamic applications. Aerospace Science and Technology 7, 493–509, https://doi.org/10.1016/S1270-9638(03)00061-0 (2003).
Article MATH Google Scholar
Pope, S. B. A more general effective-viscosity hypothesis. Journal of Fluid Mechanics 72, 331, https://doi.org/10.1017/S0022112075003382 (1975).
Article ADS MATH Google Scholar
Witherden, F. D. & Jameson, A. Future directions of computational fluid dynamics. 23rd AIAA Computational Fluid Dynamics Conference 2017, 1–16 (2017).
Google Scholar
Ling, J., Kurzawski, A. & Templeton, J. Reynolds averaged turbulence modelling using deep neural networks with embedded invariance. Journal of Fluid Mechanics 807, 155–166, https://doi.org/10.1017/jfm.2016.615 (2016).
Article ADS MathSciNet CAS MATH Google Scholar
Wu, J. L., Xiao, H. & Paterson, E. Physics-informed machine learning approach for augmenting turbulence models: A comprehensive framework. Physical Review Fluids 7, 1–28, https://doi.org/10.1103/PhysRevFluids.3.074602 (2018).
Article Google Scholar
Kaandorp, M. Machine mearning for data-driven RANS turbulence modelling (Master’s Thesis, Delft University of Technology, 2018).
Kaandorp, M. L. & Dwight, R. P. Data-driven modelling of the Reynolds stress tensor using random forests with invariance. Computers and Fluids 202, 104497, https://doi.org/10.1016/j.compfluid.2020.104497 (2020).
Article MathSciNet MATH Google Scholar
Zhu, Y. & Dinh, N. A data-driven approach for turbulence modeling. Preprint at https://arxiv.org/abs/2005.00426 (2020).
Zhang, Z. et al. Application of deep learning method to Reynolds stress models of channel flow based on reduced-order modeling of DNS data. Journal of Hydrodynamics 31, 58–65, https://doi.org/10.1007/s42241-018-0156-9 (2019).
Article ADS Google Scholar
Fang, R., Sondak, D., Protopapas, P. & Succi, S. Neural network models for the anisotropic Reynolds stress tensor in turbulent channel flow. Journal of Turbulence 21, 525–543, https://doi.org/10.1080/14685248.2019.1706742 (2020).
Article ADS MathSciNet Google Scholar
Song, X. D., Zhang, Z., Wang, Y. W., Ye, S. R. & Huang, C. G. Reconstruction of RANS model and cross-validation of flow field based on tensor basis neural network. Proceedings of the ASME-JSME-KSME 2019 8th Joint Fluids Engineering Conference 1–6 (2019).
Duraisamy, K. Perspectives on machine learning-augmented Reynolds-averaged and large eddy simulation models of turbulence. Phys. Rev. Fluids 6, 050504, https://doi.org/10.1103/PhysRevFluids.6.050504 (2021).
Schmelzer, M., Dwight, R. & Cinnella, P. Discovery of algebraic Reynolds-stress models using sparse symbolic regression. Flow Turbulence Combust. 104, 579–603, https://doi.org/10.1007/s10494-019-00089-x (2020).
Article Google Scholar
Taghizadeh, S., Witherden, F. D. & Girimaji, S. S. Turbulence closure modeling with data-driven techniques: physical compatibility and consistency considerations. New Journal of Physics 22, 0–32, https://doi.org/10.1088/1367-2630/abadb3 (2020).
Article Google Scholar
Mesnard, O. & Barba, L. A. Reproducible and replicable computational fluid dynamics: It’s harder than you think. Computing in Science and Engineering 19, 44–55, https://doi.org/10.1109/MCSE.2017.3151254 (2017).
Article Google Scholar
Pineau, J. et al. Improving reproducibility in machine learning research (A report from the NeurIPS 2019 reproducibility program). Journal of Machine Learning Research 22, 1-20 (2021).
McConkey, R., Yee, E. & Lien, F. S. Turbulence modelling using machine learning: Curated dataset for modelling the Reynolds stress tensor in RANS. Kaggle https://doi.org/10.34740/kaggle/dsv/2637500 (2021).
OpenCFD Ltd. OpenFOAM: User Guide v2006 (2019).
Xiao, H., Wu, J. L., Laizet, S. & Duan, L. Flows over periodic hills of parameterized geometries: A dataset for data-driven turbulence modeling from direct simulations. Computers and Fluids 200, 104431, https://doi.org/10.1016/j.compfluid.2020.104431 (2020).
Article MathSciNet MATH Google Scholar
Launder, B. & Spalding, D. The numerical computation of turbulent flows. Computer Methods in Applied Mechanics and Engineering 3, 269–289, https://doi.org/10.1016/0045-7825(74)90029-2 (1974).
Article ADS MATH Google Scholar
Laurence, D. R., Uribe, J. C. & Utyuzhnikov, S. V. A robust formulation of the v2-f model. Flow, Turbulence and Combustion 73, 169–185, https://doi.org/10.1007/s10494-005-1974-8 (2005).
Article MATH Google Scholar
Menter, F. R., Kuntz, M. & Langtry, R. Ten years of industrial experience with the SST turbulence model. Turbulence, Heat and Mass Transfer 4, 625–632 (2003).
Google Scholar
Durbin, P. A. Near-wall turbulence closure modeling without “damping functions”. Theoretical and Computational Fluid Dynamics 3, 1–13, https://doi.org/10.1007/BF00271513 (1991).
Article ADS MATH Google Scholar
Lien, F. S. & Kalitzin, G. Computations of transonic flow with the v2-f turbulence model. International Journal of Heat and Fluid Flow 22, 53–61, https://doi.org/10.1016/S0142-727X(00)00073-4 (2001).
Article Google Scholar
Liu, F. A thorough description of how wall functions are implemented in OpenFOAM. Proceedings of CFD with OpenSource Software 1–33 (2016).
Pinelli, A., Uhlmann, M., Sekimoto, A. & Kawahara, G. Reynolds number dependence of mean flow structure in square duct turbulence. Journal of Fluid Mechanics 644, 107–122, https://doi.org/10.1017/S0022112009992242 (2010).
Article ADS MATH Google Scholar
Matai, R. & Durbin, P. Large-eddy simulation of turbulent flow over a parametric set of bumps. Journal of Fluid Mechanics 866, 503–525, https://doi.org/10.1017/jfm.2019.80 (2019).
Article ADS MathSciNet CAS MATH Google Scholar
Laval, J. P. & Marquillie, M. Direct numerical simulations of converging-diverging channel flow. In Stanislas, M., Jimenez, J. & Marusic, I. (eds.) ERCOFTAC Series, vol. 14, 203–209, https://doi.org/10.1007/978-90-481-9603-6_21 (Springer Netherlands, Dordrecht, 2011).
Marquillie, M., Laval, J. P. & Dolganov, R. Direct numerical simulation of a separated channel flow with a smooth profile. Journal of Turbulence 9, 1–23, https://doi.org/10.1080/14685240701767332 (2008).
Article ADS Google Scholar
Schiavo, L. A., Jesus, A. B., Azevedo, J. L. & Wolf, W. R. Large eddy simulations of convergent-divergent channel flows at moderate Reynolds numbers. International Journal of Heat and Fluid Flow 56, 137–151, https://doi.org/10.1016/j.ijheatfluidflow.2015.07.006 (2015).
Article Google Scholar
Bentaleb, Y., Lardeau, S. & Leschziner, M. A. Large-eddy simulation of turbulent boundary layer separation from a rounded step. Journal of Turbulence 13, 1–28, https://doi.org/10.1080/14685248.2011.637923 (2012).
Article ADS MathSciNet Google Scholar
Banerjee, S., Krahl, R., Durst, F. & Zenger, C. Presentation of anisotropy properties of turbulence, invariants versus eigenvalue approaches. Journal of Turbulence 8, 1–27, https://doi.org/10.1080/14685240701506896 (2007).
Article MathSciNet MATH Google Scholar
Spencer, A. J. & Rivlin, R. S. Isotropic integrity bases for vectors and second-order tensors - Part I. Archive for Rational Mechanics and Analysis 9, 45–63, https://doi.org/10.1007/BF00253332 (1962).
Article ADS MathSciNet MATH Google Scholar
Rhie, C. M. & Chow, W. L. Numerical study of the turbulent flow past an airfoil with trailing edge separation. AIAA Journal 21, 1525–1532, https://doi.org/10.2514/3.8284 (1983).
Article ADS MATH Google Scholar

Download references

Acknowledgements

R.M. is supported by the Ontario Graduate Scholarship program (OGS), and the Natural Sciences and Engineering Research Council of Canada (NSERC). The computational resources for this work were supported by the Tyler Lewis Clean Energy Research Foundation (TLCERF), and the Shared Hierarchical Academic Research Computing Network (SHARCNET).

Author information

These authors jointly supervised this work: Eugene Yee, Fue-Sang Lien.

Authors and Affiliations

University of Waterloo, Department of Mechanical and Mechatronics Engineering, 200 University Avenue, Waterloo, ON, N2L 3G1, Canada
Ryley McConkey, Eugene Yee & Fue-Sang Lien

Authors

Ryley McConkey
View author publications
You can also search for this author in PubMed Google Scholar
Eugene Yee
View author publications
You can also search for this author in PubMed Google Scholar
Fue-Sang Lien
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors conceived the experiments. R.M. conducted the simulations, and prepared the dataset. All authors reviewed the manuscript.

Corresponding author

Correspondence to Ryley McConkey.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Figures

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver http://creativecommons.org/publicdomain/zero/1.0/ applies to the metadata files associated with this article.

Reprints and permissions

About this article

Cite this article

McConkey, R., Yee, E. & Lien, FS. A curated dataset for data-driven turbulence modelling. Sci Data 8, 255 (2021). https://doi.org/10.1038/s41597-021-01034-2

Download citation

Received: 25 March 2021
Accepted: 19 August 2021
Published: 30 September 2021
DOI: https://doi.org/10.1038/s41597-021-01034-2

This article is cited by

A highly accurate strategy for data-driven turbulence modeling
- Bernardo P. Brener
- Matheus A. Cruz
- Roney L. Thompson
Computational and Applied Mathematics (2024)

Subjects

Abstract

Similar content being viewed by others

Direct numerical simulation of an unsteady wall-bounded turbulent flow configuration for the assessment of large-eddy simulation models

Adaptive physics-informed neural operator for coarse-grained non-equilibrium flows

Comparison and verification of turbulence Reynolds-averaged Navier–Stokes closures to model spatially varied flows

Background & Summary

Methods

Selection of reference cases

Computational method

Numerical schemes

Turbulence modelling

Domain and boundary conditions

Mesh

Periodic hills

Square duct

Parametric bumps

Converging-diverging channel

Curved backward-facing step

Data Records

Labels

Invariants of tensor bases

Other input scalars

Technical Validation

Usage Notes

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supplementary Figures

Rights and permissions

About this article

Cite this article

Share this article

This article is cited by

A highly accurate strategy for data-driven turbulence modeling

Search

Quick links