Main

Understanding the statistical and geometrical properties of particles advected by turbulent flows is a challenging problem of utmost importance for modelling, predicting and controlling many applications such as combustion, industrial mixing, pollutant dispersion, quantum fluids, protoplanetary disks accretion, cloud formation and prey–predator dynamics, to cite just a few1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16. The main difficulties arise from the vast range of time scales involved, spanning from the longest, τL, governed by the stirring mechanism, to the shortest, τη, associated with viscous dissipation and the presence of strong non-Gaussian fluctuations (intermittency). Indeed, the ratio τL/τη is proportional to the Taylor Reynolds number, Rλ, a dimensionless measure of the turbulent intensity, varying from a few thousand in laboratory experiments to millions and even larger in atmospheric and astrophysical contexts17. Similarly, non-Gaussian fat tails become more pronounced with increasing Rλ, resulting in rare-but-intense velocity and acceleration fluctuations of up to 50–60 standard deviations that can be easily measured even in table-top laboratory flows at moderate Rλ (Figs. 1a and 2). Due to the combined influence of long-distance sweeping, multitime fluctuations and small-scale trapping within intense minitornadoes, the problem remains insurmountable from both theoretical and modelling perspectives at the present time.

Fig. 1: Comparison between DNSs and DMs.
figure 1

a, Standardized PDFs of one generic component of the velocity increment, δτVi, at τ/τη = 1, 2, 5, 100 for ground-truth DNS data (black lines), synthetically generated data from DM-1c (blue lines with circles) and that from DM-1c-10% (green lines with squares), a DM-1c model trained with 10% DNS data. PDFs for different τ are vertically shifted for the sake of presentation. σ is the standard deviation. bd, DM-1c trajectories for one generic velocity component with large (b), medium (c) and small (d) time increments, τ/τη = 100, 5, 1, respectively. e, Comparison of 3D trajectories showing small-scale vortex structures for both DNS and DM-3c data, where different curves correspond to the three standardized velocity components i = x, y, z. For the DNS, the high oscillatory correlations between the three components are consistent with the presence of strong vortical structures. Similarly, in the case of DM-3c, these correlations can be interpreted as reflecting vortical structures within the hypothetical Eulerian flow. f, Examples of 3D trajectories reconstructed from DNS (bottom) and DM-3c (top). Notice in panel a the remarkable generalizability properties of our DM data-driven model, able to explore and capture extreme events for velocity fluctuations with far larger intensities than observed in the DNS dataset, represented by much more extended tails, while still maintaining the ground-truth statistics inherent in the training data. Here, the statistics for DM-1c and DM-1c-10% data are derived from 86 and 22 times the number of trajectories in the DNS, respectively.

Source data

Fig. 2: Statistics of acceleration.
figure 2

Standardized PDFs of one generic component of the acceleration, ai, for ground-truth DNS data (black line), synthetically generated data from DM-1c (blue line with circles) and that from DM-1c-10% (green line with squares). Notice the ability of DM-1c to well generalize the statistical trend for rare intense fluctuations never experienced during the training phase with the DNS data. The statistics of the DM-1c and DM-1c-10% data are based on 86 and 22 times the number of trajectories in the DNS, respectively. Inset: acceleration correlation function.

Source data

Over the past 30 years, many different Lagrangian phenomenological models have been proposed, employing various methods such as two-time Ornstein–Uhlembeck stochastic approaches, to capture the dynamics at the two spectrum extremes, τL, τη (refs. 18,19) as well as multitime infinite-differentiable processes20. Numerous other models have explored with differing degrees of success, including applications to passive scalar fluctuations21,22,23,24,25. Moreover, both Markovian and non-Markovian modelization based on multifractal and/or multiplicative models have been employed previously to reproduce certain observed Lagrangian and Eulerian multiscale turbulent features26,27,28,29,30,31; see ref. 32 for a recent attempt to combine multifractal scaling and stochastic partial differential equations. However, although all these previous attempts are able to reproduce well some non-trivial features of the turbulent statistics, we still lack a systematic way to generate synthetic trajectories with the correct multiscale statistics over the full range of dynamics encountered in a real turbulent environment, from the large forcing scales, through the intermittent inertial range, to the coupled regime between inertial and dissipative scales33.

As a result, new approaches are needed to attack the problem. Machine learning (ML) synthetic data-driven models, including variational autoencoders34, generative adversarial networks (GANs)35 and, more recently, diffusion models (DMs)36, have exhibited remarkable success across diverse fields such as computer vision, audio generation, natural language processing, healthcare and various other domains37,38,39,40. Building upon this success, there is a growing interest in applying these techniques to scientific challenges. Specifically, ML methods have shown strong potential to tackle open problems in fluid mechanics41,42. ML tools have been further developed for tasks like generation, super-resolution, prediction and inpainting of dynamical systems43,44, two-dimensional (2D) and three-dimensional (3D) Eulerian turbulent snapshots45,46,47,48,49,50; see ref. 51 for a short summary. In many cases, the validation of these tools when applied to fluid mechanics is primarily limited to simple 2D smooth and quasi-Gaussian turbulent flows or focused on single-point measurements such as mean profiles and two-point spectral properties. There is often a lack of comprehensive quantitative assessments concerning the more intricate multiscale non-Gaussian properties at high Reynolds numbers. Recently, a fully convolutional model has been proposed to generate one-dimensional Eulerian cuts of high-Reynolds-number turbulence52. This model has demonstrated success in capturing up to the fourth-order structure function; however, its generalization to higher-order statistics exhibits less accuracy. Given the state of the art, it is fair to say that we lack both equation-informed and data-driven tools to generate 3D single- or multiparticle Lagrangian trajectories possessing statistical and geometrical properties that quantitatively agree with experiments and direct numerical simulations (DNSs). The demand for the synthetic generation of high-quality and high-quantity data is crucial in various turbulent applications, particularly in the Lagrangian domain, where having even a single trajectory requires the reproduction of the entire Eulerian field over huge spatial domains, which is often a daunting or impossible task for DNSs or extremely laborious for experiments.

Here we present a stochastic data-driven model able to match numerical and experimental data concerning single-particle statistics in homogeneous and isotropic turbulence at high Reynolds numbers. The model is based on a state-of-the-art generative DM36,37,53. We have trained two distinct DMs for our study: DM-1c, which generates a single component of the Lagrangian velocity, and DM-3c, which simultaneously outputs all three correlated components (Methods). Our synthetic generation protocol is able to reproduce the scaling of velocity increments over the full range of available frequencies and for all statistically converged moments up to the eighth order in the original training data. Moreover, the protocol successfully captures acceleration fluctuations of up to 60 standard deviations and even beyond, including the cross-correlations between the three velocity components. We train the model using high-quality data obtained from DNS at Rλ 310. The results also show excellent agreement with the numerical ground-truth data for the generalized flatness of fourth, sixth and eighth orders, whose intensities, due to the presence of intermittent fluctuations, are found to be an order of magnitude larger than the expected values in the presence of a Gaussian statistic. Remarkably, our model exhibits strong generalization properties, enabling the synthesis of events with intensities never encountered during the training phase. These extreme fluctuations, resulting from small-scale vortex trapping and sharp u-turn trajectories with unprecedented excursions and rarity, consistently follow the realistic statistics inherent in the training data.

Problem setup

Lagrangian turbulence

The dataset used for training is extracted from a high-resolution DNS of the 3D Navier–Stokes equations (NSE) in a cubic periodic domain with large-scale isotropic forcing. Lagrangian point-like particles have an instantaneous velocity, \({{{\bf{V}}}}(t)=\dot{{{{\bf{X}}}}}(t)\), coinciding with the local instantaneous flow streamlines at the particle position, X(t):

$$\dot{{{{\bf{X}}}}}(t)={{{\bf{u}}}}({{{\bf{X}}}}(t),t),$$
(1)

where u solves the NSE; see equation (6) in Methods. To construct a high-quality ground-truth database, we tracked a total number of trajectories, Np = 327,680, each spanning a length of T 1.3τL 200τη, with a temporal sampling interval of dts 0.1τη. Consequently, each trajectory is discretized into a total of K = 2,000 points; see Table 1. Particles are injected randomly in the 3D volume once a statistically stationary evolution is reached for the underlying Eulerian flow, thus ensuring that the Lagrangian statistics are also stationary. The set of multitime observables utilized to benchmark the quality of the single-particle 3D trajectory generation primarily relies on the statistics of Lagrangian velocity increments:

$${\delta }_{\tau }{V}_{i}(t)={V}_{i}(t+\tau )-{V}_{i}(t),$$
(2)

where i = x, y, z indicates any of the three velocity components and τ represents the time increment. The instantaneous particle acceleration is obtained from the limit \({a}_{i}(t)=\mathop{\lim }\nolimits_{\tau \to 0}{\delta }_{\tau }{V}_{i}/\tau\), where we use a time resolution of 0.1τη for both DNS and DM. The probability density functions (PDFs) of δτVi in Fig. 1a and ai in Fig. 2 show strongly non-Gaussian fluctuations. The PDFs of δτVi become more pronounced at decreasing the time scale τ. It is a well-known empirical fact that Lagrangian velocity increments develop scaling power laws in the inertial range, τηττL, as measured by the Lagrangian structure functions33,54,55 of order p:

$${S}_{\tau }^{(p)}=\langle {({\delta }_{\tau }{V}_{i})}^{p}\rangle \propto {\tau }^{\xi (p)},$$
(3)

where with 〈  〉, we indicate an average over all Np trajectories and over time. For both DNS and DM-3c, \({S}_{\tau }^{(p)}\) is calculated by further averaging over all velocity components. Henceforth, we neglect the dependence on the velocity component because of isotropy. Concerning the scaling exponents, ξ(p), there exists a whole spectrum of anomalous corrections, Δ(p), to the mean-field dimensional estimate, p/2, leading to ξ(p) = p/2 + Δ(p). Furthermore, beyond global scaling laws, the statistics of velocity fluctuations can be quantitatively captured scale by scale for each τ by measuring the local scaling exponents, which are obtained from the logarithmic derivatives of \({S}_{\tau }^{(p)}\):

$$\zeta (p,\tau )=\frac{{\mathrm{d}}\,\log {S}_{\tau }^{(p)}}{{\mathrm{d}}\,\log {S}_{\tau }^{(2)}}.$$
(4)
Table 1 Eulerian and Lagrangian DNS parameters

DMs

DMs emerge in recent years, outperforming the current state-of-the-art GANs on image synthesis37. DMs are built upon forward and backward diffusion processes (Fig. 3a and Methods). The forward process is a Markov chain that gradually introduces Gaussian noise into the training data until the original signal is transformed into pure noise. In the opposite direction, the backward process starts from pure Gaussian-noise realizations and learns to progressively denoise the signal, effectively generating the desired data samples, as shown in Fig. 3f. The diffusion processes stem from non-equilibrium statistical physics, leveraging Markov chains to progressively morph one distribution into another56,57. The training of DMs involves the use of variational inference lower bound to estimate the loss function along a finite, but large, number of diffusion steps. By focusing on these small incremental changes, the loss term becomes tractable, eliminating the need to resort to the less stable adversarial training, a strategy commonly used by GANs, which aims to reproduce the entire data distribution in a single jump from the input noise. Our implementation of DMs has adopted the UNet architecture of the cutting-edge DM used in computer vision37. An optimized noise schedule for the diffusion processes has also been developed to enhance both efficiency and performance when constructing the multiscale features of the signal, as presented in Fig. 3b and discussed in more detail in the Methods.

Fig. 3: Illustration of the DM and in-depth examination of its backward generation process.
figure 3

a, Schematic representation of the DM and associated UNet sketch, complemented by a table of hyperparameters. Here, N denotes the total number of diffusion steps and n denotes the intermediate step. More details on the network architecture can be found in the Methods section and ref. 37. b, Three distinct noise schedules for the DM’s forward and backward processes explored in this study (Methods). Points A–D indicate four different stages during the backward generation process (from \({{{{\mathcal{V}}}}}_{N}\) to \({{{{\mathcal{V}}}}}_{0}\)) along the optimal noise schedule, curve (tanh6-1). At an early step during the backward process, we have very noisy signals, n = 0.52N (D), followed by two intermediate steps at n = 0.27N (C) and n = 0.06N (B) and the final synthetic trajectory obtained for n = 0 (A). ce, A few statistical properties of the DM-1c signals generated at the four backward steps A–D: PDF of δτVi for τ = τη (c), second-order structure function, \({S}_{\tau }^{(2)}\) (d), fourth-order flatness, \({F}_{\tau }^{(4)}\) (e). f, Illustration of one trajectory generation from D to A, corresponding to b.

Source data

Results

PDFs

In Fig. 1a, we show the success of the DM in generating more and more intense (non-Gaussian) velocity fluctuations, δτVi, by sending τ → 0, with very good statistical agreement with the ground truth. The typical trajectories generated by DM-1c are also qualitatively shown in Fig. 1b–d for different time lags, τ, with local events belonging to both laminar and intense fluctuations. Note the ability of DMs to overcome the additional difficulty of simultaneously generating the three correlated components (DM-3c) required to produce highly complex topological–vortical structures, as shown in Fig. 1e,f. In Fig. 2, we present the PDF of one generic component of the acceleration, ai, from DM-1c, showing very close agreement with the fat-tail ground-truth DNS distribution up to fluctuations around 60–70 times the standard deviation. To illustrate the convergence and generalizability of the DMs, we included results in Figs. 1a and 2 from the DM-1c model trained on only 10% of the DNS data, denoted as DM-1c-10%. The DM-1c and DM-1c-10% results match closely, demonstrating the training convergence. In Fig. 1a, the alignment of DM-1c-10% with the DNS data further underscores the DM’s generalizability to generate extreme events unseen in the training data, which, importantly, adhere to the realistic statistical properties. Further details and comparisons of other statistical measurements for DM-1c-10% are provided in the Supplementary Information.

Lagrangian structure functions and generalized flatness

In Fig. 4a, we show for both DM-1c and DM-3c the Lagrangian structure functions given by equation (3) for p = 2, 4, 6; and in Fig. 4b, we show the generalized flatness

$${F}_{\tau }^{\;(p)}={S}_{\tau }^{(p)}/{[{S}_{\tau }^{(2)}]}^{p/2}.$$
(5)

Due to the zero-value odd-order structure functions caused by the symmetry of PDFs of the velocity increments, we focus only on the even orders. Structure functions and generalized flatness of different orders are superimposed with the ground-truth DNS for comparison. The capacity of both DM-1c and DM-3c to reproduce the ground truth over many time-scale decades is striking, especially for ττη. However, under the dissipative scale, with τ → 0, we observe a tendency for the DM-3c model to generate a slightly smoother signal compared to the DNS, consistent with our observations in Fig. 2. The fourth-order mixed flatness, \({F}_{\tau }^{(4,ij)}=\) \(\langle {({\delta }_{\tau }{V}_{i})}^{2}{({\delta }_{\tau }{V}_{j})}^{2}\rangle /{[{S}_{\tau }^{(2)}]}{\,}^{2}\), calculated by averaging over ij = xy, xz and yz is shown in Fig. 4c to check the ability of the DM-3c to reproduce the correlation among different components of the velocity vector, confirming quantitatively the agreement between DM-3c and DNS shown in Fig. 1e,f. It is worth noting that although the results are very good, there is still room for further refinement of the scales in the dissipative range.

Fig. 4: Multiscale statistical properties of velocity increments.
figure 4

a, log–log plot of Lagrangian structure functions, \({S}_{\tau }^{(p)}\), for p = 2, 4 and 6, compared across DNS, DM-1c and DM-3c. b, log–log plot of the generalized flatness, \({F}_{\tau }^{(p)}\), for p = 4, 6 and 8, compared across DNS, DM-1c and DM-3c. c, log–log plot of fourth-order mixed flatness, \({F}_{\tau }^{(4,ij)}\), averaged over combinations of ij = xy, xz and yz for both DNS and DM-3c. The error bars represent the minimum and maximum values obtained for each measure by dividing the entire dataset used to compute the statistics into ten different independent batches of smaller size. Error bars may appear smaller than the data points.

Source data

Acceleration correlation function

In the inset of Fig. 2, we also present the synthetic single-component acceleration correlation function, Cτ = 〈ai(t + τ)ai(t)〉, where i = x, y, z. The result demonstrates a strong alignment with the DNS. This multiscale Lagrangian structure function has been the subject of intense studying and modelling in the past, due to the presence of a whole set of hierarchical time scales affecting its properties58,59,60,61.

Local scaling exponents

Let us now introduce what is perhaps the most stringent and quantitative multiscale test for turbulence studies: the comparison of local scaling properties provided by the scale-by-scale exponent defined in equation (4). In practice, we compute ζ(p, τ) by first computing \(\text{d}\log {S}_{\tau }^{(p)}/\text{d}\log \tau\) and \(\text{d}\log {S}_{\tau }^{(2)}/\text{d}\log \tau\) on a grid with τ intervals of 1 (from 1 to 1,024) using second-order accurate central differences and then performing the division. It is easy to realize that in the inertial range, where equation (3) is supposed to hold, we have ζ(p, τ) = ξ(p)/ξ(2), independently of τ. On the other hand, it is known that most of the ‘turbulent’ deadlocks develop at the interface between viscous and inertial ranges, τ ≈ τη, where the highest level of non-Gaussian fluctuations is observed. Multifractal statistical models are able to fit the whole complexity of the ζ(p, τ) curves in the entire range of time scales33,54,62,63. This is achieved by introducing a multiplicative cascade model in the inertial range, ended with a fluctuating dissipative time scale, \({\tilde{\tau }}_{\eta }\) (refs. 64,65). Despite numerous attempts, we miss a proper constructive method for embedding the above phenomenology to generate synthetic, realistic 3D Lagrangian trajectories27,29,32,66. In Fig. 5a, we show the local exponent for p = 4 for DM-1c and DM-3c and for the DNS data used for training. For comparison, in Fig. 5b we show a state-of-the-art collection of experimental and other DNS data published in the past. Similar results are obtained for p = 6 and 8 (not shown). The agreement of results from DMs with experimental and DNS data is remarkable. This is considered a high-quality benchmark, demanding the reproduction of the rate of variation of the local scaling properties over a range of frequencies/time lags spanning more than three decades and a corresponding variation of the structure functions (equation (3)) over four to five decades (Fig. 4). Such substantial variations are distilled into the measurement of O(1) quantities (equation (4)) with an error margin within 5%. There are no other tests that can check the scaling properties with greater precision because statistical accuracy typically does not allow one to go beyond a simple—and inaccurate—log–log fit of scaling laws over the full range of variation.

Fig. 5: Scale-by-scale intermittent properties.
figure 5

a, Comparison between the ground-truth DNS and the two DMs, on the lin-log scale, for the fourth-order logarithmic local slope ζ(4, τ) defined in equation (4). b, The same quantity shown in a from a state-of-the-art collection of DNS84,85,86,87,88 and experimental data3,33,89,90. The dotted horizontal lines represent the non-intermittent dimensional scaling, \({S}_{\tau }^{(4)}\propto {[{S}_{\tau }^{(2)}]}{\,}^{2}\). Statistics and error bars in a are derived as in Fig. 4. This resulted in 30 batches for DNS and DM-3c and ten batches for DM-1c. The error bars in panel b are computed solely over the three different velocity components.

Source data

Discussion

We have presented a data-driven model capable of reproducing all recognized statistical properties of single-particle Lagrangian turbulence in homogeneous and isotropic turbulence from the large scales down to the inertial and inertial-viscous scaling range, including the enhanced intermittent properties observed around τη. This achievement is summarized by the PDFs of velocity increments in the inertial range and acceleration (Figs. 1 and 2) as well as by the structure functions, the flatness among different components and the local scaling exponents as shown in Figs. 4 and 5. In Table 2, we further summarize a comparison of single-time two-point correlations of velocity and acceleration, showing excellent matching of DM synthetic data with DNS, except for the case of cross correlation among different acceleration components, ΣA, where DM-3c gives a smaller value than DNS. This trend is also reflected in the smoother transition observed in the limit τ → 0 for the single- and mixed-component flatness in Fig. 4b,c. Furthermore, it is important to highlight the ability of both DM-1c and DM-3c to break the deadlock of viscous intermittency by being able to reproduce the dip structure in the local scaling exponents, as shown in Fig. 5 in the range τ ≈ τη. Fig. 6 shows how DM generation improves multiscale statistics as training progresses. We also evaluated another prominent generative model, the Wasserstein GAN, for this task. Despite efforts to train and select the best-performing model, its accuracy was only satisfactory at large and intermediate scales and failed considerably at smaller time scales. Further details can be found in the Supplementary Information.

Fig. 6: DM training protocol.
figure 6

The training loss function, \(\langle {L}_{n}^{{{{\rm{simple}}}}}\rangle\), against iterations for DM-1c. Here, 〈  〉 represents the average over a batch of training data, each of which has a corresponding random step n with 0 ≤ n ≤ N. The inset presents the fourth-order flatness obtained from DM-1c at different iterations (A: 10 × 103, B: 30 × 103 C: 250 × 103), in comparison with that from DNS data. Statistics and error bars are derived as in Fig. 4.

Source data

Table 2 Single-time second-order correlations

Generalizability

Having AI models capable of generating high-quality trajectories can considerably increase the availability of well-validated synthetic data for pretraining physical applications based on Lagrangian single-particle dispersion. Even more surprisingly, our DM shows the ability to generate trajectories with extremely intense events, thus generalizing beyond the information absorbed during the training phase while still preserving realistic statistical properties. This is clearly illustrated by the striking observation of the extended tails of the PDFs measured from the larger dataset generated by the DM compared to those measured from the smaller set of training data, as shown in Figs. 1a and 2. Currently, our DM is not configured to generalize to different flow configurations, such as different boundary conditions, forcing mechanisms or higher Reynolds numbers. Achieving this adaptability may require the use of a conditional diffusion model37,53. By integrating data composed of diverse flows and geometries, such a model could interpolate between different setups and adapt to new conditions, providing a promising avenue for future research.

Explainability

The fundamental physical model learned by the DM to generate the correct set of multitime fluctuations remains elusive. The DM is based on nested non-linear Gaussian denoising, resembling in spirit the multiscale buildup of fluctuations used in the creation of multifractal signals and measures. The progressive enrichment of signal properties along the backward diffusion process is displayed in Fig. 3c–f. In Fig. 3e, we show quantitatively the buildup of non-trivial flatness at different stages of the backward process. Similarly, but more qualitatively, Fig. 3f shows the emerging non-Gaussian and non-trivial properties within a single trajectory, transitioning from a very noisy signal (n = 0.52N) to the final step of the backward process (n = 0). Figure 3c–f illustrates that during the generation process, the model initially generates statistics at larger scales and gradually builds up statistics at smaller scales. Decrypting this multiscale process in terms of precise non-linear mapping could lead to important discoveries in our phenomenological understanding of turbulence. A promising approach to enhance the interpretability of the model is to factorize the data with wavelet decomposition and implement DMs to synthesize the wavelet coefficients, conditioning on the low-frequency ones67.

Impact

Synthetic stochastic generative models offer remarkable advantages. They (1) provide access to open data without copyright or ethical issues connected to real-data usage and (2) enable the production of high-quality and high-quantity datasets, which can be used to train other models that require such data as input. The ultimate goal is to provide synthetic datasets that enable new models for downstream applications to reach enhanced accuracy, replacing the necessity for real-data pretraining with synthetic pretraining. Our study opens the way for addressing many questions for which the use of real Lagrangian trajectories requires an unfeasible computational or experimental effort. These questions include the relative dispersion problem between two or more particles to study Richardson diffusion68,69, shape dynamics70,71, data augmentation of datasets for drifter trajectories in specific oceanic applications72,73, generation and classification of inertial particle trajectories8 and data inpainting48.

Methods

Navier–Stokes simulations for Lagrangian tracers

We solve the 3D NSE:

$$\left\{\begin{array}{l}{\partial }_{t}{{{{{\mathbf{u}}}}}}+{{{{{\mathbf{u}}}}}}\cdot \nabla {{{{{\mathbf{u}}}}}}=-\nabla p+\nu \Delta {{{{{\mathbf{u}}}}}}+{{{\bf{F}}}}\quad \\ \nabla \cdot {{{{{\mathbf{u}}}}}}=0\quad \end{array}\right.,$$
(6)

for an incompressible fluid of viscosity ν17. The flow is driven to a non-equilibrium statistically steady state by a homogeneous and isotropic forcing, F, obtained via a second-order Ornstein–Uhlenbeck process18. For the DNS of the Eulerian field, we used a standard pseudospectral solver fully dealiased with the two-thirds rule. Details on the simulation can be found in ref. 74. Parameters of the DNS used in this work are given in Table 1. The database of Lagrangian trajectories used in this study is dumped each dts = 15dt  0.1τη (ref. 75). Lagrangian integration of tracers is obtained via a B-spline sixth-order interpolation scheme to obtain the fluid velocity at the particle position and with a second-order Adams–Bashforth time-marching scheme76.

DMs

The specific implementation of DMs utilized in this work is based on recent research37 that demonstrated extremely good performances of DMs even in comparison with GAN for image synthesis. The network architecture, depicted in Fig. 3, relies on the typical UNet structure77, which is commonly used for image analysis tasks as it is designed to capture both high-level contextual information and precise spatial detail. The UNet consists of two primary components: a contracting and an expanding path. Acting as an encoder, the contracting path progressively reduces the spatial dimension of the input data while increasingly extracting abstract features that contain the global context of the input data. The expanding path acts as a decoder, interpreting the learned features and systematically recovering the spatial resolution to generate the final output (see the later section ‘DM architecture and noise schedule’ and Fig. 3 for more details).

Training algorithm

We train two different classes of DM: one to generate a single component of the Lagrangian velocity field (DM-1c) and one for the three components simultaneously (DM-3c). Let us denote each entire trajectory as \({{{\mathcal{V}}}}\), where

$${{{\mathcal{V}}}}=\{{V}_{i}({t}_{k})| {t}_{k}\in [0,T];i=x,y,z\};\qquad \,{{\mbox{(DM-1c)}}}\,$$

and

$${{{\mathcal{V}}}}=\{{V}_{x}({t}_{k}),{V}_{y}({t}_{k}),{V}_{z}({t}_{k})\,| {t}_{k}\in [0,T]\};\qquad \,{{\mbox{(DM-3c)}}}\,$$

and k = 1, …, K goes over the total number of discretized sampling times for each trajectory (Table 1). The distribution of the ground-truth trajectories obtained from DNS of the NSE is denoted as \(q({{{\mathcal{V}}}})\). We introduce a forward noising process that starts from the ground-truth trajectory \({{{{\mathcal{V}}}}}_{0}={{{\mathcal{V}}}}\) and transforms it, after N steps, to a set of trajectories identical to pure random uncorrelated Gaussian noise. This process generates latent variables \({{{{\mathcal{V}}}}}_{1},\ldots ,{{{{\mathcal{V}}}}}_{N}\) by introducing Gaussian noise at step n with a variance βn (0, 1) according to the following formulation

$$q({{{{\mathcal{V}}}}}_{1:N}| {{{{\mathcal{V}}}}}_{0}):=\mathop{\prod }\limits_{n=1}^{N}q({{{{\mathcal{V}}}}}_{n}| {{{{\mathcal{V}}}}}_{n-1}),$$
(7)

where we have introduced the shorthand notation \({{{{\mathcal{V}}}}}_{1:N}\) to denote the entire chain of the ensemble of noisy trajectories \({{{{\mathcal{V}}}}}_{1},{{{{\mathcal{V}}}}}_{2},\ldots ,{{{{\mathcal{V}}}}}_{N}\), and given that the tilde represents ‘distributed as’, each step is defined as

$$q({{{{\mathcal{V}}}}}_{n}| {{{{\mathcal{V}}}}}_{n-1})\to {{{{\mathcal{V}}}}}_{n} \sim {{{\mathcal{N}}}}\left(\sqrt{1-{\beta }_{n}}{{{{\mathcal{V}}}}}_{n-1},{\beta }_{n}{{{\bf{I}}}}\right).$$
(8)

Equation (7) is obtained using the Markovian property of the n steps in the forward process. For a large enough N and a suitable sequence of βn, the latent vector \({{{{\mathcal{V}}}}}_{N} \sim {{{\mathcal{N}}}}(0,{{{\bf{I}}}})\) approximates a delta-correlated Gaussian signal with zero mean and unitary variance. A second remarkable property of the above process, which follows from the Gaussian property of the noise introduced at each step (equation (8)), is that given \({{{{\mathcal{V}}}}}_{0}\), we can sample \({{{{\mathcal{V}}}}}_{n}\) at any given arbitrary n in a closed form by defining αn: = 1 − βn and \({\bar{\alpha }}_{n}:=\mathop{\prod }\nolimits_{i = 0}^{n}{\alpha }_{i}\) as

$$q({{{{\mathcal{V}}}}}_{n}| {{{{\mathcal{V}}}}}_{0})\to {{{{\mathcal{V}}}}}_{n} \sim {{{\mathcal{N}}}}(\sqrt{{\bar{\alpha }}_{n}}{{{{\mathcal{V}}}}}_{0},(1-{\bar{\alpha }}_{n}){{{\bf{I}}}}).$$
(9)

In other words, starting from any ground-truth trajectory, \({{{{\mathcal{V}}}}}_{0}\), we can evaluate its corresponding realization after n steps in the forward process as

$${{{{\mathcal{V}}}}}_{n}=\sqrt{{\bar{\alpha }}_{n}}{{{{\mathcal{V}}}}}_{0}+\sqrt{1-{\bar{\alpha }}_{n}}\epsilon ,$$
(10)

where \(\epsilon \sim {{{\mathcal{N}}}}({{{\boldsymbol{0}}}},{{{\bf{I}}}})\). Now, it is clear that if we can reverse the above process and sample from \(p({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n})\), we will be able to generate new true samples starting from the Gaussian-noise input, \(p({{{{\mathcal{V}}}}}_{N})={{{\mathcal{N}}}}({{{\boldsymbol{0}}}},{{{\bf{I}}}})\). In general, the backward distribution, \(p({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n})\), is unknown. However, in the limit of continuous diffusion (small βn), the reverse process has the identical functional form of the forward process56. Because \(q({{{{\mathcal{V}}}}}_{n}| {{{{\mathcal{V}}}}}_{n-1})\) is a Gaussian distribution and βn is chosen to be small, \(p({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n})\) will also be a Gaussian. In this way, the UNet, with trainable parameters θ, needs to model the mean \({\mu }_{\theta }({{{{\mathcal{V}}}}}_{n},n)\) and standard deviation \({\Sigma }_{\theta }({{{{\mathcal{V}}}}}_{n},n)\) of the transition probabilities for all steps in the backward diffusion process:

$${p}_{\theta }({{{{\mathcal{V}}}}}_{0:N})=p({{{{\mathcal{V}}}}}_{N})\mathop{\prod }\limits_{n=1}^{N}{p}_{\theta }({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n}),$$
(11)

where each reverse step can be written as

$${p}_{\theta }({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n})\to {{{{\mathcal{V}}}}}_{n-1} \sim {{{\mathcal{N}}}}({\mu }_{\theta }({{{{\mathcal{V}}}}}_{n},n),{\Sigma }_{\theta }({{{{\mathcal{V}}}}}_{n},n)).$$
(12)

During training, the optimization involves minimizing the cross entropy, LCE, between the ground-truth distribution and the likelihood of the generated data

$$\begin{array}{rcl}{L}_{\text{CE}}&:=&-{{\mathbb{E}}}_{q({{{{\mathcal{V}}}}}_{0})}\log \left({p}_{\theta }({{{{\mathcal{V}}}}}_{0})\right)\\ &=&-{{\mathbb{E}}}_{q({{{{\mathcal{V}}}}}_{0})}\log \left(\displaystyle\int{p}_{\theta }({{{{\mathcal{V}}}}}_{0:N})d{{{{\mathcal{V}}}}}_{1:N}\right).\end{array}$$
(13)

However, integrating over all possible backward paths from 1 to N and averaging over all ground-truth data, \({{\mathbb{E}}}_{q({{{{\mathcal{V}}}}}_{0})}[\ldots ]=\int[\ldots ]q({{{{\mathcal{V}}}}}_{0})d{{{{\mathcal{V}}}}}_{0}\), to evaluate every network update is beyond being numerically intractable. A way out is to exploit a variational lower bound LVLB for the cross entropy56:

$$\begin{array}{rc}&{L}_{\text{CE}}\le {{\mathbb{E}}}_{q({{{{\mathcal{V}}}}}_{0})}{{\mathbb{E}}}_{p({{{{\mathcal{V}}}}}_{1:N}| {{{{\mathcal{V}}}}}_{0})}\left[\log \frac{p({{{{\mathcal{V}}}}}_{1:N}| {{{{\mathcal{V}}}}}_{0})}{{p}_{\theta }({{{{\mathcal{V}}}}}_{0:N})}\right]:={L}_{\text{VLB}}.\end{array}$$
(14)

To make the above expression computable, the expectation value can be split into its independent steps. Consequently, it can be rewritten as a summation of several Kullback–Leibler divergences, DKL, plus one entropy term (see details in Appendix B of ref. 56). In this way, LVLB becomes

$$\begin{array}{rcl}{L}_{\text{VLB}}&=&{{\mathbb{E}}}_{q({{{{\mathcal{V}}}}}_{0})}\left[\underbrace{{D}_{{{{\rm{KL}}}}}\!\left(p({{{{\mathcal{V}}}}}_{N}| {{{{\mathcal{V}}}}}_{0})\parallel {p}_{\theta }({{{{\mathcal{V}}}}}_{N})\right)}_{\begin{array}{c}{L}_{N}\end{array}}\right.\\ &&+\mathop{\sum }\limits_{n > 1}^{N}\underbrace{{D}_{{{{\rm{KL}}}}}\!\left(p({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n},{{{{\mathcal{V}}}}}_{0})\parallel {p}_{\theta }({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n})\right)}_{\begin{array}{c}{L}_{n-1}\end{array}}\\ &&\left.\underbrace{-\log {p}_{\theta }({{{{\mathcal{V}}}}_{0}| {{{{\mathcal{V}}}}}_{1})}}_{\begin{array}{c}{L}_{0}\end{array}}\right].\end{array}$$
(15)

The first term, LN, can be ignored during training, as \(p({{{{\mathcal{V}}}}}_{N}| {{{{\mathcal{V}}}}}_{0})\) does not depend on the network hyperparameters, and \({p}_{\theta }({{{{\mathcal{V}}}}}_{N})={{{\mathcal{N}}}}(0,{{{\bf{I}}}})\) is just the Gaussian distribution. Hence, the network must minimize only the terms Ln with n < N to reproduce the entire backward diffusion process and generate correct data. At this point, the last remarkable property that allows each term of the variational lower bound to be written in a tractable way is that the inverse conditional probability can be calculated analytically when conditioned on a particular realization of the ground-truth data. Using Bayes’ theorem, we can write

$$p({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n},{{{{\mathcal{V}}}}}_{0})=q({{{{\mathcal{V}}}}}_{n}| {{{{\mathcal{V}}}}}_{n-1},{{{{\mathcal{V}}}}}_{0})\frac{q({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{0})}{q({{{{\mathcal{V}}}}}_{n}| {{{{\mathcal{V}}}}}_{0})}.$$
(16)

All probabilities in the right-hand side of equation (16) describe forward steps as defined in equations (8) and (9). Therefore, equation (16) can be regarded as the product of three Gaussians

$$\begin{array}{rcl}p({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n},{{{{\mathcal{V}}}}}_{0})&\propto &\exp \left(-\frac{{({{{{\mathcal{V}}}}}_{n}-\sqrt{{\alpha }_{n}}{{{{\mathcal{V}}}}}_{n-1})}^{2}}{2{\beta }_{n}}\right)\\ &&\cdot \exp \left(-\frac{{({{{{\mathcal{V}}}}}_{n-1}-\sqrt{{\bar{\alpha }}_{n-1}}{{{{\mathcal{V}}}}}_{0})}^{2}}{2(1-{\bar{\alpha }}_{n-1})}\right)\\ &&\cdot \exp \left(\frac{{({{{{\mathcal{V}}}}}_{n}-\sqrt{{\bar{\alpha }}_{n}}{{{{\mathcal{V}}}}}_{0})}^{2}}{2(1-{\bar{\alpha }}_{n})}\right),\end{array}$$
(17)

which can be rewritten as

$$p({{{{\mathcal{V}}}}}_{n-1}| {{{{\mathcal{V}}}}}_{n},{{{{\mathcal{V}}}}}_{0})\to {{{{\mathcal{V}}}}}_{n-1} \sim {{{\mathcal{N}}}}(\tilde{\mu }({{{{\mathcal{V}}}}}_{n},{{{{\mathcal{V}}}}}_{0}),{\tilde{\beta }}_{n}{{{\bf{I}}}}),$$
(18)

where the mean and the standard deviation of the conditioned reverse probability are, respectively,

$${\tilde{\mu }}_{n}({{{{\mathcal{V}}}}}_{n},{{{{\mathcal{V}}}}}_{0}):=\frac{\sqrt{{\bar{\alpha }}_{n-1}}{\beta }_{n}}{1-{\bar{\alpha }}_{n}}{{{{\mathcal{V}}}}}_{0}+\frac{\sqrt{{\alpha }_{n}}(1-{\bar{\alpha }}_{n-1})}{1-{\bar{\alpha }}_{n}}{{{{\mathcal{V}}}}}_{n}$$
(19)

and

$${\tilde{\beta }}_{n}:=\frac{1-{\bar{\alpha }}_{n-1}}{1-{\bar{\alpha }}_{n}}{\beta }_{n}.$$
(20)

All terms denoted by Ln−1 in the variational lower bound are DKL between the two Gaussians that depend only on the difference between their mean values and standard deviations. Assuming that the standard deviations of the reverse and forward processes are identical, that is, Σθ = βnI, we only need to model the mean values of the backward Gaussians. Consequently, the Kullback–Leibler divergence simplifies to the difference between the two mean values, given in equation (19) and the output of the UNet mode, \({\mu }_{\theta }({{{{\mathcal{V}}}}}_{n},n)\), in equation (12). From this simplification, it follows that each loss term becomes

$${L}_{n-1}={{\mathbb{E}}}_{q({{{{\mathcal{V}}}}}_{0})}\left[\frac{1}{2{\beta }_{n}}| | {\tilde{\mu }}_{n}({{{{\mathcal{V}}}}}_{n},{{{{\mathcal{V}}}}}_{0})-{\mu }_{\theta }({{{{\mathcal{V}}}}}_{n},n)| {| }^{2}\right].$$

Expressing \({{{{\mathcal{V}}}}}_{0}\) in term of \({{{{\mathcal{V}}}}}_{n}\) by inverting equation (10) and substituting it in equation (19), the mean value of the reverse conditioned probability can be rewritten as

$$\tilde{\mu }({{{{\mathcal{V}}}}}_{n},{{{{\mathcal{V}}}}}_{0})=\frac{1}{\sqrt{{\alpha }_{n}}}\left({{{{\mathcal{V}}}}}_{n}-\frac{{\beta }_{n}}{\sqrt{1-{\bar{\alpha }}_{n}}}{\boldsymbol{\epsilon }}_{{{{{\mathcal{V}}}}}_{0},n}\right),$$
(21)

where the subscripts of the noise term, \({{{{\boldsymbol{\epsilon }}}}}_{{{{{\mathcal{V}}}}}_{0},n}\), indicate that this is the specific noise realization used to obtain \({{{{\mathcal{V}}}}}_{n}\) from \({{{{\mathcal{V}}}}}_{0}\), as defined in equation (10). Now, because \({{{{\mathcal{V}}}}}_{n}\) is known by the network, one may reparameterize the predicted mean \({\mu }_{\theta }({{{{\mathcal{V}}}}}_{n},n)\) as

$${\mu }_{\theta }({{{{\mathcal{V}}}}}_{n},n)=\frac{1}{\sqrt{{\alpha }_{n}}}\left({{{{\mathcal{V}}}}}_{n}-\frac{{\beta }_{n}}{\sqrt{1-{\bar{\alpha }}_{n}}}{{{{\boldsymbol{\epsilon }}}}}_{\theta }({{{{\mathcal{V}}}}}_{n},n)\right),$$
(22)

where ϵθ is a function approximator designed to predict \({{{{\boldsymbol{\epsilon }}}}}_{{{{{\mathcal{V}}}}}_{0},n}\) from \({{{{\mathcal{V}}}}}_{n}\), leading to the following reformulation of the loss terms:

$${L}_{n-1}={{\mathbb{E}}}_{q({{{{\mathcal{V}}}}}_{0}),{{{{\boldsymbol{\epsilon }}}}}_{{}_{{{{{\mathcal{V}}}}}_{0},n}}}\left[\frac{{\beta }_{n}}{2{\alpha }_{n}(1-{\bar{\alpha }}_{n})}| | {{{{\boldsymbol{\epsilon }}}}}_{{{{{\mathcal{V}}}}}_{0},n}-{{{{\boldsymbol{\epsilon }}}}}_{\theta }({{{{\mathcal{V}}}}}_{n},n)| {| }^{2}\right].$$

Namely, in the training ϵθ predicted from the DM is compared with the one used to build up the \({{{{\mathcal{V}}}}}_{n}\) from \({{{{\mathcal{V}}}}}_{0}\). This formulation leads to faster and more stable training36. Moreover, it has been shown36 that one can obtain good results even by performing the training without learning the variance of the reverse process and introducing a simpler, reweighted loss function defined as

$${L}_{n-1}^{{{{\rm{simple}}}}}={{\mathbb{E}}}_{q({{{{\mathcal{V}}}}}_{0}),{{{{\boldsymbol{\epsilon }}}}}_{{{{{\mathcal{V}}}}_{0},n}}}\left[| | {{{{\boldsymbol{\epsilon }}}}}_{{{{{\mathcal{V}}}}}_{0},n}-{{{{\boldsymbol{\epsilon }}}}}_{\theta }({{{{\mathcal{V}}}}}_{n},n)| {| }^{2}\right],$$
(23)

which is identical to the one we implemented in this work. It is worth noting that due to the Gaussian form of \({p}_{\theta }({{{{\mathcal{V}}}}}_{0}| {{{{\mathcal{V}}}}}_{1}),{L}_{0}\) results in the same loss function as depicted in equation (23). Therefore, the optimized loss functions can be expressed as \({L}_{n}^{{{{\rm{simple}}}}}\), where n ranges from 0 to N − 1.

DM architecture and noise schedule

The UNet architecture we have implemented is one of the most advanced networks described in the literature, demonstrating state-of-the-art performance in image generation37. It is capable of extracting hidden, spatially correlated information that is essential both for image generation and for accomplishing our specific task. The details of the architecture, including the hyperparameters, are summarized in the table in Fig. 3a. Each encoder and decoder part consists of five levels. Progressing to the next level entails doubling or halving the resolution as one passes through an Upsample or Downsample layer, respectively. The Depth parameter controls the number of ResBlocks with or without AttentionBlocks at each level. Within each level, layers share the same number of features, which can be determined using the Channels and Channels multiple parameters from the table. Attention mechanisms78 allow neural networks to prioritize certain regions or features within the data. In this study, we employed multihead attention with four heads. AttentionBlocks were utilized at levels with resolutions of 250 and 125. For the DM-1c model, we utilized 250 × 103 iterations, while 400 × 103 iterations were used for the DM-3c model. In each iteration, we sample a batch of training data and assign a random step index n to each sample and then optimize \({L}_{n}^{{{{\rm{simple}}}}}\) across the data batch. Figure 6 shows the training loss as a function of iteration for DM-1c alongside the fourth-order flatness of samples generated from it at different iteration checkpoints: A, B and C. Here, C corresponds to the final model. It reveals that although the loss rapidly reached a ‘plateau’, it is crucial to continue training for the model convergence. This is because \(\langle {L}_{n}^{{{{\rm{simple}}}}}\rangle\) is an average derived from a data batch where each sample is assigned a random n, which does not truly represent the inherent loss LCE described in equation (13). Although LCE can be approximated as the summed expectation of \({L}_{n}^{{{{\rm{simple}}}}}\) across the training dataset for 0 < n ≤ N, direct evaluation of LCE is impractical. Instead, we rely on examining the statistical properties to measure training progress.

Concerning the noise schedule to improve the training and sampling protocols, we explored three different laws and found that the optimal one for our application is given in terms of a tanh profile; see Fig. 3b. Indeed, all results shown in the main text and in panels Fig. 3c–e of the same figure have been obtained by following the schedule (tanh6-1):

$${\bar{\alpha }}_{n}=\frac{-\tanh (7n/N-6)+\tanh 1}{-\tanh (-6)+\tanh 1},$$
(24)

which allowed us to use N = 800 diffusion steps rather than N = 4,000 needed for the linear case where the forward process variances are constantly increasing from β1 = 10−4 to βN = 0.02. As a result, a fivefold improvement in performance is achieved. We also explored an alternative noise schedule (power4) with a functional form: \({\bar{\alpha }}_{n}=1-{(n/N)}^{4}\), with N = 800, which resulted in being slightly inferior to (tanh6-1). Note that applying methods to speed up DM sampling with pretrained models remains worthy of future exploration79,80.

Computational cost

To illustrate the computational cost of our case, the DNS of the Eulerian field takes about 7.2 hours on 4,096 cores. This step is essential even to generate a single Lagrangian trajectory. An additional 64% of the time is required to track 4 million Lagrangian tracers. All training and sampling of the DMs in our study was performed on four NVIDIA A100 GPUs. Training takes approximately 1 hour per 10,000 iterations, resulting in approximately 25 hours for DM-1c and 40 hours for DM-3c. Sampling an equivalent number of 4 million trajectories takes about 200 hours.