Synthetic Lagrangian Turbulence by Generative Diffusion Models

Lagrangian turbulence lies at the core of numerous applied and fundamental problems related to the physics of dispersion and mixing in engineering, bio-fluids, atmosphere, oceans, and astrophysics. Despite exceptional theoretical, numerical, and experimental efforts conducted over the past thirty years, no existing models are capable of faithfully reproducing statistical and topological properties exhibited by particle trajectories in turbulence. We propose a machine learning approach, based on a state-of-the-art diffusion model, to generate single-particle trajectories in three-dimensional turbulence at high Reynolds numbers, thereby bypassing the need for direct numerical simulations or experiments to obtain reliable Lagrangian data. Our model demonstrates the ability to reproduce most statistical benchmarks across time scales, including the fat-tail distribution for velocity increments, the anomalous power law, and the increased intermittency around the dissipative scale. Slight deviations are observed below the dissipative scale, particularly in the acceleration and flatness statistics. Surprisingly, the model exhibits strong generalizability for extreme events, producing events of higher intensity and rarity that still match the realistic statistics. This paves the way for producing synthetic high-quality datasets for pre-training various downstream applications of Lagrangian turbulence.

Understanding the statistical and geometrical properties of particles advected by turbulent flows is a challenging problem of utmost importance for modeling, predicting, and controlling many applications such as combustion, industrial mixing, pollutant dispersion, quantum fluids, protoplanetary disks accretion, cloud formation, prey-predator dynamics, to cite just a few [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16].The main difficulties arise from the vast range of time scales involved, spanning from the longest, τ L , governed by the stirring mechanism to the shortest, τ η , associated with viscous dissipation, and the presence of strong non-Gaussian fluctuations (intermittency).Indeed, the ratio τ L /τ η is proportional to the Taylor Reynolds number, R λ , a dimensionless measure of the turbulent intensity, varying from a few thousand in laboratory experiments to millions and even larger in atmospheric and astrophysical contexts [17].Similarly, non-Gaussian fat tails become more pronounced with increasing R λ , resulting in rare-but-intense velocity and acceleration fluctuations of up to 50-60 standard deviations that can be easily measured even in table-top laboratory flows at moderate R λ (see Fig. 1a and Fig. 2).Due to the combined influence of long-distance sweeping, multi-time fluctuations, and small-scale trapping within intense mini-tornadoes, the problem remains insurmountable from both theoretical and modeling perspectives at the present time.
Over the past thirty years, many different Lagrangian phenomenological models have been proposed, employing various methods such as two-time Ornstein-Uhlembeck stochastic approaches, to capture the dynamics at the two spectrum extremes, τ L , τ η [18,19], as well as multi-time infinite-differentiable processes [20].Numerous other models have explored with differing degrees of success, including applications to passive scalar fluctuations [21][22][23][24][25].Moreover, both Markovian and non-Markovian modelization based on multifractal and/or multiplicative models, have been employed previously to reproduce certain observed Lagrangian and Eulerian multiscale turbulent features [26][27][28][29][30][31], see [32] for a recent attempt to combine multifractal scaling and stochastic partial differential equations.However, although all these previous attempts are able to reproduce well some non-trivial features of the turbulent statistics, we still lack a systematic way to generate synthetic trajectories with the correct multiscale statistics over the full range of dynamics encountered in a real turbulent environment, from the large forcing scales, through the intermittent inertial range, to the coupled regime between inertial and dissipative scales [33].
As a result, new approaches are needed to attack the problem.Machine learning (ML) synthetic data-driven models, including variational autoencoders (VAEs) [34], generative adversarial networks (GANs) [35], and more recently, diffusion models (DMs) [36], have exhibited remarkable success across diverse fields such as computer vision, audio generation, natural language processing, healthcare, and various other domains [37][38][39][40].Building upon this success, there is a growing interest in applying these techniques to scientific challenges.Specifically, ML methods have shown strong potential to tackle open problems in fluid mechanics [41,42].ML tools have been further developed for tasks like generation, super-resolution, prediction, and inpainting of dynamical systems [43,44], two-dimensional (2D) and threedimensional (3D) Eulerian turbulent snapshots [45][46][47][48][49][50], see [51] for a short summary.In many cases, the validation of these tools when applied to fluid mechanics is primarily limited to simple 2D smooth and quasi- Gaussian turbulent flows, or focused on single-point measurements such as mean profiles and two-point spectral properties.There is often a lack of comprehensive quantitative assessments concerning the more intricate multiscale non-Gaussian properties at high Reynolds numbers.Recently, a fully convolutional model has been proposed to generate one-dimensional Eulerian cuts of high-Reynolds-number turbulence [52].This model has demonstrated success in capturing up to the 4th-order structure function, however, its generalization to higher-order statistics exhibits less accuracy.Given the state-of-the-art, it is fair to say that we lack both equation-informed and data-driven tools to generate 3D single-or multi-particle Lagrangian trajectories possessing statistical and geometrical properties that quantitatively agree with experiments and direct numerical simulations (DNS).The demand for the synthetic generation of high-quality and high-quantity data is crucial in various turbulent applications, particularly in the Lagrangian domain, where having even a single trajectory requires the reproduction of the entire Eulerian field over huge spatial domains, which is often a daunting or impossible task for DNS or extremely laborious for experiments.
Here we present a stochastic data-driven model able to match numerical and experimental data concerning single-particle statistics in homogeneous and isotropic turbulence (HIT) at high Reynolds numbers.The model is based on a novel application of state-of-theart generative DM [36,37,53].We have trained two distinct DMs for our study: DM-1c, which generates a single component of the Lagrangian velocity, and DM-3c, which simultaneously outputs all three correlated components (see Methods).Our synthetic generation protocol is able to reproduce the scaling of velocity increments over the full range of available frequencies and for all statistically converged moments up to the 8th order in the original training data.Moreover, the protocol successfully captures acceleration fluctuations of up to 60 standard deviations and even beyond, including the cross-correlations between the three velocity components.We train the model using high-quality data obtained from DNS at R λ ≃ 310.The results show also excellent agreement with the numerical ground-truth data for the generalized flatness of 4th, 6th, and 8th orders, whose intensities, due to the presence of inter-  mittent fluctuations, are found to be order of magnitude larger than the expected values in the presence of a Gaussian statistic.Remarkably, our model exhibits strong generalization properties, enabling the synthesis of events with intensities never encountered during the training phase.These extreme fluctuations, resulting from small-scale vortex trapping and sharp u-turn trajectories with unprecedented excursions and rarity, consistently follow the realistic statistics inherent in the training data.

I. PROBLEM SET-UP
Lagrangian turbulence.The dataset used for training is extracted from a high-resolution DNS of the 3D Navier-Stokes equations (NSE) in a cubic periodic domain with large-scale isotropic forcing.Lagrangian point-like particles have an instantaneous velocity, V (t) = Ẋ(t), coinciding with the local instantaneous flow streamlines at the particle position, X(t): where u solves the NSE, see Eq. ( 6) in Methods.To construct a high-quality ground-truth database, we tracked a total of N p = 327680 trajectories, each spanning a length of T ≃ 1.3τ L ≃ 200τ η , with a temporal sampling interval of dt s ≃ 0.1τ η .Consequently, each trajectory is discretized into a total of K = 2000 points, see Table I.Particles are injected randomly in the 3D volume once a statistically stationary evolution is reached for the underlying Eulerian flow, thus ensuring that the Lagrangian statistics are also stationary.The set of multi-time observables utilized to benchmark the quality of the single-particle 3D trajectory generation primarily relies on the statistics of Lagrangian velocity increments: where i = x, y, z indicates any of the three velocity components and τ represents the time increment.The instantaneous particle acceleration is obtained from the limit a i (t) = lim τ →0 δ τ V i /τ , where we use a time resolution of 0.1τ η for both DNS and DM.Both the probability density functions (PDFs) of δ τ V i in Fig. 1a, and that of a i in Fig. 2, show strongly non-Gaussian fluctuations.The PDFs of δ τ V i become more pronounced at decreasing the time scale τ .It is a well-known empirical fact that Lagrangian velocity increments develop scaling power-laws in the inertial range, τ η ≪ τ ≪ τ L , as measured by the Lagrangian Structure Functions [33,54,55]: where with ⟨•⟩ we indicate an average over all N p trajectories and over time.For both DNS and DM-3c, S Diffusion Models.DMs emerge in recent years, outperforming the current state-of-the-art GANs on image synthesis [37].DMs are built upon forward and backward diffusion processes (see Fig. 3a and Methods).The forward process is a Markov chain that gradually introduces Gaussian noise into the training data until the original signal is transformed into pure noise.
In the opposite direction, the backward process starts from pure Gaussian-noise realizations and learns to progressively denoise the signal, effectively generating the desired data samples, as shown in Fig. 3f.The diffusion processes stem from non-equilibrium statistical physics, leveraging Markov chains to progressively morph one distribution into another [56,57].The training of DMs involves the use of variational inference lower bound to Eulerian and Lagrangian DNS parameters.NL is the resolution in each spatial dimension; L is the physical dimension of the cubic periodic box; dt represents the time step in the DNS integration; ν stands for kinematic viscosity; ϵ = ν⟨∂iuj∂iuj⟩ is the total mean energy dissipation, averaged over time and space; τη = ν/ϵ is the Kolmogorov dissipative time; η = (ν 3 /ϵ) 1/4 is the Kolmogorov dissipative scale; R λ = urmsλ/ν signifies the 'Taylor-scale' Reynolds number, where urms is the root mean squared velocity, and λ = 5Etot/Ω ≃ 0.14 represents the 'Taylorscale', with Etot ≃ 4.5 and Ω ≃ 1200, being respectively the total mean energy and enstrophy in the flow.Additionally, τL = L/urms ≃ 3.5 is the integral time scale.Parameters of the Lagrangian particles: Np is the total number of trajectories; dts is the time lag between two consecutive Lagrangian dumps; T is the total length of each trajectory; and K = T /dts is the total number of points in each trajectory.
estimate the loss function along a finite, but large, number of diffusion steps.By focusing on these small incremental changes, the loss term becomes tractable, eliminating the need to resort to the less stable adversarial training, a strategy commonly used by GAN, which aims to reproduce the entire data distribution in a single jump from the input noise.Our implementation of DM has adopted the UNet architecture of the cutting-edge DM model used in computer vision [37].An optimized noise schedule for the diffusion processes has also been developed in order to enhance both the efficiency and performance when constructing the multiscale features of the signal, as presented in Fig. 3b and discussed in more detail in the Methods section.

II. RESULTS
Probability density functions.In Fig. 1a we show the success of the DM to generate more and more intense (non-Gaussian) velocity fluctuations, δ τ V i , by sending τ → 0, with very good statistical agreement with the ground truth.The typical trajectories generated by the DM-1c are also qualitatively shown in Fig. 1b-d for different time lags, τ , with local events belonging to both laminar and intense fluctuations.Note the ability of DMs to overcome the additional difficulty of simultaneously generating the three correlated components (DM-3c), required to produce highly complex topological -vortical-structures, as show in Fig. 1e,f.In Fig. 2 we present the PDF of one generic component of the acceleration, a i , from DM-1c, showing a very close agreement with the fat-tail ground-truth DNS distribu-tion up to fluctuations around 60-70 times the standard deviation.To illustrate the convergence and generalizability of the DM models, we included results in Fig. 1a and Fig. 2 from the DM-1c model trained on only 10% of the DNS data, denoted as DM-1c-10%.The DM-1c and DM-1c-10% results closely match, demonstrating the training convergence.In Fig. 1a, the alignment of DM-1c-10% with the DNS data further underscores the DM's generalizability to generate extreme events unseen in the training data, which importantly, adhere to the realistic statistical properties.Further details and comparisons of other statistical measurements for DM-1c-10% are provided in the Supplementary Material.
Lagrangian Structure Functions and Generalized Flatness.In Fig. 4 we show for both DM-1c and DM-3c the Lagrangian structure functions given by (3) for p = 2, 4, 6 in panel a, and in panel b the generalized flatness, Due to the zero-value odd-order structure functions caused by the symmetry of PDFs of the velocity increments, we focus only on the even orders.Structure functions and generalized flatness of different orders are superimposed with the ground-truth DNS for comparison.The capacity of both DM-1c and DM-3c to reproduce the ground truth over many time-scale decades is striking, especially for τ > ∼ τ η .However, under the dissipative scale, with τ → 0 we observe a tendency for the DM-3c model to generate a slightly smoother signal compared to the DNS, consistent with our observations in Fig. 2. The 4th-order mixed flatness, τ ] 2 , calculated by averaging over ij = xy, xz and yz, is shown in panel c of the same figure, in order to check the ability of the DM-3c to reproduce the correlation among different components of the velocity vector, confirming quantitatively the agreement between DM-3c and DNS shown in Fig. 1e,f.It is worth noting that while the results are very good, there is still room for further refinement of the scales in the dissipative range.
Acceleration correlation function.In the inset of Fig. 2 we also present the synthetic single-component acceleration correlation function, C τ = ⟨a i (t + τ )a i (t)⟩, where i = x, y, z.The result demonstrates a strong alignment with the DNS.This multiscale Lagrangian structure function has been the subject of intense studying and modeling in the past, due to the presence of a whole set of hierarchical time scales affecting its properties [58][59][60][61].

Local Scaling Exponents. Let us now introduce
what is perhaps the most stringent and quantitative multiscale test for turbulence studies: the comparison of local scaling properties provided by the scale-byscale exponent defined in (4).In practice, we com-

Hyperparameters
τ ; e, Fourth-order flatness, F τ /d log τ on a grid with τ intervals of 1 (from 1 to 1024) using second-order accurate central differences and then performing the division.It is easy to realize that in the inertial range, where (3) is supposed to hold, we have ζ(p, τ ) = ξ(p)/ξ(2), independently of τ .On the other hand, it is known that most of the 'turbulent' deadlocks develop at the interface between viscous and inertial ranges, τ ∼ τ η , where the highest level of non-Gaussian fluctuations is observed.Multifractal statistical models are able to fit the whole complexity of the ζ(p, τ ) curves in the entire range of time scales [33,54,62,63].This is achieved by introducing a multiplicative cascade model in the inertial range, ended by a fluctuating dissipative time scale, τη [64,65].Despite numerous attempts, we miss a proper construc- tive method for embedding the above phenomenology to generate synthetic, realistic 3D Lagrangian trajectories [27,29,32,66].In Fig. 5a we show the local exponent for p = 4 for DM-1c and DM-3c, and for the DNS data used for training, for comparison in Fig. 5b we show a state-of-the-art collection of experimental and other DNS data published in the past.Similar results are obtained for p = 6 and 8 (not shown).The agreement of results from DMs with experimental and DNS data is remarkable.This is considered a high-quality benchmark, demanding the reproduction of the rate of variation of the local scaling properties over a range of frequencies/time lags spanning more than 3 decades and a corresponding variation of the structure functions (3) over 4-5 decades (see Fig. 4).Such substantial variations are distilled into the measurement of O(1) quantities (4) with an error margin within 5%.There are no other tests that can check the scaling properties with greater precision because statistical accuracy typically does not allow one to go beyond a simple -and inaccurate -log-log fit of scaling laws over the full range of variation.

III. DISCUSSIONS
We have presented a data-driven model capable of reproducing all recognized statistical properties of singleparticle Lagrangian turbulence in HIT from the large scales down to the inertial and inertial-viscous scaling range, including the enhanced intermittent properties observed around τ η .This achievement is summarized by the PDFs of velocity increments in the inertial range and acceleration (Fig. 1 and Fig. 2), as well as by the structure functions, the flatness among different components and the local scaling exponents as shown in Figs 4 and 5.In Table II, we further summarize a comparison of single-time two-point correlations of ve-locity and acceleration, showing an excellent matching of DM synthetic data with DNS, except for the case of cross correlation among different acceleration components, Σ A , where DM-3c gives a smaller value than DNS.This trend is also reflected in the smoother transition observed in the limit τ → 0 for the single-and mixed-component flatness in Figs.4b,c.Furthermore, it is important to highlight the ability of both DM-1c and DM-3c to break the deadlock of viscous intermittency by being able to reproduce the dip structure in the local scaling exponents, as shown in Fig. 5 in the range τ ∼ τ η .Fig. 6 shows how DM generation improves multiscale statistics as training progresses.We also evaluated another prominent generative model, the Wasserstein GAN, for this task.Despite efforts to train and select the best performing model, its accuracy was only satisfactory at large and intermediate scales, and failed considerably at smaller time scales.Further details can be found in the Supplementary Material.Quantities are related to both velocity and acceleration for DNS, DM-1c and DM-3c: , where in the last two expressions the summation is only for ij = xy, xz and yz.4).b, The same quantity shown in a from a state-of-the-art collection of DNS [84][85][86][87][88] and experimental data [3,89,90] (redrawn from Fig. 1 of [33]).The dotted horizontal lines represent the non-intermittent dimensional scaling, S  single-particle dispersion.Even more surprisingly, our DM model shows the ability to generate trajectories with extremely intense events, thus generalizing beyond the information absorbed during the training phase while still preserving realistic statistical properties.This is clearly illustrated by the striking observation of the extended tails of the PDFs measured from the larger dataset generated by the DM, compared to those measured from the smaller set of training data, as shown in Fig. 1a and Fig. 2. Currently, our DM model is not configured to generalize to different flow configurations, such as different boundary conditions, forcing mechanisms or higher Reynolds numbers.Achieving this adaptability may require the use of a conditional diffusion model [37,53].By integrating data composed of diverse flows and geometries, such a model could interpolate between different setups and adapt to new conditions, providing a promising avenue for future research.

Generalizability. Having AI models capable to generate high-quality trajectories can considerably increase the availability of well-validated synthetic data for pre-training physical applications based on Lagrangian
Explainability.The fundamental physical model learned by the DM to generate the correct set of multitime fluctuations remains elusive.DM is based on nested non-linear Gaussian denoising, resembling in spirit the multiscale build-up of fluctuations used in the creation of multifractal signals and measures.The progressive enrichment of signal properties along the backward diffusion process is displayed in Fig. 3c-f.In panel e we show quantitatively the build-up of non-trivial flatness at different stages of the backward process.Similarly, but more qualitatively, panel f shows the emerging non-Gaussian and non-trivial properties within a single trajectory, transitioning from a very noisy signal (n = 0.52N ) to the final step of the backward process (n = 0).Fig. 3c-f illustrates that during the generation process, the model initially generates statistics at larger scales and gradually builds up statistics at smaller scales.Decrypting this multiscale process in terms of precise non-linear mapping could lead to important discoveries in our phenomenological understanding of turbulence.A promising approach to enhance the interpretability of the model is to factorize the data with wavelet decomposition and implement DMs to synthesize the wavelet coefficients, conditioning on the lowfrequency ones [67].
Impact.Synthetic stochastic generative models offer remarkable advantages.They (i) provide access to open data without copyright or ethical issues connected to real-data usage, (ii) enable the production of high-quality and high-quantity datasets, which can be used to train other models that require such data as input.The ultimate goal is to provide synthetic datasets that enable new models for downstream applications to reach enhanced accuracy, replacing the necessity for real-data pre-training with synthetic pre-training.Our study opens the way for addressing many questions for which the use of real Lagrangian trajectories requires an unfeasible computational or experimental effort.These questions include the relative dispersion problem between two or more particles to study Richardson diffusion [68,69], shape dynamics [70,71], data augmentation of datasets for drifter trajectories in specific oceanic applications [72,73], generation and classification of inertial particle trajectories [8] and data inpainting [48].

Navier-Stokes simulations for Lagrangian tracers.
We solve the 3D Navier-Stokes equations: for an incompressible fluid of viscosity ν [17].The flow is driven to a non-equilibrium statistically steady state by a homogeneous and isotropic forcing, F, obtained via a second-order Ornstein-Uhlenbeck process [18].For the DNS of the Eulerian field, we used a standard pseudospectral solver fully dealiased with the two-thirds rule.Details on the simulation can be found in [74].Parameters of the DNS used in this work are given in Table I.The database of Lagrangian trajectories used in this study is dumped each dt s = 15dt ≃ 0.1τ η [75].Lagrangian integration of tracers is obtained via a Bspline 6th-order interpolation scheme to obtain the fluid velocity at the particle position and with a second-order Adams-Bashforth time-marching scheme [76].
Diffusion Models.The specific implementation of DM utilized in this work is based on the recent research [37], which demonstrated extremely good performances of DM even in comparison with GAN for image synthesis.The network architecture, depicted in Fig. 3, relies on the typical UNet structure [77], which is commonly used for image analysis tasks as it is designed to capture both high-level contextual information and precise spatial detail.The UNet consists of two primary components: a contracting and an expanding path.Acting as an encoder, the contracting path progressively reduces the spatial dimension of the input data while increasingly extracting abstract features that contain the global context of the input data.The expanding path acts as a decoder, interpreting the learned features and systematically recovering the spatial resolution to generate the final output (see the later section DM architecture and Noise schedule and Fig. 3 for more details).
Training algorithm.We train two different classes of DM.One to generate a single component of the Lagrangian velocity field (DM-1c) and one for the three components simultaneously (DM-3c).Let us denote each entire trajectory as V, where and and k = 1, . . ., K goes over the total number of discretized sampling times for each trajectory (see Table I).The distribution of the ground truth trajectories obtained from DNS of the NSE is denoted as q(V).We introduce a forward noising process, that starts from the ground truth trajectory, V 0 = V, and transforms it, after N steps, to a set of trajectories identical to pure random uncorrelated Gaussian noise.This process generates latent variables V 1 , . . ., V N by introducing Gaussian noise at step n with a variance β n ∈ (0, 1) according to the following formulation where we have introduced the shorthand notation V 1:N to denote the entire chain of the ensemble of noisy trajectories V 1 , V 2 , . . ., V N , and each step is defined as Eq. ( 7) is obtained using the Markovian property of the n steps in the forward process.For a large enough N and a suitable sequence of β n , the latent vector V N ∼ N (0, I) approximates a delta-correlated Gaussian signal with zero mean and unitary variance.A second remarkable property of the above process, which follows from the Gaussian property of the noise introduced at each step (8), is that given V 0 , we can sample V n at any given arbitrary n in a closed form, by defining α n := 1 − β n and ᾱn := n i=0 α i , as In other words, starting from any ground-truth trajectory, V 0 , we can evaluate its corresponding realization after n steps in the forward process as where ϵ ∼ N (0, I).Now, it is clear that if we can reverse the above process and sample from p(V n−1 |V n ), we will be able to generate new true samples starting from the Gaussian-noise input, p(V N ) = N (0, I).In general, the backward distribution, p(V n−1 |V n ), is unknown.However, in the limit of continuous diffusion (small β n ), the reverse process has the identical functional form of the forward process [56].Since q(V n |V n−1 ) is a Gaussian distribution, and β n is chosen to be small, then p(V n−1 |V n ) will also be a Gaussian.In this way, the UNet needs to model the mean µ θ (V n , n) and standard deviation Σ θ (V n , n) of the transition probabilities for all steps in the backward diffusion process: where each reverse step can be written as, During training, the optimization involves minimizing the cross entropy, L CE , between the ground truth distribution and the likelihood of the generated data, However, integrating over all possible backward paths from 1 to N and averaging over all ground truth data, E q(V0) [..] = [..]q(V 0 )dV 0 , to evaluate every network update, is beyond being numerically intractable.A way out is to exploit a variational lower bound L V LB , for the cross entropy [56]: To make the above expression computable the expectation value can be split into its independent steps.Consequently, it can be rewritten as a summation of several Kullback-Leibler (KL) divergences, D KL , plus one entropy term (see details in Appendix B of [56]).In this way, L V LB becomes The first term, L N , can be ignored during training, as p(V N |V 0 ) does not depend on the network hyperparameters, and p θ (V N ) = N (0, I) is just the Gaussian distribution.Hence, the network must minimize only the terms, L n with n < N , to reproduce the entire backward diffusion process and generate correct data.At this point, the last remarkable property that allows each term of the variational lower bound to be written in a tractable way is that the inverse conditional probability can be calculated analytically when conditioned on a particular realization of the ground-truth data.Using Bayes' theorem, we can write All probabilities in the right-hand side of Eq. ( 16) describe forward steps as defined in Eq. ( 8) and Eq. ( 9).Therefore, Eq. ( 16) can be regarded as the product of three Gaussians, • which can be rewritten as where the mean and the standard deviation of the conditioned reverse probability are, respectively, and All terms denoted by L n−1 , in the variational lower bound, are D KL between the two Gaussians that depend only on the difference between their mean values and standard deviations.Assuming that the standard deviations of the reverse and forward processes are identical, i.e., Σ θ = β n I, we only need to model the mean values of the backward Gaussians.Consequently, the KL divergence simplifies to the difference between the two mean values, given in Eq. ( 19) and the output of the UNet mode, µ θ (V n , n), in Eq. (12).From this simplification, it follows that each loss term becomes Expressing V 0 in term of V n by inverting (10) and substituting it in (19), the mean value of the reverse conditioned probability can be rewritten as, where the subscripts of the noise term, ϵ V0,n , indicate that this is the specific noise realization used to obtain V n from V 0 , as defined in Eq. (10).Now since V n is known by the network one may re-parameterize the predicted mean µ θ (V n , n) as: where ϵ θ is a function approximator designed to predict ϵ V0,n from V n , leading to the following reformulation of the loss terms, namely in the training ϵ θ predicted from the DM is compared with the one used to build up the V n from V 0 .This formulation leads to faster and more stable training [36].Moreover, it has been shown [36] that one can obtain good results even by performing the training without learning the variance of the reverse process and introducing a simpler, re-weighted loss function defined as which is identical to the one we implemented in this work.It is worth noting that due to the Gaussian form of p θ (V 0 |V 1 ), L 0 results in the same loss function as depicted in Eq. ( 23).Therefore, the optimized loss functions can be expressed as L simple n , where n ranges from 0 to N − 1.
DM architecture and Noise schedule.The UNet architecture we have implemented is one of the most advanced networks described in the literature, demonstrating state-of-the-art performance in image generation [37].It is capable of extracting the hidden, spatially correlated information that is essential both for image generation and for accomplishing our specific task.The details of the architecture, including the hyperparameters, are summarized in the table in Fig. 3a.Each encoder and decoder part consists of five levels.Progressing to the next level entails doubling or halving the resolution as one passes through an Upsample or Downsample layer, respectively.The Depth parameter controls the number of ResBlocks with or without Atten-tionBlocks at each level.Within each level, layers share the same number of features, which can be determined using the Channels and Channels multiple parameters from the table.Attention mechanisms [78] allow neural networks to prioritize certain regions or features within the data.In this study, we employed multi-head attention with four heads.AttentionBlocks were utilized at levels with resolutions of 250 and 125.For the DM-1c model, we utilized 250 × 10 3 iterations, while 400 × 10 3 iterations were used for the DM-3c model.In each iteration, we sample a batch of training data and assign a random step index n to each sample, then optimize L simple n across the data batch.Fig. 6 shows the training loss as a function of iteration for DM-1c, alongside the fourth-order flatness of samples generated from it at different iteration checkpoints: A, B, and C. Here, C corresponds to the final model.It reveals that while the loss rapidly reached a 'plateau', it is crucial to continue training for the model convergence.This is because ⟨L simple n ⟩ is an average derived from a data batch where each sample is assigned a random n, which does not truly represent the inherent loss L CE described in Eq. (13).While L CE can be approximated as the summed expectation of L simple n across the training dataset for 0 < n ≤ N , direct evaluation of L CE is impractical.Instead, we rely on examining the statistical properties to measure training progress.Concerning the noise schedule to improve the training and sampling protocols, we explored three different laws and found that the optimal one for our application is given in terms of a tanh-profile, see Fig.
which allowed us to use N = 800 diffusion steps rather than N = 4000 needed for the linear case where the forward process variances are constantly increasing from β 1 = 10 −4 to β N = 0.02.As a result, a five-fold improvement in performance is achieved.We also explored an alternative noise schedule (power4) with a functional form: ᾱn = 1 − (n/N ) 4 , with N = 800, which resulted to be slightly inferior to (tanh6-1).Note that applying methods to speed up DM sampling with pre-trained models remains worthy of future exploration [79,80].
Computational cost.To illustrate the computational cost of our case, the DNS of the Eulerian field takes about 7. V. DATA AVAILABILITY The Lagrangian trajectories used in this study, which include the positions, velocities and accelerations of each particle, are available for download from the open access Smart-TURB portal http://smart-turb.roma2.infn.it, in the TURB-Lagr repository [74,75].
It is also possible to download from the same repository a minimum dataset for testing the code and the generated Lagrangian trajectories (velocities over time) used for all analyses in this paper.TURB-Lagr is a newly developed database of 3D turbulent Lagrangian trajectories obtained by DNS of the NSE with homogeneous and isotropic forcing.Details on how to download and read the database are also given in the portal.All data related to this study have also been uploaded to the Open Access Repository [81].

VI. CODE AVAILABILITY
The code to train the DM model and generate new trajectories can be found at https://github.com/SmartTURB/diffusion-lagr [82].A ready-to-run Code Ocean Capsule with the complete environment is available at https://codeocean.com/capsule/0870187/ tree/v1 [83].

FIG. 1 .
FIG. 1.Comparison between direct numerical simulations (DNS) and diffusion models (DMs).a, Standardized probability density functions (PDFs) of one generic component of the velocity increment, δτ Vi, at τ /τη = 1, 2, 5, 100 for groundtruth DNS data (black lines), synthetically generated data from DM-1c (blue lines with circles) and that from DM-1c-10% (green lines with squares), a DM-1c model trained with 10% DNS data.PDFs for different τ are vertically shifted for the sake of presentation.b,c,d, DM-1c trajectories for one generic velocity component with large, medium, and small time increments, τ /τη = 100, 5, 1, respectively.e, Comparison of 3D trajectories showing small-scale vortex structures, for both DNS and DM-3c data, where different curves correspond to the three standardized velocity components, i = x, y, z.For the DNS, the high oscillatory correlations between the three components are consistent with the presence of strong vortical structures.Similarly, in the case of DM-3c, these correlations can be interpreted as reflecting vortical structures within the hypothetical Eulerian flow.f, Examples of 3D trajectories reconstructed from DNS (bottom) and DM-3c (top).Notice in panel a the remarkable generalizability properties of our DM data-driven model, able to explore and capture extreme events for velocity fluctuations with far larger intensities than observed in the DNS dataset, represented by much more extended tails, while still maintaining the ground truth statistics inherent in the training data.Here, the statistics for DM-1c and DM-1c-10% data are derived from 86 and 22 times the number of trajectories in the DNS, respectively.
τ is calculated by further averaging over all velocity components.Henceforth, we neglect the dependence on the velocity component because of isotropy.Concerning the scaling exponents, ξ(p), there exists a whole spectrum of anomalous corrections, ∆(p), to the mean-field dimensional estimate, p/2, leading to ξ(p) = p/2 + ∆(p).Furthermore, beyond global scaling laws, the statistics of velocity fluctuations can be quantitatively captured scale-by-scale, for each τ by measuring the local scaling exponents, which are obtained from the logarithmic derivatives of S (p) τ :

FIG. 3 .
FIG. 3.Illustration of the DM model and in-depth examination of its backward generation process.a, Schematic representation of the DM model and associated UNet sketch, complemented by a table of hyperparameters.Here, N denotes the total number of diffusion steps and n denotes the intermediate step.More details on the network architecture can be found in the Methods section and in[37].b, Three distinct noise schedules for the DM's forward and backward processes explored in this study (see Methods).Points A-D indicate four different stages during the backward generation process (from VN to V0) along the optimal noise schedule, curve (tanh6-1).At an early step during the backward process, we have very noisy signals, n = 0.52N (D), followed by two intermediate steps at n = 0.27N (C) and n = 0.06N (B), and the final synthetic trajectory obtained for n = 0 (A).Please see panel f for the corresponding illustration of one trajectory generation from D to A. A few statistical properties of the DM-1c signals generated at the four backward steps A-D: c, PDF of δτ Vi for τ = τη; d, Second-order structure function, S τ .pute ζ(p, τ ) by first computing d log S (p) τ /d log τ and d log S(2)

FIG. 4 .
FIG. 4. Multiscale statistical properties of velocity increments.a, Log-log plot of Lagrangian structure functions, S (p) τ , for p = 2, 4 and 6, compared across DNS, DM-1c, and DM-3c.b, Log-log plot of the generalized flatness, F (p) τ , for p = 4, 6 and 8, compared across DNS, DM-1c, and DM-3c.c, Log-log plot of 4th-order mixed flatness, F (4,ij) τ , averaged over combinations of ij = xy, xz and yz for both DNS and DM-3c.Error bars are computed as min-max range over the fluctuations of 10 different independent batches sub-sampled from Np trajectories for each velocity component.Error bars may appear smaller than the data points.

τ ] 2 .
Statistics and error bars in a are derived as in Fig. 4.This resulted in 30 batches for DNS and DM-3c, and 10 batches for DM-1c.The error bars in panel b are computed solely over the three different velocity components.

FIG. 6 .
FIG. 6. DM training protocol.The training loss function, ⟨L simple n ⟩, against iterations for DM-1c.Here, ⟨•⟩ represents the average over a batch of training data, each of which has a corresponding random step n with 0 ≤ n ≤ N .The inset presents the fourth-order flatness obtained from DM-1c at different iterations (A: 10 × 10 3 , B: 30 × 10 3 C: 250 × 10 3 ), in comparison with that from DNS data.Statistics and error bars are derived as in Fig. 4.

TABLE II .
Single-time second-order correlations.
2 hours on 4096 cores.This step is essential even to generate a single Lagrangian trajectory.An additional 64% of the time is required to track 4 million Lagrangian tracers.All training and sampling of the DM models in our study was performed on 4 NVIDIA A100 GPUs.Training takes approximately 1 hour per 10,000 iterations, resulting in approximately 25 hours for DM-1c and 40 hours for DM-3c.Sampling an equivalent number of 4 million trajectories takes about 200 hours.