h-Analysis and data-parallel physics-informed neural networks

We explore the data-parallel acceleration of physics-informed machine learning (PIML) schemes, with a focus on physics-informed neural networks (PINNs) for multiple graphics processing units (GPUs) architectures. In order to develop scale-robust and high-throughput PIML models for sophisticated applications which may require a large number of training points (e.g., involving complex and high-dimensional domains, non-linear operators or multi-physics), we detail a novel protocol based on h-analysis and data-parallel acceleration through the Horovod training framework. The protocol is backed by new convergence bounds for the generalization error and the train-test gap. We show that the acceleration is straightforward to implement, does not compromise training, and proves to be highly efficient and controllable, paving the way towards generic scale-robust PIML. Extensive numerical experiments with increasing complexity illustrate its robustness and consistency, offering a wide range of possibilities for real-world simulations.


Introduction
Simulating physics throughout accurate surrogates is a hard task for engineers and computer scientists.Numerical methods such as finite element methods, finite difference methods and spectral methods can be used to approximate the solution of partial differential equations (PDEs) by representing them as a finite-dimensional function space, delivering an approximation to the desired solution or mapping 1 .
Real-world applications often incorporate partial information of physics and observations, which can be noisy.This hints at using data-driven solutions throughout machine learning (ML) techniques.In particular, deep learning (DL) 2,3 principles have been praised for their good performance, granted by the capability of deep neural networks (DNNs) to approximate high-dimensional and non-linear mappings, and offer great generalization with large datasets.Furthermore, the exponential growth of GPUs capabilities has made it possible to implement even larger DL models.
Recently, a novel paradigm called physics-informed machine learning (PIML) 4 was introduced to bridge the gap between data-driven 5 and physics-based 6 frameworks.PIML enhances the capability and generalization power of ML by adding prior information on physical laws to the scheme by restricting the output space (e.g., via additional constraints or a regularization term).This simple yet general approach was applied successfully to a wide range complex real-world applications, including structural mechanics 7,8 and biological, biomedical and behavioral sciences 9 .
In particular, physics-informed neural networks (PINNs) 10 consist in applying PIML by means of DNNs.They encode the physics in the loss function and rely on automatic differentiation (AD) 11 .PINNs have been used to solve inverse problems 12 , stochastic PDEs 13,14 , complex applications such as the Boltzmann transport equation 15 and large-eddy simulations 16 , and to perform uncertainty quantification 17,18 .
Concerning the challenges faced by the PINNs community, efficient training 19 , proper hyper-parameters setting 20 , and scaling PINNs 21 are of particular interest.Regarding the latter, two research areas are gaining attention.
First, it is important to understand how PINNs behave for an increasing number of training points N (or equivalently, for a suitable bounded and fixed domain, a decreasing maximum distance between points h).Throughout this work, we refer to this study as h-analysis as being the analysis of the number of training data needed to obtain a stable generalization error.In their pioneer works 22,23 , Mishra and Molinaro provided a bound for the generalization error with respect to to N for data-free and unique continuation problems, respectively.More precise bounds have been obtained using characterizations of the DNN 24 .
Second, PINNs are typically trained over graphics processing units (GPUs), which have limited memory capabilities.To ensure models scale well with increasingly complex settings, two paradigms emerge: data-parallel and model-parallel acceleration.The former splits the training data over different workers, while the latter distributes the model weights.However, general DL backends do not readily support multiple GPU acceleration.To address this issue, Horovod 25 is a distributed arXiv:2302.08835v3[cs.CE] 23 Aug 2023 framework specifically designed for DL, featuring a ring-allreduce algorithm 26 and implementations for TensorFlow, Keras and PyTorch.
As model size becomes prohibitive, domain decomposition-based approaches allow for distributing the computational domain.Examples of such approaches include conservative PINNs (cPINNs) 27 , extended PINNs (XPINNs) 28,29 , and distributed PINNs (DPINNs) 30 .cPINNs and XPINNs were compared in 31 .These approaches are compatible with data-parallel acceleration within each subdomain.Additionally, a recent review concerning distributed PIML 21 is also available.Regarding existing data-parallel implementations, TensorFlow MirroredStrategy in TensorDiffEq 32 and NVIDIA Modulus 33 , should be mentioned.However, to the authors knowledge, there is no systematic study of the background of data-parallel PINNs and their implementation.
In this work, we present a procedure to attain data-parallel efficient PINNs.It relies on h-analysis and is backed by a Horovod-based acceleration.Concerning h-analysis, we observe PINNs exhibiting three phases of behavior as a function of the number of training points N: 1.A pre-asymptotic regime, where the model does not learn the solution due to missing information; 2. A transition regime, where the error decreases with N; 3. A permanent regime, where the error remains stable.
To illustrate this, Fig. 1 presents the relative L 2 error distribution with respect to N f (number of domain collocation points) for the forward "1D Laplace" case.The experiment was conducted over 8 independent runs with a learning rate of 10 −4 and 20000 iterations of ADAM 34 algorithm.The transition regime-where variability in the results is high and some models converge while others do not-is between N f = 64 and N f = 400.For more information on the experimental setting and the definition of precision ρ, please refer to "1D Laplace".precision Figure 1.Error v/s the number of domain collocation points N f for the "1D Laplace" case.A pre-asymptotic regime (pink) is followed by a rapid transition regime (blue), and eventually leading to a permanent regime (green).This transition occurs over a few extra training points.
Building on the empirical observations, we use the setting in 22,23 to supply a rigorous theoretical background to h-analysis.One of the main contributions of this manuscript is the bound on the "Generalization error for generic PINNs", which allows for a simple analysis of the h-dependence.Furthermore, this bound is accompanied by a practical "Train-test gap bound", supporting regimes detection.
To summarize the latter results, a simple yet powerful recipe for any PIML scheme could be: 1. Choose the right model and hyper-parameters to achieve a low training loss; 2. Use enough training points N to reach the permanent regime (e.g., such that the training and test losses are similar).
Any practitioner strives to reach the permanent regime for their PIML scheme, and we provide the necessary details for an easy implementation of Horovod-based data acceleration for PINNs, with direct application to any PIML model.Fig. 2 (left) further illustrates the scope of data-parallel PIML.For the sake of clarity, Fig. 2 (right) supplies a comprehensive review of important notations defined throughout this manuscript, along with their corresponding introductions.
Next, we apply the procedure to increasingly complex problems and demonstrate that Horovod acceleration is straightforward, using the pioneer PINNs code of Raissi as an example.Our main practical findings concerning data-parallel PINNs for up to 8 GPUs are the following :  • They do not require to modify their hyper-parameters; • They show similar training convergence to the 1 GPU-case; • They lead to high efficiency for both weak and strong scaling (e.g E ff > 80% for Navier-Stokes problem with 8 GPUs).
This work is organized as follows: In "Problem formulation", we introduce the PDEs under consideration, PINNs and convergence estimates for the generalization error.We then move to "Data-parallel PINNs" and present "Numerical experiments".Finally, we close this manuscript in "Conclusion".

General notation
Throughout, vector and matrices are expressed using bold symbols.For a natural number k, we set with N a spatio-temporal differential operator, B the boundary conditions (BCs) operator, λ the material parameters-the latter being unknown for inverse problems-and u(x,t) ∈ R m for any m ∈ N 1 .Accordingly, for any function û defined over D, we introduce and define the residuals ξ v for each v ∈ Λ and any observation function u obs :

PINNs
Following 20,35 , let σ be a smooth activation function.Given an input (x,t) ∈ R d+1 , we define N N θ as being a L-layer neural feed-forward neural network with W 0 = d + 1, W L = m and W l neurons in the l-th layer for 1 ≤ l ≤ L − 1.For constant width DNNs, we set For 1 ≤ l ≤ L, let us denote the weight matrix and bias vector in the l-th layer by This results in representation z L (x,t), with the (trainable) parameters-or weights-in the network.We set Θ = R |Θ| .Application of PINNs to Eq. ( 1) yields the approximate u θ (x,t) = z L (x,t).
We introduce the training dataset v we associate a quadrature weight w i v > 0. All throughout this manuscript, we set: Note that M (resp.N) represents the amount of information for the data-driven (resp.physics) part, by virtue of the PIML paradigm (refer to Fig. 2).The network weights θ in Eq. ( 5) are trained (e.g., via ADAM optimizer 34 ) by minimizing the weighted loss: We seek at obtaining: The formulation for PINNs addresses the cases with no data (i.e.M = 0) or physics (i.e.N = 0), thus exemplifying the PIML paradigm.Furthermore, it is able to handle time-independent operators with only minor changes; a schematic representation of a forward time-independent PINN is shown in Fig. 3. Our setting assumes that the material parameters λ are known.If M > 0, one can solve the inverse problem by seeking: Similarly, unique continuation problems 23 , which assume incomplete information for f , g and h, are solved throughout PINNs without changes.Indeed, "2D Navier-Stokes" combines unique continuation problem and unknown parameters λ 1 , λ 2 .

Automatic Differentiation
We aim at giving further details about back-propagation algorithms and their dual role in the context of PINNs: 1. Training the DNN by calculating ∂ L θ ∂ θ ; 2. Evaluating the partial derivatives in N [u θ (x,t); λ ] and B[u θ (x,t); λ ] so as to compute the loss L θ .
They consist in a forward pass to evaluate the output u θ (and L θ ), and a backward pass to assess the derivatives.To further elucidate back-propagation, we reproduce the informative diagram from 11 in Fig. 4. TensorFlow includes reverse mode AD by default.Its cost is bounded with |Θ| for scalar output NNs (i.e. for m = 1).The application of back-propagation (and reverse mode AD in particular) to any training point is independent of other information, such as neighboring points or the volume of training data.This allows for data-parallel PINNs.Before detailing its implementation, we justify the h-analysis through an abstract theoretical background.A forward pass generates activations y i and computes the error L θ (y 3 , u).This is followed by a backward pass, through which the error adjoint is propagated to obtain the gradient with respect to weights ∇L θ where θ = (w 1 , • • • , w 6 ).Additionally, spatio-temporal partial derivatives can be computed in the same backward pass.

Convergence estimates
To understand better how PINNs scale with N, we follow the method in 22,23 under a simple setting, allowing to control the data and physics counterparts in PIML.Set s ≥ 0 and define spaces: We assume that Eq. ( 1) can be recast as: We suppose that Eq. ( 11) is well-posed and that for any u, v ∈ X, there holds that: Eq. ( 11) is a stability estimate, allowing to control the total error by means of a bound on PINNs residual.Residuals in Eq. ( 3) are: From the expression of residuals, we are interested in approximating integrals: We assume that we are provided quadratures: for weights w i D , w i u and quadrature points τ i D , τ i u ∈ D such that for α, β > 0: For any ω u > 0, the loss is defined as follows: with ε T,D and ε T,u the training error for collocation points and observations respectively.Notice that application of Eq. ( 17) to ξ D,θ and ξ u,θ yields: We seek to quantify the generalization error: We detail a new result concerning the generalization error for PINNs.
Theorem 1 (Generalization error for generic PINNs) Under the presented setting, there holds that: with μ := ∥u − u obs ∥ X .
The novelty of Theorem 1 is that it describes the generalization error for a simple case involving collocation points and observations.It states that the PINN generalizes well as long as the training error is low and that sufficient training points are used.To make the result more intuitive, we rewrite Eq. ( 21), with ∼ expressing the terms up to positive constants:

6/19
The generalization error depends on the training errors (which are tractable during training), parameters N and M and bias μ.
To return to h-analysis, we now have a theoretical proof of the three regimes presented in "Introduction".Let us assume that μ = 0.For small values of N or M, the bound in Theorem 1 is too high to yield a meaningful estimate.Subsequently, the convergence is as max( N−α/2 , M −β /2 ), marking the transition regime.It is paramount for practitioners to reach the permanent regime when training PINNs, giving ground to data-parallel PINNs.
In general applications, the exact solution u is not available.Moreover, it is relevant to determine whether N is large enough.To this extent, we introduce a same cardinality testing (or validation) set.Interestingly, the entire analysis above and Theorem 1 remain valid for another set of testing points, with the testing error ε V,D and ε V,u set as in Eq. ( 18).The train-test gap, which is tractable, can be quantified as follows.
Theorem 2 (Train-test gap bound) Under the presented setting, there holds that: Proof 2 Consider the setting of Theorem 2. For v ∈ {D, u} and • ∈ {Y, X} there holds that: , by Eq. ( 19).
The bound in Theorem 1 is valuable as it allows to assess the quadrature error convergence-and the regime-with respect to the number of training points.

Data-distribution and Horovod
In this section, we present the data-parallel distribution for PINNs.Let us set size ∈ N 1 and define ranks (or workers): each rank corresponding generally to a GPU.Data-parallel distribution requires the appropriate partitioning of the training points across ranks.
We introduce N1 , M 1 ∈ N 1 collocation points and observations, respectively, for each rank (e.g., a GPU) yielding: Data-parallel approach is as follows: We send the same synchronized copy of the DNN N N θ defined in Eq. ( 4) to each rank.Each rank evaluates the loss L rank θ and the gradient ∇ θ L rank θ .The gradients are then averaged using an all-reduce operation, such as the ring all-reduce implemented in Horovod 26,36 , which is known to be bandwith optimal with respect to the number of ranks 36 .The process is illustrated in Figure 5 for size = 4.The ring-allreduce algorithm involves each of the size nodes communicating with two of its peers 2 × (size − 1) times 26 .
It is noteworthy to observe that data generation for data-free PINNs (i.e. with M = 0) requires no modification to existing codes, provided that each rank has a different seed for random or pseudo-random sampling.Horovod allows to apply dataparallel acceleration with minimal changes to existing code.Moreover, our approach and Horovod can easily be extended to multiple computing nodes.As pointed out in "Introduction", Horovod supports popular DL backends such as TensorFlow, PyTorch and Keras.In Listing 1, we demonstrate how to integrate data-parallel distribution using Horovod with a generic PINNs implementation in TensorFlow 1.x.The highlighted changes in pink show the steps for incorporating Horovod, which include: (i) initializing Horovod; (ii) pinning available GPUs to specific workers; (iii) wrapping the Horovod distributed optimizer and (iv) broadcasting initial variables to the master rank being rank = 0.

Weak and strong scaling
Two key concepts in data distribution paradigms are weak and strong scaling, which can be explained as follows: Weak scaling involves increasing the problem size proportionally with the number of processors, while strong scaling involves keeping the problem size fixed and increasing the number of processors.To reformulate: • Weak scaling: Each worker has ( N1 , M 1 ) training points, and we increase the number of workers size; • Strong scaling: We set a fixed total number of ( N1 , M 1 ) training points, and we split the data over increasing size workers.
We portray weak and strong scaling in Fig. 6 for a data-free PINN with N1 = 16.Each box represents a GPU, with the number of collocation points as a color.On the left of each scaling option, we present the unaccelerated case.Finally, we introduce the training time t size for size workers.This allows to define the efficiency and speed-up as:
For each case, we perform a h-analysis followed by Horovod data-parallel acceleration, which is applied to the domain training points (and observations for the Navier-Stokes case).Boundary loss terms are negligible due to the sufficient number of boundary data points.

Methodology
We perform simulations in single float precision on a AMAX DL-E48A AMD Rome EPYC server with 8 Quadro RTX 8000 Nvidia GPUs-each one with a 48 GB memory.We use a Docker image of Horovod 0.26.1 with CUDA 12.1, Python 3.6.9and Tensorflow 2.6.2.All throughout, we use tensorflow.compat.v1as a backend without eager execution.All the results are ready for use in HorovodPINNs GitHub repository and fully reproducible, ensuring also compliance with FAIR principles (Findability, Accessibility, Interoperability, and Reusability) for scientific data management and stewardship 37 .We run experiments 8 times with seeds defined as: seed + 1000 × rank, in order to obtain rank-varying training points.For domain points, Latin Hypercube Sampling is performed with pyDOE 0.3.8.Boundary points are defined over uniform grids.
We use Glorot uniform initialization [3, Chapter 8]."Error" refers to the L 2 -relative error taken over T test , and "Time" stands for the training time in seconds.For each case, the loss in Eq. ( 7) is with unit weights ω v = 1 and Monte-Carlo quadrature rule w i v = 1 N v for v ∈ Λ.Also, we set vol(D) the volume of domain D and vol(D) 1/(d+1) .

9/19
We introduce t k the time to perform k iterations.The training points processed by second is as follows: For the sake of simplicity, we summarize the parameters and hyper-parameters for each case in Table 1.

Data-parallel implementation
We set N f ,1 ≡ N 1 = 64 and compare both weak and strong scaling to original implementation, referred to as "no scaling".
• Middle: Error for weak scaling with N 1 = 64 and for size ∈ {1, 2, 4, 8}; • Right-hand side: Error for strong scaling with N 1 = 512 and for size ∈ {1, 2, 4, 8}   To reduce ambiguity, we use the * -superscript for no scaling, as size * is performed over 1 rank.The color for each violin box in the figure corresponds to the number of domain collocation points used for each GPU.Fig. 9 demonstrates that both weak and strong scaling yield similar convergence results to their unaccelerated counterpart.This result is one of the main findings in this work: PINNs scale properly with respect to accuracy, validating the intuition behind h-analysis and justifying the data-parallel approach.This allows one to move from pre-asymptotic to permanent regime by using weak scaling, or leverage the cost of a permanent regime application by dispatching the training points over different workers.Furthermore, the hyper-parameters, including the learning rate, remained unchanged.Next, we summarize the data-parallelization results in Table 2 with respect to size.In the first column, we present the  time required to run 500 iterations for ADAM, referred to as t 500 .This value is averaged over one run of 30000 iterations with a heat-up of 1500 iteration (i.e.we discard the values corresponding to iterations 0, 500 and 1000).We present the resulting mean value ± standard deviation for the resulting vector.The second column, displays the efficiency of the run, evaluated with respect to t 500 .Table 2 reveals that data-based acceleration results in supplementary training times as anticipated.Weak scaling efficiency varies between 78.01% for size = 2 to 66.67% for size = 8, resulting in a speed-up of 5.47 when using 8 GPUs.Similarly, strong scaling shows similar behavior.Furthermore, it can be observed that size = 1 yields almost equal t 500 for N f = 64 (4.08s) and N f = 512 (4.19s).
To conclude, the Laplace case is not reaching its full potential with Horovod due to the small batch size N 1 .However, increasing the value of N 1 in next cases will lead to a noticeable increase in efficiency.

h-analysis
To begin with, we perform the h-analysis for the parameters in Fig. 10.Again, the transition regime for the total execution time distribution begins at a density of ρ = 5 and spans between N f = 350 and N f = 4000.At higher magnitudes of N f , the error remains approximately the same.To illustrate this more complex case, we present the total execution time distribution in Fig. 11.We note that the training times remain stable for N f ≤ 4000.This observation is of great importance and should be emphasized, as it adds additional parameters to the analysis.Our work primarily focuses on error in h-analysis, however, execution time and memory requirements are also important considerations.We see that in this case, weak scaling is not necessary (the optimal option is to use size = 1 and N f = 4000).Alternatively, strong scaling can be done with N f ,1 = 4000.
To gain further insight into the variability of the transition regime, we focus on N f = 1000.We compare the solution for seed = 1234 and seed = 1236 in Fig. 12  showed that it converged as O(N −1 f ).Visually, one can assume that the losses are close enough for N f = 4000 (or N f = 8000), in accordance with the h-analysis performed in Fig. 10.

Data-parallel implementation
We compare the error for unaccelerated and data-parallel implementations of simulations for N f ,1 ≡ N 1 = 500 in Fig. 14, analogous to the analysis in Fig. 9. Again, the error is stable with size.Both no scaling and weak scaling are similar.Strong scaling is unaffected by size.We plot the training time in Fig. 15.We observe that both weak and strong scaling increase per second (refer to Eq. ( 24)) and the efficiency with respect to size, with white bars representing the ideal scaling.Efficiency E ff shows a gradual decrease with size, with results surpassing those of the previous section.The efficiency for size = 8 reaches 77.22% and 76.20% respectively for weak and strong scaling, representing a speed-up of 6.18 and 6.10.
wherein u(x, y,t) and v(x, y,t) are the x and y components of the velocity field, and p(x, y,t) the pressure.We assume that there exists ϕ(x, y,t) such that: Under Eq. ( 27), last row in Eq. ( 26) is satisfied.The latter leads to the definition of residuals: We introduce N f pseudo-random points x i ∈ D and observations (u i , v i ) = (u i (x i ), v i (x i )), yielding the loss: Throughout this case, we have N f = M, and we plot the results with respect to N f .Acknowledge that N = 2N f , and that both λ 1 , λ 2 and the BCs are unkwown.

h-analysis
We conduct the h-analysis and show the error in Fig. 17, showing no differences with previous cases.Surprisingly, the permanent regime is reached only for N f = 1000, despite the problem being a 3-dimensional, non-linear, and inverse one.This corresponds to low values of ρ, indicating that PINNs seem to prevent the curse of dimensionality.In fact, the it was achieved with only 1.21 points per unit per dimension.The total training time is presented in Fig. 18, where it can be seen to remain stable up to N f = 5000, and then increases linearly with N f .Again, we represent the training and test losses in Fig. 19.The train-test gap is shown to decrease for N f ≥ 350 as O(N −1 f ).Furthermore, the train-test gap can be considered as being small enough, visually, for N f = 1000.

Data-parallel implementation
We run the data-parallel simulations, setting N f ,1 ≡ N 1 = M 1 = 500.As shown in Fig. 20, the simulations exhibit stable accuracy with size.The execution time increases moderately with size, as illustrated in Fig. 21.The training time decreases with N for the no scaling case.However, this behavior is temporary (refer to Fig. 17 before).We conclude our analysis by plotting the efficiency with respect to size in Fig. 22. Efficiency lowers with increasing size, but shows the best results so far, with 80.55% (resp.86.31%) weak (resp.strong) scaling efficiency for size = 8.For the sake of completeness, the weak efficiency for N f ,1 = 50000 and size = 8 improved to 86.15%.This encouraging result sets the stage for further exploration of more intricate applications.

Conclusion
In this work, we proposed a novel data-parallelization approach for PIML with a focus on PINNs.We provided a thorough h-analysis and associated theoretical results to support our approach, as well as implementation considerations to facilitate implementation with Horovod data acceleration.Additionally, we ran reproducible numerical experiments to demonstrate the scalability of our approach.Further work include the implementation of Horovod acceleration to DeepXDE 35 library, coupling of localized PINNs with domain decomposition methods, and application on larger GPU servers (e.g., with more than 100 GPUs).

Figure 2 .
Figure 2. Left: Scope of of data-parallel PIML.Right: Comprehensive review of important notations defined throughout this manuscript, along with their corresponding introductions.
and an open set D ⊆ R d with d ∈ N 1 , let L p (D) be the standard class of functions with bounded L p -norm over D. Given s ∈ R + , we refer to [1, Section 2] for the definitions of Sobolev function spaces H s (D).Norms are denoted by ∥ • ∥, with subscripts indicating the associated functional spaces.For a finite set T , we introduce notation |T | := card(T ), closed subspaces are denoted by a ⊂ cl -symbol and ı 2 = −1.Abstract PDE In this work, we consider a domain D ⊂ R d , d ∈ N 1 , with boundary Γ = ∂ D. For any T > 0, D := D × [0, T ], we solve a general non-linear PDE of the form:

Figure 3 .
Figure 3. Schematic representation of a PINN.A DNN with L = 3 (i.e.L − 1 = 2 hidden layers) and W = 5 learns the mapping x → u(x,t).The PDE is taken into account throughout the residual L θ , and the trainable weights are optimized, leading to optimal θ ⋆ .

Figure 4 .
Figure 4. Overview of back-propagation.A forward pass generates activations y i and computes the error L θ (y 3 , u).This is followed by a backward pass, through which the error adjoint is propagated to obtain the gradient with respect to weights ∇L θ where θ = (w 1 , • • • , w 6 ).Additionally, spatio-temporal partial derivatives can be computed in the same backward pass.

Figure 9 .
Figure 9. Error v/s size * for the different scaling options.
. The upper figures depict |u(t, x)| predicted by the PINN.The lower figures show a Results for seed = 1236

Figure 14 .
Figure 14.Schrödinger: Error v/s size * for the different scaling options.

Figure 15 .
Figure 15.Schrödinger: Time (in seconds) v/s size * for the different scaling options.

Figure 19 .
Figure 19.Navier-Stokes: Training (blue) and test (orange) losses for ADAM v/s N f .

Figure 20 .
Figure 20.Navier-Stokes: Error v/s size * for the different scaling options.

Figure 21 .
Figure 21.Navier-Stokes: Time (in seconds) v/s size * for the different scaling options.

Table 1 .
Overview of the parameters and hyper-parameters for each case.

Table 2 .
1D Laplace: t 500 in seconds and efficiency E ff for the weak and strong scaling.
Navier-Stokes: Error v/s N f .
precision Figure 18.Navier-Stokes: Time (in seconds) v/s N f .