Abstract
We explore the dataparallel acceleration of physicsinformed machine learning (PIML) schemes, with a focus on physicsinformed neural networks (PINNs) for multiple graphics processing units (GPUs) architectures. In order to develop scalerobust and highthroughput PIML models for sophisticated applications which may require a large number of training points (e.g., involving complex and highdimensional domains, nonlinear operators or multiphysics), we detail a novel protocol based on hanalysis and dataparallel acceleration through the Horovod training framework. The protocol is backed by new convergence bounds for the generalization error and the traintest gap. We show that the acceleration is straightforward to implement, does not compromise training, and proves to be highly efficient and controllable, paving the way towards generic scalerobust PIML. Extensive numerical experiments with increasing complexity illustrate its robustness and consistency, offering a wide range of possibilities for realworld simulations.
Similar content being viewed by others
Introduction
Simulating physics throughout accurate surrogates is a hard task for engineers and computer scientists. Numerical methods such as finite element methods, finite difference methods and spectral methods can be used to approximate the solution of partial differential equations (PDEs) by representing them as a finitedimensional function space, delivering an approximation to the desired solution or mapping^{1}.
Realworld applications often incorporate partial information of physics and observations, which can be noisy. This hints at using datadriven solutions throughout machine learning (ML) techniques. In particular, deep learning (DL)^{2,3} principles have been praised for their good performance, granted by the capability of deep neural networks (DNNs) to approximate highdimensional and nonlinear mappings, and offer great generalization with large datasets. Furthermore, the exponential growth of GPUs capabilities has made it possible to implement even larger DL models.
Recently, a novel paradigm called physicsinformed machine learning (PIML)^{4} was introduced to bridge the gap between datadriven^{5} and physicsbased^{6} frameworks. PIML enhances the capability and generalization power of ML by adding prior information on physical laws to the scheme by restricting the output space (e.g., via additional constraints or a regularization term). This simple yet general approach was applied successfully to a wide range complex realworld applications, including structural mechanics^{7,8} and biological, biomedical and behavioral sciences^{9}.
In particular, physicsinformed neural networks (PINNs)^{10} consist in applying PIML by means of DNNs. They encode the physics in the loss function and rely on automatic differentiation (AD)^{11}. PINNs have been used to solve inverse problems^{12}, stochastic PDEs^{13,14}, complex applications such as the Boltzmann transport equation^{15} and largeeddy simulations^{16}, and to perform uncertainty quantification^{17,18}.
Concerning the challenges faced by the PINNs community, efficient training^{19}, proper hyperparameters setting^{20}, and scaling PINNs^{21} are of particular interest. Regarding the latter, two research areas are gaining attention.
First, it is important to understand how PINNs behave for an increasing number of training points N (or equivalently, for a suitable bounded and fixed domain, a decreasing maximum distance between points h). Throughout this work, we refer to this study as hanalysis as being the analysis of the number of training data needed to obtain a stable generalization error. In their pioneer works^{22,23}, Mishra and Molinaro provided a bound for the generalization error with respect to to N for datafree and unique continuation problems, respectively. More precise bounds have been obtained using characterizations of the DNN^{24}.
Second, PINNs are typically trained over graphics processing units (GPUs), which have limited memory capabilities. To ensure models scale well with increasingly complex settings, two paradigms emerge: dataparallel and modelparallel acceleration. The former splits the training data over different workers, while the latter distributes the model weights. However, general DL backends do not readily support multiple GPU acceleration. To address this issue, Horovod^{25} is a distributed framework specifically designed for DL, featuring a ringallreduce algorithm^{26} and implementations for TensorFlow, Keras and PyTorch.
As model size becomes prohibitive, domain decompositionbased approaches allow for distributing the computational domain. Examples of such approaches include conservative PINNs (cPINNs)^{27}, extended PINNs (XPINNs)^{28,29}, and distributed PINNs (DPINNs)^{30}. cPINNs and XPINNs were compared in^{31}. These approaches are compatible with dataparallel acceleration within each subdomain. Additionally, a recent review concerning distributed PIML^{21} is also available. Regarding existing dataparallel implementations, TensorFlow MirroredStrategy in TensorDiffEq^{32} and NVIDIA Modulus^{33}, should be mentioned. However, to the authors knowledge, there is no systematic study of the background of dataparallel PINNs and their implementation.
In this work, we present a procedure to attain dataparallel efficient PINNs. It relies on hanalysis and is backed by a Horovodbased acceleration. Concerning hanalysis, we observe PINNs exhibiting three phases of behavior as a function of the number of training points N:

1.
A preasymptotic regime, where the model does not learn the solution due to missing information;

2.
A transition regime, where the error decreases with N;

3.
A permanent regime, where the error remains stable.
To illustrate this, Fig. 1 presents the relative \(L^2\) error distribution with respect to \(N_f\) (number of domain collocation points) for the forward “1D Laplace” case. The experiment was conducted over 8 independent runs with a learning rate of \(10^{4}\) and 20,000 iterations of ADAM^{34} algorithm. The transition regime—where variability in the results is high and some models converge while others do not—is between \(N_f={64}\) and \(N_f={400}\). For more information on the experimental setting and the definition of precision \(\rho \), please refer to “1D Laplace”.
Building on the empirical observations, we use the setting in^{22,23} to supply a rigorous theoretical background to hanalysis. One of the main contributions of this manuscript is the bound on the “Generalization error for generic PINNs”, which allows for a simple analysis of the hdependence. Furthermore, this bound is accompanied by a practical “Traintest gap bound”, supporting regimes detection.
To summarize the latter results, a simple yet powerful recipe for any PIML scheme could be:

1.
Choose the right model and hyperparameters to achieve a low training loss;

2.
Use enough training points N to reach the permanent regime (e.g., such that the training and test losses are similar).
Any practitioner strives to reach the permanent regime for their PIML scheme, and we provide the necessary details for an easy implementation of Horovodbased data acceleration for PINNs, with direct application to any PIML model. Figure 2 (left) further illustrates the scope of dataparallel PIML. For the sake of clarity, Fig. 2 (right) supplies a comprehensive review of important notations defined throughout this manuscript, along with their corresponding introductions.
Next, we apply the procedure to increasingly complex problems and demonstrate that Horovod acceleration is straightforward, using the pioneer PINNs code of Raissi as an example. Our main practical findings concerning dataparallel PINNs for up to 8 GPUs are the following:

They do not require to modify their hyperparameters;

They show similar training convergence to the 1 GPUcase;

They lead to high efficiency for both weak and strong scaling (e.g \(E_\text {ff} {> 80\% }\) for Navier–Stokes problem with 8 GPUs).
This work is organized as follows: In “Problem formulation”, we introduce the PDEs under consideration, PINNs and convergence estimates for the generalization error. We then move to “Dataparallel PINNs” and present “Numerical experiments”. Finally, we close this manuscript in “Conclusion”.
Problem formulation
General notation
Throughout, vector and matrices are expressed using bold symbols. For a natural number k, we set \({\mathbb {N}}_k:= \{k,k+1,\ldots \}\). For \(p \in {\mathbb {N}}_0 = \{0,1,\ldots ,\}\), and an open set \(D\subseteq {\mathbb {R}}^d\) with \(d\in {\mathbb {N}}_1\), let \(L^p(D)\) be the standard class of functions with bounded \(L^p\)norm over D. Given \(s\in {\mathbb {R}}^+\), we refer to^{1}, Section 2 for the definitions of Sobolev function spaces \(H^s(D)\). Norms are denoted by \(\Vert \cdot \Vert \), with subscripts indicating the associated functional spaces. For a finite set \({\mathscr {T}}\), we introduce notation \({\mathscr {T}} := \text {card} ({\mathscr {T}})\), closed subspaces are denoted by a \(\mathop \subset \nolimits_{{{\text{cl}}}}\)symbol and \(\imath ^2=1\).
Abstract PDE
In this work, we consider a domain \(D \subset {\mathbb {R}}^d\), \(d \in {\mathbb {N}}_1\), with boundary \(\Gamma = \partial D\). For any \(T>0\), \({\mathbb {D}}:= D\times [0,T]\), we solve a general nonlinear PDE of the form:
with \({\mathscr {N}}\) a spatiotemporal differential operator, \({\mathscr {B}}\) the boundary conditions (BCs) operator, \({\varvec{\lambda }}\) the material parameters—the latter being unknown for inverse problems—and \(u(\textbf{x},t)\in {\mathbb {R}}^m\) for any \(m\in {\mathbb {N}}_1\). Accordingly, for any function \({\hat{u}}\) defined over \({\mathbb {D}}\), we introduce
and define the residuals \(\xi _v\) for each \(v \in \Lambda \) and any observation function \(u_{obs} \):
PINNs
Following^{20,35}, let \(\sigma \) be a smooth activation function. Given an input \((\textbf{x},t) \in {\mathbb {R}}^{d+1}\), we define \({\mathscr {N}}\!\!{\mathscr {N}}_\theta \) as being a Llayer neural feedforward neural network with \(W_0= d+ 1\), \(W_L = m\) and \(W_l\) neurons in the lth layer for \(1 \le l \le L1\). For constant width DNNs, we set \(W=W_1=\cdots = W_{L1}\). For \(1 \le l \le L\), let us denote the weight matrix and bias vector in the lth layer by \(\textbf{W}^l \in {\mathbb {R}}^{d_l \times d_{l1}}\) and \(\textbf{b}^l \in {\mathbb {R}}^{d_l}\), respectively, resulting in:
This results in representation \(\textbf{z}^L(\textbf{x},t)\), with
the (trainable) parameters—or weights—in the network. We set \(\Theta = {\mathbb {R}}^{\Theta }\). Application of PINNs to Eq. (1) yields the approximate \(u_\theta (\textbf{x},t) = \textbf{z}^L(\textbf{x},t)\).
We introduce the training dataset \({\mathscr {T}}_v:=\{\tau _v^i\}_{i=1}^{N_v}\), \(\tau _v^i \in D_v\), \(N_v \in {\mathbb {N}}\) for \(i = 1,\cdots , N_v, v \in \Lambda \) and observations \(u_\text {obs}(\tau _u^i)\), \(i=1,\cdots ,N_u\). Furthermore, to each training point \(\tau _v^i\) we associate a quadrature weight \(w_v^i>0\). All throughout this manuscript, we set:
Note that M (resp. \({\hat{N}}\)) represents the amount of information for the datadriven (resp. physics) part, by virtue of the PIML paradigm (refer to Fig. 2). The network weights \(\theta \) in Eq. (5) are trained (e.g., via ADAM optimizer^{34}) by minimizing the weighted loss:
We seek at obtaining:
The formulation for PINNs addresses the cases with no data (i.e. \(M=0\)) or physics (i.e. \({\hat{N}}=0\)), thus exemplifying the PIML paradigm. Furthermore, it is able to handle timeindependent operators with only minor changes; a schematic representation of a forward timeindependent PINN is shown in Fig. 3.
Our setting assumes that the material parameters \({\varvec{\lambda }}\) are known. If \(M>0\), one can solve the inverse problem by seeking:
Similarly, unique continuation problems^{23}, which assume incomplete information for f, g and \(\hbar \), are solved throughout PINNs without changes. Indeed, “2D Navier–Stokes” combines unique continuation problem and unknown parameters \(\lambda _1,\lambda _2\).
Automatic differentiation
We aim at giving further details about backpropagation algorithms and their dual role in the context of PINNs:

1.
Training the DNN by calculating \(\frac{\partial {\mathscr {L}}_\theta }{\partial \theta }\);

2.
Evaluating the partial derivatives in \({\mathscr {N}}[u_\theta (\textbf{x},t);{\varvec{\lambda }}]\) and \({\mathscr {B}}[u_\theta (\textbf{x},t);{\varvec{\lambda }}]\) so as to compute the loss \({\mathscr {L}}_\theta \).
They consist in a forward pass to evaluate the output \(u_\theta \) (and \({\mathscr {L}}_\theta \)), and a backward pass to assess the derivatives. To further elucidate backpropagation, we reproduce the informative diagram from^{11} in Fig. 4.
TensorFlow includes reverse mode AD by default. Its cost is bounded with \(\Theta \) for scalar output NNs (i.e. for \( m=1\)). The application of backpropagation (and reverse mode AD in particular) to any training point is independent of other information, such as neighboring points or the volume of training data. This allows for dataparallel PINNs. Before detailing its implementation, we justify the hanalysis through an abstract theoretical background.
Convergence estimates
To understand better how PINNs scale with N, we follow the method in^{22,23} under a simple setting, allowing to control the data and physics counterparts in PIML. Set \(s \ge 0\) and define spaces:
We assume that Eq. (1) can be recast as:
We suppose that Eq. (11) is wellposed and that for any \(u,v \in {\hat{X}}\), there holds that:
Eq. (11) is a stability estimate, allowing to control the total error by means of a bound on PINNs residual. Residuals in Eq. (3) are:
From the expression of residuals, we are interested in approximating integrals:
We assume that we are provided quadratures:
for weights \({w^i_D},{w^i_u}\) and quadrature points \({\tau _D^i}, {\tau _u^i} \in {\mathbb {D}}\) such that for \(\alpha ,\beta > 0\):
For any \(\upomega _u>0\), the loss is defined as follows:
with \(\varepsilon _{T,D}\) and \(\varepsilon _{T,u}\) the training error for collocation points and observations respectively.
Notice that application of Eq. (17) to \(\xi _{D,\theta }\) and \(\xi _{u,\theta }\) yields:
We seek to quantify the generalization error:
We detail a new result concerning the generalization error for PINNs.
Theorem 1
(Generalization error for generic PINNs) Under the presented setting, there holds that:
with \({\hat{\mu }}: = \Vert uu_{obs} \Vert _X\).
Proof
Consider the setting of Theorem 1. There holds that:
\(\square \)
The novelty of Theorem 1 is that it describes the generalization error for a simple case involving collocation points and observations. It states that the PINN generalizes well as long as the training error is low and that sufficient training points are used. To make the result more intuitive, we rewrite Eq. (21), with \(\sim \) expressing the terms up to positive constants:
The generalization error depends on the training errors (which are tractable during training), parameters \({\hat{N}}\) and M and bias \({\hat{\mu }}\).
To return to hanalysis, we now have a theoretical proof of the three regimes presented in “Introduction”. Let us assume that \({\hat{\mu }}=0\). For small values of \({\hat{N}}\) or M, the bound in Theorem 1 is too high to yield a meaningful estimate. Subsequently, the convergence is as \(\max ({\hat{N}}^{\alpha /2},M^{\beta /2})\), marking the transition regime. It is paramount for practitioners to reach the permanent regime when training PINNs, giving ground to dataparallel PINNs.
In general applications, the exact solution u is not available. Moreover, it is relevant to determine whether N is large enough. To this extent, we introduce a same cardinality testing (or validation) set. Interestingly, the entire analysis above and Theorem 1 remain valid for another set of testing points, with the testing error \(\varepsilon _{V,D}\) and \(\varepsilon _{V,u}\) set as in Eq. (18). The traintest gap, which is tractable, can be quantified as follows.
Theorem 2
(Traintest gap bound) Under the presented setting, there holds that:
Proof
Consider the setting of Theorem 2. For \(v \in \{D, u\}\) and \(\cdot \in \{ Y,X\}\) there holds that:
\(\square \)
The bound in Theorem 1 is valuable as it allows to assess the quadrature error convergence—and the regime—with respect to the number of training points.
Dataparallel PINNs
Datadistribution and Horovod
In this section, we present the dataparallel distribution for PINNs. Let us set \(\texttt {size}\in {\mathbb {N}}_1\) and define ranks (or workers):
each rank corresponding generally to a GPU. Dataparallel distribution requires the appropriate partitioning of the training points across ranks.
We introduce \({\hat{N}}_1,M_1\in {\mathbb {N}}_1\) collocation points and observations, respectively, for each \(\texttt {rank}\) (e.g., a GPU) yielding:
with
Dataparallel approach is as follows: We send the same synchronized copy of the DNN \({\mathscr {N}}\!\!{\mathscr {N}}_\theta \) defined in Eq. (4) to each rank. Each rank evaluates the loss \({\mathscr {L}}^\texttt {rank}_\theta \) and the gradient \(\nabla _\theta {\mathscr {L}}^\texttt {rank}_\theta \). The gradients are then averaged using an allreduce operation, such as the ring allreduce implemented in Horovod^{26,36}, which is known to be bandwidth optimal with respect to the number of ranks^{36}. The process is illustrated in Fig. 5 for \(\texttt {size}=4\). The ringallreduce algorithm involves each of the \(\texttt {size}\) nodes communicating with two of its peers \(2 \times (\texttt {size}1)\) times^{26}.
It is noteworthy to observe that data generation for datafree PINNs (i.e. with \(M=0\)) requires no modification to existing codes, provided that each rank has a different seed for random or pseudorandom sampling. Horovod allows to apply dataparallel acceleration with minimal changes to existing code. Moreover, our approach and Horovod can easily be extended to multiple computing nodes. As pointed out in “Introduction”, Horovod supports popular DL backends such as TensorFlow, PyTorch and Keras. In Listing 1, we demonstrate how to integrate dataparallel distribution using Horovod with a generic PINNs implementation in TensorFlow 1.x. The highlighted changes in pink show the steps for incorporating Horovod, which include: (i) initializing Horovod; (ii) pinning available GPUs to specific workers; (iii) wrapping the Horovod distributed optimizer and (iv) broadcasting initial variables to the master rank being \(\texttt {rank}= 0\).
Weak and strong scaling
Two key concepts in data distribution paradigms are weak and strong scaling, which can be explained as follows: Weak scaling involves increasing the problem size proportionally with the number of processors, while strong scaling involves keeping the problem size fixed and increasing the number of processors. To reformulate:

Weak scaling: Each worker has \(({\hat{N}}_1,M_1)\) training points, and we increase the number of workers \(\texttt {size}\);

Strong scaling: We set a fixed total number of \(({\hat{N}}_1,M_1)\) training points, and we split the data over increasing \(\texttt {size}\) workers.
We portray weak and strong scaling in Fig. 6 for a datafree PINN with \({\hat{N}}_1=16\). Each box represents a GPU, with the number of collocation points as a color. On the left of each scaling option, we present the unaccelerated case. Finally, we introduce the training time \(t_\texttt {size}\) for \(\texttt {size}\) workers. This allows to define the efficiency and speedup as:
Numerical experiments
Throughout, we apply our proceeding to three cases of interest:

“1D Laplace” equation (forward problem);

“1D Schrödinger” equation (forward problem);

“2D Navier–Stokes” equation (inverse problem).
For each case, we perform a hanalysis followed by Horovod dataparallel acceleration, which is applied to the domain training points (and observations for the Navier–Stokes case). Boundary loss terms are negligible due to the sufficient number of boundary data points.
Methodology
We perform simulations in single float precision on a AMAX DLE48A AMD Rome EPYC server with 8 Quadro RTX 8000 Nvidia GPUs—each one with a 48 GB memory. We use a Docker image of Horovod 0.26.1 with CUDA 12.1, Python 3.6.9 and Tensorflow 2.6.2. All throughout, we use tensorflow.compat.v1 as a backend without eager execution.
All the results are ready for use in HorovodPINNs GitHub repository and fully reproducible, ensuring also compliance with FAIR principles (Findability, Accessibility, Interoperability, and Reusability) for scientific data management and stewardship^{37}. We run experiments 8 times with seeds defined as:
in order to obtain rankvarying training points. For domain points, Latin Hypercube Sampling is performed with pyDOE 0.3.8. Boundary points are defined over uniform grids.
We use Glorot uniform initialization^{3}, Chapter 8. “Error” refers to the \(L^2\)relative error taken over \({\mathscr {T}}^{test} \), and “Time” stands for the training time in seconds. For each case, the loss in Eq. (7) is with unit weights \(\upomega _v=1\) and MonteCarlo quadrature rule \({w_v^i} = \frac{1}{N_v}\) for \(v\in \Lambda \). Also, we set \(\text {vol}({\mathbb {D}})\) the volume of domain \({\mathbb {D}}\) and
We introduce \(t^k\) the time to perform k iterations. The training points processed by second is as follows:
For the sake of simplicity, we summarize the parameters and hyperparameters for each case in Table 1.
1D Laplace
We first consider the 1D Laplace equation in \(D = [1,7]\) as being:
Acknowledge that \(u(x) = \sin (\pi x)\). We solve the problem for:
We set \(N_g=2\) and \(N_u = 0\). Points in \({\mathscr {T}}_D\) are generated randomly over D, and \({\mathscr {T}}_b = \{1,7\}\). The residual in Eq. (3):
yields the loss:
hAnalysis
We perform the hanalysis for the error as portrayed before in Fig. 1. The asymptotic regime occurs between \(N_f={64}\) and \(N_f={400}\), with a precision of \(\rho =8\) in accordance with general results for hanalysis of traditional solvers. The permanent regime shows a slight improvement in accuracy, with mean “Error” dropping from \(8.75 \times 10^{3}\) for \(N_f=400\) to \(3.91 \times 10^{3}\) for \(N=65,536\). To complete the hanalysis, Fig. 7 shows the convergence results of ADAM optimizer for all the values of \(N_f\). This plot reveals that each regime exhibits similar patterns. The high variability in convergence during the transition regime is particularly interesting, with some runs converging and others not. In the permanent regime, the convergence shows almost identical and stable patterns irrespective of \(N_f\).
Furthermore, we plot the training and test losses in Fig. 8. Acknowledge that the validation loss and “Error” show similar behaviors. We use this figure as a reference to define each transition regime. In particular, it hints that the permanent regime is reached for \(N=400\) as the relative error at best iteration between \({\mathscr {L}}_\theta ^\text {train}\) and \({\mathscr {L}}_\theta ^\text {test}\) drops from \(1.60 \times 10^{1}\) to \(6.02 \times 10^{5}\). For the sake of precision, the value for \(N=512\) is \(2.20 \times 10^{5}\).
Dataparallel implementation
We set \(N_{f,1}\equiv N_1= {64}\) and compare both weak and strong scaling to original implementation, referred to as “no scaling”.
We provide a detailed description of Fig. 9, as it will serve as the basis for future cases:

Lefthand side: Error for \(\texttt {size}^* \in \{1,2,4,8\}\) corresponding to \(N_f\in \{64,128, {256,512}\}\).

Middle: Error for weak scaling with \(N_{1}=64\) and for \(\texttt {size}\in \{1,2,4,8\}\);

Righthand side: Error for strong scaling with \(N_{1}={512}\) and for \(\texttt {size}\in \{1,2,4,8\}\)
To reduce ambiguity, we use the \(*\)superscript for no scaling, as \(\texttt {size}^*\) is performed over 1 rank. The color for each violin box in the figure corresponds to the number of domain collocation points used for each GPU.
Figure 9 demonstrates that both weak and strong scaling yield similar convergence results to their unaccelerated counterpart. This result is one of the main findings in this work: PINNs scale properly with respect to accuracy, validating the intuition behind hanalysis and justifying the dataparallel approach. This allows one to move from preasymptotic to permanent regime by using weak scaling, or leverage the cost of a permanent regime application by dispatching the training points over different workers. Furthermore, the hyperparameters, including the learning rate, remained unchanged.
Next, we summarize the dataparallelization results in Table 2 with respect to \(\texttt {size}\).
In the first column, we present the time required to run 500 iterations for ADAM, referred to as \(t^\text {500}\). This value is averaged over one run of 30,000 iterations with a heatup of 1500 iteration (i.e. we discard the values corresponding to iterations 0, 500 and 1000). We present the resulting mean value ± standard deviation for the resulting vector. The second column, displays the efficiency of the run, evaluated with respect to \(t^\text {500}\).
Table 2 reveals that databased acceleration results in supplementary training times as anticipated. Weak scaling efficiency varies between \({78.01}\%\) for \(\texttt {size}={2}\) to \({66.67}\%\) for \(\texttt {size}=8\), resulting in a speedup of 5.47 when using 8 GPUs. Similarly, strong scaling shows similar behavior. Furthermore, it can be observed that \(\texttt {size}=1\) yields almost equal \(t^{500}\) for \(N_f={64}\) (4.08s) and \(N_f={512}\) (4.19s).
1D Schrödinger
We solve the nonlinear Schrödinger equation along with periodic BCs (refer to^{10}, Section 3.1.1) given over \({\overline{D}}\times {[0, T ] }\) with \(D:=(5,5)\) and \(T := \pi / 2\):
where \(u(x,t) = u^0(x,t) + \imath u^1(x,t)\). We apply PINNs to the system with \(m=2\) in Eq. (4) as \((u^0_\theta ,v^1_\theta ) \in {\mathbb {R}}^m={\mathbb {R}}^2\). We set \(N_g=N_\hbar = 200\).
hAnalysis
To begin with, we perform the hanalysis for the parameters in Fig. 10. Again, the transition regime for the total execution time distribution begins at a density of \(\rho =5\) and spans between \(N_f=350\) and \(N_f=4000\). At higher magnitudes of \(N_f\), the error remains approximately the same. To illustrate this more complex case, we present the total execution time distribution in Fig. 11.
We note that the training times remain stable for \(N_f\le 4000\). This observation is of great importance and should be emphasized, as it adds additional parameters to the analysis. Our work primarily focuses on error in hanalysis, however, execution time and memory requirements are also important considerations. We see that in this case, weak scaling is not necessary (the optimal option is to use \(\texttt {size}= 1\) and \(N_f = 4000\)). Alternatively, strong scaling can be done with \(N_{f,1} = 4000\).
To gain further insight into the variability of the transition regime, we focus on \(N_f = 1000\). We compare the solution for \(\texttt {seed}= 1234\) and \(\texttt {seed}= 1236\) in Fig. 12. The upper figures depict u(t, x) predicted by the PINN. The lower figures show a comparison between the exact and predicted solutions are plotted for \(t\in \{0.59,0.79,0.98\}\). It is evident that the solution for \(\texttt {seed}=1234\) closely resembles the exact solution, whereas the solution for \(\texttt {seed}=1236\) fails to accurately represent the solution near \(x=0\), thereby illustrating the importance of achieving a permanent regime. Next, we show the training and testing losses in Fig. 13. We remark that the training and test losses converge for \(N_f >200\). Analysis of the traintest gap showed that it converged as \({\mathscr {O}}(N_f^{1})\). Visually, one can assume that the losses are close enough for \(N_f=4000\) (or \(N_f=8000\)), in accordance with the hanalysis performed in Fig. 10.
Dataparallel implementation
We compare the error for unaccelerated and dataparallel implementations of simulations for \(N_{f,1}\equiv N_{1}=500\) in Fig. 14, analogous to the analysis in Fig. 9. Again, the error is stable with \(\texttt {size}\). Both no scaling and weak scaling are similar. Strong scaling is unaffected by \(\texttt {size}\). We plot the training time in Fig. 15. We observe that both weak and strong scaling increase linearly and slightly with \(\texttt {size}\). Both scaling show similar behaviors. Fig. 16 portrays the number of training points processed per second [refer to Eq. (24)] and the efficiency with respect to \(\texttt {size}\), with white bars representing the ideal scaling. Efficiency \(E_\text {ff}\) shows a gradual decrease with \(\texttt {size}\), with results surpassing those of the previous section. The efficiency for \(\texttt {size}=8\) reaches \(77{.22}\%\) and \(76{.20}\%\) respectively for weak and strong scaling, representing a speedup of 6.18 and 6.10.
Inverse problem: Navier–Stokes equation
We consider the Navier–Stokes problem with \(D : =[1,8] \times [2,2]\), \(T=20\) and unknown parameters \(\uplambda _1\uplambda _2 \in {\mathbb {R}}\). The resulting divergencefree Navier–Stokes is expressed as follows:
wherein u(x, y, t) and v(x, y, t) are the x and y components of the velocity field, and p(x, y, t) the pressure. We assume that there exists \(\varphi (x,y,t)\) such that:
Under Eq. (27), last row in Eq. (26) is satisfied. The latter leads to the definition of residuals:
We introduce \(N_f\) pseudorandom points \(\textbf{x}^i \in D\) and observations \((u^i,v^i)= (u^i (\textbf{x}^i), v^i (\textbf{x}^i)) \), yielding the loss:
with
Throughout this case, we have \(N_f=M\), and we plot the results with respect to \(N_f\). Acknowledge that \(N = 2 N_f\), and that both \(\lambda _1,\lambda _2\) and the BCs are unkwown.
hAnalysis
We conduct the hanalysis and show the error in Fig. 17, showing no differences with previous cases. Surprisingly, the permanent regime is reached only for \(N_f=1000\), despite the problem being a 3dimensional, nonlinear, and inverse one. This corresponds to low values of \(\rho \), indicating that PINNs seem to prevent the curse of dimensionality. In fact, the it was achieved with only 1.21 points per unit per dimension. The total training time is presented in Fig. 18, where it can be seen to remain stable up to \(N_f=5000\), and then increases linearly with \(N_f\).
Again, we represent the training and test losses in Fig. 19. The traintest gap is shown to decrease for \(N_f \ge 350\) as \({\mathscr {O}}(N_f^{1})\). Furthermore, the traintest gap can be considered as being small enough, visually, for \(N_f = 1000\).
Dataparallel implementation
We run the dataparallel simulations, setting \(N_{f,1} \equiv N_{1} = M_1= 500\). As shown in Fig. 20, the simulations exhibit stable accuracy with \(\texttt {size}\). The execution time increases moderately with \(\texttt {size}\), as illustrated in Fig. 21. The training time decreases with N for the no scaling case. However, this behavior is temporary (refer to Fig. 17 before). We conclude our analysis by plotting the efficiency with respect to \(\texttt {size}\) in Fig. 22. Efficiency lowers with increasing \(\texttt {size}\), but shows the best results so far, with \(80{.55}\%\) (resp. \(86.31\%\)) weak (resp. strong) scaling efficiency for \(\texttt {size}= 8\). For the sake of completeness, the weak efficiency for \(N_{f,1}=50,000\) and \(\texttt {size}=8\) improved to \(86.15\%\). This encouraging result sets the stage for further exploration of more intricate applications.
Conclusion
In this work, we proposed a novel dataparallelization approach for PIML with a focus on PINNs. We provided a thorough hanalysis and associated theoretical results to support our approach, as well as implementation considerations to facilitate implementation with Horovod data acceleration. Additionally, we ran reproducible numerical experiments to demonstrate the scalability of our approach. Further work include the implementation of Horovod acceleration to DeepXDE^{35} library, coupling of localized PINNs with domain decomposition methods, and application on larger GPU servers (e.g., with more than 100 GPUs).
Data availability
The code required to reproduce these findings are available to download from https://github.com/pescap/HorovodPINNs.
Change history
27 May 2024
A Correction to this paper has been published: https://doi.org/10.1038/s41598024622849
References
Steinbach, O. Numerical Approximation Methods for Elliptic Boundary Value Problems: Finite and Boundary Elements (Springer, 2007).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Bengio, Y., Goodfellow, I. & Courville, A. Deep Learning Vol. 1 (MIT Press, 2017).
Karniadakis, G. E. et al. Physicsinformed machine learning. Nat. Rev. Phys. 3, 422–440. https://doi.org/10.1038/s42254021003145 (2021).
You, H., Yu, Y., Trask, N., Gulian, M. & D’Elia, M. Datadriven learning of nonlocal physics from highfidelity synthetic data. Comput. Methods Appl. Mech. Eng. 374, 113553. https://doi.org/10.1016/j.cma.2020.113553 (2021).
Sun, L., Gao, H., Pan, S. & Wang, J.X. Surrogate modeling for fluid flows based on physicsconstrained deep learning without simulation data. Comput. Methods Appl. Mech. Eng. 361, 112732. https://doi.org/10.1016/j.cma.2019.112732 (2020).
Lai, Z. et al. Neural Modal ODEs: Integrating Physicsbased Modeling with Neural ODEs for Modeling High Dimensional Monitored Structures. http://arxiv.org/abs/2207.07883 (2022).
Lai, Z., Mylonas, C., Nagarajaiah, S. & Chatzi, E. Structural identification with physicsinformed neural ordinary differential equations. J. Sound Vib. 508, 116196. https://doi.org/10.1016/j.jsv.2021.116196 (2021).
Alber, M. et al. Integrating machine learning and multiscale modelingperspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences. NPJ Dig. Med. 2, 1–11. https://doi.org/10.1038/s417460190193y (2019).
Raissi, M., Perdikaris, P. & Karniadakis, G. Physicsinformed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707. https://doi.org/10.1016/j.jcp.2018.10.045 (2019).
Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 18, 5595–5637. https://doi.org/10.5555/3122009.3242010 (2017).
Chen, Y., Lu, L., Karniadakis, G. E. & Negro, L. D. Physicsinformed neural networks for inverse problems in nanooptics and metamaterials. Opt. Express 28, 11618–11633. https://doi.org/10.1364/OE.384875 (2020).
Chen, X., Duan, J. & Karniadakis, G. E. Learning and metalearning of stochastic advectiondiffusionreaction systems from sparse measurements. Eur. J. Appl. Math. 32, 397–420. https://doi.org/10.1017/S0956792520000169 (2021).
Meng, X. & Karniadakis, G. E. A composite neural network that learns from multifidelity data: Application to function approximation and inverse PDE problems. J. Comput. Phys. 401, 109020. https://doi.org/10.1016/j.jcp.2019.109020 (2020).
Li, R., Wang, J.X., Lee, E. & Luo, T. Physicsinformed deep learning for solving phonon Boltzmann transport equation with large temperature nonequilibrium. NPJ Comput. Mater. 8, 1–10. https://doi.org/10.1038/s4152402200712y (2022).
Yang, X. I. A., Zafar, S., Wang, J.X. & Xiao, H. Predictive largeeddysimulation wall modeling via physicsinformed neural networks. Phys. Rev. Fluids 4, 034602. https://doi.org/10.1103/PhysRevFluids.4.034602 (2019).
Zhang, D., Lu, L., Guo, L. & Karniadakis, G. E. Quantifying total uncertainty in physicsinformed neural networks for solving forward and inverse stochastic problems. J. Comput. Phys. 397, 108850. https://doi.org/10.1016/j.jcp.2019.07.048 (2019).
EscapilInchauspé, P. & Ruz, G. A. Physicsinformed neural networks for operator equations with stochastic data. https://doi.org/10.48550/ARXIV.2211.10344 (2022).
Wang, S., Yu, X. & Perdikaris, P. When and why PINNs fail to train: A neural tangent kernel perspective. J. Comput. Phys. 449, 110768. https://doi.org/10.1016/j.jcp.2021.110768 (2022).
EscapilInchauspé, P. & Ruz, G. A. Hyperparameter tuning of physicsinformed neural networks: Application to Helmholtz problems. Neurocomputing 1, 126826. https://doi.org/10.1016/j.neucom.2023.126826 (2023).
Shukla, K., Xu, M., Trask, N. & Karniadakis, G. E. Scalable algorithms for physicsinformed neural and graph networks. DataCentric Eng. 3, e24. https://doi.org/10.1017/dce.2022.24 (2022).
Mishra, S. & Molinaro, R. Estimates on the generalization error of physicsinformed neural networks for approximating PDEs. IMA J. Numer. Anal.https://doi.org/10.1093/imanum/drab093 (2022).
Mishra, S. & Molinaro, R. Estimates on the generalization error of physicsinformed neural networks for approximating a class of inverse problems for PDEs. IMA J. Numer. Anal. 42, 981–1022. https://doi.org/10.1093/imanum/drab032 (2021).
Shin, Y., Darbon, J. & Karniadakis, G. E. On the convergence of physics informed neural networks for linear secondorder elliptic and parabolic type PDEs. http://arxiv.org/abs/2004.01806 (2020).
Khoo, Y., Lu, J. & Ying, L. Solving parametric PDE problems with artificial neural networks. Eur. J. Appl. Math. 32, 421–435 (2021).
Sergeev, A. & Del Balso, M. Horovod: Fast and easy distributed deep learning in TensorFlow. http://arxiv.org/abs/1802.05799 (2018).
Jagtap, A. D., Kharazmi, E. & Karniadakis, G. E. Conservative physicsinformed neural networks on discrete domains for conservation laws: Applications to forward and inverse problems. Comput. Methods Appl. Mech. Eng. 365, 113028. https://doi.org/10.1016/j.cma.2020.113028 (2020).
Jagtap, A. D. & Karniadakis, G. E. Extended physicsinformed neural networks (XPINNs): A generalized spacetime domain decomposition based deep learning framework for nonlinear partial differential equations. Commun. Comput. Phys. 28, 2002–2041. https://doi.org/10.4208/cicp.OA20200164 (2020).
Hu, Z., Jagtap, A. D., Karniadakis, G. E. & Kawaguchi, K. When do extended physicsinformed neural networks (XPINNs) improve generalization?. SIAM J. Sci. Comput. 44, A3158–A3182. https://doi.org/10.1137/21M1447039 (2022).
Dwivedi, V., Parashar, N. & Srinivasan, B. Distributed physics informed neural network for dataefficient solution to partial differential equations. http://arxiv.org/abs/1907.08967 (2019).
Shukla, K., Jagtap, A. D. & Karniadakis, G. E. Parallel physicsinformed neural networks via domain decomposition. J. Comput. Phys. 447, 110683. https://doi.org/10.1016/j.jcp.2021.110683 (2021).
McClenny, L. D., Haile, M. A. & BragaNeto, U. M. TensorDiffEq: Scalable MultiGPU Forward and Inverse Solvers for Physics Informed Neural Networks. http://arxiv.org/abs/2103.16034 (2021).
Hennigh, O. et al. NVIDIA SimNet™: An AIaccelerated multiphysics simulation framework. In Computational Science–ICCS 2021: 21st International Conference, Krakow, Poland, June 16–18, 2021, Proceedings, Part V, 447–461 (Springer, 2021).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. http://arxiv.org/abs/1412.6980 (2014).
Lu, L., Meng, X., Mao, Z. & Karniadakis, G. E. DeepXDE: A deep learning library for solving differential equations. SIAM Rev. 63, 208–228. https://doi.org/10.1137/19M1274067 (2021).
Patarasuk, P. & Yuan, X. Bandwidth optimal allreduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 117–124 (2009).
Wilkinson, M. D. et al. The fair guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016).
Acknowledgements
The authors would like to thank the Data Observatory Foundation, ANID FONDECYT 1230315, ANID FONDECYT 3230088, FESUAI postdoc grant, ANID PIA/BASAL FB0002, and ANID/PIA/ANILLOS ACT210096, for financially supporting this research.
Author information
Authors and Affiliations
Contributions
P.E.I. conceived the experiment(s). All authors analyzed the results and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this Article was revised: The original version of this Article contained errors in section ‘General Notation’, Equation 3, Equation 13, Equation 14, Equation 21, section ‘Proof Consider the setting of Theorem 1.’ and in section ‘Methodology’, where notations were incorrect. Full information regarding the corrections made can be found in the correction for this Article.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
EscapilInchauspé, P., Ruz, G.A. hAnalysis and dataparallel physicsinformed neural networks. Sci Rep 13, 17562 (2023). https://doi.org/10.1038/s41598023445415
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598023445415
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.