Introduction

Simulating physics throughout accurate surrogates is a hard task for engineers and computer scientists. Numerical methods such as finite element methods, finite difference methods and spectral methods can be used to approximate the solution of partial differential equations (PDEs) by representing them as a finite-dimensional function space, delivering an approximation to the desired solution or mapping1.

Real-world applications often incorporate partial information of physics and observations, which can be noisy. This hints at using data-driven solutions throughout machine learning (ML) techniques. In particular, deep learning (DL)2,3 principles have been praised for their good performance, granted by the capability of deep neural networks (DNNs) to approximate high-dimensional and non-linear mappings, and offer great generalization with large datasets. Furthermore, the exponential growth of GPUs capabilities has made it possible to implement even larger DL models.

Recently, a novel paradigm called physics-informed machine learning (PIML)4 was introduced to bridge the gap between data-driven5 and physics-based6 frameworks. PIML enhances the capability and generalization power of ML by adding prior information on physical laws to the scheme by restricting the output space (e.g., via additional constraints or a regularization term). This simple yet general approach was applied successfully to a wide range complex real-world applications, including structural mechanics7,8 and biological, biomedical and behavioral sciences9.

In particular, physics-informed neural networks (PINNs)10 consist in applying PIML by means of DNNs. They encode the physics in the loss function and rely on automatic differentiation (AD)11. PINNs have been used to solve inverse problems12, stochastic PDEs13,14, complex applications such as the Boltzmann transport equation15 and large-eddy simulations16, and to perform uncertainty quantification17,18.

Concerning the challenges faced by the PINNs community, efficient training19, proper hyper-parameters setting20, and scaling PINNs21 are of particular interest. Regarding the latter, two research areas are gaining attention.

First, it is important to understand how PINNs behave for an increasing number of training points N (or equivalently, for a suitable bounded and fixed domain, a decreasing maximum distance between points h). Throughout this work, we refer to this study as h-analysis as being the analysis of the number of training data needed to obtain a stable generalization error. In their pioneer works22,23, Mishra and Molinaro provided a bound for the generalization error with respect to to N for data-free and unique continuation problems, respectively. More precise bounds have been obtained using characterizations of the DNN24.

Second, PINNs are typically trained over graphics processing units (GPUs), which have limited memory capabilities. To ensure models scale well with increasingly complex settings, two paradigms emerge: data-parallel and model-parallel acceleration. The former splits the training data over different workers, while the latter distributes the model weights. However, general DL backends do not readily support multiple GPU acceleration. To address this issue, Horovod25 is a distributed framework specifically designed for DL, featuring a ring-allreduce algorithm26 and implementations for TensorFlow, Keras and PyTorch.

As model size becomes prohibitive, domain decomposition-based approaches allow for distributing the computational domain. Examples of such approaches include conservative PINNs (cPINNs)27, extended PINNs (XPINNs)28,29, and distributed PINNs (DPINNs)30. cPINNs and XPINNs were compared in31. These approaches are compatible with data-parallel acceleration within each subdomain. Additionally, a recent review concerning distributed PIML21 is also available. Regarding existing data-parallel implementations, TensorFlow MirroredStrategy in TensorDiffEq32 and NVIDIA Modulus33, should be mentioned. However, to the authors knowledge, there is no systematic study of the background of data-parallel PINNs and their implementation.

In this work, we present a procedure to attain data-parallel efficient PINNs. It relies on h-analysis and is backed by a Horovod-based acceleration. Concerning h-analysis, we observe PINNs exhibiting three phases of behavior as a function of the number of training points N:

  1. 1.

    A pre-asymptotic regime, where the model does not learn the solution due to missing information;

  2. 2.

    A transition regime, where the error decreases with N;

  3. 3.

    A permanent regime, where the error remains stable.

To illustrate this, Fig. 1 presents the relative \(L^2\) error distribution with respect to \(N_f\) (number of domain collocation points) for the forward “1D Laplace” case. The experiment was conducted over 8 independent runs with a learning rate of \(10^{-4}\) and 20,000 iterations of ADAM34 algorithm. The transition regime—where variability in the results is high and some models converge while others do not—is between \(N_f={64}\) and \(N_f={400}\). For more information on the experimental setting and the definition of precision \(\rho \), please refer to “1D Laplace”.

Figure 1
figure 1

Error v/s the number of domain collocation points \(N_f\) for the “1D Laplace” case. A pre-asymptotic regime (pink) is followed by a rapid transition regime (blue), and eventually leading to a permanent regime (green). This transition occurs over a few extra training points.

Building on the empirical observations, we use the setting in22,23 to supply a rigorous theoretical background to h-analysis. One of the main contributions of this manuscript is the bound on the “Generalization error for generic PINNs”, which allows for a simple analysis of the h-dependence. Furthermore, this bound is accompanied by a practical “Train-test gap bound”, supporting regimes detection.

To summarize the latter results, a simple yet powerful recipe for any PIML scheme could be:

  1. 1.

    Choose the right model and hyper-parameters to achieve a low training loss;

  2. 2.

    Use enough training points N to reach the permanent regime (e.g., such that the training and test losses are similar).

Any practitioner strives to reach the permanent regime for their PIML scheme, and we provide the necessary details for an easy implementation of Horovod-based data acceleration for PINNs, with direct application to any PIML model. Figure 2 (left) further illustrates the scope of data-parallel PIML. For the sake of clarity, Fig. 2 (right) supplies a comprehensive review of important notations defined throughout this manuscript, along with their corresponding introductions.

Figure 2
figure 2

Left: Scope of of data-parallel PIML. Right: Comprehensive review of important notations defined throughout this manuscript, along with their corresponding introductions.

Next, we apply the procedure to increasingly complex problems and demonstrate that Horovod acceleration is straightforward, using the pioneer PINNs code of Raissi as an example. Our main practical findings concerning data-parallel PINNs for up to 8 GPUs are the following:

  • They do not require to modify their hyper-parameters;

  • They show similar training convergence to the 1 GPU-case;

  • They lead to high efficiency for both weak and strong scaling (e.g \(E_\text {ff} {> 80\% }\) for Navier–Stokes problem with 8 GPUs).

This work is organized as follows: In “Problem formulation”, we introduce the PDEs under consideration, PINNs and convergence estimates for the generalization error. We then move to “Data-parallel PINNs” and present “Numerical experiments”. Finally, we close this manuscript in “Conclusion”.

Problem formulation

General notation

Throughout, vector and matrices are expressed using bold symbols. For a natural number k, we set \({\mathbb {N}}_k:= \{k,k+1,\ldots \}\). For \(p \in {\mathbb {N}}_0 = \{0,1,\ldots ,\}\), and an open set \(D\subseteq {\mathbb {R}}^d\) with \(d\in {\mathbb {N}}_1\), let \(L^p(D)\) be the standard class of functions with bounded \(L^p\)-norm over D. Given \(s\in {\mathbb {R}}^+\), we refer to1, Section 2 for the definitions of Sobolev function spaces \(H^s(D)\). Norms are denoted by \(\Vert \cdot \Vert \), with subscripts indicating the associated functional spaces. For a finite set \({\mathscr {T}}\), we introduce notation \(|{\mathscr {T}}| := \text {card} ({\mathscr {T}})\), closed subspaces are denoted by a \(\mathop \subset \nolimits_{{{\text{cl}}}}\)-symbol and \(\imath ^2=-1\).

Abstract PDE

In this work, we consider a domain \(D \subset {\mathbb {R}}^d\), \(d \in {\mathbb {N}}_1\), with boundary \(\Gamma = \partial D\). For any \(T>0\), \({\mathbb {D}}:= D\times [0,T]\), we solve a general non-linear PDE of the form:

$$\begin{aligned} {\left\{ \begin{array}{ll} {\mathscr {N}}[u(\textbf{x},t);{\varvec{\lambda }}] &{}= f(\textbf{x},t) \quad (\textbf{x},t) \in D_f =: D \times [0,T],\\ {\mathscr {B}}[u(\textbf{x},t); {\varvec{\lambda }}] &{}= g(\textbf{x},t), \quad (\textbf{x},t) \in D_g =: \Gamma \times [0,T],\\ u (\textbf{x},0) &{} = \hbar (\textbf{x}), \quad \textbf{x}\in D_{\hbar } =: D, \end{array}\right. } \end{aligned}$$
(1)

with \({\mathscr {N}}\) a spatio-temporal differential operator, \({\mathscr {B}}\) the boundary conditions (BCs) operator, \({\varvec{\lambda }}\) the material parameters—the latter being unknown for inverse problems—and \(u(\textbf{x},t)\in {\mathbb {R}}^m\) for any \(m\in {\mathbb {N}}_1\). Accordingly, for any function \({\hat{u}}\) defined over \({\mathbb {D}}\), we introduce

$$\begin{aligned} {\Lambda :=\{f,g,\hbar ,u\}} \end{aligned}$$
(2)

and define the residuals \(\xi _v\) for each \(v \in \Lambda \) and any observation function \(u_{obs} \):

$$\begin{aligned} {\left\{ \begin{array}{ll} \xi _f(\textbf{x},t;{\varvec{\lambda }})&{}: = {\mathscr {N}}[{\hat{u}}(\textbf{x},t);{\varvec{\lambda }}] - f(\textbf{x},t) \quad in \quad D_f,\\ \xi _g (\textbf{x},t;{\varvec{\lambda }})&{}: = {\mathscr {B}}[{\hat{u}}(\textbf{x},t);{\varvec{\lambda }}]- g(\textbf{x},t) \quad in \quad D_g,\\ \xi _\hbar (\textbf{x},0) &{}: ={\hat{u}} (\textbf{x},0) - \hbar (\textbf{x}) \quad in \quad D_\hbar , \\ \xi _{u} (\textbf{x},t) &{}: = {\hat{u}}(\textbf{x},t) - u_{obs} (\textbf{x},t)\quad in \quad D_u:= {\mathbb {D}}. \end{array}\right. } \end{aligned}$$
(3)

PINNs

Following20,35, let \(\sigma \) be a smooth activation function. Given an input \((\textbf{x},t) \in {\mathbb {R}}^{d+1}\), we define \({\mathscr {N}}\!\!{\mathscr {N}}_\theta \) as being a L-layer neural feed-forward neural network with \(W_0= d+ 1\), \(W_L = m\) and \(W_l\) neurons in the l-th layer for \(1 \le l \le L-1\). For constant width DNNs, we set \(W=W_1=\cdots = W_{L-1}\). For \(1 \le l \le L\), let us denote the weight matrix and bias vector in the l-th layer by \(\textbf{W}^l \in {\mathbb {R}}^{d_l \times d_{l-1}}\) and \(\textbf{b}^l \in {\mathbb {R}}^{d_l}\), respectively, resulting in:

$$\begin{aligned} \begin{array}{rll} \text {input layer:} \quad &{} (\textbf{x},t)\in {\mathbb {R}}^{d+1},\\ \text {hidden layers:} \quad &{} \textbf{z}^l (\textbf{x}) = \sigma ( \textbf{W}^l \textbf{z}^{l-1} (\textbf{x}) + \textbf{b}^l) \in {\mathbb {R}}^{d_l} &{} \quad \text {for} \quad 1 \le l \le L-1,\\ \text {output layer:} \quad &{} \textbf{z}^L(\textbf{x}) = \textbf{W}^L \textbf{z}^{L-1} (\textbf{x}) + \textbf{b}^L \in {\mathbb {R}}^m. \end{array} \end{aligned}$$
(4)

This results in representation \(\textbf{z}^L(\textbf{x},t)\), with

$$\begin{aligned} \theta := \left\{ (\textbf{W}^1, \textbf{b}^1), \cdots , (\textbf{W}^L, \textbf{b}^L)\right\} , \end{aligned}$$
(5)

the (trainable) parameters—or weights—in the network. We set \(\Theta = {\mathbb {R}}^{|\Theta |}\). Application of PINNs to Eq. (1) yields the approximate \(u_\theta (\textbf{x},t) = \textbf{z}^L(\textbf{x},t)\).

We introduce the training dataset \({\mathscr {T}}_v:=\{\tau _v^i\}_{i=1}^{N_v}\), \(\tau _v^i \in D_v\), \(N_v \in {\mathbb {N}}\) for \(i = 1,\cdots , N_v, v \in \Lambda \) and observations \(u_\text {obs}(\tau _u^i)\), \(i=1,\cdots ,N_u\). Furthermore, to each training point \(\tau _v^i\) we associate a quadrature weight \(w_v^i>0\). All throughout this manuscript, we set:

$$\begin{aligned} M:= N_u, \quad {\hat{N}}:= N_f + N_g + N_\hbar \quad \text {and} \quad N:= {\hat{N}}+ M. \end{aligned}$$
(6)

Note that M (resp. \({\hat{N}}\)) represents the amount of information for the data-driven (resp. physics) part, by virtue of the PIML paradigm (refer to Fig. 2). The network weights \(\theta \) in Eq. (5) are trained (e.g., via ADAM optimizer34) by minimizing the weighted loss:

$$\begin{aligned} {\mathscr {L}}_\theta : = \sum _{v\in \Lambda } \upomega _v {\mathscr {L}}_\theta ^v,\quad wherein \quad {\mathscr {L}}_\theta ^v:=\sum _{i=1}^{N_v} w_v^i|\xi _{v,\theta }(\tau ^i_v)|^2 \quad and \quad \upomega _v >0 \quad for \quad v \in \Lambda . \end{aligned}$$
(7)

We seek at obtaining:

$$\begin{aligned} \theta ^\star := {\text {argmin}}_{\theta \in \Theta } ({\mathscr {L}}_\theta ). \end{aligned}$$
(8)

The formulation for PINNs addresses the cases with no data (i.e. \(M=0\)) or physics (i.e. \({\hat{N}}=0\)), thus exemplifying the PIML paradigm. Furthermore, it is able to handle time-independent operators with only minor changes; a schematic representation of a forward time-independent PINN is shown in Fig. 3.

Figure 3
figure 3

Schematic representation of a PINN. A DNN with \(L=3\) (i.e. \(L-1 = 2\) hidden layers) and \(W=5\) learns the mapping \(\textbf{x}\mapsto u(\textbf{x}{,t})\). The PDE is taken into account throughout the residual \({\mathscr {L}}_\theta \), and the trainable weights are optimized, leading to optimal \(\theta ^\star \).

Our setting assumes that the material parameters \({\varvec{\lambda }}\) are known. If \(M>0\), one can solve the inverse problem by seeking:

$$\begin{aligned} (\theta ^\star _\text {inverse},{\varvec{\lambda }}^\star _\text {inverse}): = {\text {argmin}}_{\theta \in \Theta , {\varvec{\lambda }}} {\mathscr {L}}_\theta [{\varvec{\lambda }}]. \end{aligned}$$
(9)

Similarly, unique continuation problems23, which assume incomplete information for fg and \(\hbar \), are solved throughout PINNs without changes. Indeed, “2D Navier–Stokes” combines unique continuation problem and unknown parameters \(\lambda _1,\lambda _2\).

Automatic differentiation

We aim at giving further details about back-propagation algorithms and their dual role in the context of PINNs:

  1. 1.

    Training the DNN by calculating \(\frac{\partial {\mathscr {L}}_\theta }{\partial \theta }\);

  2. 2.

    Evaluating the partial derivatives in \({\mathscr {N}}[u_\theta (\textbf{x},t);{\varvec{\lambda }}]\) and \({\mathscr {B}}[u_\theta (\textbf{x},t);{\varvec{\lambda }}]\) so as to compute the loss \({\mathscr {L}}_\theta \).

They consist in a forward pass to evaluate the output \(u_\theta \) (and \({\mathscr {L}}_\theta \)), and a backward pass to assess the derivatives. To further elucidate back-propagation, we reproduce the informative diagram from11 in Fig. 4.

Figure 4
figure 4

Overview of back-propagation. A forward pass generates activations \(y_i\) and computes the error \({\mathscr {L}}_\theta (y_3,u)\). This is followed by a backward pass, through which the error adjoint is propagated to obtain the gradient with respect to weights \(\nabla {\mathscr {L}}_\theta \) where \(\theta =(\textrm{w}_1,\cdots ,\textrm{w}_6)\). Additionally, spatio-temporal partial derivatives can be computed in the same backward pass.

TensorFlow includes reverse mode AD by default. Its cost is bounded with \(|\Theta |\) for scalar output NNs (i.e. for \( m=1\)). The application of back-propagation (and reverse mode AD in particular) to any training point is independent of other information, such as neighboring points or the volume of training data. This allows for data-parallel PINNs. Before detailing its implementation, we justify the h-analysis through an abstract theoretical background.

Convergence estimates

To understand better how PINNs scale with N, we follow the method in22,23 under a simple setting, allowing to control the data and physics counterparts in PIML. Set \(s \ge 0\) and define spaces:

$$ {\hat{Y} }\mathop \subset \limits_{{\rm cl}} Y^{\star} \mathop \subset \limits_{{\rm cl}} Y = {L}^{2} ({\mathbb{D}}, {\mathbb{R}}^{m} )\quad {\text{and}}\quad {\hat{\rm X}}\mathop \subset \limits_{{\rm cl}} {\rm X}^{\star} \mathop \subset \limits_{{\rm cl}} X = {H}^{s} ({\mathbb{D}},\mathbb{R}^{m} ) $$
(10)

We assume that Eq. (1) can be recast as:

$$\begin{aligned} {\textsf{A}}u&= b \quad with \quad {\textsf{A}}: X^\star \rightarrow Y^\star \quad and \quad b \in Y^\star ,\end{aligned}$$
(11)
$$\begin{aligned} u&= u_\text {obs} \quad in \quad X^\star . \end{aligned}$$
(12)

We suppose that Eq. (11) is well-posed and that for any \(u,v \in {\hat{X}}\), there holds that:

$$\begin{aligned} \Vert u - v\Vert _{Y} \le C_{pde} (\Vert u\Vert _{{\hat{X}}}, \Vert v\Vert _{{\hat{X}}}) \left( \Vert {\textsf{A}}u- {\textsf{A}}v\Vert _Y\right) . \end{aligned}$$
(13)

Eq. (11) is a stability estimate, allowing to control the total error by means of a bound on PINNs residual. Residuals in Eq. (3) are:

$$\begin{aligned} {\left\{ \begin{array}{ll} \xi _D&{}: = {\textsf{A}}u - b \quad \text {in}\quad Y^\star ,\\ \xi _{u} &{} = u - u_{obs} \quad \text {in}\quad X^\star . \end{array}\right. } \end{aligned}$$
(14)

From the expression of residuals, we are interested in approximating integrals:

$$\begin{aligned} {\overline{g}} = \int _{\mathbb {D}}g(y) dy \quad \text {and} \quad {\overline{l}} = \int _{\mathbb {D}}l(z) dz \quad \text {for} \quad g \in {\hat{Y}}, l \in {\hat{X}}. \end{aligned}$$
(15)

We assume that we are provided quadratures:

$$\begin{aligned} {\overline{g}}_{{\hat{N}}} = \sum _{i=1}^{{\hat{N}}} {w^i_D} g({\tau ^i_D}) \quad \text {and} \quad {\overline{l}}_M = \sum _{i=1}^M {w^i_u} l({\tau ^i_u}) \end{aligned}$$
(16)

for weights \({w^i_D},{w^i_u}\) and quadrature points \({\tau _D^i}, {\tau _u^i} \in {\mathbb {D}}\) such that for \(\alpha ,\beta > 0\):

$$\begin{aligned} |{\overline{g}} - {{\overline{g}}}_{{\hat{N}}} |\le C_{quad ,Y} {{\hat{N}}}^{-\alpha } \quad \text {and} \quad |{\overline{l}} - {\overline{l}}_M | \le C_{quad ,X} M^{-\beta }. \end{aligned}$$
(17)

For any \(\upomega _u>0\), the loss is defined as follows:

$$\begin{aligned} \begin{aligned} {\mathscr {L}}_\theta&= \sum _{i=1}^{{\hat{N}}} {w^i_D} |\xi _{D,\theta } ({\tau ^i_D})|^2 + \upomega _u \sum _{i=1}^M {w_u^i} |\xi _{u,\theta } ({\tau _u^i})|^2 \approx \Vert \xi _{D,\theta }\Vert _{Y}^2 + \upomega _u \Vert \xi _{u,\theta } \Vert _{X}^2\\&= \varepsilon _{T,D}^2 + \upomega _u\varepsilon _{T,u}^2, \end{aligned} \end{aligned}$$
(18)

with \(\varepsilon _{T,D}\) and \(\varepsilon _{T,u}\) the training error for collocation points and observations respectively.

Notice that application of Eq. (17) to \(\xi _{D,\theta }\) and \(\xi _{u,\theta }\) yields:

$$\begin{aligned} |\Vert \xi _{D,\theta }\Vert ^2_Y - \varepsilon _{T,D}^2 | \le C_{quad ,Y}{\hat{N}}^{-\alpha }\quad \text {and} \quad |\Vert \xi _{u,\theta }\Vert _X^2 - \varepsilon _{T,u}^2 | \le C_{quad ,X}M^{-\beta }. \end{aligned}$$
(19)

We seek to quantify the generalization error:

$$\begin{aligned} \varepsilon _G = \varepsilon _G(\theta ^\star ): = \Vert u - u^\star \Vert _X \quad with \quad u^\star := u_{\theta ^\star } \quad \text {and}\quad \theta ^\star := {\text {argmin}}_\theta {\mathscr {L}}_\theta . \end{aligned}$$
(20)

We detail a new result concerning the generalization error for PINNs.

Theorem 1

(Generalization error for generic PINNs) Under the presented setting, there holds that:

$$\begin{aligned} \varepsilon _G \le \frac{C_{pde} }{1+\upomega _u} \left( \varepsilon _{T,D} + C_{quad ,Y}^{1/2} {\hat{N}}^{-\alpha /2}\right) + \frac{\upomega _u}{1+\upomega _u} \left( \varepsilon _{T,u} + C_{quad ,X}^{1/2} M^{-\beta /2} + {\hat{\mu }} \right) \end{aligned}$$
(21)

with \({\hat{\mu }}: = \Vert u-u_{obs} \Vert _X\).

Proof

Consider the setting of Theorem 1. There holds that:

$$\begin{aligned} (1 + \upomega _u)\varepsilon _G&= \Vert u - u^\star \Vert _X + \upomega _u \Vert u - u^\star \Vert _X, \quad {\rm by\, Eq.}~(20) \\&\le C_{pde} \Vert {\textsf{A}}u - {\textsf{A}}u^\star \Vert _Y + \upomega _u \Vert u-u^\star \Vert _X, \quad{\rm by \,Eq.}~(13) \\&\le C_{pde} \Vert \xi _{D,\theta ^\star }\Vert _Y +\upomega _u \Vert u - u_{obs} \Vert _X + \upomega _u \Vert u_{obs} - u^\star \Vert _X , \quad {\rm by \,Eq.}~(14) \; \mathrm{and \;triangular \;inequality} \\&= C_{pde} \Vert \xi _{D,\theta ^\star }\Vert _Y + \upomega _{u} \Vert \xi _{u,\theta ^\star }\Vert _Y + \upomega _u{\hat{\mu }}, \quad {\rm by\, definition} \\&\le C_{pde} \varepsilon _{T,D} + \upomega _u \varepsilon _{T,u} + C_{pde} C_{quad ,Y}^{1/2} {\hat{N}}^{-\alpha /2} + \upomega _u C_{quad ,X}^{1/2} M^{-\beta /2} + \upomega _u {\hat{\mu }}, \quad {\rm by \,Eq.}~(19) . \end{aligned}$$

\(\square \)

The novelty of Theorem 1 is that it describes the generalization error for a simple case involving collocation points and observations. It states that the PINN generalizes well as long as the training error is low and that sufficient training points are used. To make the result more intuitive, we rewrite Eq. (21), with \(\sim \) expressing the terms up to positive constants:

$$\begin{aligned} \varepsilon _G \sim ({\hat{N}}^{-\alpha /2} + M^{-\beta /2}) + \varepsilon _{T,D} + \varepsilon _{T,u} + {\hat{\mu }}. \end{aligned}$$
(22)

The generalization error depends on the training errors (which are tractable during training), parameters \({\hat{N}}\) and M and bias \({\hat{\mu }}\).

To return to h-analysis, we now have a theoretical proof of the three regimes presented in “Introduction”. Let us assume that \({\hat{\mu }}=0\). For small values of \({\hat{N}}\) or M, the bound in Theorem 1 is too high to yield a meaningful estimate. Subsequently, the convergence is as \(\max ({\hat{N}}^{-\alpha /2},M^{-\beta /2})\), marking the transition regime. It is paramount for practitioners to reach the permanent regime when training PINNs, giving ground to data-parallel PINNs.

In general applications, the exact solution u is not available. Moreover, it is relevant to determine whether N is large enough. To this extent, we introduce a same cardinality testing (or validation) set. Interestingly, the entire analysis above and Theorem 1 remain valid for another set of testing points, with the testing error \(\varepsilon _{V,D}\) and \(\varepsilon _{V,u}\) set as in Eq. (18). The train-test gap, which is tractable, can be quantified as follows.

Theorem 2

(Train-test gap bound) Under the presented setting, there holds that:

$$\begin{aligned} |\varepsilon _{T,D} - \varepsilon _{V,D}| \le 2 C_{quad ,Y}^{1/2} {\hat{N}}^{-\alpha /2} \quad \text {and} \quad |\varepsilon _{T,u} - \varepsilon _{V,u}| \le 2 C_{quad ,X}^{1/2} M^{-\beta /2}. \end{aligned}$$

Proof

Consider the setting of Theorem 2. For \(v \in \{D, u\}\) and \(\cdot \in \{ Y,X\}\) there holds that:

$$\begin{aligned} |\varepsilon _{T,v} - \varepsilon _{V,v}|&\le \left| \varepsilon _{T,v} - \Vert \xi _{v,\theta ^\star }\Vert ^2_{\cdot } \right| + \left| \varepsilon _{V,v}- \Vert \xi _{v,\theta ^\star }\Vert ^2_{\cdot } \right| , \quad{\rm by\, triangular\, inequality} \\&\le 2 C_{quad , \cdot }^{1/2} N_v^{-\alpha /2}, \quad{\rm by \; Eq.} \; (19) . \end{aligned}$$

\(\square \)

The bound in Theorem 1 is valuable as it allows to assess the quadrature error convergence—and the regime—with respect to the number of training points.

Data-parallel PINNs

Data-distribution and Horovod

In this section, we present the data-parallel distribution for PINNs. Let us set \(\texttt {size}\in {\mathbb {N}}_1\) and define ranks (or workers):

$$\begin{aligned} \texttt {rank}= 0, \ldots , \texttt {size}-1, \end{aligned}$$

each rank corresponding generally to a GPU. Data-parallel distribution requires the appropriate partitioning of the training points across ranks.

We introduce \({\hat{N}}_1,M_1\in {\mathbb {N}}_1\) collocation points and observations, respectively, for each \(\texttt {rank}\) (e.g., a GPU) yielding:

$$\begin{aligned} {\mathscr {T}}_v = \bigcup _{\texttt {rank}=0}^{\texttt {size}-1} {\mathscr {T}}_v^\texttt {rank}\quad for \quad v \in \{D,u\}, \end{aligned}$$

with

$$\begin{aligned} {\hat{N}}= \texttt {size}\times {\hat{N}}_{1}, \quad M = \texttt {size}\times M_1 \quad \text {and}\quad {\mathscr {T}}= {\mathscr {T}}_{{D}} \cup {\mathscr {T}}_u \quad \text {with} \quad N = {\hat{N}}+ M. \end{aligned}$$
(23)

Data-parallel approach is as follows: We send the same synchronized copy of the DNN \({\mathscr {N}}\!\!{\mathscr {N}}_\theta \) defined in Eq. (4) to each rank. Each rank evaluates the loss \({\mathscr {L}}^\texttt {rank}_\theta \) and the gradient \(\nabla _\theta {\mathscr {L}}^\texttt {rank}_\theta \). The gradients are then averaged using an all-reduce operation, such as the ring all-reduce implemented in Horovod26,36, which is known to be bandwidth optimal with respect to the number of ranks36. The process is illustrated in Fig. 5 for \(\texttt {size}=4\). The ring-allreduce algorithm involves each of the \(\texttt {size}\) nodes communicating with two of its peers \(2 \times (\texttt {size}-1)\) times26.

Figure 5
figure 5

Data-parallel framework. Horovod supports ring-allreduce algorithm.

It is noteworthy to observe that data generation for data-free PINNs (i.e. with \(M=0\)) requires no modification to existing codes, provided that each rank has a different seed for random or pseudo-random sampling. Horovod allows to apply data-parallel acceleration with minimal changes to existing code. Moreover, our approach and Horovod can easily be extended to multiple computing nodes. As pointed out in “Introduction”, Horovod supports popular DL backends such as TensorFlow, PyTorch and Keras. In Listing 1, we demonstrate how to integrate data-parallel distribution using Horovod with a generic PINNs implementation in TensorFlow 1.x. The highlighted changes in pink show the steps for incorporating Horovod, which include: (i) initializing Horovod; (ii) pinning available GPUs to specific workers; (iii) wrapping the Horovod distributed optimizer and (iv) broadcasting initial variables to the master rank being \(\texttt {rank}= 0\).

figure a

Listing 1. Horovod for PINNs with TensorFlow 1.x. Data-parallel Horovod PINNs require minor changes to existing code.

Weak and strong scaling

Two key concepts in data distribution paradigms are weak and strong scaling, which can be explained as follows: Weak scaling involves increasing the problem size proportionally with the number of processors, while strong scaling involves keeping the problem size fixed and increasing the number of processors. To reformulate:

  • Weak scaling: Each worker has \(({\hat{N}}_1,M_1)\) training points, and we increase the number of workers \(\texttt {size}\);

  • Strong scaling: We set a fixed total number of \(({\hat{N}}_1,M_1)\) training points, and we split the data over increasing \(\texttt {size}\) workers.

We portray weak and strong scaling in Fig. 6 for a data-free PINN with \({\hat{N}}_1=16\). Each box represents a GPU, with the number of collocation points as a color. On the left of each scaling option, we present the unaccelerated case. Finally, we introduce the training time \(t_\texttt {size}\) for \(\texttt {size}\) workers. This allows to define the efficiency and speed-up as:

$$\begin{aligned} E_{\text {ff}}:= \frac{t_1}{t_\texttt {size}}\quad and \quad S_{\text {up}}:= \texttt {size}\frac{t_1}{t_\texttt {size}}. \end{aligned}$$
Figure 6
figure 6

Weak and strong scaling for \({\hat{N}}_1=16\) and \(\texttt {size}=8\).

Numerical experiments

Throughout, we apply our proceeding to three cases of interest:

  • “1D Laplace” equation (forward problem);

  • “1D Schrödinger” equation (forward problem);

  • “2D Navier–Stokes” equation (inverse problem).

For each case, we perform a h-analysis followed by Horovod data-parallel acceleration, which is applied to the domain training points (and observations for the Navier–Stokes case). Boundary loss terms are negligible due to the sufficient number of boundary data points.

Methodology

We perform simulations in single float precision on a AMAX DL-E48A AMD Rome EPYC server with 8 Quadro RTX 8000 Nvidia GPUs—each one with a 48 GB memory. We use a Docker image of Horovod 0.26.1 with CUDA 12.1, Python 3.6.9 and Tensorflow 2.6.2. All throughout, we use tensorflow.compat.v1 as a backend without eager execution.

All the results are ready for use in HorovodPINNs GitHub repository and fully reproducible, ensuring also compliance with FAIR principles (Findability, Accessibility, Interoperability, and Reusability) for scientific data management and stewardship37. We run experiments 8 times with seeds defined as:

$$\begin{aligned} \texttt {seed}+ 1000 \times \texttt {rank}, \end{aligned}$$

in order to obtain rank-varying training points. For domain points, Latin Hypercube Sampling is performed with pyDOE 0.3.8. Boundary points are defined over uniform grids.

We use Glorot uniform initialization3, Chapter 8. “Error” refers to the \(L^2\)-relative error taken over \({\mathscr {T}}^{test} \), and “Time” stands for the training time in seconds. For each case, the loss in Eq. (7) is with unit weights \(\upomega _v=1\) and Monte-Carlo quadrature rule \({w_v^i} = \frac{1}{N_v}\) for \(v\in \Lambda \). Also, we set \(\text {vol}({\mathbb {D}})\) the volume of domain \({\mathbb {D}}\) and

$$\begin{aligned} \rho : = \frac{{N_f}^{1/ (d+1)} }{{\text {vol}}({\mathbb {D}})^{1 / (d+1)} }{.} \end{aligned}$$

We introduce \(t^k\) the time to perform k iterations. The training points processed by second is as follows:

$$\begin{aligned} \text {pointsec}: = \frac{k N_f}{t^k}. \end{aligned}$$
(24)

For the sake of simplicity, we summarize the parameters and hyper-parameters for each case in Table 1.

Table 1 Overview of the parameters and hyper-parameters for each case.

1D Laplace

We first consider the 1D Laplace equation in \(D = [-1,7]\) as being:

$$\begin{aligned} - \Delta u = f \quad in \quad D \quad with \quad f = \pi ^2 \sin (\pi x) \quad and \quad u(-1)=u(7)=0. \end{aligned}$$
(25)

Acknowledge that \(u(x) = \sin (\pi x)\). We solve the problem for:

$$\begin{aligned} N_f = 2^i \quad i= 3,\ldots ,1{6} \quad and \quad N_f= {100,200,300,400}. \end{aligned}$$

We set \(N_g=2\) and \(N_u = 0\). Points in \({\mathscr {T}}_D\) are generated randomly over D, and \({\mathscr {T}}_b = \{-1,7\}\). The residual in Eq. (3):

$$\begin{aligned} \xi ^f = -\Delta u - f \end{aligned}$$

yields the loss:

$$\begin{aligned} {\mathscr {L}}_\theta = {\mathscr {L}}_\theta ^f + {\mathscr {L}}_\theta ^b \quad with \quad {\mathscr {L}}_\theta ^f:= \frac{1}{N_f} \sum _{{\mathscr {T}}_f}|\xi _f|^2 \quad and \quad {\mathscr {L}}_\theta ^b:= \frac{1}{2} \left( |u(-1)|^2 + |u(7)|^2\right) . \end{aligned}$$

h-Analysis

We perform the h-analysis for the error as portrayed before in Fig. 1. The asymptotic regime occurs between \(N_f={64}\) and \(N_f={400}\), with a precision of \(\rho =8\) in accordance with general results for h-analysis of traditional solvers. The permanent regime shows a slight improvement in accuracy, with mean “Error” dropping from \(8.75 \times 10^{-3}\) for \(N_f=400\) to \(3.91 \times 10^{-3}\) for \(N=65,536\). To complete the h-analysis, Fig. 7 shows the convergence results of ADAM optimizer for all the values of \(N_f\). This plot reveals that each regime exhibits similar patterns. The high variability in convergence during the transition regime is particularly interesting, with some runs converging and others not. In the permanent regime, the convergence shows almost identical and stable patterns irrespective of \(N_f\).

Figure 7
figure 7

1D Laplace: Convergence error for ADAM v/s \(N_f\) for \(l_r=10^{-4}\) and 20,000 iterations.

Furthermore, we plot the training and test losses in Fig. 8. Acknowledge that the validation loss and “Error” show similar behaviors. We use this figure as a reference to define each transition regime. In particular, it hints that the permanent regime is reached for \(N=400\) as the relative error at best iteration between \({\mathscr {L}}_\theta ^\text {train}\) and \({\mathscr {L}}_\theta ^\text {test}\) drops from \(1.60 \times 10^{-1}\) to \(6.02 \times 10^{-5}\). For the sake of precision, the value for \(N=512\) is \(2.20 \times 10^{-5}\).

Figure 8
figure 8

1D Laplace: Training (blue) and test (orange) losses for ADAM v/s \(N_f\) for \(l_r=10^{-4}\) and 20000 iterations.

Data-parallel implementation

We set \(N_{f,1}\equiv N_1= {64}\) and compare both weak and strong scaling to original implementation, referred to as “no scaling”.

Figure 9
figure 9

Error v/s \(\texttt {size}^*\) for the different scaling options.

We provide a detailed description of Fig. 9, as it will serve as the basis for future cases:

  • Left-hand side: Error for \(\texttt {size}^* \in \{1,2,4,8\}\) corresponding to \(N_f\in \{64,128, {256,512}\}\).

  • Middle: Error for weak scaling with \(N_{1}=64\) and for \(\texttt {size}\in \{1,2,4,8\}\);

  • Right-hand side: Error for strong scaling with \(N_{1}={512}\) and for \(\texttt {size}\in \{1,2,4,8\}\)

To reduce ambiguity, we use the \(*\)-superscript for no scaling, as \(\texttt {size}^*\) is performed over 1 rank. The color for each violin box in the figure corresponds to the number of domain collocation points used for each GPU.

Figure 9 demonstrates that both weak and strong scaling yield similar convergence results to their unaccelerated counterpart. This result is one of the main findings in this work: PINNs scale properly with respect to accuracy, validating the intuition behind h-analysis and justifying the data-parallel approach. This allows one to move from pre-asymptotic to permanent regime by using weak scaling, or leverage the cost of a permanent regime application by dispatching the training points over different workers. Furthermore, the hyper-parameters, including the learning rate, remained unchanged.

Next, we summarize the data-parallelization results in Table 2 with respect to \(\texttt {size}\).

Table 2 1D Laplace: \(t^{500}\) in seconds and efficiency \(E_\text {ff}\) for the weak and strong scaling.

In the first column, we present the time required to run 500 iterations for ADAM, referred to as \(t^\text {500}\). This value is averaged over one run of 30,000 iterations with a heat-up of 1500 iteration (i.e. we discard the values corresponding to iterations 0, 500 and 1000). We present the resulting mean value ± standard deviation for the resulting vector. The second column, displays the efficiency of the run, evaluated with respect to \(t^\text {500}\).

Table 2 reveals that data-based acceleration results in supplementary training times as anticipated. Weak scaling efficiency varies between \({78.01}\%\) for \(\texttt {size}={2}\) to \({66.67}\%\) for \(\texttt {size}=8\), resulting in a speed-up of 5.47 when using 8 GPUs. Similarly, strong scaling shows similar behavior. Furthermore, it can be observed that \(\texttt {size}=1\) yields almost equal \(t^{500}\) for \(N_f={64}\) (4.08s) and \(N_f={512}\) (4.19s).

1D Schrödinger

We solve the non-linear Schrödinger equation along with periodic BCs (refer to10, Section 3.1.1) given over \({\overline{D}}\times {[0, T ] }\) with \(D:=(-5,5)\) and \(T := \pi / 2\):

$$\begin{aligned} \imath u_t + 0.5 u_{xx} + |u|^2 u&= 0 \quad in \quad D \times (0,T),\\ u(-5,t)&=u(5,t), {\quad t \in [0,T],}\\ \partial _x u(-5,t)&= \partial _x u(5,t),{\quad t \in [0,T],}\\ u(x,0)&= 2 {\text {sech}}(x),{\quad x \in D,} \end{aligned}$$

where \(u(x,t) = u^0(x,t) + \imath u^1(x,t)\). We apply PINNs to the system with \(m=2\) in Eq. (4) as \((u^0_\theta ,v^1_\theta ) \in {\mathbb {R}}^m={\mathbb {R}}^2\). We set \(N_g=N_\hbar = 200\).

h-Analysis

To begin with, we perform the h-analysis for the parameters in Fig. 10. Again, the transition regime for the total execution time distribution begins at a density of \(\rho =5\) and spans between \(N_f=350\) and \(N_f=4000\). At higher magnitudes of \(N_f\), the error remains approximately the same. To illustrate this more complex case, we present the total execution time distribution in Fig. 11.

Figure 10
figure 10

Schrödinger: Error v/s \(N_f\).

Figure 11
figure 11

Schrödinger: Time (in seconds) v/s \(N_f\).

We note that the training times remain stable for \(N_f\le 4000\). This observation is of great importance and should be emphasized, as it adds additional parameters to the analysis. Our work primarily focuses on error in h-analysis, however, execution time and memory requirements are also important considerations. We see that in this case, weak scaling is not necessary (the optimal option is to use \(\texttt {size}= 1\) and \(N_f = 4000\)). Alternatively, strong scaling can be done with \(N_{f,1} = 4000\).

To gain further insight into the variability of the transition regime, we focus on \(N_f = 1000\). We compare the solution for \(\texttt {seed}= 1234\) and \(\texttt {seed}= 1236\) in Fig. 12. The upper figures depict |u(tx)| predicted by the PINN. The lower figures show a comparison between the exact and predicted solutions are plotted for \(t\in \{0.59,0.79,-0.98\}\). It is evident that the solution for \(\texttt {seed}=1234\) closely resembles the exact solution, whereas the solution for \(\texttt {seed}=1236\) fails to accurately represent the solution near \(x=0\), thereby illustrating the importance of achieving a permanent regime. Next, we show the training and testing losses in Fig. 13. We remark that the training and test losses converge for \(N_f >200\). Analysis of the train-test gap showed that it converged as \({\mathscr {O}}(N_f^{-1})\). Visually, one can assume that the losses are close enough for \(N_f=4000\) (or \(N_f=8000\)), in accordance with the h-analysis performed in Fig. 10.

Figure 12
figure 12

Schrödinger: Solution for \(N_f=1000\) for two differents seeds.

Figure 13
figure 13

Schrödinger: Training (blue) and test (orange) losses for ADAM v/s \(N_f\) for \(l_r=10^{-4}\) and 20,000 iterations.

Data-parallel implementation

We compare the error for unaccelerated and data-parallel implementations of simulations for \(N_{f,1}\equiv N_{1}=500\) in Fig. 14, analogous to the analysis in Fig. 9. Again, the error is stable with \(\texttt {size}\). Both no scaling and weak scaling are similar. Strong scaling is unaffected by \(\texttt {size}\). We plot the training time in Fig. 15. We observe that both weak and strong scaling increase linearly and slightly with \(\texttt {size}\). Both scaling show similar behaviors. Fig. 16 portrays the number of training points processed per second [refer to Eq. (24)] and the efficiency with respect to \(\texttt {size}\), with white bars representing the ideal scaling. Efficiency \(E_\text {ff}\) shows a gradual decrease with \(\texttt {size}\), with results surpassing those of the previous section. The efficiency for \(\texttt {size}=8\) reaches \(77{.22}\%\) and \(76{.20}\%\) respectively for weak and strong scaling, representing a speed-up of 6.18 and 6.10.

Figure 14
figure 14

Schrödinger: Error v/s \(\texttt {size}^*\) for the different scaling options.

Figure 15
figure 15

Schrödinger: Time (in seconds) v/s \(\texttt {size}^*\) for the different scaling options.

Figure 16
figure 16

Schrödinger: Efficiency v/s \(\texttt {size}\).

Inverse problem: Navier–Stokes equation

We consider the Navier–Stokes problem with \(D : =[-1,8] \times [-2,2]\), \(T=20\) and unknown parameters \(\uplambda _1\uplambda _2 \in {\mathbb {R}}\). The resulting divergence-free Navier–Stokes is expressed as follows:

$$\begin{aligned} {\left\{ \begin{array}{ll} u_t + \lambda _1 ( u u_x + v u_y )&{} = - p_x + \lambda _2 (u_{xx} + u_{yy}),\\ v_t + \lambda _1 ( u v_x + v v_y )&{} = - p_y + \lambda _2 (v_{xx} + v_{yy}),\\ u_x + u_t &{} = 0, \end{array}\right. } \end{aligned}$$
(26)

wherein u(xyt) and v(xyt) are the x and y components of the velocity field, and p(xyt) the pressure. We assume that there exists \(\varphi (x,y,t)\) such that:

$$\begin{aligned} u = \varphi _x,\quad and \quad v = - \varphi _x. \end{aligned}$$
(27)

Under Eq. (27), last row in Eq. (26) is satisfied. The latter leads to the definition of residuals:

$$\begin{aligned} \xi _1&: = u_t + \lambda _1 ( u u_x + v u_y ) + p_x - \lambda _2 (u_{xx} + u_{yy})\\ \xi _2&: = v_t + \lambda _1 ( u v_x + v v_y ) + p_y -\lambda _2 (v_{xx} + v_{yy}). \end{aligned}$$

We introduce \(N_f\) pseudo-random points \(\textbf{x}^i \in D\) and observations \((u^i,v^i)= (u^i (\textbf{x}^i), v^i (\textbf{x}^i)) \), yielding the loss:

$$\begin{aligned} {\mathscr {L}}_\theta : = {\mathscr {L}}_\theta ^u + {\mathscr {L}}^f_{\theta } \end{aligned}$$

with

$$\begin{aligned} {\mathscr {L}}_\theta ^u = \frac{1}{{N_f}} \sum _{i=1}^{{N_f}} \left( |u(\textbf{x}^i)-u^i|^2 + |v(\textbf{x}^i)-v^i|^2\right) \quad and \quad {\mathscr {L}}_\theta ^f = \frac{1}{{N_f}} \sum _{i=1}^{{N_f}} \left( \xi _1(\textbf{x}^i){^2} + \xi _2(\textbf{x}^i){^2}\right) . \end{aligned}$$

Throughout this case, we have \(N_f=M\), and we plot the results with respect to \(N_f\). Acknowledge that \(N = 2 N_f\), and that both \(\lambda _1,\lambda _2\) and the BCs are unkwown.

h-Analysis

We conduct the h-analysis and show the error in Fig. 17, showing no differences with previous cases. Surprisingly, the permanent regime is reached only for \(N_f=1000\), despite the problem being a 3-dimensional, non-linear, and inverse one. This corresponds to low values of \(\rho \), indicating that PINNs seem to prevent the curse of dimensionality. In fact, the it was achieved with only 1.21 points per unit per dimension. The total training time is presented in Fig. 18, where it can be seen to remain stable up to \(N_f=5000\), and then increases linearly with \(N_f\).

Figure 17
figure 17

Navier–Stokes: Error v/s \(N_f\).

Figure 18
figure 18

Navier–Stokes: Time (in seconds) v/s \(N_f\).

Again, we represent the training and test losses in Fig. 19. The train-test gap is shown to decrease for \(N_f \ge 350\) as \({\mathscr {O}}(N_f^{-1})\). Furthermore, the train-test gap can be considered as being small enough, visually, for \(N_f = 1000\).

Figure 19
figure 19

Navier–Stokes: Training (blue) and test (orange) losses for ADAM v/s \(N_f\).

Data-parallel implementation

We run the data-parallel simulations, setting \(N_{f,1} \equiv N_{1} = M_1= 500\). As shown in Fig. 20, the simulations exhibit stable accuracy with \(\texttt {size}\). The execution time increases moderately with \(\texttt {size}\), as illustrated in Fig. 21. The training time decreases with N for the no scaling case. However, this behavior is temporary (refer to Fig. 17 before). We conclude our analysis by plotting the efficiency with respect to \(\texttt {size}\) in Fig. 22. Efficiency lowers with increasing \(\texttt {size}\), but shows the best results so far, with \(80{.55}\%\) (resp. \(86.31\%\)) weak (resp. strong) scaling efficiency for \(\texttt {size}= 8\). For the sake of completeness, the weak efficiency for \(N_{f,1}=50,000\) and \(\texttt {size}=8\) improved to \(86.15\%\). This encouraging result sets the stage for further exploration of more intricate applications.

Figure 20
figure 20

Navier–Stokes: Error v/s \(\texttt {size}^*\) for the different scaling options.

Figure 21
figure 21

Navier–Stokes: Time (in seconds) v/s \(\texttt {size}^*\) for the different scaling options.

Figure 22
figure 22

Navier–Stokes: Efficiency v/s \(\texttt {size}\).

Conclusion

In this work, we proposed a novel data-parallelization approach for PIML with a focus on PINNs. We provided a thorough h-analysis and associated theoretical results to support our approach, as well as implementation considerations to facilitate implementation with Horovod data acceleration. Additionally, we ran reproducible numerical experiments to demonstrate the scalability of our approach. Further work include the implementation of Horovod acceleration to DeepXDE35 library, coupling of localized PINNs with domain decomposition methods, and application on larger GPU servers (e.g., with more than 100 GPUs).