Abstract
We explore the data-parallel acceleration of physics-informed machine learning (PIML) schemes, with a focus on physics-informed neural networks (PINNs) for multiple graphics processing units (GPUs) architectures. In order to develop scale-robust and high-throughput PIML models for sophisticated applications which may require a large number of training points (e.g., involving complex and high-dimensional domains, non-linear operators or multi-physics), we detail a novel protocol based on h-analysis and data-parallel acceleration through the Horovod training framework. The protocol is backed by new convergence bounds for the generalization error and the train-test gap. We show that the acceleration is straightforward to implement, does not compromise training, and proves to be highly efficient and controllable, paving the way towards generic scale-robust PIML. Extensive numerical experiments with increasing complexity illustrate its robustness and consistency, offering a wide range of possibilities for real-world simulations.
Similar content being viewed by others
Introduction
Simulating physics throughout accurate surrogates is a hard task for engineers and computer scientists. Numerical methods such as finite element methods, finite difference methods and spectral methods can be used to approximate the solution of partial differential equations (PDEs) by representing them as a finite-dimensional function space, delivering an approximation to the desired solution or mapping1.
Real-world applications often incorporate partial information of physics and observations, which can be noisy. This hints at using data-driven solutions throughout machine learning (ML) techniques. In particular, deep learning (DL)2,3 principles have been praised for their good performance, granted by the capability of deep neural networks (DNNs) to approximate high-dimensional and non-linear mappings, and offer great generalization with large datasets. Furthermore, the exponential growth of GPUs capabilities has made it possible to implement even larger DL models.
Recently, a novel paradigm called physics-informed machine learning (PIML)4 was introduced to bridge the gap between data-driven5 and physics-based6 frameworks. PIML enhances the capability and generalization power of ML by adding prior information on physical laws to the scheme by restricting the output space (e.g., via additional constraints or a regularization term). This simple yet general approach was applied successfully to a wide range complex real-world applications, including structural mechanics7,8 and biological, biomedical and behavioral sciences9.
In particular, physics-informed neural networks (PINNs)10 consist in applying PIML by means of DNNs. They encode the physics in the loss function and rely on automatic differentiation (AD)11. PINNs have been used to solve inverse problems12, stochastic PDEs13,14, complex applications such as the Boltzmann transport equation15 and large-eddy simulations16, and to perform uncertainty quantification17,18.
Concerning the challenges faced by the PINNs community, efficient training19, proper hyper-parameters setting20, and scaling PINNs21 are of particular interest. Regarding the latter, two research areas are gaining attention.
First, it is important to understand how PINNs behave for an increasing number of training points N (or equivalently, for a suitable bounded and fixed domain, a decreasing maximum distance between points h). Throughout this work, we refer to this study as h-analysis as being the analysis of the number of training data needed to obtain a stable generalization error. In their pioneer works22,23, Mishra and Molinaro provided a bound for the generalization error with respect to to N for data-free and unique continuation problems, respectively. More precise bounds have been obtained using characterizations of the DNN24.
Second, PINNs are typically trained over graphics processing units (GPUs), which have limited memory capabilities. To ensure models scale well with increasingly complex settings, two paradigms emerge: data-parallel and model-parallel acceleration. The former splits the training data over different workers, while the latter distributes the model weights. However, general DL backends do not readily support multiple GPU acceleration. To address this issue, Horovod25 is a distributed framework specifically designed for DL, featuring a ring-allreduce algorithm26 and implementations for TensorFlow, Keras and PyTorch.
As model size becomes prohibitive, domain decomposition-based approaches allow for distributing the computational domain. Examples of such approaches include conservative PINNs (cPINNs)27, extended PINNs (XPINNs)28,29, and distributed PINNs (DPINNs)30. cPINNs and XPINNs were compared in31. These approaches are compatible with data-parallel acceleration within each subdomain. Additionally, a recent review concerning distributed PIML21 is also available. Regarding existing data-parallel implementations, TensorFlow MirroredStrategy in TensorDiffEq32 and NVIDIA Modulus33, should be mentioned. However, to the authors knowledge, there is no systematic study of the background of data-parallel PINNs and their implementation.
In this work, we present a procedure to attain data-parallel efficient PINNs. It relies on h-analysis and is backed by a Horovod-based acceleration. Concerning h-analysis, we observe PINNs exhibiting three phases of behavior as a function of the number of training points N:
-
1.
A pre-asymptotic regime, where the model does not learn the solution due to missing information;
-
2.
A transition regime, where the error decreases with N;
-
3.
A permanent regime, where the error remains stable.
To illustrate this, Fig. 1 presents the relative \(L^2\) error distribution with respect to \(N_f\) (number of domain collocation points) for the forward “1D Laplace” case. The experiment was conducted over 8 independent runs with a learning rate of \(10^{-4}\) and 20,000 iterations of ADAM34 algorithm. The transition regime—where variability in the results is high and some models converge while others do not—is between \(N_f={64}\) and \(N_f={400}\). For more information on the experimental setting and the definition of precision \(\rho \), please refer to “1D Laplace”.
Building on the empirical observations, we use the setting in22,23 to supply a rigorous theoretical background to h-analysis. One of the main contributions of this manuscript is the bound on the “Generalization error for generic PINNs”, which allows for a simple analysis of the h-dependence. Furthermore, this bound is accompanied by a practical “Train-test gap bound”, supporting regimes detection.
To summarize the latter results, a simple yet powerful recipe for any PIML scheme could be:
-
1.
Choose the right model and hyper-parameters to achieve a low training loss;
-
2.
Use enough training points N to reach the permanent regime (e.g., such that the training and test losses are similar).
Any practitioner strives to reach the permanent regime for their PIML scheme, and we provide the necessary details for an easy implementation of Horovod-based data acceleration for PINNs, with direct application to any PIML model. Figure 2 (left) further illustrates the scope of data-parallel PIML. For the sake of clarity, Fig. 2 (right) supplies a comprehensive review of important notations defined throughout this manuscript, along with their corresponding introductions.
Next, we apply the procedure to increasingly complex problems and demonstrate that Horovod acceleration is straightforward, using the pioneer PINNs code of Raissi as an example. Our main practical findings concerning data-parallel PINNs for up to 8 GPUs are the following:
-
They do not require to modify their hyper-parameters;
-
They show similar training convergence to the 1 GPU-case;
-
They lead to high efficiency for both weak and strong scaling (e.g \(E_\text {ff} {> 80\% }\) for Navier–Stokes problem with 8 GPUs).
This work is organized as follows: In “Problem formulation”, we introduce the PDEs under consideration, PINNs and convergence estimates for the generalization error. We then move to “Data-parallel PINNs” and present “Numerical experiments”. Finally, we close this manuscript in “Conclusion”.
Problem formulation
General notation
Throughout, vector and matrices are expressed using bold symbols. For a natural number k, we set \({\mathbb {N}}_k:= \{k,k+1,\ldots \}\). For \(p \in {\mathbb {N}}_0 = \{0,1,\ldots ,\}\), and an open set \(D\subseteq {\mathbb {R}}^d\) with \(d\in {\mathbb {N}}_1\), let \(L^p(D)\) be the standard class of functions with bounded \(L^p\)-norm over D. Given \(s\in {\mathbb {R}}^+\), we refer to1, Section 2 for the definitions of Sobolev function spaces \(H^s(D)\). Norms are denoted by \(\Vert \cdot \Vert \), with subscripts indicating the associated functional spaces. For a finite set \({\mathscr {T}}\), we introduce notation \(|{\mathscr {T}}| := \text {card} ({\mathscr {T}})\), closed subspaces are denoted by a \(\mathop \subset \nolimits_{{{\text{cl}}}}\)-symbol and \(\imath ^2=-1\).
Abstract PDE
In this work, we consider a domain \(D \subset {\mathbb {R}}^d\), \(d \in {\mathbb {N}}_1\), with boundary \(\Gamma = \partial D\). For any \(T>0\), \({\mathbb {D}}:= D\times [0,T]\), we solve a general non-linear PDE of the form:
with \({\mathscr {N}}\) a spatio-temporal differential operator, \({\mathscr {B}}\) the boundary conditions (BCs) operator, \({\varvec{\lambda }}\) the material parameters—the latter being unknown for inverse problems—and \(u(\textbf{x},t)\in {\mathbb {R}}^m\) for any \(m\in {\mathbb {N}}_1\). Accordingly, for any function \({\hat{u}}\) defined over \({\mathbb {D}}\), we introduce
and define the residuals \(\xi _v\) for each \(v \in \Lambda \) and any observation function \(u_{obs} \):
PINNs
Following20,35, let \(\sigma \) be a smooth activation function. Given an input \((\textbf{x},t) \in {\mathbb {R}}^{d+1}\), we define \({\mathscr {N}}\!\!{\mathscr {N}}_\theta \) as being a L-layer neural feed-forward neural network with \(W_0= d+ 1\), \(W_L = m\) and \(W_l\) neurons in the l-th layer for \(1 \le l \le L-1\). For constant width DNNs, we set \(W=W_1=\cdots = W_{L-1}\). For \(1 \le l \le L\), let us denote the weight matrix and bias vector in the l-th layer by \(\textbf{W}^l \in {\mathbb {R}}^{d_l \times d_{l-1}}\) and \(\textbf{b}^l \in {\mathbb {R}}^{d_l}\), respectively, resulting in:
This results in representation \(\textbf{z}^L(\textbf{x},t)\), with
the (trainable) parameters—or weights—in the network. We set \(\Theta = {\mathbb {R}}^{|\Theta |}\). Application of PINNs to Eq. (1) yields the approximate \(u_\theta (\textbf{x},t) = \textbf{z}^L(\textbf{x},t)\).
We introduce the training dataset \({\mathscr {T}}_v:=\{\tau _v^i\}_{i=1}^{N_v}\), \(\tau _v^i \in D_v\), \(N_v \in {\mathbb {N}}\) for \(i = 1,\cdots , N_v, v \in \Lambda \) and observations \(u_\text {obs}(\tau _u^i)\), \(i=1,\cdots ,N_u\). Furthermore, to each training point \(\tau _v^i\) we associate a quadrature weight \(w_v^i>0\). All throughout this manuscript, we set:
Note that M (resp. \({\hat{N}}\)) represents the amount of information for the data-driven (resp. physics) part, by virtue of the PIML paradigm (refer to Fig. 2). The network weights \(\theta \) in Eq. (5) are trained (e.g., via ADAM optimizer34) by minimizing the weighted loss:
We seek at obtaining:
The formulation for PINNs addresses the cases with no data (i.e. \(M=0\)) or physics (i.e. \({\hat{N}}=0\)), thus exemplifying the PIML paradigm. Furthermore, it is able to handle time-independent operators with only minor changes; a schematic representation of a forward time-independent PINN is shown in Fig. 3.
Our setting assumes that the material parameters \({\varvec{\lambda }}\) are known. If \(M>0\), one can solve the inverse problem by seeking:
Similarly, unique continuation problems23, which assume incomplete information for f, g and \(\hbar \), are solved throughout PINNs without changes. Indeed, “2D Navier–Stokes” combines unique continuation problem and unknown parameters \(\lambda _1,\lambda _2\).
Automatic differentiation
We aim at giving further details about back-propagation algorithms and their dual role in the context of PINNs:
-
1.
Training the DNN by calculating \(\frac{\partial {\mathscr {L}}_\theta }{\partial \theta }\);
-
2.
Evaluating the partial derivatives in \({\mathscr {N}}[u_\theta (\textbf{x},t);{\varvec{\lambda }}]\) and \({\mathscr {B}}[u_\theta (\textbf{x},t);{\varvec{\lambda }}]\) so as to compute the loss \({\mathscr {L}}_\theta \).
They consist in a forward pass to evaluate the output \(u_\theta \) (and \({\mathscr {L}}_\theta \)), and a backward pass to assess the derivatives. To further elucidate back-propagation, we reproduce the informative diagram from11 in Fig. 4.
TensorFlow includes reverse mode AD by default. Its cost is bounded with \(|\Theta |\) for scalar output NNs (i.e. for \( m=1\)). The application of back-propagation (and reverse mode AD in particular) to any training point is independent of other information, such as neighboring points or the volume of training data. This allows for data-parallel PINNs. Before detailing its implementation, we justify the h-analysis through an abstract theoretical background.
Convergence estimates
To understand better how PINNs scale with N, we follow the method in22,23 under a simple setting, allowing to control the data and physics counterparts in PIML. Set \(s \ge 0\) and define spaces:
We assume that Eq. (1) can be recast as:
We suppose that Eq. (11) is well-posed and that for any \(u,v \in {\hat{X}}\), there holds that:
Eq. (11) is a stability estimate, allowing to control the total error by means of a bound on PINNs residual. Residuals in Eq. (3) are:
From the expression of residuals, we are interested in approximating integrals:
We assume that we are provided quadratures:
for weights \({w^i_D},{w^i_u}\) and quadrature points \({\tau _D^i}, {\tau _u^i} \in {\mathbb {D}}\) such that for \(\alpha ,\beta > 0\):
For any \(\upomega _u>0\), the loss is defined as follows:
with \(\varepsilon _{T,D}\) and \(\varepsilon _{T,u}\) the training error for collocation points and observations respectively.
Notice that application of Eq. (17) to \(\xi _{D,\theta }\) and \(\xi _{u,\theta }\) yields:
We seek to quantify the generalization error:
We detail a new result concerning the generalization error for PINNs.
Theorem 1
(Generalization error for generic PINNs) Under the presented setting, there holds that:
with \({\hat{\mu }}: = \Vert u-u_{obs} \Vert _X\).
Proof
Consider the setting of Theorem 1. There holds that:
\(\square \)
The novelty of Theorem 1 is that it describes the generalization error for a simple case involving collocation points and observations. It states that the PINN generalizes well as long as the training error is low and that sufficient training points are used. To make the result more intuitive, we rewrite Eq. (21), with \(\sim \) expressing the terms up to positive constants:
The generalization error depends on the training errors (which are tractable during training), parameters \({\hat{N}}\) and M and bias \({\hat{\mu }}\).
To return to h-analysis, we now have a theoretical proof of the three regimes presented in “Introduction”. Let us assume that \({\hat{\mu }}=0\). For small values of \({\hat{N}}\) or M, the bound in Theorem 1 is too high to yield a meaningful estimate. Subsequently, the convergence is as \(\max ({\hat{N}}^{-\alpha /2},M^{-\beta /2})\), marking the transition regime. It is paramount for practitioners to reach the permanent regime when training PINNs, giving ground to data-parallel PINNs.
In general applications, the exact solution u is not available. Moreover, it is relevant to determine whether N is large enough. To this extent, we introduce a same cardinality testing (or validation) set. Interestingly, the entire analysis above and Theorem 1 remain valid for another set of testing points, with the testing error \(\varepsilon _{V,D}\) and \(\varepsilon _{V,u}\) set as in Eq. (18). The train-test gap, which is tractable, can be quantified as follows.
Theorem 2
(Train-test gap bound) Under the presented setting, there holds that:
Proof
Consider the setting of Theorem 2. For \(v \in \{D, u\}\) and \(\cdot \in \{ Y,X\}\) there holds that:
\(\square \)
The bound in Theorem 1 is valuable as it allows to assess the quadrature error convergence—and the regime—with respect to the number of training points.
Data-parallel PINNs
Data-distribution and Horovod
In this section, we present the data-parallel distribution for PINNs. Let us set \(\texttt {size}\in {\mathbb {N}}_1\) and define ranks (or workers):
each rank corresponding generally to a GPU. Data-parallel distribution requires the appropriate partitioning of the training points across ranks.
We introduce \({\hat{N}}_1,M_1\in {\mathbb {N}}_1\) collocation points and observations, respectively, for each \(\texttt {rank}\) (e.g., a GPU) yielding:
with
Data-parallel approach is as follows: We send the same synchronized copy of the DNN \({\mathscr {N}}\!\!{\mathscr {N}}_\theta \) defined in Eq. (4) to each rank. Each rank evaluates the loss \({\mathscr {L}}^\texttt {rank}_\theta \) and the gradient \(\nabla _\theta {\mathscr {L}}^\texttt {rank}_\theta \). The gradients are then averaged using an all-reduce operation, such as the ring all-reduce implemented in Horovod26,36, which is known to be bandwidth optimal with respect to the number of ranks36. The process is illustrated in Fig. 5 for \(\texttt {size}=4\). The ring-allreduce algorithm involves each of the \(\texttt {size}\) nodes communicating with two of its peers \(2 \times (\texttt {size}-1)\) times26.
It is noteworthy to observe that data generation for data-free PINNs (i.e. with \(M=0\)) requires no modification to existing codes, provided that each rank has a different seed for random or pseudo-random sampling. Horovod allows to apply data-parallel acceleration with minimal changes to existing code. Moreover, our approach and Horovod can easily be extended to multiple computing nodes. As pointed out in “Introduction”, Horovod supports popular DL backends such as TensorFlow, PyTorch and Keras. In Listing 1, we demonstrate how to integrate data-parallel distribution using Horovod with a generic PINNs implementation in TensorFlow 1.x. The highlighted changes in pink show the steps for incorporating Horovod, which include: (i) initializing Horovod; (ii) pinning available GPUs to specific workers; (iii) wrapping the Horovod distributed optimizer and (iv) broadcasting initial variables to the master rank being \(\texttt {rank}= 0\).
Weak and strong scaling
Two key concepts in data distribution paradigms are weak and strong scaling, which can be explained as follows: Weak scaling involves increasing the problem size proportionally with the number of processors, while strong scaling involves keeping the problem size fixed and increasing the number of processors. To reformulate:
-
Weak scaling: Each worker has \(({\hat{N}}_1,M_1)\) training points, and we increase the number of workers \(\texttt {size}\);
-
Strong scaling: We set a fixed total number of \(({\hat{N}}_1,M_1)\) training points, and we split the data over increasing \(\texttt {size}\) workers.
We portray weak and strong scaling in Fig. 6 for a data-free PINN with \({\hat{N}}_1=16\). Each box represents a GPU, with the number of collocation points as a color. On the left of each scaling option, we present the unaccelerated case. Finally, we introduce the training time \(t_\texttt {size}\) for \(\texttt {size}\) workers. This allows to define the efficiency and speed-up as:
Numerical experiments
Throughout, we apply our proceeding to three cases of interest:
-
“1D Laplace” equation (forward problem);
-
“1D Schrödinger” equation (forward problem);
-
“2D Navier–Stokes” equation (inverse problem).
For each case, we perform a h-analysis followed by Horovod data-parallel acceleration, which is applied to the domain training points (and observations for the Navier–Stokes case). Boundary loss terms are negligible due to the sufficient number of boundary data points.
Methodology
We perform simulations in single float precision on a AMAX DL-E48A AMD Rome EPYC server with 8 Quadro RTX 8000 Nvidia GPUs—each one with a 48 GB memory. We use a Docker image of Horovod 0.26.1 with CUDA 12.1, Python 3.6.9 and Tensorflow 2.6.2. All throughout, we use tensorflow.compat.v1 as a backend without eager execution.
All the results are ready for use in HorovodPINNs GitHub repository and fully reproducible, ensuring also compliance with FAIR principles (Findability, Accessibility, Interoperability, and Reusability) for scientific data management and stewardship37. We run experiments 8 times with seeds defined as:
in order to obtain rank-varying training points. For domain points, Latin Hypercube Sampling is performed with pyDOE 0.3.8. Boundary points are defined over uniform grids.
We use Glorot uniform initialization3, Chapter 8. “Error” refers to the \(L^2\)-relative error taken over \({\mathscr {T}}^{test} \), and “Time” stands for the training time in seconds. For each case, the loss in Eq. (7) is with unit weights \(\upomega _v=1\) and Monte-Carlo quadrature rule \({w_v^i} = \frac{1}{N_v}\) for \(v\in \Lambda \). Also, we set \(\text {vol}({\mathbb {D}})\) the volume of domain \({\mathbb {D}}\) and
We introduce \(t^k\) the time to perform k iterations. The training points processed by second is as follows:
For the sake of simplicity, we summarize the parameters and hyper-parameters for each case in Table 1.
1D Laplace
We first consider the 1D Laplace equation in \(D = [-1,7]\) as being:
Acknowledge that \(u(x) = \sin (\pi x)\). We solve the problem for:
We set \(N_g=2\) and \(N_u = 0\). Points in \({\mathscr {T}}_D\) are generated randomly over D, and \({\mathscr {T}}_b = \{-1,7\}\). The residual in Eq. (3):
yields the loss:
h-Analysis
We perform the h-analysis for the error as portrayed before in Fig. 1. The asymptotic regime occurs between \(N_f={64}\) and \(N_f={400}\), with a precision of \(\rho =8\) in accordance with general results for h-analysis of traditional solvers. The permanent regime shows a slight improvement in accuracy, with mean “Error” dropping from \(8.75 \times 10^{-3}\) for \(N_f=400\) to \(3.91 \times 10^{-3}\) for \(N=65,536\). To complete the h-analysis, Fig. 7 shows the convergence results of ADAM optimizer for all the values of \(N_f\). This plot reveals that each regime exhibits similar patterns. The high variability in convergence during the transition regime is particularly interesting, with some runs converging and others not. In the permanent regime, the convergence shows almost identical and stable patterns irrespective of \(N_f\).
Furthermore, we plot the training and test losses in Fig. 8. Acknowledge that the validation loss and “Error” show similar behaviors. We use this figure as a reference to define each transition regime. In particular, it hints that the permanent regime is reached for \(N=400\) as the relative error at best iteration between \({\mathscr {L}}_\theta ^\text {train}\) and \({\mathscr {L}}_\theta ^\text {test}\) drops from \(1.60 \times 10^{-1}\) to \(6.02 \times 10^{-5}\). For the sake of precision, the value for \(N=512\) is \(2.20 \times 10^{-5}\).
Data-parallel implementation
We set \(N_{f,1}\equiv N_1= {64}\) and compare both weak and strong scaling to original implementation, referred to as “no scaling”.
We provide a detailed description of Fig. 9, as it will serve as the basis for future cases:
-
Left-hand side: Error for \(\texttt {size}^* \in \{1,2,4,8\}\) corresponding to \(N_f\in \{64,128, {256,512}\}\).
-
Middle: Error for weak scaling with \(N_{1}=64\) and for \(\texttt {size}\in \{1,2,4,8\}\);
-
Right-hand side: Error for strong scaling with \(N_{1}={512}\) and for \(\texttt {size}\in \{1,2,4,8\}\)
To reduce ambiguity, we use the \(*\)-superscript for no scaling, as \(\texttt {size}^*\) is performed over 1 rank. The color for each violin box in the figure corresponds to the number of domain collocation points used for each GPU.
Figure 9 demonstrates that both weak and strong scaling yield similar convergence results to their unaccelerated counterpart. This result is one of the main findings in this work: PINNs scale properly with respect to accuracy, validating the intuition behind h-analysis and justifying the data-parallel approach. This allows one to move from pre-asymptotic to permanent regime by using weak scaling, or leverage the cost of a permanent regime application by dispatching the training points over different workers. Furthermore, the hyper-parameters, including the learning rate, remained unchanged.
Next, we summarize the data-parallelization results in Table 2 with respect to \(\texttt {size}\).
In the first column, we present the time required to run 500 iterations for ADAM, referred to as \(t^\text {500}\). This value is averaged over one run of 30,000 iterations with a heat-up of 1500 iteration (i.e. we discard the values corresponding to iterations 0, 500 and 1000). We present the resulting mean value ± standard deviation for the resulting vector. The second column, displays the efficiency of the run, evaluated with respect to \(t^\text {500}\).
Table 2 reveals that data-based acceleration results in supplementary training times as anticipated. Weak scaling efficiency varies between \({78.01}\%\) for \(\texttt {size}={2}\) to \({66.67}\%\) for \(\texttt {size}=8\), resulting in a speed-up of 5.47 when using 8 GPUs. Similarly, strong scaling shows similar behavior. Furthermore, it can be observed that \(\texttt {size}=1\) yields almost equal \(t^{500}\) for \(N_f={64}\) (4.08s) and \(N_f={512}\) (4.19s).
1D Schrödinger
We solve the non-linear Schrödinger equation along with periodic BCs (refer to10, Section 3.1.1) given over \({\overline{D}}\times {[0, T ] }\) with \(D:=(-5,5)\) and \(T := \pi / 2\):
where \(u(x,t) = u^0(x,t) + \imath u^1(x,t)\). We apply PINNs to the system with \(m=2\) in Eq. (4) as \((u^0_\theta ,v^1_\theta ) \in {\mathbb {R}}^m={\mathbb {R}}^2\). We set \(N_g=N_\hbar = 200\).
h-Analysis
To begin with, we perform the h-analysis for the parameters in Fig. 10. Again, the transition regime for the total execution time distribution begins at a density of \(\rho =5\) and spans between \(N_f=350\) and \(N_f=4000\). At higher magnitudes of \(N_f\), the error remains approximately the same. To illustrate this more complex case, we present the total execution time distribution in Fig. 11.
We note that the training times remain stable for \(N_f\le 4000\). This observation is of great importance and should be emphasized, as it adds additional parameters to the analysis. Our work primarily focuses on error in h-analysis, however, execution time and memory requirements are also important considerations. We see that in this case, weak scaling is not necessary (the optimal option is to use \(\texttt {size}= 1\) and \(N_f = 4000\)). Alternatively, strong scaling can be done with \(N_{f,1} = 4000\).
To gain further insight into the variability of the transition regime, we focus on \(N_f = 1000\). We compare the solution for \(\texttt {seed}= 1234\) and \(\texttt {seed}= 1236\) in Fig. 12. The upper figures depict |u(t, x)| predicted by the PINN. The lower figures show a comparison between the exact and predicted solutions are plotted for \(t\in \{0.59,0.79,-0.98\}\). It is evident that the solution for \(\texttt {seed}=1234\) closely resembles the exact solution, whereas the solution for \(\texttt {seed}=1236\) fails to accurately represent the solution near \(x=0\), thereby illustrating the importance of achieving a permanent regime. Next, we show the training and testing losses in Fig. 13. We remark that the training and test losses converge for \(N_f >200\). Analysis of the train-test gap showed that it converged as \({\mathscr {O}}(N_f^{-1})\). Visually, one can assume that the losses are close enough for \(N_f=4000\) (or \(N_f=8000\)), in accordance with the h-analysis performed in Fig. 10.
Data-parallel implementation
We compare the error for unaccelerated and data-parallel implementations of simulations for \(N_{f,1}\equiv N_{1}=500\) in Fig. 14, analogous to the analysis in Fig. 9. Again, the error is stable with \(\texttt {size}\). Both no scaling and weak scaling are similar. Strong scaling is unaffected by \(\texttt {size}\). We plot the training time in Fig. 15. We observe that both weak and strong scaling increase linearly and slightly with \(\texttt {size}\). Both scaling show similar behaviors. Fig. 16 portrays the number of training points processed per second [refer to Eq. (24)] and the efficiency with respect to \(\texttt {size}\), with white bars representing the ideal scaling. Efficiency \(E_\text {ff}\) shows a gradual decrease with \(\texttt {size}\), with results surpassing those of the previous section. The efficiency for \(\texttt {size}=8\) reaches \(77{.22}\%\) and \(76{.20}\%\) respectively for weak and strong scaling, representing a speed-up of 6.18 and 6.10.
Inverse problem: Navier–Stokes equation
We consider the Navier–Stokes problem with \(D : =[-1,8] \times [-2,2]\), \(T=20\) and unknown parameters \(\uplambda _1\uplambda _2 \in {\mathbb {R}}\). The resulting divergence-free Navier–Stokes is expressed as follows:
wherein u(x, y, t) and v(x, y, t) are the x and y components of the velocity field, and p(x, y, t) the pressure. We assume that there exists \(\varphi (x,y,t)\) such that:
Under Eq. (27), last row in Eq. (26) is satisfied. The latter leads to the definition of residuals:
We introduce \(N_f\) pseudo-random points \(\textbf{x}^i \in D\) and observations \((u^i,v^i)= (u^i (\textbf{x}^i), v^i (\textbf{x}^i)) \), yielding the loss:
with
Throughout this case, we have \(N_f=M\), and we plot the results with respect to \(N_f\). Acknowledge that \(N = 2 N_f\), and that both \(\lambda _1,\lambda _2\) and the BCs are unkwown.
h-Analysis
We conduct the h-analysis and show the error in Fig. 17, showing no differences with previous cases. Surprisingly, the permanent regime is reached only for \(N_f=1000\), despite the problem being a 3-dimensional, non-linear, and inverse one. This corresponds to low values of \(\rho \), indicating that PINNs seem to prevent the curse of dimensionality. In fact, the it was achieved with only 1.21 points per unit per dimension. The total training time is presented in Fig. 18, where it can be seen to remain stable up to \(N_f=5000\), and then increases linearly with \(N_f\).
Again, we represent the training and test losses in Fig. 19. The train-test gap is shown to decrease for \(N_f \ge 350\) as \({\mathscr {O}}(N_f^{-1})\). Furthermore, the train-test gap can be considered as being small enough, visually, for \(N_f = 1000\).
Data-parallel implementation
We run the data-parallel simulations, setting \(N_{f,1} \equiv N_{1} = M_1= 500\). As shown in Fig. 20, the simulations exhibit stable accuracy with \(\texttt {size}\). The execution time increases moderately with \(\texttt {size}\), as illustrated in Fig. 21. The training time decreases with N for the no scaling case. However, this behavior is temporary (refer to Fig. 17 before). We conclude our analysis by plotting the efficiency with respect to \(\texttt {size}\) in Fig. 22. Efficiency lowers with increasing \(\texttt {size}\), but shows the best results so far, with \(80{.55}\%\) (resp. \(86.31\%\)) weak (resp. strong) scaling efficiency for \(\texttt {size}= 8\). For the sake of completeness, the weak efficiency for \(N_{f,1}=50,000\) and \(\texttt {size}=8\) improved to \(86.15\%\). This encouraging result sets the stage for further exploration of more intricate applications.
Conclusion
In this work, we proposed a novel data-parallelization approach for PIML with a focus on PINNs. We provided a thorough h-analysis and associated theoretical results to support our approach, as well as implementation considerations to facilitate implementation with Horovod data acceleration. Additionally, we ran reproducible numerical experiments to demonstrate the scalability of our approach. Further work include the implementation of Horovod acceleration to DeepXDE35 library, coupling of localized PINNs with domain decomposition methods, and application on larger GPU servers (e.g., with more than 100 GPUs).
Data availability
The code required to reproduce these findings are available to download from https://github.com/pescap/HorovodPINNs.
Change history
27 May 2024
A Correction to this paper has been published: https://doi.org/10.1038/s41598-024-62284-9
References
Steinbach, O. Numerical Approximation Methods for Elliptic Boundary Value Problems: Finite and Boundary Elements (Springer, 2007).
LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).
Bengio, Y., Goodfellow, I. & Courville, A. Deep Learning Vol. 1 (MIT Press, 2017).
Karniadakis, G. E. et al. Physics-informed machine learning. Nat. Rev. Phys. 3, 422–440. https://doi.org/10.1038/s42254-021-00314-5 (2021).
You, H., Yu, Y., Trask, N., Gulian, M. & D’Elia, M. Data-driven learning of nonlocal physics from high-fidelity synthetic data. Comput. Methods Appl. Mech. Eng. 374, 113553. https://doi.org/10.1016/j.cma.2020.113553 (2021).
Sun, L., Gao, H., Pan, S. & Wang, J.-X. Surrogate modeling for fluid flows based on physics-constrained deep learning without simulation data. Comput. Methods Appl. Mech. Eng. 361, 112732. https://doi.org/10.1016/j.cma.2019.112732 (2020).
Lai, Z. et al. Neural Modal ODEs: Integrating Physics-based Modeling with Neural ODEs for Modeling High Dimensional Monitored Structures. http://arxiv.org/abs/2207.07883 (2022).
Lai, Z., Mylonas, C., Nagarajaiah, S. & Chatzi, E. Structural identification with physics-informed neural ordinary differential equations. J. Sound Vib. 508, 116196. https://doi.org/10.1016/j.jsv.2021.116196 (2021).
Alber, M. et al. Integrating machine learning and multiscale modeling-perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences. NPJ Dig. Med. 2, 1–11. https://doi.org/10.1038/s41746-019-0193-y (2019).
Raissi, M., Perdikaris, P. & Karniadakis, G. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. J. Comput. Phys. 378, 686–707. https://doi.org/10.1016/j.jcp.2018.10.045 (2019).
Baydin, A. G., Pearlmutter, B. A., Radul, A. A. & Siskind, J. M. Automatic differentiation in machine learning: A survey. J. Mach. Learn. Res. 18, 5595–5637. https://doi.org/10.5555/3122009.3242010 (2017).
Chen, Y., Lu, L., Karniadakis, G. E. & Negro, L. D. Physics-informed neural networks for inverse problems in nano-optics and metamaterials. Opt. Express 28, 11618–11633. https://doi.org/10.1364/OE.384875 (2020).
Chen, X., Duan, J. & Karniadakis, G. E. Learning and meta-learning of stochastic advection-diffusion-reaction systems from sparse measurements. Eur. J. Appl. Math. 32, 397–420. https://doi.org/10.1017/S0956792520000169 (2021).
Meng, X. & Karniadakis, G. E. A composite neural network that learns from multi-fidelity data: Application to function approximation and inverse PDE problems. J. Comput. Phys. 401, 109020. https://doi.org/10.1016/j.jcp.2019.109020 (2020).
Li, R., Wang, J.-X., Lee, E. & Luo, T. Physics-informed deep learning for solving phonon Boltzmann transport equation with large temperature non-equilibrium. NPJ Comput. Mater. 8, 1–10. https://doi.org/10.1038/s41524-022-00712-y (2022).
Yang, X. I. A., Zafar, S., Wang, J.-X. & Xiao, H. Predictive large-eddy-simulation wall modeling via physics-informed neural networks. Phys. Rev. Fluids 4, 034602. https://doi.org/10.1103/PhysRevFluids.4.034602 (2019).
Zhang, D., Lu, L., Guo, L. & Karniadakis, G. E. Quantifying total uncertainty in physics-informed neural networks for solving forward and inverse stochastic problems. J. Comput. Phys. 397, 108850. https://doi.org/10.1016/j.jcp.2019.07.048 (2019).
Escapil-Inchauspé, P. & Ruz, G. A. Physics-informed neural networks for operator equations with stochastic data. https://doi.org/10.48550/ARXIV.2211.10344 (2022).
Wang, S., Yu, X. & Perdikaris, P. When and why PINNs fail to train: A neural tangent kernel perspective. J. Comput. Phys. 449, 110768. https://doi.org/10.1016/j.jcp.2021.110768 (2022).
Escapil-Inchauspé, P. & Ruz, G. A. Hyper-parameter tuning of physics-informed neural networks: Application to Helmholtz problems. Neurocomputing 1, 126826. https://doi.org/10.1016/j.neucom.2023.126826 (2023).
Shukla, K., Xu, M., Trask, N. & Karniadakis, G. E. Scalable algorithms for physics-informed neural and graph networks. Data-Centric Eng. 3, e24. https://doi.org/10.1017/dce.2022.24 (2022).
Mishra, S. & Molinaro, R. Estimates on the generalization error of physics-informed neural networks for approximating PDEs. IMA J. Numer. Anal.https://doi.org/10.1093/imanum/drab093 (2022).
Mishra, S. & Molinaro, R. Estimates on the generalization error of physics-informed neural networks for approximating a class of inverse problems for PDEs. IMA J. Numer. Anal. 42, 981–1022. https://doi.org/10.1093/imanum/drab032 (2021).
Shin, Y., Darbon, J. & Karniadakis, G. E. On the convergence of physics informed neural networks for linear second-order elliptic and parabolic type PDEs. http://arxiv.org/abs/2004.01806 (2020).
Khoo, Y., Lu, J. & Ying, L. Solving parametric PDE problems with artificial neural networks. Eur. J. Appl. Math. 32, 421–435 (2021).
Sergeev, A. & Del Balso, M. Horovod: Fast and easy distributed deep learning in TensorFlow. http://arxiv.org/abs/1802.05799 (2018).
Jagtap, A. D., Kharazmi, E. & Karniadakis, G. E. Conservative physics-informed neural networks on discrete domains for conservation laws: Applications to forward and inverse problems. Comput. Methods Appl. Mech. Eng. 365, 113028. https://doi.org/10.1016/j.cma.2020.113028 (2020).
Jagtap, A. D. & Karniadakis, G. E. Extended physics-informed neural networks (XPINNs): A generalized space-time domain decomposition based deep learning framework for nonlinear partial differential equations. Commun. Comput. Phys. 28, 2002–2041. https://doi.org/10.4208/cicp.OA-2020-0164 (2020).
Hu, Z., Jagtap, A. D., Karniadakis, G. E. & Kawaguchi, K. When do extended physics-informed neural networks (XPINNs) improve generalization?. SIAM J. Sci. Comput. 44, A3158–A3182. https://doi.org/10.1137/21M1447039 (2022).
Dwivedi, V., Parashar, N. & Srinivasan, B. Distributed physics informed neural network for data-efficient solution to partial differential equations. http://arxiv.org/abs/1907.08967 (2019).
Shukla, K., Jagtap, A. D. & Karniadakis, G. E. Parallel physics-informed neural networks via domain decomposition. J. Comput. Phys. 447, 110683. https://doi.org/10.1016/j.jcp.2021.110683 (2021).
McClenny, L. D., Haile, M. A. & Braga-Neto, U. M. TensorDiffEq: Scalable Multi-GPU Forward and Inverse Solvers for Physics Informed Neural Networks. http://arxiv.org/abs/2103.16034 (2021).
Hennigh, O. et al. NVIDIA SimNet™: An AI-accelerated multi-physics simulation framework. In Computational Science–ICCS 2021: 21st International Conference, Krakow, Poland, June 16–18, 2021, Proceedings, Part V, 447–461 (Springer, 2021).
Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. http://arxiv.org/abs/1412.6980 (2014).
Lu, L., Meng, X., Mao, Z. & Karniadakis, G. E. DeepXDE: A deep learning library for solving differential equations. SIAM Rev. 63, 208–228. https://doi.org/10.1137/19M1274067 (2021).
Patarasuk, P. & Yuan, X. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput. 69, 117–124 (2009).
Wilkinson, M. D. et al. The fair guiding principles for scientific data management and stewardship. Sci. Data 3, 1–9 (2016).
Acknowledgements
The authors would like to thank the Data Observatory Foundation, ANID FONDECYT 1230315, ANID FONDECYT 3230088, FES-UAI postdoc grant, ANID PIA/BASAL FB0002, and ANID/PIA/ANILLOS ACT210096, for financially supporting this research.
Author information
Authors and Affiliations
Contributions
P.E.I. conceived the experiment(s). All authors analyzed the results and reviewed the manuscript.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher's note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original online version of this Article was revised: The original version of this Article contained errors in section ‘General Notation’, Equation 3, Equation 13, Equation 14, Equation 21, section ‘Proof Consider the setting of Theorem 1.’ and in section ‘Methodology’, where notations were incorrect. Full information regarding the corrections made can be found in the correction for this Article.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Escapil-Inchauspé, P., Ruz, G.A. h-Analysis and data-parallel physics-informed neural networks. Sci Rep 13, 17562 (2023). https://doi.org/10.1038/s41598-023-44541-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41598-023-44541-5
Comments
By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.