An improved data-free surrogate model for solving partial differential equations using deep neural networks

Partial differential equations (PDEs) are ubiquitous in natural science and engineering problems. Traditional discrete methods for solving PDEs are usually time-consuming and labor-intensive due to the need for tedious mesh generation and numerical iterations. Recently, deep neural networks have shown new promise in cost-effective surrogate modeling because of their universal function approximation abilities. In this paper, we borrow the idea from physics-informed neural networks (PINNs) and propose an improved data-free surrogate model, DFS-Net. Specifically, we devise an attention-based neural structure containing a weighting mechanism to alleviate the problem of unstable or inaccurate predictions by PINNs. The proposed DFS-Net takes expanded spatial and temporal coordinates as the input and directly outputs the observables (quantities of interest). It approximates the PDE solution by minimizing the weighted residuals of the governing equations and data-fit terms, where no simulation or measured data are needed. The experimental results demonstrate that DFS-Net offers a good trade-off between accuracy and efficiency. It outperforms the widely used surrogate models in terms of prediction performance on different numerical benchmarks, including the Helmholtz, Klein–Gordon, and Navier–Stokes equations.

Numerical simulations play a vital role in the fields of scientific and engineering applications, such as aerospace, finance, civil, energy engineering, and biological engineering [1][2][3] . The principle of the simulation process is to solve linear/nonlinear partial differential equations (PDEs). Since the 1970s, various mesh-based numerical methods, such as finite difference (FD), finite element (FE), and finite volume (FV) methods, have been developed to solve PDE systems 4 . These methods first discretize the computational domain into mesh units and then iteratively solve the system of PDEs on each subdomain in order to yield an analysis capability for the numerical solution of the unknown functions.
However, traditional discrete methods often involve tedious meshing and iterative solving of large sparse nonlinear systems, which are computationally cumbersome on modern parallelized architectures [5][6][7][8] . Moreover, the current meshing process is still a highly specialized activity that remains in the empirical, descriptive realm of knowledge, especially for complex geometries and physical configurations 9,10 . Careful human-computer interaction is usually required to ensure a valid, high-quality mesh for the convergence of the PDE solvers. The extensive computational overhead and manual interaction limit the use of a principled PDE model for real-time analysis and optimization design. Therefore, developing a cost-effective surrogate model is desirable. An ideal model is one that takes the coordinates of some random points in the computational domain and automatically outputs the corresponding degrees of freedom of the PDE.
To fulfill this role, Raissi et al. 11,12 employed machine learning techniques (Gaussian process and Bayesian regression) to devise functional representations for linear/nonlinear operators in physical and mathematical problems. Tartakovsky et al. 13 presented a physics-informed machine learning approach, PICKLE, for elliptic diffusion equations. This approach uses conditional Karhunen-Loè ve expansion (cKLE) to minimize the PDE residuals and approximate the observed parameters and states. Ahalpara 14 developed a surrogate model for solving the Korteweg-de Vries (KdV) equation using a genetic algorithm. A random forest regression model was introduced by Wang et al. 15 to predict the Reynolds stresses in the flow over periodic hills. However, the above Methodology Problem setup. We begin with the description of a system of nonlinear partial differential equations (PDEs) in the following generalized form: where the spatial domain ∈ R d , D represents the differential operator, and u(x, t) is the unknown solution we wish to solve for. The above system is subject to the well-posed boundary condition, and the initial condition, where ∂� denotes the boundary of , and D bc is the differential operator that imposes boundary conditions for the PDE. h(x, t) : R d+1 → R , g(x) : R d → R are given functions. When a set of physical parameters and data-fit terms are given, the unknown function of interest u(x, t) can be solved by discretizing the nonlinear/linear system using traditional numerical methods, such as the finite difference method, the finite element method, and the finite volume method. However, these methods involve tedious mesh generation and iterative solving, which heavily increases the simulation overhead.
Deep neural network and physics-informed training. Deep neural networks have proven their powerful learning ability in many time-consuming classification-and regression-based physical applications 9,27-29 . Well-trained networks utilize multiple layers of neural units to automatically approximate complex input-output mapping from high-dimensional parameter spaces. Mathematically, a surrogate network model F is built to approximate the latent solution F for the underlying application: where θ W,b is a set of network parameters, including the weights and biases of the constructed network. Minimizing the mismatch between the desired solution F and the DNN-based predictions F is known to be a nonconvex optimization problem. A key element in the resulting problem is the training of tuning parameters on a very high dimensional parameter space. This process can be formulated as: where � · � denotes the L 2 -norm over the domain . After suitable training, one can find a set of optimal or suboptimal network parameters W * , b * , such that Loss(θ W,b , x, t) is as close to zero as possible.
Recently, physics-informed neural networks (PINNs) 23,24,30 have been employed to infer PDE solutions by building complex features from multiple layers of neural units via input-output relationships. In PINNs, neurons are fully connected. A loss function satisfying the PDEs [(Eqs. (1)-(3)] is employed to constrain the optimization process of the neuron parameters. Let (x r n , t r n ) n=N r n=1 be a preselected set of spatial and temporal points inside the solution domain . and (x i n , t i n ) n=N i n=1 denote the preselected point data sampled from the boundary and initial condition, respectively. The loss function of PINN is formulated as: Given a specific neural network architecture, the PINN can be viewed collectively as a function of the input data and the parameters θ W,b . It maps the time t, spatial coordinates x , and variables to the quantities of interest, e.g., the velocity field u or pressure field p, thus allowing a data-free PDE "solver" that does not require meshing or numerical iteration. If the physics-based loss function Loss(θ W,b , x, t) becomes identically zero, the output predictions will exactly satisfy the underlying PDE system. However, experimental results in 31,32 indicate that PINN shows difficulties in fitting all equation residuals and fails to guarantee a stable approximation of the solution. This may lead to serious errors in the PINN and return incorrect predictions, especially in the near-wall domains. DFS-Net: a data-free surrogate model for solving PDEs. Following the original works of PINN, we propose an improved deep neural network DFS-Net to tackle the aforementioned challenges.
To alleviate the problem of unstable predictions in different subdomains and obtain more accurate results, we first introduce a weighting mechanism in DFS-Net. The key idea is to associate the weights of each training point with its coordinates in the computational domain. Through trial and observation, we found that the center points are more informative than the edge points. Based on this observation, we set the point weighting mechanism with a function that gives more weight to these "pivotal" points. Taking a 2-D domain as an example (see Fig. 1), for training points sampled inside the domain , we use a piecewise function ω to give different weights to points in different domains: where d l is the shortest distance of the weighted point from boundaries (bc1-bc4), and d t is an empirical parameter controlling the scope of the weighting area.
In this way, DFS-Net is able to better balance the contribution of training points sampled from different subdomains (e.g., interior or boundary points) and accelerate the loss convergence. Moreover, this mechanism could also eliminate the potential singular values caused by discontinuities or sudden changes in the boundary conditions, thus allowing us to achieve better accuracy.
We now introduce the overall pipeline of the proposed DFS-Net. Instead of employing a very deep neural network, DFS-Net uses a lightweight structure to learn solution-related features from the space-time input. As depicted in Fig. 2, we first introduce a linear expanding layer as a data enhancement in DFS-Net. This layer defines a mapping from the input layer z in ∈ R 2 to the output z 1 ∈ R 8 : where pow(·) computes the square of the (x, t) element-wise, and sin(·) and cos(·) compute the sine and cosine of (x, t), respectively.
The affine transformation in the expanding layer can make the input (x, t) better suited for nonlinear partial differential function approximation, thereby capturing the complex high-dimensional features inherent in  www.nature.com/scientificreports/ conservation laws. Our experiments show that the introduced expanding layer improves the performance of neural network-based surrogate models compared with no expansion, albeit leading to a higher computational cost at the beginning of the training phase.
After expansion, a series of hidden layers are employed to extract the features of interest from the expanded spatial and temporal coordinates. In these hidden layers, the neurons of adjacent layers are fully connected, and each hidden layer of the network receives an output from the previous layer. Inspired by 33 , we introduce onedimensional excitation blocks to further increase the approximating ability of DFS-Net. An excitation block is a computational unit built upon a transformation between two layers. It takes the output of the last hidden layer as the input and produces a collection of per-channel modulation weights, which can be regarded as a simple self-gating mechanism. The affine transformation in each hidden layer is computed as: where the subscript l denotes the index of the hidden layer, and W l and b l are the weight matrix and bias vector in layer l, respectively. The element-wise activation function σ is applied to the transformed vector (W l z l−1 + b l ) , for which a number of options can be chosen, e.g., sigmoids, rectified linear units (ReLU), and tanh functions 34 . ⊙ represents a point-wise multiplication operator. W excitation is the weight matrix provided by the excitation block. In DFS-Net, we provide two excitation modes: (1) fixed excitation mode and (2) unfixed excitation mode. Among them, the fixed excitation mode adaptively recalibrates the per-channel feature responses by explicitly modeling interdependencies between the channels using a shared excitation block for each hidden layer. The unfixed excitation mode employs different excitation blocks for different hidden layers. Both modes help DFS-Net reweight channel attention without requiring additional supervision, where the fixed block mode induces a relatively small computational and memory overhead.
The detailed training procedure of the data-free surrogate model DFS-Net is shown in Algorithm 1. In our work, we consider the spatial and temporal coordinates as inputs. We first randomly select the training data from the solution domain. One can also sample training data uniformly depending upon space (time) scales or using other randomized designs, such as Latin hypercube sampling strategy and a truncated Gaussian distribution 23 . Then, we adopt the weighting mechanism to compute weights for each set of preselected training data, aimed at alleviating the prediction inaccuracy in the near-wall domains.
In step 3, we construct the loss function corresponding to the underlying PDE system. The loss function is used to constrain the DFS-Net, such that the conservation laws, boundary conditions, and initial conditions are satisfied at each iteration of subsequent training. The loss function Loss(θ W,b , x, t) of DFS-Net for solving PDEs is defined as follows: where Loss N r , Loss N b and Loss N i represent the loss terms corresponding to the residual of the governing equation, the boundary condition, and the initial condition, respectively. N r , N b and N i denote the number of preselected point samples for different loss terms. ω n is the weight of the n-th point calculated by the weighting mechanism.
1 and 2 are penalty coefficients introduced by the learning rate annealing method, which work as a dynamic www.nature.com/scientificreports/ penalty strategy to overcome the imbalance contribution of the governing equation term and the boundary/ initial condition terms in Loss(θ W,b , x, t) . The balanced interplay between different terms can help DFS-Net better learn the rule-based algorithm from the PDE systems and accelerate convergence during training. In our experiments, 1 and 2 are initialized by 10 and are updated every 10 gradient descent steps. After constructing the network structure (Step 4) and initializing the network parameters (Step 5), we feed the preselected data into DFS-Net for feedforward training. The input signals are passed between all hidden layers (with activation functions) and converted into high-level features. During the backpropagation process, the loss function concludes the partial derivatives of the layer outputs with respect to the variables. The network variables (weights, biases, and ) are optimized via non-convex optimization algorithms (e.g., stochastic gradient descent or quasi-Newton) to minimize Loss(θ W,b , x, t) . The optimization process stops after converging to a local optimum. Thereafter, the well-trained DFS-Net with a set of (sub)optimal network parameters can be used as a black box to rapidly compute the prediction solution for any given input vector (coordinates), such as the velocity, temperature, or pressure field. Since this feedforward prediction procedure only involves a few matrix multiplications, the computational cost for the prediction can be neglected compared to that of a traditional numerical simulation.
Training. The activation functions play an important role in neural network training. They perform a nonlinear transformation to the output of each hidden layer, making it possible for neurons to approximate complex patterns. The swish activation function is a widely used nonlinear mapping function for deep neural networks, which is defined as: Previous studies show that swish is less prone to the vanishing and exploding gradient problem 35 . In DFS-Net, we use the swish function with β = 10 for activation in each hidden layer, except for the last layer, where a linear activation function is used.
For the non-convex optimization algorithm, we combine the Adam and L-BFGS-B optimizers 36 to minimize the loss function. We first apply the Adam optimizer for stochastic gradient descent training and then employ the L-BFGS-B optimizer to finetune the results. During the Adam-based training, the optimizer randomly samples a subset of data (called a mini-batch) from the training set to calculate the direction of the gradient at each iteration. In our work, the initial learning rate is 1 × 10 −4 and decays 0.9 every 1000 epochs (iterations). The mini-batch size is 128, and the number of training epochs is 1 × 10 4 for Adam-based training. L-BFGS-B is a limited-memory quasi-Newton optimizer for bound-constrained optimization. It is known to work very well at escaping from local optima during network training and requires little tuning. For L-BFGS-B training, the input training set is 1280 points that are randomly sampled from the solution domain. We set the stopping criterion of L-BFGS-B to sys.float_info.min , which is the minimum float value in Python 37 .
For all test cases, we trained the DFS-Net on Intel Intel(R) Xeon(R) Gold 6150 CPUs. The partial differential operators in governing equations are computed using "tf.gradients()" based on the chain rule and automatic differentiation in TensorFlow 1.15.0 38,39 . During training, the random seeds for TensorFlow and Numpy 37 are set to 666 to ensure the reproducibility of the experimental results.

Results and discussion
In this section, we study and compare the performance of DFS-Net with some other widely used DNN-based surrogate models for PDE solving, including PINN 23 , PINN with the learning rate annealing algorithm 25 , DGM 26 , and GP-PINN 25 . To evaluate the prediction accuracy of different models, we use the relative L 2 -error criterion, which is defined as: where u ref denotes the reference solution given by the analytical solution or high-fidelity DNS numerical results, and u pred denotes the predicted solution obtained by surrogate models.

Helmholtz equation.
In the first test case, we use the two-dimensional Helmholtz equation as a benchmark to investigate the function approximation capability of the proposed methodology. This equation is one of the fundamental PDEs arising in various fields, such as acoustics, electromagnetism, and elastic mechanics 40 . The two-dimensional Helmholtz equation we used is given by: where denotes the Laplace operator, ∈ [0, 1] × [0, 1] , and q(x, y) is a source term given by: To obtain the predicted solution of Eqs. (15 and 16), we formulate the following loss function to guide the subsequent training: q(x, y) = − (α 1 π) 2 sin (α 1 πx) sin α 2 πy − (α 2 π) 2 sin (α 1 πx) sin α 2 πy + k 2 sin (α 1 πx) sin α 2 πy ,  (x, y) is computed by the analytical solution (k = 1) given by: In this case, we construct a three-hidden-layer DFS-Net with 50 neural units per layer to find the optimal network parameters for which the suitably defined loss function (Eq. 18) is minimized. We conduct experiments on two different equation settings: (1) α 1 = 1 and α 2 = 4 and (2) α 1 = 2 and α 2 = 3 . Figure 3 depicts the convergence of DFS-Net on these two Helmholtz benchmarks. From the variation curves of the loss value, we can observe that applying a two-step optimization (Adam and L-BFGS-B) is robust for the DFS-Net. During the Adam training phase, the value of the loss terms ( Loss N r and Loss N b ) decreases as the learning rate decays with the increase in the epoch. For the L-BFGS-B phase, DFS-Net rapidly converges after 3417 epochs (for α 1 = 1 and α 2 = 4 ) and 2251 epochs (for α 1 = 2 and α 2 = 3 ), respectively.
In Fig. 4, we compare the predicted solution u pred with the reference solution u ref and report the point-wise absolute error between them. It is evident that the DFS-Net based approximation does a good job at fitting the governing equation and boundary conditions on Helmholtz benchmarks. In particular, the introduced point weighting mechanism and excitation blocks in DFS-Net effectively alleviate the problem of unstable predictions by PINNs. Compared with the absolute error of PINN and GP-PINN depicted in Fig. 4b and d, we can clearly see the advantage of the proposed DFS-Net, that is, the prediction solution is more accurate, especially in subdomains near the boundaries.
In Table 1, we compare the DFS-Nets (fixed excitation mode DFS-Net fix and unfixed excitation mode DFS-Net unfix ) against the other surrogate models. To ensure fairness in the comparison of different models, we set the number of hidden layers of all models to 3 × 50 and use the same hyperparameter settings. The experimental results show that DFS-Nets outperform the existing neural network-based solvers. When α 1 = 1, α 2 = 4 , DFS-Net fix achieves an average L 2 -error of 3.27e−03 in this case, while DFS-Net unfix yields 1.48e−03. The prediction errors of DFS-Net are approximately two orders of magnitude lower than those of PINN and DGM and one order of magnitude lower than those of PINN-anneal. Similar results can be seen in Table 2, where α 1 = 2 and α 2 = 3 . We can see that the proposed DFS-Nets are able to better extract the solution-related features inherent in conservation laws compared to other models, and DFS-Net unfix achieves the best performance of 2.95e−03 on this benchmark.
The training time column of Table 1  This equation is a second-order nonlinear PDE closely related to many scientific fields, such as quantum, solidstate, and condensed matter physics 41 . The initial boundary value problem of the one−dimensional Klein-Gordon equation is given by:  A composite loss function that includes the governing equation term, boundary condition term, and initial condition term for this benchmark is formulated as follows: The DFS-Net used in this case consists of five hidden layers with 50 neural units in each layer. During the DFS-Net training process, we seek to find the minimum loss value by tuning the network parameters. For this purpose, we use the chain rule to back-propagate derivatives from the output layer to the inputs and update the weights, biases, and . After offline training, this DNN-based surrogate model is expected to provide a rapid online prediction of observables with a set of optimal parameters θ.
In Fig. 5a, we present the comparison results of the reference solution and the predicted solution given by DFS-Net. We also summarize the pointwise absolute error of different surrogate models on the Klein-Gordon equation in Fig. 5b. As expected, DFS-Net presents good agreement with the reference solution and achieves the smallest pointwise absolute error in the solution domain of this problem. Table 3 provides a more detailed evaluation of the L 2 -error for different models. As shown in Table 3, both modes of DFS-Net are able to yield solutions with a high accuracy. The network with unfixed excitation blocks (DFS-Net unfix ) achieves the best performance of 1.45e−03 at the end of 13775 epochs (10000 epochs of Adambased training and 3775 epochs of L-BFGS-B-based training). We can also observe that by introducing the learning rate annealing method, the DNN-based surrogate models can obtain more robust prediction results (corresponding to model PINN-anneal, GP-PINN, and DFS-Nets). In contrast, PINN and DGM fail to yield the desired prediction accuracy, leaning to L 2 -errors of 1.38e−01 and 2.09e−01, respectively.
x ∈ �, t = 0, u(x, t) = x cos(5πt) + (xt) 3 . www.nature.com/scientificreports/ In this case, we also investigate the effect of different architectures, i.e., the number of hidden layers and the number of neurons per layer, on the relative L 2 -error of the predicted solution. Figure 6a shows the performance when the number of model layers varies from 1 to 10. As we can see, a single−layer architecture tends to return incorrect predictions. By increasing the number of layers, the improved surrogate model yields a better accuracy. However, we can also observe that excessive layers will lead to overfitting of the model, resulting in suboptimal results. Similar results can be obtained in Fig. 6b, which analyzes the performance for a varying number of neurons. It can be seen that the increase in the number of neurons may not guarantee the desired performance, and the model works best when the number of neurons per layer is 50 to 70.

(25)
Lid-driven cavity flow. In the last case, we employ a canonical benchmark problem, the steady-state flow in a two-dimensional lid-driven cavity (see Fig. 7), to analyze the performance of the DNN-based surrogate models. The flow system is governed by the Navier-Stokes equation 42,43 , which can be written as: (26) u(x, y) · ∇u(x, y) + ∇p(x, y) − 1 Re �u(x, y) = 0 (x, y) ∈ �,  www.nature.com/scientificreports/ where u(x, y) is a velocity vector field, p(x, y) is a scalar pressure field, Re is the Reynolds number of the flow, ∈ [0, 1] × [0, 1] , Ŵ 1 denotes the top boundary of the two-dimensional square cavity, and Ŵ 0 denotes the other three sides.
In our experiments, we perform a neural network-based simulation using the vorticity-velocity (VV) formulation of the Navier-Stokes equations 42 . In this formulation, the velocity components u, v are obtained by taking derivatives of the scalar potential function ψ(x, y) with respect to the x and y coordinates: As a result, the continuity equation ∇ · u(x, y) = 0 for incompressible fluids is automatically satisfied. Moreover, since only steady-state solutions are considered for this proof of concept, the constraint of temporal terms can be neglected. The corresponding loss function for this benchmark is defined as: where h 1 (x, y) = (1, 0) for (x, y) ∈ Ŵ 1 , and h 0 (x, y) = (0, 0) for (x, y) ∈ Ŵ 0 . The first and second derivative terms of ψ , u, v, and p with respect to the spatial coordinates (x, t) are computed using automatic differentiation.
In this case, we conduct experiments at Reynolds numbers Re = 100, 300, 600 to comprehensively study the prediction performance of neural network-based surrogate models. The DFS-Net we used again contains three hidden layers. To better infer the PDE solutions, we gradually increase the number of neurons in each hidden (28) u(x, y) = (1, 0) (x, y) ∈ Ŵ 1 ,  For each Re, we implement two modes of DFS-Net for training by means of the point weighting mechanism and excitation blocks. The models take the expanded spatial coordinates as inputs and output the pressure and vorticity fields. To validate the prediction accuracy of the trained models, we solve the Navier-Stokes equations to generate the reference solution using the open-source CFD solver OpenFOAM 44 . For different Reynolds numbers, DFS-Nets show a rapid convergence and reach their local optimum after approximately 20,000 epochs. The reference and predicted distributions of velocity are shown in Fig. 8. It is observed that DFS-Net can accurately capture the intricate nonlinear behavior of the Navier-Stokes equations and agrees well with the CFD solutions.
It is worth noting that in this case, the velocity of u changes sharply from 0 to 1 at the junction of Ŵ 1 and Ŵ 0 [e.g., points (0,1) and (1,1)] according to the definition of the boundary velocity. These sharp discontinuities can lead to instability during neural network training in the near-wall regions. To emphasize the ability of the proposed model to handle nonlinearity in different subdomains, we plot and compare the point-wise absolute error obtained by PINN, GP-PINN, and DFS-Net in Fig. 9. It is observed that PINN fails to provide satisfactory prediction results for the underlying NS equations with different Re settings. At a Reynolds number of 100, PINN suffers a large error near the right boundary, yielding a prediction error of 3.47e−01. As the Reynolds number increases, the errors are more severe: 6.25e−01 at Re = 300 and 7.76e−01 at Re = 600.
When Re = 100 , GP-PINN can mitigate the prediction errors caused by sharp discontinuities. However, the incorrect prediction in the near-wall subdomains is still obvious (see Fig. 9). In contrast, the proposed DFS-Net appears to be more robust as the Reynolds number increases. The visualization results show that DFS-Net has a better ability to approximate complicated functions than other models and obtains a stable prediction accuracy in all three Re cases. This proves that the combined use of the weighting mechanism and the excitation blocks in DFS-Net has a positive effect on the velocity prediction of near-wall regions. The weighting mechanism assigns different weights to the sample points, thus eliminating any potential discontinuities (the loss weight of the discontinuous points is 0). Meanwhile, the introduced excitation blocks work as a tool to bias the allocation of available processing resources towards the most informative components of the expanded channels and correspondingly increase the weights of these solution-related features. As a result, they speed up the convergence of DFS-Net and allow us to achieve better accuracy.
To further analyze the performance of our method, we compare the L 2 -error given by DFS-Nets against the other four widely used surrogate models. The experimental results are summarized in Tables 4, 5 and 6. When Re = 100 , DFS-Net fix and DFS-Net unfix perform better than the other comparative models, and the resulting prediction error is measured at 1.34e−02 and 2.91e−02 in the relative L 2 -error, respectively. These two models improve the prediction accuracy of PINN and PINN-anneal by a factor of 5-25, although more training time is required. Compared with DGM and GP-PINN, the proposed models are more accurate and more efficient. Benefiting from the lightweight structure of DFS-Nets, the total training time of DFS-Net fix is 1610.1 s, while that of DFS-Net unfix is 1730.3 s. The prediction overhead for each sample is approximately 0.5 ms.
When Re = 300 and 600, DFS-Net uniformly leads to the most accurate results we have obtained for this benchmark. Compared with the widely used models, the prediction error of the two DFS-Nets can be largely reduced by nearly an order of magnitude, yielding L 2 -errors ranging from 3.31-3.80% at Re =300 and 7.25-8.50% at Re =600. We also conducted experiments for higher Re settings to verify the performance of the proposed methodology. However, problems with higher Reynolds numbers tend to have more stringent requirements on the depth (width) of the underlying network, which can lead to very large training overheads (e.g., more than 100k epochs of training are required to ensure a L 2 -error of less than 10% at a Reynolds number of 1000). The excessive training overhead weakens the practicability of neural network-based methods and is therefore not discussed in this paper.
The tuning of hyperparameters is an essential ingredient and important process of deep learning methods. Here, we evaluate the effect of two basic training hyperparameters, namely, the learning rate and batch size, at Re = 600 . During Adam-based training, the learning rate is a key hyperparameter used to control the step size of the gradient descent. An overly large learning rate might overshoot and prevent convergence, while at small step size may get stuck in a local minima, thus providing suboptimal solutions. A comparison of the L 2 -error for different learning rates is shown in Fig. 10a. We can see that in this test case, the range of 1.0e−02 to 1.0e−03 yields a good convergence. Figure 10b depicts the performance when DFS-Net is trained on different batch sizes. Due to the memory limitations of the underlying system, we only test the results for batch sizes smaller than 512. The experimental results show that a relatively large batch size is required in order to achieve the desired accuracy, and that the minimum error is achieved when the batch size is 256.
Overall, we performed a comprehensive study on the robustness of the proposed DFS-Net model using three PDE benchmarks (Helmholtz equation, Klein-Gordon equation, and Navier-Stokes equation). To keep the training and prediction costs low, we did not consider very deep architectures throughout all test cases. Instead, we employed a fixed neural architecture (less than 256 neural units per layer) to evaluate the L 2 -error against the reference solution, as well as the time cost required to complete each simulation. The experimental results demonstrate that DFS-Net is able to alleviate the problem of unstable predictions of existing neural network-based surrogate models and infer a solution of the underlying partial differential equations with a remarkable accuracy.

Conclusion
Deep neural networks provide an efficient substitute for inferring PDE solutions because of their universal approximation capabilities in the high-dimensional parameter space. In this paper, we designed an improved neural network-based surrogate model, DFS-Net, for PDE solving. The proposed model employs a series of attention-based neural units to approximate the nonlinear mapping relations between the coordinate inputs and www.nature.com/scientificreports/ predictions. Moreover, we introduced a weighting mechanism in DFS-Net to enhance its ability to encode the underlying physical laws that govern a given PDE system. After suitable training, DFS-Net allows us to construct a computationally efficient and fully differentiable surrogate, where the quantities of interest can be immediately obtained by evaluating the trained network with any given input point without meshing.
To verify the robustness of DFS-Net, we conducted a collection of numerical studies on different surrogate models in terms of their learning efficiency and prediction accuracy. The experiments demonstrated that DFS-Net is able to yield a good trade-off between accuracy and efficiency. It outperforms the widely used surrogate models www.nature.com/scientificreports/ and achieves the best prediction performance on different numerical benchmarks, including the Helmholtz, Klein-Gordon, and Navier-Stokes equations. Designing a suitable deep neural network for the PDE system is challenging, despite there are many architectural and parametric possibilities to consider. In future work, we will focus on studying the implicitly encoded features in the current DFS-Net and calibrating the model to more complex tasks. As deep learning technology is continuing to grow rapidly in terms of both methodological and algorithmic developments, we believe that     Figure 10. A comparison of the L 2 -error for different learning rates and batch sizes.