Physics-informed learning of governing equations from scarce data

Harnessing data to discover the underlying governing laws or equations that describe the behavior of complex physical systems can significantly advance our modeling, simulation and understanding of such systems in various science and engineering disciplines. This work introduces a novel approach called physics-informed neural network with sparse regression to discover governing partial differential equations from scarce and noisy data for nonlinear spatiotemporal systems. In particular, this discovery approach seamlessly integrates the strengths of deep neural networks for rich representation learning, physics embedding, automatic differentiation and sparse regression to approximate the solution of system variables, compute essential derivatives, as well as identify the key derivative terms and parameters that form the structure and explicit expression of the equations. The efficacy and robustness of this method are demonstrated, both numerically and experimentally, on discovering a variety of partial differential equation systems with different levels of data scarcity and noise accounting for different initial/boundary conditions. The resulting computational framework shows the potential for closed-form model discovery in practical applications where large and accurate datasets are intractable to capture.


Algorithm 1
The proposed ADO for network training: rθ best , Λ best s " ADO pD u , D c , α, γ,∆δ, n max , n str q 1: Input: Measurement data Du, collocation points Dc " txi, tiui"1,2,...,N c , relative weighting of loss functions α and γ, threshold tolerance increment ∆δ for STRidge, maximum number of ADO iterations nmax, and maximum number of STRidge cycles nstr. # we take a 2D system in a 2D domain as an example: u " tu, vu and x " tx, yu 2: Split measurement data Du into training-validation sets (ntr{nva " 80{20): D tr u P R n trˆ2 and D va u P R nvaˆ2 . # Nm " ntr`nva 3: Split collocation points Dc into training-validation sets (mtr{mva " 80{20): D tr c P R m trˆ3 and D va c P R mvaˆ3 . # Nc " mtr`mva 4: Initialize the Tensor Graph for the entire network. # κ can be determined via a Pareto front analysis, e.g., κ " 1.

5:
Enforce sparsity to small values by setting them to zero:λI " 0. 6: Update remaining non-zero values with ridge regression: . # the parameter 1ˆ10´5 is small and tunable 7: until maximum number of iterations reached. 8: Output: The best solutionλ "λI YλJ

Selection of hyper-parameters
The criteria for selecting hyper-parameters tα, β, γ, n max u have been introduced in Main Text (see Method). Other required hyper-parameters are selected based on the criteria below.
• Maximum Number of Epochs: Our intuition of selecting the number of epochs for DNN training follows the general practice: the network should be sufficiently trained. It depends on the specific problem complexity and the number of trainable parameters. The pre-training typically consists of up to 10 4 " 10 5 epochs for Adam and/or BFGS, while, in each ADO iteration, 10 3 epochs for Adam are used. Post-training might consist of up to 3ˆ10 4 epochs for Adam and/or BFGS.
• Maximum Number of STRidge Cycles (n str ): We set the number of STRidge cycles large enough (e.g., 100) to allow the algorithm to heuristically locate an optimal threshold.
• Threshold Tolerance Increment (∆δ): The role of ∆δ is to adaptively determine δ that introduces proper sparsity in the discovered equation. Since the optimal threshold is automatically determined and normalized in STRidge, we take a small positive value (e.g., ∆δ " 1) as long as the number of STRidge cycles is sufficient. However, in the case of complex-value systems (such as the nonlinear Schrödinger equation), since the threshold is compared with the modulus of complex numbers, we find empirically setting ∆δ as 100 is more proper.
The list of hyper-parameters used in all examples in the paper is presented in Section 3.8.

Examples
We observe the efficacy and robustness of our methodology on a group of canonical PDEs used to represent a wide range of physical systems with nonlinear, periodic and/or chaotic behaviors. In particular, we discover the closed forms of Burgers', Kuramoto-Sivashinsky (KS), nonlinear Schrödinger, Navier-Stokes (NS), and λ-ω Reaction-Diffusion (RD) equations from scarce and noisy time-series measurements recorded by a number of sensors at fixed locations from a single I/BC. Gaussian white noise is added to the synthetic response with the noise level defined as the root-meansquare ratio between the noise and the exact solution. To demonstrate the "root-branch" network for discovery of PDE(s) based on multiple independent datasets sampled under different I/BCs, we consider (1) the 1D Burgers' equation with light viscosity that exhibits a shock behavior, and (2) a 2D Fitzhugh-Nagumo (FN) type reaction-diffusion system that describes activator-inhibitor neuron activities excited by external stimulus. At last, we test our framework on the experimental data of cell migration and proliferation. Our method is also compared with SINDy (the PDE-FIND approach presented in [2]) which is also presented herein. The identification error is defined as the average relative error of the identified non-zero PDE coefficients with respect to the ground truth, which is used to evaluate the accuracy of the discovered PDEs for the following examples. Simulations in this paper are performed on a workstation with 28 Intel Core i9-7940X CPUs and 2 NVIDIA GTX 1080Ti GPU cards (otherwise, noted separately).

Burgers' equation
We first consider a dissipative system with the dynamics governed by a 1D viscous Burgers' equation expressed as where u is a field variable, x and t are the spatial and temporal coordinates, and ν denotes the diffusion coefficient. The equation describes the decaying stationary viscous shock of a system after a finite period of time, commonly found in simplified fluid mechanics, nonlinear acoustics, gas dynamics and traffic flow. In this work, solution for the Burgers' Equation is from an open dataset [2], in which the diffusion coefficient ν is assumed to be 0.1 and u is discretized into 256 spatial grid points for 101 time steps with a Gaussian initial condition. In particular, 10 sensors are randomly placed at fixed locations among the 256 spatial grid points to record the wave response for 101 time steps, leading to 3.19% of the dataset used in [2]. Among all the measurements, 80% are allocated for training purpose and the rest 20% for validation. A total number of 5ˆ10 4 collocation points, e.g., in the pair of tx, tu, are sampled by the Latin hypercube sampling [3]. A group of 16 candidate functions (φ P R 1ˆ16 ) are used to reconstruct the PDE, consisting of polynomial terms (u, u 2 , u 3 ), derivatives (u x , u xx , u xxx ) and their multiplications whose coefficient vector Λ is initialized uninformatively as zeros. The fully connected DNN has 8 hidden layers and a width of 20 neuron nodes in each layer where activation functions are hyperbolic tangent, weights are initialized based on Glorot normal distribution [4] and biases are initialized as zeros. The training efforts are performed via up to 1ˆ10 4 epochs of L-BFGS for pre-training and 6 ADO iterations. In each ADO iteration, we use 1000 epochs of Adam to train the DNN for alternation with STRidge. The discovered equation for the dataset with 10 % noise reads: where the aggregated relative identification error for all non-zero elements in Λ is 0.88˘0.03%. Despite that only 3.19% subsampled responses are measured, the PINN-SR approach can accurately extrapolate the full-field solution with a 2 error of 1.32% (see Fig. 1A). Fig. 1B shows the comparison of spatial and temporal snapshots between the predicted and the exact solutions which agree extremely well.

Kuramoto-Sivashinsky equation
Another dissipative system with intrinsic instabilities is considered, governed by the 1D Kuramoto-Sivashinsky (KS) equation: u t "´uu x´uxx´uxxxx where the reverse diffusion term´u xx leads to the blowup behavior while the fourth-order derivative u xxxx introduces chaotic patterns as shown in Fig. 2B, making an ideal test problem for equation discovery. Starting with a smooth initial condition, the KS system evolves to an unstable laminar status due to the highly nonlinear terms including the high-order derivative. The KS equation is widely used to model the instabilities in laminar flame fronts and dissipative trapped-ion modes among others. We subsample the open dataset [2] by randomly choosing 320 points from the 1024 spatial grid nodes as fixed sensors and record the wave response for 101 time steps, occupying about 12.6% of the original dataset. The training/validation measurements are separated based on an 80-20 principle. A set of 2ˆ10 4 collocation points, generated using the Latin hypercube sampling in the spatiotemporal domain, are employed to evaluate the residual physics loss. A library of 36 candidate functions are used to construct the PDE, consisting of polynomials (u, u 2 , u 3 , u 4 , u 5 ), derivatives (u x , u xx , u xxx , u xxxx , u xxxxx ) and their multiplications, whose initial coefficients are uninformatively chosen to be zeros. The DNN architecture has 8 hidden layers each with 40 nodes whose activation functions and initialization are the same as the previous Burgers' case. The training efforts are performed via 8ˆ10 4 epochs of Adam and up to 8ˆ10 4 epochs of L-BFGS for pre-training and 6 ADO iterations. In each ADO iteration, we use 1000 epochs of Adam to train the DNN for alternation with STRidge. As for post-training, 2ˆ10 4 Adam epochs and up to 2ˆ10 4 L-BFGS epochs are used. It is notable that the chaotic behavior poses significant challenges in approximating the fullfield spatiotemporal derivatives, especially the high-order u xxxx , from poorly measured data for discovery of such a PDE. Existing methods (for example the family of SINDy methods [2,5]) eventually fail in this case given very coarse and noisy measurements. Nevertheless, the PINN-SR approach successfully distils the closed form of the KS equation from subsampled sparse data even with 10% noise: where the coefficients have an average relative error, for all non-zero elements in Λ, of 0.94˘0.05%.
Although the available measurement data are sparsely sampled in the spatiotemporal domain under a high-level noise corruption, the predicted full-field wave by the trained PINN-SR also agrees well with the exact solution with a relative 2 error of 2.14% ( Fig. 2A). The spatial and temporal snapshots of the predicted response match seamlessly the ground truth as shown in The relative full-field 2 error of the prediction is 2.14%.

Nonlinear Schrödinger equation
In the third example, we discover the nonlinear Schrödinger equation, originated as a classical wave equation, given by iu t "´0.5u xx´| u| 2 u where u is a complex field variable. This well-known equation is widely used in modeling the propagation of light in nonlinear optical fibers, Bose-Einstein condensates, Langmuir waves in hot plasmas, and so on. The solution to this Schrödinger equation is simulated based on a Gaussian initial condition with the problem domain meshed into 512 spatial points and 501 temporal steps, while the measurements are taken from 256 randomly chosen spatial "sensors" for 375 time instants, resulting in 37.5% data used in [2] for uncovering the closed form of the equation. A library of 40 candidate functions are used for constructing the PDE, varying among polynomial functions (u, u 2 , u 3 ), absolute values (|u|, |u| 2 , |u| 3 ), derivatives (u x , u xx , u xxx ) and their combination, whose coefficients are initialized as zeros. Since the function is complex valued, we model separately the real part (u R ) and the imaginary part (u I ) of the solution in the output of the DNN, assemble them to obtain the complex solution u " u R`i u I , and construct the complex valued candidate functions for discovery. To avoid complex gradients in optimization, we compute the mean square error of the modulus |u| for the residual physics loss L p . The fully connected DNN has 8 hidden layers and a width of 40 neuron nodes in each layer with the same activation functions and initialization for network's weights and biases as previous. The pre-training takes 8ˆ10 4 epochs of Adam (with at most 1.6ˆ10 5 epochs of L-BFGS) followed by 6 ADO iterations. In each ADO iteration, we use The discovered equation under 10 % noise is written as where the average relative error for non-zero coefficients is 0.08˘0.03%. The predicted full-field response, for both real and imaginary parts, matches well the exact solution with a slight relative 2 error of about 0.26% ( Fig. 3A and B ). The comparison of spatiotemporal snapshots between the predicted and the exact solutions for the real part (Fig. 3C ) and imaginary part (Fig. 3D) also shows almost perfect agreement.

Navier-Stokes equation
We consider a 2D fluid flow passing a circular cylinder with the local rotation dynamics (see Fig.  4). For incompressible and isotropic fluids which also have conservative body forces, the well-known Navier-Stokes vorticity equation reads w t "´uw x´v w y`0 .01w xx`0 .01w yy where w is the spatiotemporally variant vorticity, u " tu, vu denotes the fluid velocities at Reynolds number 100. The full-field solution to the NS vorticity equation is obtained using the immersed boundary projection method [6]. The dimensionless domain is discretized into a 499ˆ199 spatial grid and 151 time steps. The cylinder has a unit diameter and the input flow from the left side has a unit velocity. Measurements of velocities tu, vu and vorticity w are taken at 500 random spatial locations lasting 60 time steps in the boxed area behind the cylinder as shown in Fig. 4, namely 0.20% subsamples from the total dataset and 1/10 of the data used in [2]. The residual physics loss is evaluated on 6ˆ10 4 collocation points randomly sampled in the spatiotemporal doamin using the Latin hypercube sampling method. The library of candidate functions consists of 60 components including polynomial terms (u, v, w, uv, uw, vw, u 2 , v 2 , w 2 ), derivatives (w x , w y , w xx , w xy , w yy ) and their combination, whose coefficients are initialized as zeros. The latent output in the DNN contains u, v and w. The DNN has 8 fully connected hidden layers and a width of 60 nodes in each layer with the same activation functions and initialization methods as previous. The pre-training takes 5ˆ10 3 epochs of Adam (with additional L-BFGS tuning up to 1ˆ10 4 epochs) followed by 6 ADO iterations. In each ADO iteration, we use the Adam optimizer with 1000 epochs to train the DNN for each alternation within STRidge.
The discovered NS vorticity equation for the case of 10% noise is given as follows where the aggregated relative identification error for all non-zero elements in Λ is 1.22˘0.69%. It is encouraging that the uncovered vorticity equation is almost identical to the ground truth, for both the derivative terms and their coefficients, even under 10% noise corruption. The vorticity patterns and magnitudes are also well predicted as indicated by multiple spatial snapshots at different time instants (t " 0.  Fig. 5C ). Note that the response in these snapshots is not used in training the network. The 2 error of the predicted full-field vorticity response is about 2.58%. This example provides a compelling test case for the proposed PINN-SR approach which is capable of discovering the closed-form NS equation with scarce and noisy data.

λ-ω type Reaction-Diffusion equations
The examples discussed previously are low-dimensional (1D) models with limited complexity. We herein consider a λ-ω reaction-diffusion (RD) system in a 2D domain with the pattern forming behavior governed by two coupled PDEs: where u and v are two field variables. The λ-ω equations are typically used to describe the multiscale phenomenon of local reactive transformation and the global diffusion in chemical reactions, with wide applications in pattern formation [7], biological morphogenesis [8], and ecological invasions [9], among others. The λ-ω equations exhibit a wide range of behaviors including wave-like phenomena and self-organized patterns found in chemical and biological systems. The binomial system is also called an activator-inhibitor system because one state variable encourages the increase of both states while the other state component inhibits their growth. The particular λ-ω equations in this test example display spiral waves subjected to periodic boundary conditions. The domain for generating the solution is divided into 65,536 (256ˆ256) spatial points with 201 time steps. We take randomly 2,500 spatial points as fixed sensors recording the wave response for 15 randomly sampled time steps, leading to 1/4 of the subsampled dataset used in [2] and 0.29% of the total data. We sample 8ˆ10 4 collocation points using the Latin hypercube sampling to evaluate the residual physics loss. A total of 110 candidate functions are employed, including polynomials up to the 3rd order (u, v, u 2 , v 2 , uv, u 3 , u 2 v, uv 2 , v 3 ), derivatives up to the 2nd order (u x , u y , v x , v y , u xx , u xy , u yy , v xx , v xy , v yy ) and their combination, for the sparse discovery of the two PDEs, whose coefficients are initialized as zeros. Since the system dimension is relatively high, we enhance the discovery by post-training (post-tuning) of the DNN and the uncovered non-zero PDE coefficients, after the ADO stage, resulting in refined/improved discovery. The DNN has 8 fully connected hidden layers and a width of 60 nodes in each layer, whose activation functions and initializations are the same as previous case studies. Due to the high dimensionality of the system and a large candidate library, we run this case on a workstation with an NVIDIA Tesla V100 GPU card (32 GB). The pre-training takes 1ˆ10 4 epochs of Adam (with additional L-BFGS tuning up to 4ˆ10 4 epochs) followed by 6 ADO iterations. In each ADO iteration, we use 1ˆ10 3 Adam epochs to train the DNN for each alternation within STRidge. After discovering the PDE equation structure, we further use 1ˆ10 4 Adam and up to 1ˆ10 4 L-BFGS to refine DNN and equation

coefficients.
The reconstructed equations for the case of 10% noise are given by where the the average relative error for all non-zero coefficients is 1.84˘1.48%. The predicted response snapshots by the trained PINN-SR at different time instants, e.g., t " t0.5, 2.15, 4.35, 5.45, 7.2, 9.65u, are shown in in Fig. 6A and B, which are very close to the ground truth (the errors are distributed within a small range). This example shows especially the great ability and robustness of our method for discovering governing PDEs for high-dimensional systems from highly noisy data. The average relative full-field 2 error of the predictedû andv is 2.14%.

Comparison with SINDy
We have performed the comparison study between the proposed PINN-SR approach and the state-of-the-art PDE-FIND method (an extended version of SINDy) [2], in the context of different levels of data size and noise. We test the five PDEs described previously and summarize the discovery errors for the sparse coefficients in Table 1. The error is defined as the average relative error of the identified non-zero PDE coefficients with respect to the ground truth. If the terms in the PDEs are discovered incorrectly, we mark it as "Fail". It is seen from Table 1 that the proposed PINN-SR approach is capable of correctly uncovering the closed-form PDEs for all cases, regardless of the varying levels of data size and noise, which demonstrates excellent robustness. Although PDE-FIND shows great success in PDE discovery with negligible error for large and clean (or approximately noise-free) measurement data, this method eventually fails when the level of data scarcity and/or noise increases. In general, PDE-FIND relies on the strict requirement of measurement quality and quantity. However, PINN-SR is able to alleviate and resolve this limitation thanks to the combination of the strengths of DNNs for rich representation learning of nonlinear functions, automatic differentiation for accurate derivative calculation as well as 0 sparse regression. In addition, the use of collocation points introduces additional "pseudo datasets", compensates indirectly the scarcity of measurement data, and enriches the constraint for constructing the closed form of PDEs.
We must admit that, since our method involves DNN training, the computational cost is much higher compared with SINDy (e.g., 553 seconds for ours vs. 2 seconds for SINDy, in the Burgers' example). However, the critical bottleneck of SINDy lies in its requirement of large high-quality (clean) structured measurement data, owing to its use of numerical differentiation, which poses critical limitation of SINDy in practical applications where data is sparse and noisy (e.g., the experimental data in the cell migration and proliferation example, discussed in Section 2.4). While PINN-SR is more computationally costly, it is capable of producing accurate discovery even in the presence of sparse and noisy datasets. There is obviously a trade-off between computational efficiency and need of high-quality data. Furthermore, this issue of longer computational time can be well managed through parallel computing on a powerful GPU platform and remains a less important concern compared to the aim for successful discovery of correct underlying PDEs.

Burgers' equation with shock behavior
We consider to discover the previously discussed Burgers' equation (see Section 2.1.1) with a small diffusion/viscosity parameter, expressed as .01 π u xx based on datasets generated by imposing three different I/BCs. The small diffusion coefficient 0.01{π « 0.0032 creates creates shock formation in a compact area with sharp gradient and poses notorious difficulty for many numerical methods to resolve, which could challenge the DNN's approximation ability and thus affect the discovery. The three I/BCs used for data generation include: where G denotes a Gaussian function. The ground truth solution is simulated by MATLAB function pdede in a spatiotemporal domain Ωˆr0, T s " r´1, 1s d"200ˆr 0, 1s d"1000 . For all I/BCs, we assume that there are 30 sensors randomly deployed in space measuring the wave traveling (e.g., u) for 500 time instants (7.5% of the total grid points). A denser sensor grid is needed herein, compared with the previous Burgers' example, in order to capture the shock behaviors. All measurements are polluted with 10% Gaussian noise. The noisy measurements are depicted in Fig. 7A for the three datasets. For visualization purpose, we only draw a handful of signals out of a total of 30 time series for each I/BC. We design a "root-branch" DNN: the root takes the spatiotemporal coordinates tx, tu as input followed by 4 hidden layers of 20 nodes, while each of the three branches is separately connected to the last hidden layer of the root followed by 4 hidden layers of 30 nodes before the output layer, whose activation functions and initializations are the same as the previous. The motivation for this design is that the branch nets can capture the solution difference due to different I/BCs while the shared root net learns the common response patterns that obey the unique Burgers' equation. Note that although we have three distinctive solution approximations, we stack them into one candidate library followed by a unified form of PDE. Therefore, we can combine the information from three datasets to discover one physics equation. A group of 4.5ˆ10 4 collocation points are generated by the Latin hypercube sampling strategy [3] for determining the residual physics loss. The PDE library consists of 16 candidate functions, exactly the same as the Burgers' case in Section 2.1.1, whose coefficients are initialized as zeros. Due to the size of the composite deep neural networks, we run this case on a workstation with an NVIDIA Tesla V100 GPU card (32 GB). The training efforts include up to 8ˆ10 4 epochs pretraining by L-BFGS followed by 6 ADO iterations. In each ADO iteration, we use 1ˆ10 3 Adam epochs to train the DNN in synergy with STRidge. The discovered PDE is given by u t "´1.002uu x`0 .0032u xx which shows great agreement with the ground truth. The trained "root-branch" network can accurately reproduce distinctive system responses even with limited measurements under 10% noise, giving a full-field 2 error of 0.65%, as shown in Fig. 7B-D.

Fitzhugh-Nagumo type of Reaction-Diffusion equations
We consider the Fitzhugh-Nagumo (FN) type reaction-diffusion system, in a 2D domain Ω " r0, 150sˆr0, 150s with periodic boundary conditions, whose governing equations are expressed by two coupled PDEs [10,11]: where u and v represent two interactive components/matters (e.g., biological), γ u " 1 and γ v " 100 are diffusion coefficients, α " 0.01 and β " 0.25 are the coefficients for reaction terms, and ∆ is the Laplacian operator. The FN equations are commonly used to describe biological neuron activities excited by external stimulus (α), which exhibit an activator-inhibitor system because one equation boosts the production of both components while the other equation dissipates their new growth. The  ground truth data is generated by the finite difference method (dx " dy " 0.5 and dt " 0.0002) for the time period of r0, T s " r0, 36s, with three random fields as initial conditions. Three measurement datasets are then generated, each of which consists of 31 low-resolution snapshots (projected into a 31ˆ31 grid) uniformly down-sampled from full-field synthetic data during the period of r7.18, 36s under a 10% noise condition (see Fig. 8). Similar to the previous example in Section 2.3.1, we design a "root-branch" DNN with three branches: the root net has 2 hidden layers of 60 nodes while each of the three branch nets has 3 hidden layers of 60 nodes, whose activation functions and initializations are the same as previous. We sample 5ˆ10 4 spatiotemporal collocation points using Latin hypercube sampling [3] to construct the physics residuals.
We assume the diffusion terms (∆u and ∆v) are known in the PDEs, whose coefficients (γ u and γ v ) yet need to be identified. We employ the bounds to these two positive coefficients to speed up the convergence, namely, γ u P r0, 5s and γ v P r0, 150s. We design 70 candidate functions, composed of up to third-order polynomials (including the constant term "1" as the zero order), derivatives tu x , u y , u xy , v x , v y , v xy u and their mutual multiplication, to reconstruct the nonlinear reaction terms in the PDEs. Hence, the final library has 72 candidate terms, whose initial coefficients are randomly chosen between´1 and 1. To account for the small stimulus term (e.g., 0.01 in the first equation), we increase the sensitivity of the constant candidate "1" in the library by down-scaling its magnitude to the order of 10´5 and 5ˆ10´4 for u and v equations respectively. Due to the memory issue, we run this case on a workstation with an NVIDIA Tesla V100 GPU card (32 GB). The training efforts include the pretraining stage with 2ˆ10 3 Adam epochs and 2ˆ10 4 L-BFGS epochs, 6 ADO iterations, and an extra post-training with 2ˆ10 4 Adam epochs and 2ˆ10 4 L-BFGS epochs. In each ADO iteration, we use 1ˆ10 3 Adam epochs in synergy with STRidge. To deal with the aforementioned bounds for γ u and γ v in an unconstrained optimization process, we set γ u " 5σpγ u q and γ v " 150σpγ v q and take tγ u ,γ v u as trainable variables, where σp¨q denotes the Sigmoid function. The discovered equations under 10 % noise is It is seen that the form of the PDEs is precisely uncovered with all correct active terms (including the unknown external stimulus in the first equation). The corresponding identified coefficients are generally close to the ground truth (error of non-zero coefficients: 9.05˘5.66%) except the diffusion coefficient for v (i.e., γ v ) which seems to be a less sensitive parameter according to our test. It should be noted that, given very scarce and noisy measurement datasets in this example, the "root-branch" DNN is faced with challenges to accurately model the solutions with sharp propagating fronts (see Fig. 9C -D). The less accurate solution approximation by DNN then affects the discovery precision. This issue can be naturally alleviated by increasing the spatiotemporal measurement resolution (even still under fairly large noise pollution, e.g., 10%). Nevertheless, the exact form of the PDEs is successfully discovered in this challenging example, which is deemed more important since the coefficients can be further tuned/calibrated when additional data arrives. The evolution of the PDE coefficients corresponding to 72 candidate functions forû andv is illustrated in Fig. 9A and B, respectively. Note that, for visualization purpose, we re-scale the identified coefficients of the constant stimulus term "1" in the u-equation by multiplying 100 in Fig. 9A and the diffusion term ∆v in the v-equation by dividing 50 in Fig. 9B. The trained network is finally used to predict the full-field responses under three I/BCs (see the snapshots in Fig. 9C -D at two unmeasured time instants). The stacked full-field 2 error is 5.02%.

Experimental discovery of cell migration and proliferation
In this example, we consider to discover a biological system based on scratch assay experiments [12] investigating the cell migration and proliferation process. The 1D cell density distributions at different time instants (0h, 12h, 24h, 36h, 48h) were extracted and simplified from high-resolution imaging via image segmentation and cell counting (see Fig. 10). A series of assays were performed under different initial cell densities (e.g., the total number of cells spans from 10,000 to 20,000 following the designated initial distribution in the test well. More detailed description of the experiment setup and datasets can be found in [12]. Our objective herein is to uncover a parsimonious PDE for modeling the dynamics of cell density ρpx, tq. Here, we consider four scenarios with the initial number of cells ranging from 14,000, 16,000, 18,000 to 20,000. We take the mean of the test data from three identically-prepared experimental replicates for each scenario for PDE discovery. Each mean dataset has a total of 38ˆ5 measurement points for five time instances. To improve the optimization condition, all measurements are upscaled by 1000 before being used in network training. Given our prior knowledge that the cell dynamics can be described by a diffusion (migration) and reaction (proliferation) process, we assume the PDE holds the form of ρ t " γρ xx`F pρq, where γ is the unknown diffusion coefficient and F denotes the underlying nonlinear reaction functional. We use 8 additional candidate terms (e.g., t1, ρ, ρ 2 , ρ 3 , ρ x , ρρ x , ρ 2 ρ x , ρ 3 ρ x u) to reconstruct F, whose coefficients are sparse. Hence, the total number of trainable coefficients remains 9 (e.g., Λ P R 9ˆ1 ), whose initial values are randomly sampled from [´1, 1]. We believe incorporating our domain-specific prior knowledge is reasonable and should be encouraged in interpretable model discovery, which could help improve our solution confidence when available data is very sparse and noisy (e.g., in this example). We sample 1ˆ10 4 collocation pairs using Latin hypercube sampling [3] in the spatiotemporal domain of Ωˆr0, T s " r0, 1900sµmˆr0, 48sh. The DNN has 3 hidden layers of 30 nodes activated by by the tanh function (see Fig. 1 in the Main Text). Considering that the cell density is constantly positive, we impose a softplus function (e.g., lnp1`e z q) in the output layer to curb the final output of ρ. To account for potential large magnitude variation of the candidate term coefficients, we apply the tanh function to squash magnitude gaps. Specifically, we set γ " 1000tanhpγq and λ " 50tanhpλq, whereγ andλ are trainable "proxies" for diffusion coefficient γ and other 8 coefficients λ (note: Λ " tγ, λu P R 9ˆ1 ). The training efforts include the pretraining stage with 8ˆ10 3 Adam epochs and up to 8ˆ10 3 L-BFGS epochs, 6 ADO iterations, and an extra post-training with 2ˆ10 4 Adam epochs and up to 2ˆ10 4 L-BFGS epochs. In each ADO iteration, we use 1ˆ10 3 Adam epochs in synergy with STRidge. The discovered underlying PDEs under different initial cell states are given as follows: 14k cells: ρ t " 530.39ρ xx`0 .066ρ´46.42ρ 2 16k cells: ρ t " 484.74ρ xx`0 .065ρ´43.15ρ 2 18k cells: ρ t " 636.68ρ xx`0 .070ρ´45.48ρ 2 20k cells: ρ t " 982.26ρ xx`0 .078ρ´47.65ρ 2 which share a unified form of ρ t " γρ xx`λ1 ρ`λ 2 ρ 2 which exactly matches the famous Fisher-Kolmogorov model [13,14]. The rates of migration (diffusion) and proliferation (reaction) generally increase along with the number of cells, as seen from the identified coefficients. Fig, 11A-D depict the learned cell density profiles by the trained DNN, which capture the critical patterns of the measurement while showing little evidence of overfitting. With the discovered PDEs, we simulate/predict the evolution of cell densities at different time instants (12h, 24h, 36h and 48h) presented in Fig. 11E -H, where the measurement at 0h is used as the initial condition while ρ x px " 0, tq " ρ x px " 1900, tq " 0 is employed as the Neumann boundary condition. The satisfactory agreement between the prediction and the measurement provides a clear validation of our discovered PDEs.

Discussion
In this section, we discuss several other features, influence factors and limitations of the proposed PINN-SR method for data-driven discovery of PDEs, and highlight the potential future work. ,000 and 20,000 cells, respectively, where the measurement at 0h is used as the initial condition while ρxpx " 0, tq " ρxpx " 1900, tq " 0 is employed as the Neumann boundary condition. The simulation result is represented by solid curves while the markers denote the measurement data.

Simultaneous identification of unknown source
In practical applications, the physical system might be subjected to spatiotemporal source input (p) which is unknown and can be only sparsely measured. When discovering the underlying governing equation for such a system, the source should be considered and reconstructed concurrently. In this case, we incorporate the source candidate functions into the library φ for simultaneous discovery of the PDE and reconstruction of the unknown source. Thus, the sparse representation of the PDE(s) can be written as u t " rφ u φ p srΛ u Λ p s T where φ u and φ p denote the libraries of candidate functions, while Λ u and Λ p are the corresponding sparse coefficients, for the field variable u and the source p, respectively. To demonstrate this concept, we test the Burgers' equation driven by a source term, expressed as u t`u u x´0 .1u xx " sinpxq sinptq.
To generate the solution, the problem domain is meshed into 201 spatial grid points for x P r´5, 5s and 101 time steps for t P r0, 10s. We use 20 fixed sensors randomly selected from the spatial grid points to record the wave response (u) for 50 time steps, polluted with 10% noise. Note that the source is not measured and regarded as unknown.
The following libraries of candidate function are used to reconstruct the PDE and the source: ta, b, c, d, a 2 , b 2 , c 2 , d 2 , ac, ab, ad, bc, bd, cdu where a " sinptq, b " sinpxq, c " cosptq and d " cospxq. The hyperparameters for the PINN-SR network are similar those used in the previous Burgers' example. The pre-training takes up to 15ˆ10 3 epochs of Adam and about 1ˆ10 3 epochs of L-BFGS, followed by 20 ADO iterations. In each ADO iteration, we use the Adam optimizer with 1ˆ10 3 epochs and the L-BFGS with up to 1ˆ10 3 epochs to train the DNN for each alternation within STRidge. The discovered PDE along with the uncovered source term is given by It can be seen by comparing the above two equations that both the sparse terms and the corresponding coefficients are accurately identified, despite only scarce and noisy measurement of the system response is supplied. Although only 4.9% subsampled responses are measured while the source information is completely unknown, the PINN-SR approach can reasonably well extrapolate the full-field solution with a 2 error of 13.8% (see Fig. 12A). The major errors are mostly distributed close to the boundaries due to scarce training data. Fig. 12B shows the comparison of spatial and temporal snapshots between the predicted and the exact solutions which match well with each other. Nevertheless, if the source is very complex with its general expression or form completely unknown, distinct challenges arise when designing the source library of candidate functions φ p . This may require an extraordinarily large-space library to retain diversifying representations, and thus pose additional computational complexity for accurate discovery of the PDEs. In some specific cases, the unknown source term can probably be approximated by the combination of continuous basis functions such as the Fourier series and the Lagrange polynomials, instead of finding its closed form. These open questions will be addressed in our future work.

Effect of missing candidate terms
Since the majority of well-known first-order PDEs w.r.t. time can be represented by linear combination of several active linear/nonlinear terms, we try to include as many as possible commonly seen terms following polynomial basis. Failing to include essential candidate functions will lead to false-positive discovery of parsimonious closed form of PDEs, despite that a "best-of-fit" form can be found. To investigate the effect of missing candidate term(s), we consider a special case where the actual candidate uw x is intentionally ignored from the original library for discovery of the NS equation, leading to an inadequate library with 59 candidate functions. All hyper-parameters and training efforts are the same as those in Section 2.1.4. The discovered equation is given by: w t "´0.253w x`0 .008w yy`0 .035uw xx´0 .782u 2 w x´0 .026u 2 w xx´0 .616vw y´0 .155v 2 w x´0 .526uvw y It can be seen that the discovered equation is less parsimonious and has much more terms compared with the ground truth. The predicted full-field error for w increases from 2.58% (see Section 2.  Ground truth: u t + uu x 0.1u xx = sin(x) sin(t) Figure 12: Discovery of Burgers' equation and source term for measurement data with 10% noise: (A) Evolution of the sparse coefficients Λ P R 30ˆ1 for 30 candidate functions φ P R 1ˆ30 used to form the PDE and the unknown source term, where the color represents the coefficient value. The predicted response in comparison with the exact solution with the prediction error. (B ) Comparison of spatial and temporal snapshots between the predicted and the exact solutions. The relative full-field 2 error of the prediction is 13.8%. The major errors are mostly distributed close to the boundaries due to scarce training data. to 5.25%. We observe that given an insufficient library, the discovery is likely to result in a nonparsimonious equation, where redundant terms are found to replenish the lost pattern of uw x , and consequently yields less accurate response prediction.

Noisy measurements and collocation points
The total loss function is evaluated on the measurement data (for L d ) and the collocation points (for L p ). Therefore, the availability of noisy measurement data and the number of collocation points sampled from the spatiotemporal space will affect the convergence of the PINN-SR model and thus the PDE discovery accuracy. We herein study the sensitivity of PINN-SR to these factors in the context of discovery accuracy based on the Burgers' equation example. In particular, we use the relative 2 -norm error to reflect the global accuracy of of the identified sparse coefficients, defined as e λ " ||Λ´Λ true || 2 {||Λ true || 2 whereΛ denotes the identified coefficients and Λ true is the ground truth. Fig. 13 shows the error metrics for discovering the Burgers' equation under different quantities of measurement points and collocation points and noise levels. Increasing the number of data points in the measurement (e.g., recorded by more sensors) can well compensate the noise  effect as shown in Fig. 13A (the number of collocation points is fixed at 1.6ˆ10 5 ), which agrees with our common sense. Although optimal sensor placement might alleviate the need of large datasets [15], this is out of the scope of this work. The use of more collocation points can mitigate the noise effect and improve the discovery accuracy as illustrated in Fig. 13B (the number of measurement points is fixed at 1ˆ10 3 ). For this specific case, 2ˆ10 4 (or more) collocation points are able to maintain a satisfactory discovery accuracy for measurements under noise corruption at a realistic level (e.g., ď 20%). When the data are sampled under a very noisy condition (e.g., 40% noise level), the proposed method is still robust if a larger number of collocation points are used (e.g., ě 8ˆ10 4 ). It is noteworthy that the collocation points require no correlation with the measurement data. In particular, we use the Sobol sequence [16] (or Latin hypercube sampling [3] which is also applicable) to simulate a finer uniform partitions of the problem domain, making the random sampling of collocation points more representative. Intuitively, the more the collocation points are used, the more generalizable the trained network will be and the more accurate the discovered PDE is. However, a large number of collocation points also impose heavy computational burden, limited by hardware resources. A fairly large amount of collocation samples (e.g., on the order of magnitude of ą 10 4 ), comparable to the complexity and dimension of the discovery problem, are suggested in practical applications meanwhile considering the memory of the computing machine.
We further conduct a comparative study on the role of collocation points and seek for numerical understanding of how much they can help for extrapolation. Taking the Burgers' equation for instance, we define an enclosed area, part of the full-field response, as shown in Fig. 14, and sample the measurement data only within such an area. We intend to reconstruct the full-field response beyond the enclosed area and discover the PDE taking advantage of collocation points. More specifically, the enclosed area is meshed by 100ˆ50 spatiotemporal points. We take 30 randomly selected spatial locations as fixed sensors recording the dynamic response of the system, resulting in 1.5ˆ10 3 data points. Additionally, we sample 8ˆ10 4 collocation points from the full spatiotemporal field for evaluating the residual physics loss during model training. Four cases are considered to demonstrate the function of collocation points with measurements sampled in the enclosed area (see Table 2).
Provided with clean measurements from the enclosed area and global collocation points, PINN-SR does an impressive job on both full-field response reconstruction and sparse coefficients identification (see Case 1 in Table 2). When the measurements become noisy (e.g., 10% noise level), despite the response prediction errors increase, the PDE can still be accurately discovered (see Case 2 in Table 2). If we consider removing all collocation points and only train the network with clean measurements, the response prediction errors (even during the training and validation stage) all remain over 50%, meanwhile the PDE is also completely misidentified (see Case 3 in Table 2). Once we double the clean measurement points to 3ˆ10 3 , the trained DNN has strong interpolation and discovery abilities; however, the trained network does a poor job in extrapolating the full-field response (see Case 4 in Table 2). Concluding from this parametric test, we can see that the collocation points can render PINN-SR tolerable to scarce and noisy measurements, making the DNN generalizable.

On the effect of non-Gaussian noise
To explore the effect of non-Gaussian noise, we test our algorithm on discovery of the FN equations based on a series of low-resolution snapshots blurred/smoothed by a 2D Gaussian filter (e.g., the resulting noise in the measurement data is always positive). In particular, the original high-resolution data for the case of IC3 (see Section 2.3.2) is firstly smoothed by two 2D Gaussian filters, of standard deviation 2 and 12.5 and of spatial size 9ˆ9 and 51ˆ51 for u and v, respectively. Afterwards, we uniformly sub-sample 31 low-resolution snapshots (projected into a 31ˆ31 spatial grid) from the blurred dataset and take them as our measurement data. The rest algorithm settings are the same as those in Section 2.3.2. The discovered equations are given as follows: It is seen that the discovered PDEs bear high resemblance to the ground truth. Fig. 15 displays the evolution of 72 library coefficients Λ. We further predict the full field response of u and v at a high-resolution grid 151ˆ151 for 241 snapshots uniformly distributed in r0, T s " r0, 28.8s. The predicted solution field at one unmeasured time instance t " 23.4 is visualized in Fig. 15 which shows the sporadic evolution of this system with sharp propagating fronts. The error contours at the same time instance mainly appear in the propagation fronts. The overall error for all 241 predicted high-resolution images is about 3.71%. We can conclude that the proposed method is also capable of handling non-Gaussian-type noises.

Optimal sensor placement
In fact, optimal placement of sensors is very important in inverse problems, which could provide informative datasets facilitating discovery. Ideally, sensors should be efficiently installed in the most representative region of the system with sharp gradients (such as Fig. 14), even though such a deployment will require a stronger foresight of how the system will behave. There are active learning strategies similar to [17] that can place new sensors in an online manner to minimize large physics residue in some local areas. Nevertheless, such an investigation is out of the scope of this work. In our present study, since we assume that we have little prior knowledge on the system response, it is infeasible to proactively deploy sensors in specific regions (e.g., the shock developing regions). In particular, when multiple I/BCs are considered, these regions might vary case by case. For instance, in the example of Burgers' equation with shock behavior (see Section 2.3.1), the shock developing regions are very different. Therefore, we choose to randomly place the sensors in space to cover various possibility.

Convergence history
As an example, the total loss history for the NS example (see Section 2.1.4) is shown in Fig.  16, where the loss decreases remarkably during pre-training when both DNN's parameters and the PDE candidate term coefficients are trained to model the paramount patterns revealed by the noisy data (e.g., 10% noise). The switch between Adam and L-BFGS occurs at the Epoch 5000. In the ADO stage, the parsimonious equation structure is adaptively extracted while the model carefully maintains prediction ability, where in consequence the loss doesn't drop too much. The

Pre-train ADO
Post-train Figure 16: The history of the total loss for Navier-Stokes vorticity equation. The loss decreases remarkably during pre-training, where both DNN's weights and biases and library coefficients learn the paramount patterns in the data. In ADO stage, the parsimonious equation structure is adaptively extracted while the model carefully maintains prediction ability, in consequence the loss doesn't drop too much. The loss mildly declines in the end due to the refinement of DNN parameters' and non-zero equation coefficients in particular. jump at the beginning of ADO is due to the update and pruning ofΛ in the STRidge. The loss mildly decreases in the post-training stage due to the refinement of DNN parameters' and non-zero equation coefficients.

A parametric study on network size
The network configurations are adopted based on common practices in popular PINN literature [18][19][20][21][22] and our empirical observations. We found that networks with 4-8 hidden layers, each having 20-60 nodes, are generally sufficient for accurate modeling the system behaviors in our discovery problems, depending on the problem complexity. To further show the effect of the network configuration, we perform a parametric studies in one typical example (e.g., the Burgers' equation) on the number of hidden layers {4, 6, 8} with the number of nodes {20, 40} in each hidden layer. In all cases, we use the same hyper-parameters as those in 2.1.1 except that the measurements are noise-free here. The analyses are performed on a workstation with 28 Intel Core i9-7940X CPUs and 2 NVIDIA GTX 1080Ti GPU cards. Results of 6 different network configurations are shown in Table  3, where it is clear that as long as the layers and nodes are sufficient for solution approximation, it ensures a successful discovery. However, a deeper or wider network requires more training effort and results in longer computational time. Furthermore, a bigger network is more susceptible to overfitting, as the full-field prediction error increases with the growth of network width. The proper value of β should be selected in the Optimal Region as shown in the plots. For example, we select κ " 1 and κ " 100 for discovery of the Burgers and Schrödinger equations, respectively, as shown in Table 4. Note that β " κLp`θ0,Λ0; D va c˘.

List of hyper-parameters used in the examples
Following the consistent criteria discussed in Main Text Method and Section 1.1, the list of hyper-parameters used in the above examples is summarized in Table 4. Here, we illustrate how β is selected based on Pareto front analysis. In particular, we take the Burgers' and nonlinear Schrödinger equations as an example. We firstly construct the sparse regression problem solved by STRidge, where 9 U and Φ are evaluated based on the pre-trained DNN (with the trained network parameters denoted by θ 0 ). A grid search for β is then performed by running only the first (or two) ADO iterations to obtain the graphical representation of the Pareto set (e.g., L p pθ 0 , Λ; D c q vs. }Λ} 0 ). The optimal range of β can then be determined. Fig. 17 summarizes the analysis results.

Other limitations
We also observe some other limitations of our proposed method in its current version. For example, we failed to discover the Gray-Scott reaction-diffusion equation [23] (with several falsepositives) whose PDE coefficients have orders of magnitude difference (e.g., diffusion coefficient 10´5 vs. reaction term coefficient 1). While such a magnitude difference can be solved by applying activation functions to equation coefficients (e.g. the FN case and the cell case), such a treatment requires extra prior knowledge. In addition, there are other factors that could bring challenges to our current framework, such as extremely low-quality measurements and pathology in PINN's gradient [21], which will be investigated in our future studies.  Note: a. The values of rσ were estimated based on the measurement data (e.g., with 10% noise) and serve as a reference for determining α in ADO/post-training. In pre-training, we generally reduce the value of α to relax the physics constraint. A special case is the λ-ω equations, where u and v in the spiral dataset bear high resemblance, which require a larger α value. b. To balance the physics residue and the equation sparsity, the initial value for β can be set as the pre-trained physics loss Lp`θ 0 ,Λ 0 ; D va c˘.
However, a Pareto front analysis is typically required for determining this critical hyper-parameter. c. Sparsity threshold increment ∆δ is usually set to 1, since STRidge can adaptively determine a normalized and optimal threshold except the Schrödinger cases. The Schrödinger equation is a complex-value system, therefore a larger initial ∆δ is required to compare with the modulus of complex values.