Investigating molecular transport in the human brain from MRI with physics-informed neural networks

In recent years, a plethora of methods combining neural networks and partial differential equations have been developed. A widely known example are physics-informed neural networks, which solve problems involving partial differential equations by training a neural network. We apply physics-informed neural networks and the finite element method to estimate the diffusion coefficient governing the long term spread of molecules in the human brain from magnetic resonance images. Synthetic testcases are created to demonstrate that the standard formulation of the physics-informed neural network faces challenges with noisy measurements in our application. Our numerical results demonstrate that the residual of the partial differential equation after training needs to be small for accurate parameter recovery. To achieve this, we tune the weights and the norms used in the loss function and use residual based adaptive refinement of training points. We find that the diffusion coefficient estimated from magnetic resonance images with physics-informed neural networks becomes consistent with results from a finite element based approach when the residuum after training becomes small. The observations presented here are an important first step towards solving inverse problems on cohorts of patients in a semi-automated fashion with physics-informed neural networks.


S1.1 MRI Data
The data under consideration in this study is based on MRI scans taken of a patient who was imaged at Oslo University Hospital in Oslo, Norway. The patient was diagnosed with normal pressure hydrocephalus and is referred to as "NPH1" in 1 .
The imaging protocol starts with the acquisition of a baseline MRI before 0.5 mL of a contrast agent (1 mmol/mL gadobutrol) is injected into the CSF at the spinal canal (intrathecal injection). The pulsating movement of CSF transports the tracer towards the head where it enters the brain. The contrast agent alters the magnetic properties of tissue and fluid, and in subsequently taken MRI, enriched brain regions display changes in MR signal relative to the baseline MRI. From the change in signal we estimate the concentration of tracer per voxel at timepoints 0, 7, 24 and 46 hours after injection. Further details on the MRI acquisition and tracer concentration estimation can be found in 1 .
We next use FreeSurfer 2 to segment the baseline image into anatomical regions and obtain binary masks for white and gray matter. The human brain has many folds and represents a highly complex geometry. To limit the intrinsically high computing requirements of inverse problems, we focus on a subregion of the white matter shown in main Fig. 2a.
In the following, we describe how this data is processed further to obtain patient-specific finite element meshes to generate synthetic test data by simulation.

S1.2 Synthetic data
We use the surface meshes created during the brain segmentation with FreeSurfer 2 to create finite element meshes of the full brain. In detail, the first step in the mesh generation pipeline is to load Freesurfer brain surfaces into SVMTK 3 , a Python library based on CGAL 4 , for semi-automated removal of defects and creation of high quality finite element meshes. Details on SVMTK and the mesh generation procedure can be found in 3 .
Using FEM we then solve the PDE (1) with boundary and initial conditions (8), (9) with a diffusion coefficient D 0 = 0.36 mm 2 h −1 for the domain Ω being the whole brain. This value for the diffusion coefficient of gadubutrol was estimated in 1 from diffusion tensor imaging (DTI). In detail, we discretize (1) using the Crank-Nicolson scheme and use integration by parts to transform (1) into a variational problem that is solved in FEniCS 5 with continuous linear Lagrange elements. We use a high resolution mesh with 3 × 10 5 vertices (10 6 cells) and small time step of 16 min. In combination with the Crank-Nicolson scheme, this minimizes effects of numerical diffusion. For the initial condition (9) we assume no tracer inside the brain at t = 0, i.e. c 0 = 0. The boundary condition (8) is assumed to be spatially homogeneous while we let it vary in time as (S1) This choice leads to enrichment of tissue similar to what is observed experimentally over the timespan of T = 46 hours. Finally, we interpolate the finite element solution c(x,t) between mesh vertices and evaluate it at the center coordinates x i jk of the voxels i jk inside the region of interest Ω and store the resulting concentration arrays c i jk at 0, 7, 24 and 46 hours. With this downsampling procedure, we are then able to test the methods within the same temporal and spatial resolution as available from MRI. Finally, we remove all the voxels that are not within our white matter subdomain of interest Ω.

S1.3 Synthetic data with artifical noise
We test the susceptibility of the methods with respect to noise by adding to the data perturbations drawn randomly from the normal distribution N (0, σ 2 ). We refer to the standard deviation σ as noise level hereafter. Since negative values for the concentration c are nonphysical, we threshold negative values to 0, i.e. the noise-corrupted voxel values are computed as In all the results presented in this work, we choose σ = 0.05. This corresponds to 5 % of the maximum value of c = 1 in the simulated measurements c i jk and allows to reproduce some of the characteristic difficulties occurring when applying the PINN to the clinical data considered here.

S2 Assessing the validity of assuming a constant diffusion coefficient
We here provide arguments to motivate our modeling assumption of a scalar diffusion coefficient that is spatially constant in the region of interest Ω.

S2.1 Using tensor valued spatially varying diffusivity has little influence on forward simulations of tracer
In this section we assess the sensitivity of simulated tracer concentration with respect to two different modeling choice for the diffusivity D.
To this end, we use the diffusion tensor image (DTI) available for the patient under consideration in this work and use it to estimate the spatially varying diffusion tensor D(x) ∈ R 3×3 for the gadobutrol tracer as described in 1 . An illustration of the mean apparent diffusion coefficient 1 3 trD(x) can be found in Fig. S1d. We now use this diffusion tensor to perform forward simulations of tracer spreading into the patient's brain using FEM as described in Section S1.2. In detail, instead of solving the scalar diffusion equation (1) we solve for the two distinct choices of D(x) stated below. Here, the domain Ω is set to be the whole brain. The initial condition is zero as in Section S1.3, but here we use a linear interpolation of data as boundary condition g. In detail, we define the boundary condition as where t i = {0, 6, 24, 46} h are the observation times and c d i is the tracer estimate from MRI at time t i .

Model 1: Tensor valued diffusivity.
Here, we directly use the spatially varying diffusion tensor for gadobutrol as obtained from DTI to solve PDE (1).

Model 2: Constant scalar diffusivity in white and gray matter.
For this model, we compute regional mean apparent diffusion coefficients We consider two regions, white and gray matter, and obtain The PDE (S5) is then solved with where 1 denotes the identity matrix. In Fig. S1a and S1b we exemplarily display visualizations of the simulated tracer after 24 h (where the amount of tracer reaches its maximum in clinical observations) for both models. The difference is barely visible to the eye, and the relative 1 -error between the tracer at 24 h simulated with the two models is only 8.8 %. Hence, it can not be expected that in an inverse setting with three snapshots of real data one can reliably differentiate by which of these two diffusion models the tracer distribution observed with MRI is governed.

S2.2 DTI characteristics suggest nearly isotropic diffusion
We next present some details on the diffusion tensor image available for the patient under consideration in this work. Figure  S2a illustrates the fractional anisotropy is the mean diffusivity and λ i , i = 1, 2, 3 are the eigenvalues of the diffusion tensor. It can be seen that the FA is rather small in most of the brain, in fact the mean FA ≈ 0.25. The minimum value FA = 0 corresponds to isotropic diffusion, while FA = 1 corresponds to diffusion restricted to only one direction. Hence the value FA ≈ 0.25 is in line with our assumption of a scalar, constant diffusion coefficient. Figure S2b further illustrates the distribution of eigenvalues in our region of interest. It can be seen that the ratio between largest and smallest eigenvalue is roughly 1.8, further justifying our modeling assumption of isotropic diffusion.

S3 Hyperparameter settings
In our PINN approach, we model c : (x,t) → R, x ∈ R 3 , t ∈ [0, T ] by a feedforward neural network with 9 hidden layers and 64 neurons in each layer and hyperbolic tangent as activation function together with Glorot initialization 6 for all results presented here. We have also experimented with larger networks, different and adaptive activation functions but have not observed significant differences for a range of choices in terms of convergence rates or accuracy. We note that since the raw data is already a 3-D array representing grid structured data, using (physics-informed) convolutional neural networks instead of fully connected networks might yield benefits such as training speed up. In this work, however, we decide to focus on tuning the loss function formulation and find that using a fully connected network in combination with a properly tuned loss function yields results that are consistent with the FEM approach.
The network furthermore has an input normalization layer with fixed parameters to normalize the inputs to the range [−1, 1]. To set these weights, we first compute the smallest bounding box containing all points x = (x 1 , x 2 , x 3 ) ∈ Ω to obtain lower and upper bounds l i , u i , i = 1, 2, 3 such that l i ≤ x i ≤ u i for all x ∈ Ω. The first layer normalizes the inputs as for i = 1, 2, 3 and with T = 46 h (last MRI acquisition timepoint).
If not stated otherwise, we use N p = 10 6 space-time points (x 1 , x 2 , x 3 ,t) for the evaluation of the PDE loss (4). We found that this number was either sufficiently high to reach accurate recovery of the diffusion coefficient, or more sophisticated refinement techniques like residual-based adaptive refinement (RAR) 7 were needed instead of simply using more PDE points. The samples for the spatial coordinates x 1 , x 2 , x 3 are generated by first drawing a random voxel i inside Ω. The voxel center are then perturbed to obtain x 1 = x i 1 + dx where x i 1 is the x 1 -coordinate of the center of a randomly drawn voxel i, and similarly for x 2 and x 3 . The perturbation dx is drawn from the uniform distribution U([−0.5 mm, 0.5 mm]) and ensures that (x 1 , x 2 , x 3 ) lies within the voxel i (the voxels correspond to a volume of 1 mm 3 ). The values for t are chosen from a latin hypercube sampling strategy over the interval [0, T ]. We furthermore normalize the input data c d by the maximum value such that 0 ≤ c d ≤ 1. In both the simulation dataset and the MRI data considered here, we use four images and the same domain Ω. The binary masks describing Ω consist of roughly 0.75 × 10 4 voxels, i.e., the four images (at 0, 7, 24 and 46 hours) yield a total of N d = 3 × 10 5 data points. Due to the large number of data and PDE points, we use minibatch sampling of the PINN loss function (4) and minimize it using the ADAM optimizer 8 . The learning rate η as well as potential learning rate decay schemes are an important hyperparameter, and we specify the used values in each section. The training set is divided into 20 batches, corresponding to 10 4 and 5 × 10 4 samples per minibatch in the data and PDE loss term, respectively.
Details on the implementation of the minibatch sampling strategy are presented in Section S5. It is worth noting here that our main reason to use minibatch sampling are not memory limitations. The graphics processing units (NVIDIA A100-SXM4) that we use to train the PINN have 80 GB of memory. This is enough to minimize the PINN loss function (4) with N d = 3 × 10 5 and N p = 10 6 data and PDE loss points in a single batch. Our reason to use minibatch sampling is that the stochasticity of minibatch gradient descent helps to avoid local minima, see, e.g., Chapter 8 in 9 . In Section S4 below we perform a systematic study using different minibatch sizes and find that smaller batch sizes are preferable in our setting since they yield more accurate recovery of the diffusion coefficient (for a fixed number of epochs).
As for the finite element approach, we discretize (1) in time using the Crank-Nicolson scheme and 48 time steps. We then formulate the PDE problem as variational problem and solve it in FEniCS 5 using the finite element method. To limit the compute times required, we use a time step size of 1 h and continuous linear Lagrange elements. We further use linear interpolation of the data as a starting guess for the boundary control g, for t i ∈ T = {0, 7, 24, 46} h and t i ≤ t ≤ t i+1 . We then use dolfin-adjoint 10 to compute gradients of the functional (7) with respect to D and g and optimize using the L-BFGS method.
In terms of degrees of freedom (optimization parameters), these settings result in 33665 weights in the neural network. For the finite element approach, the degrees of freedom depend on the number of vertices on the boundary of the mesh since the control is the boundary condition g. Our mesh for Ω has 33398 cells on the boundary. For 48 time steps, this yields 48 × 33398 = 1.6 × 10 6 degrees of freedom.

S4 Validation on synthetic data
We verify the implementation of the two approaches by considering synthetic data without noise, cf. Fig. 2b, in the white matter subregion Ω depicted in Fig. 2a.  For the PINN approach we test different minibatch sizes for three different optimization schemes using the ADAM optimizer: (i) fixed learning rate 10 −3 and p = 2, (ii) fixed learning rate 10 −3 while we switch from p = 2 to p = 1 after half the epochs and (iii) using initial learning rate 10 −3 that decays exponentially during training to 10 −4 and p = 2. Table S2 tabulates the relative error between the learned diffusion coefficient and the ground truth D 0 for a wide range of parameters. We find that (a) in general smaller batch sizes result in more accurate results and (b) the results are both most stable and accurate when using exponentially decaying learning rate. Notably, the PINN recovers the ground truth diffusion coefficient D 0 to up to 1 % accuracy when using the learning rate decay optimization scheme. These result are in line with 11 where increased accuracy in parameter recovery was observed for smaller batch sizes. However, there are also settings where full batch optimization with L-BFGS improves PINN performance in parameter identification problems 12 .

6/11
For the finite element approach, Table S3 presents the accuracy of the recovered diffusion coefficient. According to the theoretical results, decreasing regularization parameters leads to higher accuracy but less well conditioned optimization problems. This is in line with the results presented in Table S3. The finite element approach with appropriate regularization parameters and the PINN approach yield comparably accurate results.

S4.1 Computational effort
We here list the computing times to estimate the diffusion coefficient with our implementation of the FEM and PINN approaches presented in the main text. It should be noted that neither of these implementations have been optimized to reduce the compute times, and that the times required for convergence vary between datasets.
Our implementation of the FEM approach using dolfin-adjoint 10 with a mesh containing 91849 cells and 23307 vertices (whereof 33398 and 16693 are on the boundary, respectively) requires around 45-48 hours computing time on noisy synthetic data for the 1,000 iterations until convergence as shown in Supplementary Fig. S3a using a single Intel Xeon Gold CPU. As for the PINN approach, on noisy synthetic data, convergence was achieved when terminating the optimization after 20,000epochs of ADAM with data and PDE batch sizes of 1.5 × 10 4 and 5 × 10 4 , respectively. With our implementation in PyTorch 13 this takes about 3 hours on a NVIDIA A100-SXM4.
As for training on real MRI data, the FEM approach was usually converged after 1,000 iterations as well, while the PINN approach required lower learning rate and thus more epochs. The PINN results presented in the text were obtained after 100,000 epochs which usually was sufficient to converge. The compute time varies between sampling strategies (for PINNs) and workload on the cluster node for the particular simulation (for both approaches). For PINNs, the compute time was around 20 h for RAR and 15 h for RAE. The results are sumarized in S1.
We have also performed additional numerical experiments where we use residual based adaptive exchange and train with both ADAM and L-BFGS optimizer after every refinement. This strategy is similar to 14 . It emerges that a combination of low learning rates for ADAM, adaptive residual based refinement and 1 norm for the PDE loss allows us to successfully train with a combination of Adam + L-BFGS. The results after training with ADAM + L-BFGS are consistent with the ADAM results reported in the manuscript, while roughly half the optimization time is required. Table S2. Average rel. error |D pinn − D 0 |/D 0 in % after 2 × 10 4 epochs training on synthetic data without noise, with Algorithm 1. We average over 5 runs, numbers in brackets are standard deviation. Table S3. Rel. error |D − D 0 |/D 0 for the FEM approach (7), different regularization parameters, 3 measurement points, clean data, 1,000 iterations. Convergence of the optimization is demonstrated in Fig. S3a.  Table S4. Rel. error |D − D 0 |/D 0 in % for the finite element approach (7), different regularization parameters and parameterizations D(δ ). Failure of the algorithm is indicated by the symbol "x". Convergence plots for the optimization are given in Fig. S3b. randomly split {x k r } into subsets R 1≤ j≤b r 10:

S4.3 Finite element solution of the synthetic testcases
# Iterate over all minibatches 11: for j in range(b) do

S5.2 Residual based refinement
Algorithm 2 Refinement step with the RAR procedure as in "Procedure 2.2" in 7 adapted to the nomenclature in our work.
Input: The set of N r PDE points P, PDE residual r(x,t), number m of points to add per refinement step, number n of points to test the residual Output: refined set of PDE points P 1: Compute the absolute value of the PDE residual |r(x,t)| at n random samples S = {(x 1 ,t 1 ), . . . , (x n ,t n )} from Ω r × τ 2: Sort S by decreasing residual |r(x,t)| and keep only the first m points in S m 3: return The refined set of N r + m points P ∪ S m Figure S4. (Left) Convergence plots for the PINN losses trained with RAR on MRI data. (Right) Diffusion coefficient during PINN training. Exponential learning rate decay from 10 −4 to 10 −5 with RAR and p = 1.
Algorithm 3 Refinement step with the RAE procedure, a modification of the RAR procedure as described under "Procedure 2.2" in 7 .
Input: The set of N r PDE points P, PDE residual r(x,t), number m of points to add per refinement step, number n of points to test the residual Output: refined PDE points P 1: Compute the absolute value of the PDE residual |r(x,t)| at n random samples S = {(x 1 ,t 1 ), . . . , (x n ,t n )} from Ω P × τ 2: Sort S by decreasing residual |r(x,t)| and keep only the first m points in S m 3: Compute the PDE residual |r(x,t)| at the points in R 4: Sort R by increasing residual |r R (x,t)| and keep only the first N r − m points in R N r −m 5: return The set of N r refined points R N r −m ∪ S m