Minimal Linear Networks for Magnetic Resonance Image Reconstruction

Modern sequences for Magnetic Resonance Imaging (MRI) trade off scan time with computational challenges, resulting in ill-posed inverse problems and the requirement to account for more elaborated signal models. Various deep learning techniques have shown potential for image reconstruction from reduced data, outperforming compressed sensing, dictionary learning and other advanced techniques based on regularization, by characterization of the image manifold. In this work we suggest a framework for reducing a “neural” network to the bare minimum required by the MR physics, reducing the network depth and removing all non-linearities. The networks performed well both on benchmark simulated data and on arterial spin labeling perfusion imaging, showing clear images while preserving sensitivity to the minute signal changes. The results indicate that the deep learning framework plays a major role in MR image reconstruction, and suggest a concrete approach for probing into the contribution of additional elements.

instances of that connection (i.e. per neuron in the receiving layer). Thus, FC layers and convolution layers differ in two aspects: connection topology, and weight sharing. Table 1 is organized according to these two aspects. Locally connected layers have the topology of convolutional layers, but not the weight sharing. An FC layer is a matrix multiplication, where each row corresponds to the weights connecting all neurons of one layer to a specific neuron in the next layer. For illustration, consider a layer with MxNxF 1 neurons (e.g. describing an MxN image with F 1 features) connected to the next layer, of size MxNxF 2 . An FC layer will be composed of M 2 ⋅N 2 ⋅F 1 ⋅F 2 weight variables (+M⋅N⋅F 2 additive bias variables). A convolutional layer, with K 1 × K 2 kernel, will have K 1 ⋅K 2 ⋅F 1 ⋅F 2 weight variables (+F 2 bias variables as they are fully connected in the feature dimension). The neighborhood topology helps by limiting the number of variables to optimize, while the shared weights serve to reduce the number of variables to optimize, and to strengthen each optimization step by combining the information from all the data patches.
Axis dependent topology and weight sharing. A 1D FT of a tensor, for example, may be described as a layer with specific topology such that it is fully-connected on one axis, while the weights are shared along the other axes. Thus, we suggest a subspace-connection layer, where some axes are fully connected, the weights shared along some of the other axes and independent along the remaining axes. Next, we demonstrate the applicability of this layer to common MR reconstruction schemes. The proposed layers are depicted in Fig. 1. notation. In the remainder of this paper we will use the following notation to denote use of the proposed connection layers: [Fully connected: Source-> New | Shared | Independent]. In the fully connected case, location information is lost, and sometimes a domain change (e.g. spectral-> spatial: k-space to image space) is indicated. The names of the source and new axis (e.g. k RO -> RO) is noted for clarity.

Results
2D-FT and 2 × 1D-FT. As a first step, we compared two networks performing 2D FT, one using a 2D-FC single layer and one composed of two FC-1D layers with weights sharing in the other dimension (see Supplementary  Fig. S1). For an NxN image, the number of weights is the two networks is thus N 4 and 2N 2 , respectively. The 2 × 1D network showed more rapid convergence (shown in Supplementary Fig. S1C). For the N = 128 case, the 2D-FC network did not fit into the device memory. Notably, for the case of N = 64 and a fast learning rate of 0.002, training the 2D-FC topology did not converge, while the 2 × 1D network was robust to this rate.

Coil compression.
Multi-channel data that are undersampled along the phase-encoding direction can be reconstructed by applying a convolution kernel shared over k-space and FC over the channel dimension to fill-in the missing lines, followed by a 2D (or 2 × 1D) FT. The kernel weights are usually calculated to fit the auto-calibration lines, or by the coil sensitivity maps from a separate acquisition. A network mimicking this reconstruction pipeline is depicted in Fig. 2A. (This toy example borrows elements from GRAPPA 7 as the k x -k y block, and elements from SMASH 22 as the single combined output channel). Training this network resulted in extraction of a kernel which enabled successfully reconstructing images from undersampled data.

Software Coil Compression(SCC) and Geometrical Coil Compression (GCC). Coil compression is
commonly applied as a first step in parallel imaging reconstruction to decrease computational demands 15,[23][24][25] and reduce noise. A network element capable of performing software coil compression is depicted in Fig. 2B. Compressing the example data set from eight to two channels resulted in compressed channels with expected phase variation along the PE dimension (Fig. 2F), allowing for parallel imaging reconstruction.
A network mimicking Geometrical coil compression 25 is depicted in Fig. 2C. GCC achieves more efficient channel reduction by using aligned spatial-location-specific coil compression, and is thus built up using 3 SFCg layers: (i) performing FT: shared weights along PE, channels, matrix multiplication along kRO-> RO axis, before FT back to k-space. In the network implementation, (ii) location-specific coil compression: independent along RO, FC along channels-> compressed channels; shared along kPE; (iii) performing iFT, as (i).
Step (ii), the core of GCC, thus exemplifies a common use of the full versatility of subspace FC operation. As illustrated in Fig. 2C, www.nature.com/scientificreports www.nature.com/scientificreports/ weights (different line widths in the figure) are different along one axis, and shared along the other axis (i.e. the black and blue). Matrix multiplication operation is run on the feature (channels-> compressed channels) axes.
Unlike the analytical implementation of GCC, the network needed no explicit constraint to be added for coil compression matrices to align and produce smooth compressed channels. The defined goal of reconstructing a correct image is sufficient to constrain the network to produce the desired behavior, i.e. smoothness of the resulting compressed channels. Figure 2E shows resulting image reconstructions, showing the network was able to learn more effective compression (than SCC) due to the added GCC flexibility. Figure 2G shows the resulting compressed coils, demonstrating the smoothness achieved without explicit constraints.
Reconstruction from spiral trajectory acquisition. Spiral trajectories enable high-efficiency coverage of k-space. However the spiral readout results in data points not lying on the Cartesian grid, requiring alternative approaches such as location specific regridding kernels 26 or NUFT [27][28][29] for image reconstruction. Here we explore the use of MLN for the reconstruction of such trajectory under field inhomogeneity common in 7 T systems. Figure 3A shows a standard time-segmented 28 image-to-signal operator for iterative reconstruction from non-Cartesian sampling under field inhomogeneity. Two topologies of Minimal Linear Networks were designed: one tightly mimicking the transposed time-segmented NUFT pipeline, and another slightly more relaxed version, composed of a k-domain regridding layer into several "time segments" (TS), a fixed 2D FT and an image-domain side combining the segments into the final image. We will refer to the latter topology as k + I MLN.

MLN for spiral reconstruction.
In the first topology, N Traj ⋅N segments + N 2 ⋅N Neighbors ⋅N Channels + N 2 ⋅N Segments weights are trained by the network, equal to the number of coefficients calculated by a time-segmented NUFT operator, where N segments is the number of time-segments.
The k + I topology, depicted in Fig. 3B, has N 2 ⋅N PE ⋅N Neighbors ⋅N Channelsj ⋅N segments + N 2 ⋅N Segments weights trained by the network. Table 2 compares the resulting number of tunable weights in a specific configuration. In contradiction with linear networks, nonlinear networks are universal approximators 30,31 , and the expressiveness of the network grows exponentially with depth 32,33 . Thus we keep the comparison to comparable topologies. While the number of tunable parameters in MLN is still high compressed to e.g. convolutional networks, the linearity of MLN means the system cannot do anything but projection of the input into a vector space, and only vector spaces can be learned.
The recently suggested dAUTOMAP 34 achieved results comparable to AUTOMAP while much reducing the number of parameters by decomposition, not dissimilar to the one approach taken in this work. However, it does not fit non-Cartesian trajectories. www.nature.com/scientificreports www.nature.com/scientificreports/ Both topologies were also ran with a single segment, mimicking NUFT without B 0 inhomogeneity correction. Note that with N Segments = 1, both topologies became nearly identical. Figure 4B shows the reconstruction with the first topology and N Segments = 7, and with the second topology and N Segments = 1,7,15 segments. The number of segments is in the order of the number of time-segments needed for sufficiently accurate description of the phase evolution due to field inhomogeneity. As expected, the single-segment (non-B 0 corrected) reconstruction showed distortion and blurring in regions of strong inhomogeneity (marked by red arrow). The time-segmented reconstruction with shared-weights resulted in an image with some residual blurring. Comparing the reconstructions with 7 and 15 independent-weights segments did not indicate further gains when using > 7 segments and the variant with 7 independent weights (i.e. ~2.1 ms difference between segments, see Methods) was hence chosen for further exploration.
The k + I MLN was also trained using a database of simulated multi-ellipse (modified Shepp-Logan) phantoms. The result of then applying this to reconstruct the actual in vivo data is shown in Fig. 4, clearly illustrating that the phantom-trained network did not generalize to the real data, indicating that the real data (as MR brain images) do not lie within the vector-space (or, have a high cost in the linear scalar field) learned by the MLN using multi-ellipses. It did perform well in reconstructing similar multi-ellipse phantom data (not shown).
Comparison with reference reconstruction techniques. Figure 5 summarizes the comparison to established regularized CS-based reconstruction technique. Suggestions for direct regridding through interpolation kernels 27 include optimized Kaiser-Bessel, Gaussian, minimal least squares and other approaches. However, the state-of-the-art methods for non-Cartesian image reconstruction combine CS with kernel interpolation for the image-to-signal operations. We have tested BART 35 as reference reconstruction (relying on Kaiser-Bessel kernels, and Toeplitz embedding). SparseMRI 5 with optimized Keiser-Bessel showed comparable results (not shown).  Application to perfusion measurement with arterial spin labelling at 7T. Due to the linearity of the model, we expect it to generalize well. The temporal variations between the label and control images in an Arterial Spin Labelling (ASL) experiment are notoriously low, around 1% 36 . The results shown in Fig. 6 demonstrate the method's sensitivity to these minute changes and are in good agreement with conventionally obtained perfusion results using echo-planar ASL at 7 T 19 . Figure 6E shows the time course signal average over the activated area. Indeed, a 1% signal change was detected, indicating the full difference between labeled and unlabeled timepoints was preserved. While tSNR values are slightly lower using MLN than by the reference reconstruction, the maps are cleaner and fuller, without loss of contrast in areas due to artifacts and regularization.
Interpretability of MLN. The chosen network is composed of trainable weights before and after FT. Figure 7A shows the Frobenius norm of the signal (k) side weight matrix for each k-space location. As expected, the locations which lie on the sampled trajectory enjoy a small norm, while those between sampled locations demand higher weights, indicating stronger use of the different channels' and neighbors' data, and increased noise for these locations. In the image domain, the trained weights can be depicted as maps, and are shown in Fig. 7B. The trained maps contain features that are reminiscent of the field inhomogeneity (B 0 ) map, in similarity with time-segmentation maps, but also of compressed channel sensitivities.  www.nature.com/scientificreports www.nature.com/scientificreports/ Benchmark performance in Cartesian reconstruction. The k + I MLN topology was applied to image reconstruction from standard 2D Cartesian 8-channel example data that was undersampled post-hoc. The SSIM score 37 was used as a quantitative metric of image quality. Figure 8 shows the performance at different undersampling factors on a standard image not within the training database. In each case, the MLN reconstruction outperforms the reference method, notably producing images with acceptable quality even at factor > 12. Fig. S2 shows the performance on additional test-set images which do not belong to the training database, including complex-valued images acquired at our institution, with SSIM scores summarized in Table S1.
Similar trends are seen for newly acquired complex-valued 7 T MR images, publicly available magnitude MR images, simulated phantom images and even general non-MR images. The proposed MLN reconstructions outperformed the reference method in all cases, indicating that the MLN learned a vector subspace (linear scalar field) more general than the training samples brain MR images. This follows the relatively weak descriptive power of the architecture, which restricts in to linear relations, and enjoys generalizability and reduced risk of overfitting.
Comparison to neural network. Figure 9 Shows results in comparison to the CascadeNet on single-channel data. Notably, CascadeNet outperformed MLN, and MLN-Knee outperformed MLN-HCP. In this experiment, networks are challenged to extract information on the image manifold from the database, rather than utilize a coil's channels. As MLN was designed to have a low, linear descriptive power, its ability to reconstruct the data in such settings are expectedly low. The gap between the MLNs trained on Knee and HCP data reflects that those datasets lie in a different vector spaces. The results thus confirm the observation regarding the limited model used by MLN, which is aimed to mimic a linear GRAPPA operation using model-based training on arbitrary trajectories. noise considerations. Fig. S3 shows the noise calculated using the proposed MLN and reference method on simulated data, and Fig. S4 on real 7 T MRI in-vivo acquisitions. In both cases the MLN reconstruction suffered from increased noise compared to the reference method, however with smaller difference in the real data. As the trained MLN does not contain any element effectively applying denoising, such as conv-net, or the regularization used in the reference method, some noise amplification is expected. Several approaches to reduce the noise amplification include the introduction of nonlinear, effectively denoising layers; weights regularization; denoising the training images; and adding realistic noise to the simulated signals.

Discussion
Deep learning brings various benefits over conventional approaches to image reconstruction. Among these are: (a) efficient use of computational resources; (b) learning from large databases; (c) optimizing directly in the result domain; (d) exploiting redundancy for robustness and over-complete representations, and (e) learning nonlinear functions. However, indications for the construction of a successful network remain vague with commonly used general topologies. Evidence presented in this work indicates that even when removing the hallmark elements of neural networks, namely non-linear activation, deep topology and redundancy (overrepresentation), elements a-c alone already outperforms state-of-the-art techniques. Lacking the non-linear activation layers, the resemblance to "neurons" diminishes, and we hence use the term linear nets, or "minimal linear nets", MLN. A concrete criterion for minimality is the inability of any reduced model to fully describe the underlying physics.
We propose connection layers that fit MR and other axis-dependent domains and applied them to design minimal, problem-specific and physics-based networks for the various elements in a typical MR reconstruction pipeline. This enables analysis of each connection layer's role. The use of minimal architectures is analogous to the traditional concepts of model regularization for improved generalization.
Instead of learning features and patterns of the possible image space, the suggested MLN suffice in learning the signal-image relation in the restricted "possible image" vector space, thus decreasing the complexity learned and embedded into the system. As NN reconstruction trains a signal-to-image function by iteratively augmenting incorrect reconstruction of the training set, we explore the power of this element without adding powerful image-based pattern description mechanism.
In application to the real-world example of 7T spiral ASL perfusion measurements, we demonstrated the method's robustness under challenging imaging conditions, and applicability to timeseries imaging.
The robustness of MLN compared to conventional reconstruction approaches can be explained by adopting an interpretation by which the learned network is a robust inversion function that is trained directly as a left-inverse, and thus emphasizes correct rather than "good" reconstruction.
Treating the MR reconstructions with a problem-specific architecture resulted in orders-of-magnitude decrease (~500×) in the computational and memory demands in comparison with the general signal-to-image transform, and allowed reconstruction on a commercial desktop computer in cases where the computational demands of a general approach are too high.
This work focused on inverting fully known but underdetermined linear systems. Thus, and also for clarity of presentation, only pure linear networks were used. However, another domain of MR reconstruction challenges deals with problems where the forward image-to-signal function is not fully known: For instance, in cases where  www.nature.com/scientificreports www.nature.com/scientificreports/ motion or changes in the B 0 field affect the image. In this case, use of common neural network techniques as non-linear activation functions in combination with redundancy is expected to be beneficial; but even then, MR physics will allow us to constrict the expected artifact, for example, to the low-bandwidth (phase-encoding) axis. In such cases, we propose the use of the connection layers introduced here, in combination with problem-specific nonlinear layers, which we will explore in future work.
For real MRI experiments, the choice of a relatively difficult imaging scenario which demands estimation and accounting for the sensitivity maps and the B 0 inhomogeneity, required us to generate training data that is subject and slice specific, resulting in high computational, time-consuming demands. In order to achieve real-time or near real-time reconstruction, a network should be designed with robustness to changes in these properties. In such cases, the inverse, signal-to-image function is highly non-linear and cannot be described using linear models, which will hence require some degree of nonlinearity to be introduced to the system.
In conclusion, this work in the context of MR image reconstruction indicates a greater role for the learning from training sets and utilization of the slow-learning back-propagation optimization algorithm, than advanced description of the MR image manifold.

Methods
training dataset. For training, 10,000 magnitude-only images were randomly chosen from the Human Connectome Project 38 database. The images were randomly taken from axial, sagittal and coronal reslicing, and were resized into a 256 × 256 matrix. Data augmentation included random cropping to 128 × 128, flipping in both dimensions and 90° rotation. A random phase was added, generated using the sum of sinusoidal functions over a randomly selected quadratic plane. Spiral acquisitions at 7T. Experiments were performed on four healthy volunteers (24-39 years old, 1 female) after obtaining informed consent. The study was approved by the Ethics Review Committee for Psychology and Neuroscience (ERCPN #180_03_06_2017) at Maastricht University and all procedures followed the principles expressed in the Declaration of Helsinki. Data was acquired on a Magnetom 7 T whole-body MRI research scanner (70 mT/m amplitude, 200 T/m/s slew rate gradients; Siemens Healthineers, Erlangen, Germany), with a 32-channel receive head coil (Nova Medical Inc, Wilmington, MA).
At ultra-high field, B 0 field inhomogeneity poses significant challenges to fast echo-train imaging such as EPI or spiral trajectories. Moreover, the short T 2 * requires use of short TE in some applications which at the desired resolution, however, may be precluded by the echo-planar readout duration. ASL for perfusion imaging is one method that benefits greatly from shorter echo-time, making a spiral-out readout an attractive solution. An ASL sequence with spiral readout was developed in-house, with the FAIR 39 QUIPSS II 40 labeling scheme with tr-FOCI inversion pulse 41 to obtain quantitative maps of perfusion using a single subtraction approach 42,43 . All ASL measurements followed 19 , i.e. had 12 slices with no interslice gap, FOV 192 mm, echo time (TE)/TI1/TI2/TR1 = 3/700/1,800/2,500 ms, 67°−90° excitation flip angle (according to each subject's Specific Absorption Rate limits), 80 repetitions, total scantime 4 minutes. Dielectric pads were used to improve labelling efficiency 44 . Variable-density spiral trajectories were designed according to 45 , duration 12.5 ms, with effective undersampling factor of ~3.2.
Multi-echo GRE data (TEs 1.5, 3.18 ms) with matched geometry was acquired separately for B 0 field mapping and coil sensitivity estimation. www.nature.com/scientificreports www.nature.com/scientificreports/ Network topologies used for Spiral reconstruction. Two topologies were designed: (i) A minimal linear network (MLN) topology tightly mimicking the transposed time-segmented NUFT pipeline: the trajectory data is first expanded by the number of "segments", using kernels that are trajectory location specific, but shared along channels. These are then rearranged into a grid, with each grid location containing a concatenation of the data from neighboring k-space locations on the trajectory, and the different channels, along a new channel/ neighbor dimension. Those are collapsed into N segments value using learned location specific kernels (shared along segments). The data are then Fourier transformed (using a fixed, non-trained standard FT matrix), and the resulting images from the different segments are collapsed into the final reconstructed image.
(ii) relaxed variant, depicted in Fig. 3B: the data are first arranged into an oversampled k-space grid, collecting data from neighboring k-space locations and channels along a new dimension, which is reduced to length N Segments using k-domain specific kernels, followed by the rest of the network as before.
The networks, specific for subject and slice, were trained to reconstruct images from simulated signal. Signal was simulated from database images (HCP, with data augmentation as detailed above), with slice-specific sensitivity and B 0 maps, obtained in a separate acquisition on the subject, and with the nominal trajectory, according to the image-to-signal transform suggest in 28,46 . Benchmarking. The MLN was trained using the same training samples on 8-channel sensitivity maps Implementation. The networks were implemented in TensorFlow using a standard configuration (Adam optimizer with fixed learning rate (unless mentioned otherwise) of 0.002 and beta = 0.9, L 1 loss) on a desktop workstation with commercial GPU (NVIDIA GTX 1080 Ti with 11GB of memory). The networks were trained by applying the slice-specific B 0 field map and sensitivity maps.
Networks were implemented to perform complex-field operations. No activation layers were used. B 0 maps were calculated using the method of Cusack et al. 48 . Coil sensitivity maps were calculated with BART's ESPIRiT module 49 using default parameters, and software coil compression (SCC) of the experimental MRI data was done using SVD. For reference reconstruction, BART's pics module 35 with manually optimized parameters was expanded with implementation accounting for B 0 correction with time-segmentation 28 , using up to 15 time segments, equally spaced over the acquisition time. NUFT 50 was also calculated using gpuNUFFT 51 . The images were also reconstructed using either the sparseMRI package 5 , or L 1 -CG-ESPIRiT with L 1 -wavelet or split TV and L 1 -wavelet regularization 49 . Parameters were set manually to optimize image quality.
The ASL data was motion corrected using ANTs 52 and brain extracted using FSL BET 53 following which perfusion tSNR maps were computed.
Memory and performance considerations. The memory requirements for training are linear with the size of the two tensors calculated (the k-and I-side tensors). In these experiments, we also loaded the database to the GPU memory for increased performance. For the real-data spiral trajectories, training was run for 2 hours per instance, which achieved convergence. Inference was done by directly applying the calculated kernels.
NUFFT for MRI under field inhomogeneities. For the sake of completeness, we briefly summarize the approach developed in 27,28,46,50 for MR image-to-signal transform under field inhomogeneities and its use for image reconstruction, which was used for reference reconstruction as detailed above and laid the basis for the suggested networks' topologies.
Let I be an image, and F u be the non-uniform Fourier transform according to the given trajectory, acquiring N data points in acquisition duration T. Set L to be a (small) number of "time-segments", separated by duration Δ = T/L. The local phase evolution due to the field inhomogeneity at time-segment l, i.e. after lΔ acquisition time units, is = − Δ P r e ( ) l iB r l ( ) 0 , and the image affected by the phase evolution is =  D I P l l , where  marks element-wise Hadamard multiplication. Let G be an Nx(L + 1) matrix containing interpolation coefficients for each (discrete) timepoint (e.g. linear of Hanning interpolation, or interpolation matrix optimized for the specific case, such as via the min-max approach 28  A NUFFT operation can be effectively and accurately approximated by (sparse) matrix multiplication of the FFT of the image with a precalculated coefficients matrix C, which is non-zero for a row corresponding to spatial frequencies f x ,f y only in a window around those frequencies 50 .