Image reconstruction through a multimode fiber with a simple neural network architecture

Multimode fibers (MMFs) have the potential to carry complex images for endoscopy and related applications, but decoding the complex speckle patterns produced by mode-mixing and modal dispersion in MMFs is a serious open problem. Several groups have recently shown that convolutional neural networks (CNNs) can be trained to perform high-fidelity MMF image reconstruction. We find that a considerably simpler neural network architecture, the single hidden layer dense neural network, outperforms previously-used CNNs in terms of image reconstruction fidelity and training time. The trained networks can accurately reconstruct MMF images collected over a week after the cessation of the training set, with the dense network continuing to outperform the CNN over the entire period.


Introduction
Optical fibers have proven to be extremely useful for endoscopy and related applications [1,2]. Present commercial methods for transmitting images through fibers are based on single-mode fiber bundles [3,4], consisting of thousands of fibers each transmitting a single pixel. It would be advantageous to instead transmit images in multimode fibers (MMFs), which are easy to fabricate and thinner than single-mode fiber bundles, and could potentially carry much more information. However, there is a serious drawback: due to mode-mixing and modal dispersion, any image coupled into a MMF is transformed into a complex speckle pattern at the output [5]. Researchers have devised various methods for reconstructing the input images from the speckle patterns, based on finding the complex transmission matrix of the MMF [6][7][8][9] or phase retrieval algorithms [10][11][12]. However, such methods generally require extra apparatus for measuring the optical phase, or have difficulty scaling to large image sizes.
Another promising approach is to use a training set of a priori known inputs to teach an artificial neural network (NN) how to map MMF output images to input images. This would not require additional interferometric equipment, and can potentially scale up to large image sizes. The idea was proposed and investigated decades ago [13][14][15], but only in recent years has it been shown to perform well for reconstructing images of reasonable complexity [16][17][18][19][20], aided by improvements in computational power and NN software.
These recent advances in NN-aided MMF image reconstruction have focused on deep convolutional neural networks (CNNs) [16][17][18][19][20][21]. Unlike traditional dense NNs [22], CNNs use convolution operations instead of general matrix multiplication within the NN layers [23], inspired by biological processes in visual perception. CNNs have enjoyed immense recent success in computer vision [24], making it natural to investigate using them for MMF image reconstruction. They have also been applied to the related problem of image reconstruction in scattering media [25][26][27][28]. However, there are grounds to question how well-suited CNNs are for analyzing speckle patterns such as those produced by MMFs, which are very different from the natural images commonly dealt with in computer vision. In MMF images, information is encoded not just locally but in the global distribution of speckles [21,29,30], whereas the localized receptive fields in convolutional layers are designed to extract relevant local features (such as edges) in natural images, rather than long-range spatial structures [31].
This paper investigates the performance of dense NNs and CNNs for MMF image reconstruction. Whereas the earliest papers on NN-aided MMF image reconstruction used dense NNs [13][14][15], most recent studies have concentrated on using CNNs [16][17][18][19][20][21] (one exception to this trend was the study by Turpin et al. of both dense NNs and CNNs for transmission control in scattering media and MMFs [27]). To our knowledge, there has been no direct comparison between the two NN architectures in the context of MMF image reconstruction.
Our principal comparison is between (i) the single hidden layer dense neural network (SHL-DNN), one of the simplest dense NN architectures, and (ii) U-Net, a CNN originally developed for biomedical imaging [32], which has recently been used for MMF image reconstruction [17]. We do not investigate very deep CNNs such as Resnet [17] or generative adversarial networks [33], as these require much greater computational resources and longer training times, and thus seem ill-suited to the MMF image reconstruction problem. After optimizing both types of NNs, we find that the SHL-DNN consistently achieves better reconstructed image fidelity than U-Net; for example, it achieves 15% higher maximum structural similarity index (SSIM) [34] score (0.77 versus 0.67). The SHL-DNN also exhibits substantially lower training time; for example, achieving SSIM 0.67 takes 2.3 minutes of training compared to 8 minutes for U-Net on the same computer. We also validated both NNs using images collected up to 235 hours after the images in the training set; both NNs continue to perform well in image reconstruction, with the SHL-DNN outperforming U-Net by around the same margin over the entire period. Moreover, we tested a "VGG-type" NN, which combines convolutional and dense layers, and found that it offers no additional performance advantage over the SHL-DNN.

Multimode fiber imaging
The optical setup is shown in Fig. 1(a). A collimated beam from a diode laser with an operating wavelength of 808 nm (Thorlabs LP808-SF30) is expanded and directed onto a spatial light modulator (SLM) (Hamamatsu X13138-02). Along with two orthogonal polarizers, the SLM generates a programmable spatial modulation in the intensity of the light beam.
The modulated beam is coupled into a one meter long multimode fiber (MMF) (Thorlabs FT400EMT) via a matching collimator (NA 0.39). The distal end of the MMF is imaged with a CMOS camera (Thorlabs DC1545M). The camera images consist of complicated speckle patterns, as shown in the left panel of Fig. 1(b), with no apparent relation to the ground truth images from the SLM. The camera images have 1280 × 1080 pixel resolution; to obtain a tractable dataset, we crop and downsample them to 64 × 64 using the Lanczos algorithm [35], as shown in the right panel of Fig. 1(b).
By operating the SLM with a refresh rate of 0.9 Hz (which allows for the generation of stable and distortion-free images), we accumulate one dataset, 61524 MMF images collected over approximately 19 hours for training and several datasets that spans across 235 hours. The ground truth images are drawn equally from (i) the MNIST digit dataset containing handwritten digits in various styles [36], and (ii) the Fashion-MNIST dataset containing images of clothing and  apparel [37]. The MNIST digit dataset is used for most of the experiment; the Fashion-MNIST dataset is used in Section 3.4. The MNIST and Fashion-MNIST ground truth images are 28 × 28, whereas the MMF-derived images in the dataset are 64 × 64. Conceptually, there is no reason to restrict the MMF images (NN inputs) to the same size as the ground truth images (and NN outputs), as was the practice in earlier studies [16,17]. Intuitively, having somewhat higher resolutions should be helpful because the image reconstruction algorithm would have more information to work with, subject to the constraints of trainability and computer memory capacity. The effects of varying the input size are studied in Section 3.1.

Neural networks
We investigate and compare two NN architectures for efficacy in MMF image reconstruction: a single hidden layer dense neural network (SHL-DNN) and the convolutional neural network U-Net. (A third architecture, a hybrid convolutional/dense network, is discussed in Section 3.3.) Dense NNs are the most elementary architecture for NN-based machine learning. The earliest papers on NN-aided MMF image reconstruction utilized dense NNs [13][14][15], but were constrained by the lower levels of computational power then available. We implement the SHL-DNN shown in Fig. 1(c), featuring a hidden layer of 4096 nodes sandwiched between input and output layers, with dense interlayer node connections. Each 64 × 64 input image is flattened and inserted into the input layer (which has 64 2 = 4096 nodes). The result from the output layer (with 28 2 = 784 nodes) is reshaped into a 28 × 28 image that can be compared to the ground truth image.
Convolutional neural networks (CNNs) have been applied to the MMF image reconstruction problem by several recent authors [16][17][18][19][20][21]. Here, we employ the U-Net architecture, which Rahmani et al. have previously used for MMF image reconstruction with the MNIST digit dataset [17]. The U-Net implemented here features only minor modifications to accomodate the target image sizes. As shown in Fig. 1(d), the input is 64 × 64 and the output is 28 × 28, the same as for the SHL-DNN. The network consists of a sequence of convolutional and pooling layers leading to a 4 × 4 × 256 intermediary layer, followed by a sequence of convolutional upsampling layers. We follow the typical U-Net architecture design rule [32] wherein a halving of the layer dimensions is accompanied by a doubling of the number of filters (image depth), and vice versa. There are also auxiliary interlayer connections that aid image localization [32].
The U-Net implementation depends on numerous hyperparameters such as the number of layers, convolutional filter depths, dropout ratio, batch size, etc. We tested the effects of varying these hyperparameters; the configuration shown in Fig. 1(d) seems to give the best results.
The NNs are trained using Adam optimization with a batch size of 256 images, and an early stopping condition of 500 epochs after validation losses stop improving. When the batch size is increased or decreased (ranging between 52 and 1024), little performance improvement is observed; a much larger batch size (27685) drastically lengthens training times. For the objective function, the NN output is compared against the original MNIST digit image via the structural similarity index (SSIM), a well-established metric for quantifying the similarity between structured images [34]. All training was performed on the same computer (Intel Xeon Gold 5218 with NVIDIA Quadro RTX 5000 GPU).

Image reconstruction fidelity
We train the SHL-DNN and U-Net using 30762 MMF images from the first 19 hours of the data collection run, with all ground truth images drawn from the MNIST digit dataset [36]. We assign 27685 images for training and the remaining 3077 for validation. The training and validation images are drawn randomly from across the collection period; the role of collection time will be investigated later, in Section 3.2. Fig. 2 shows the results of MMF image reconstruction for five representative images from the validation set. The fully-trained SHL-DNN and U-Net both recover the ground truth images with remarkable fidelity [ Fig. 2(a) and (c)-(d)], despite the lack of human-discernable patterns in the MMF images [ Fig. 2(b)]. The SHL-DNN results have noticeably higher fidelity than the U-Net results, as corroborated by their higher SSIM scores.
The training curves for the SHL-DNN and U-Net are shown in Fig. 3(a). These are plotted against training time to allow for fairer comparisons, since the two networks have different training times per epoch. We find that the SHL-DNN saturates at a higher SSIM (0.77), compared to the U-Net (0.67). To verify that this is not an artifact of the SSIM metric, in Fig. 3(b) and (c) we plot the mean squared error (MSE) and the resulting classification error for the validation set over the course of training (the training still uses SSIM for the objective function). The classification error is intended to be a measure of the overall legibility of the reconstructed digits, and is obtained by passing the NN outputs to an auxiliary digit classifier (mnist_cnn.py from the Keras project [38]) and comparing to the original digit labels. Both alternative measures show SHL-DNN outperforming U-Net. The SHL-DNN achieves MSE 2.26 × 10 −2 and classification accuracy 0.91, while the U-Net achieves MSE 3.47 × 10 −2 and classification accuracy 0.82.
The metrics for both networks can be further improved by using larger training sets, with the performance lead of SHL-DNN over U-Net appearing to be persistent. The training time per We systematically investigated the effects of various NN settings, and found that no further major performance improvements are achievable without increasing the training set size. (For these hyperparameter studies, a smaller training set of 8709 images was utilized.) For the SHL-DNN, the choice of input image size appears to play an important role. As shown in Fig. 3(d), for a smaller input image size (28 × 28) the SSIM saturates at a lower value, which can be ascribed to the NN having less information available for image reconstruction. But having inputs that are too large, such as 84 × 84, also leads to a lower SSIM, apparently because the network with its increased number of input nodes cannot be properly trained with the existing size of training set.
The SHL-DNN performance decreases when the number of hidden layer nodes is reduced below the baseline value, as shown by the red curve in Fig. 3(d) for the 512 node case. On the other hand, further increasing the number of hidden layer nodes increases the training time without significantly improving the saturated SSIM. The number of hidden layer nodes does not seem to have much effect on the optimal input image size.
As for the U-Net, one setting that notably affects performance is the number of convolutional filters. Having fewer filters reduces the SSIM, as shown in Fig. 3(e) for the case of halving the number of filters. When the number of filters is slightly increased (e.g., up to around 1.3 times the baseline), little performance impovement is observed; but when we double the number of filters, the U-Net training does not converge. Another possible setting is the number of convolutional layers. We verified that using U-Net structure with deeper and shallower depth than our baseline adversely affect the performance. One immediate example is removing the middle two layers (4 × 4 × 256 layer and subsequent 8 × 8 × 128 layer); The U-Net with the two layers removed  does not get trained unless the filter sizes are halved. Even so, the performance is worse, this is shown in Fig. 3(e).

Performance over time
It is interesting to ask whether the image reconstruction ability of the NNs is persistent, or whether it degrades over time due to a drift in the MMF's transmission characteristics. Such temporal changes can be caused by thermal and mechanical perturbations of the environment, which induce minute deformations of the fiber.
To address this question, we validate the NNs (trained using images from the first 19 hours of the dataset) against images collected during the subsequent 235 hours. The results are shown in Fig. 4. The validation data are sorted by collection time and batched into 5 minute intervals.
In terms of both SSIM and digit classification accuracy, the image reconstruction performance

Hybrid neural network
Rahmani et al. [17] studied the use of another type of NN for unscrambling MMF images: a hybrid convolutional and dense network of the type pioneered by Oxford's Visual Geometry Group (VGG). VGG-type networks are typically used for classification [39], and they were used in Ref. [17] for digit classification with the MNIST digit dataset. In this paper, we are mainly interested in image reconstruction rather than classification. Nonetheless, we find it useful to study the performance of a VGG-type network for this purpose, as a further test of the usefulness of convolutional layers for extracting structural information from MMF images. We implement a simple VGG-type network as shown in Fig. 5(a), consisting of two convolutional layers, a hidden dense layer with N h nodes, and a dense output layer. Fig. 5(b) shows the training curves for VGG-type networks with several choices of N h , as well as for the baseline SHL-DNN. When N h is equal to the number of hidden layer nodes in the SHL-DNN, the saturated SSIM is 0.71-comparable to but certainly not better than the SHL-DNN (SSIM 0.77). For smaller values of N h , the performance is substantially worse. We also investigated reversing the configuration by placing the dense layers at the input and the convolutional layers at the output, but this not produce any improvement. These results seem to bolster the case that convolutional input layers are not beneficial for MMF image reconstruction, a point that will be further discussed in Section 4.

Transfer learning and alternate image set
Transferability is a common concern in machine learning. In the present context, one may ask whether NNs trained using one kind of ground truth image-say, MNIST digits-can successfully reconstruct more general images. In other words, are the networks broadly capable of undoing  the effects of mode mixing in the MMF, or are they merely recognizing patterns that are highly specific to the sort of images in the training set?
To investigate this, we train the SHL-DNN by withholding one digit from the MNIST digit dataset, and validating it against the omitted digit. Fig. 6(a) shows representative results for the case of an omitted digit '9'. Although this SHL-DNN has not seen any examples based on the digit '9', it reconstructs the images reasonably well, albeit with lower SSIM. Here, the training set (with '9' excluded) has 21576 images, and the other network settings are the same as in the baseline network described in Section 3.1. Over 1000 instances of the digit '9', the mean SSIM is 0.70, compared to SSIM 0.81 for a validation set of 5395 images that exclude the digit '9'.
Thus far, we have performed training and validation solely using the MNIST digit dataset. To verify that the SHL-DNN also works for more complex images, we train it on clothing and apparel images from the Fashion-MNIST dataset [37]. The SHL-DNN has the baseline configuration described in Section 3.1. As shown in Fig. 6(b), the Fashion-MNIST images are distinct from (and more complex than) the MNIST digits used for training. Nonetheless, the trained SHL-DNN reconstructs the images with remarkable fidelity. Over 1000 Fashion-MNIST images, the mean SSIM is 0.75.
When we attempt to reconstruct Fashion-MNIST images using a SHL-DNN trained on MNIST digits, or vice versa, the results are extremely poor (SSIM close to zero). Likewise, when we attempt to reconstruct images consisting of random uncorrelated pixel intensities, all three trained networks (SHL-DNN, U-Net, and VGG-type) give very poor results; over 1000 images, the MSE is in the range of 0.08 − 0.09 for all the three networks, comparable to the nascent training stage of Fig. 3(b).

Discussion
We find that CNNs offer no performance advantage over the traditional dense NN architecture for MMF image reconstruction. In fact, the tested SHL-DNN outperforms U-Net both in terms of image fidelity and training time, and this seems to be robust over various different NN settings; moreover, a VGG-type hybrid convolutional/dense NN offers no obvious improvement over the SHL-DNN. For practical real-time imaging applications, simpler NN architectures may be desirable as they can be trained more quickly and with fewer computational resources.
Our interpretation of the situation is that convolutional layers, though well-suited to extracting local features in natural images, are not suited to analyzing speckle patterns in MMF images [29,30]. It would be interesting to explore modifications to the CNN scheme, or preprocessing of the speckle pattern, to improve performance [21].
The trained NNs can reliably reconstruct images collected hours after the training set; we observe short-term performance fluctuations that can be ascribed to environmental effects, but no degradation corresponding to a long-term drift in the fiber transmission characteristics. The main bottleneck in terms of training is the relatively low refresh rate of the spatial light modulator.
The NNs perform poorly on images that are too different from those in the training set, which is a common problem with NN-based machine learning. Recently, Caramazza et al. have demonstrated using an optimization algorithm to learn the complex transmission matrix for MMF image reconstruction [40], which bypasses the transfer learning limitations of the NN approach. However, this method requires much more computer memory, and the resulting image fidelity is lower; from our testing based on MNIST digits, the SSIM scores are in the range 0.2-0.5, compared to ∼ 0.75 for the SHL-DNN. In the future, it would be interesting to attempt to combine these two approaches in a way that overcomes their individual limitations.