Deep learning autofluorescence-harmonic microscopy

Laser scanning microscopy has inherent tradeoffs between imaging speed, field of view (FOV), and spatial resolution due to the limitations of sophisticated mechanical and optical setups, and deep learning networks have emerged to overcome these limitations without changing the system. Here, we demonstrate deep learning autofluorescence-harmonic microscopy (DLAM) based on self-alignment attention-guided residual-in-residual dense generative adversarial networks to close the gap between speed, FOV, and quality. Using the framework, we demonstrate label-free large-field multimodal imaging of clinicopathological tissues with enhanced spatial resolution and running time advantages. Statistical quality assessments show that the attention-guided residual dense connections minimize the persistent noise, distortions, and scanning fringes that degrade the autofluorescence-harmonic images and avoid reconstruction artifacts in the output images. With the advantages of high contrast, high fidelity, and high speed in image reconstruction, DLAM can act as a powerful tool for the noninvasive evaluation of diseases, neural activity, and embryogenesis.


Note S1 Preregistration for preparation of paired training dataset
Generally, there is a large misalignment between GR imaging and DG imaging due to the difference in scanning paths. It is difficult to manually align because this misalignment is an affine transformation including rotation and translation. And it is also difficult to align with the difference of pixel gray value because the pixels of the two images are quite different and there is a lot of noise in the resonant image. After testing, ORB 1 is more suitable for our data set than other feature extraction and matching operators such as scale invariant feature transform (SIFT) 2 and speeded up robust features (SURF) 3 . ORB has the advantages of insensitivity to illumination, scale consistency, and rotation invariance, thus, it is often used for feature matching to find optical flow between motion frames, image stitching, and other tasks.
For keypoint matching, Hamming distance was used as the descriptor distance metric for its corresponding descriptor, and the brute force matching method was used to obtain a preregistration feature point pair, as shown in Fig. S2a. (2) There were some outliers in the feature point pair, we used the iterative optimization method to calculate the homography matrix of the affine transformation. Since there was only a rotation transformation , and a translation transformation , between the input and the HR↓4, thus, where ( , ) and (x ′ , y ′ ) are the key coordinates in KPLR and KPHR↓4 after the preregistration, respectively.
We then remove the interference of outliers using the random sample consensus (RANSAC) optimization algorithm 4 and used the Levenberg-Marquardt optimization method 5 to further improve the robustness during iterative optimization.
(3) We cropped the LR image into small image blocks because of graphics memory limitation and customarily, the receptive field of the low-level vision task network does not need to be large. A large area of regions of interest (ROI) overlap could still be obtained because the pixel misalignment between the LR and HR↓4 was very small due to the preregistration. Note S2 Self-alignment pyramid, cascading, and deformable convolutions Deformable convolution was proposed to learn irregular convolution positions to obtain more efficient feature expression capabilities 6 and shows good performance in object detection 7 , recognition 8 , semantic segmentation 6 , video super-resolution 9 , etc. The advantage of deformable convolution is that pixel calibration can be achieved at the feature level by selecting sampling point convolution with position deviation for multi-frame sequential images, without explicit motion estimation such as optical flow. Inspired by the video restoration framework with enhanced deformable convolution (EDVR) 10 which obtained better results than optical flow warping in video super-resolution, we used the modulated deformable module 11 as the single image pixel calibration module. In brief, given a convolutional kernel of K sampling locations, there are a weight and a pre-specified offset for the k-th location denoted as and respectively. For example, for a 3 × 3 kernel, K = 9 and ∈ {(−1, −1), (−1,0), … , (0,1), (1,1)}. The aligned features at each position p (deformable convolution) where ∆ and ∆ denote the learnable location offset and modulation scalar range [0,1] at the kth location, respectively. ( ) is the input feature maps at location . Bilinear interpolation is applied in computing ( + + ∆ ) as + + ∆ is mostly fraction. Thus, for both ∆ and ∆ , we used a separate convolution layer acting on the same input feature maps with a total of 3K channels, where 2K channels for ∆ , K channels for ∆ with a sigmoid layer for range [0,1].
Referring to EDVR and inspired by the Laplacian pyramid super-resolution network (LapSR) 12 and texture transformer network for image super-resolution (TTSR) 13 , we proposed SAPCD (Fig. S2b) based on pyramidal processing and cascading refinement. First, we performed feature extraction of Level 1 (L1, Fig. S3a). Then the feature of Li level (i = 2, 3, …) can be downsampled from that of L(i-1) level by stride convolution with stride equaling 2. At level Li, offsets (red boxes in Fig. S3b,c) are predicted with Li level features and ×2 upsampled offsets from L(i+1) level (purple dash lines in Fig.   S3b,c). Similarly, aligned features (green boxes in Fig. S3b,c) are predicted with the deformable results and upsampled aligned features from L(i+1) level. That is and are different convolutions for different predicting tasks. (•) ↑ stands for upscaling (bilinear interpolation) by a factor . DConv is a deformable convolution given by Eq. (7).

Note S3 Residual-in-residual dense attention block for generator
Based on the residual-in-residual dense block (RRDB) module used in super-resolution generative adversarial networks (ESRGAN) 14 , we proposed residual-in-residual dense attention block (RRDAB) as dense blocks to reconstruct the features after the SAPCD alignment (Fig. S4a). To enhance the dense connection that concatenates different receptive fields of input into channels, we used squeezeand-excitation (SE) blocks, referring to SENet 15 and convolutional block attention module (CBAM) 16 , adjacent to dense connection to learn channel and spatial attention. Networks with SE blocks can adaptively weight different fields, i.e., knowing which feature map to learn, and with spatial attention module (SAM) blocks following SE blocks can adaptively weight different regions, i.e. knowing which regions of the feature map to learn. Mathematically, we denoted and −1 as input and output of the RRDAB module, denoted and as input and output of residual dense attention block (RDAB), and denoted as i-th layer in m-layer dense attention block (DAB) block and 0 and as the input and output layers, respectively.
is short for Leaky ReLU, , 0 : , 0 where constant = 0.01 in this paper.  (14) where ℎ and are the height and width of , ∈ ×ℎ× , where is the channel of .
is the fully connected layer. 0 ∈ / × and 1 ∈ × / , where is the reduction ratio.
where is used for residual scaling in Fig. S4.
where a is the aligned feature from SAPCD. out is the output feature. is the RRDAB cascading number.
where is the upsampling scale, which equals four in this work. 1 is the feature map after 1st conv layer in the feature extraction block in SAPCD, of whiich channel equals to that of .

Note S4 Perceptual loss function for discriminator
Perceptual loss compares the high-level image feature representations (instead of pixel differences) extracted from GT and output to ensure their similarity of high-level information (content and global structure). We selected pretrained VGG19 17 as the feature extract network in the perceptual loss. After trying on feature layer 1,1 , 2,2 , 3,4 , 4,4 , 5,4 ( , is the j-th feature map layer at the block before i-th max-pooling) which stand for low-to-high level vision feature, we chose 5,4 as perceptual loss layer for more distinct texture. Referring to perceptual loss in ESRGAN, we got the feature map before ReLU activation. Then, we applied a high perceptual loss as    (18) where 5,4 and 5,4 are the width and height of feature map 5,4 in VGG. SR is the generated super-resolution result, GT is the HRGT intensity.
Because of the application of GAN, the total loss for the generator is where Gen and Dis are short for generator and discriminator, Gen and Dis are their network parameters. 1 is the generative loss that encourages the generator to make the super-resolution results natural that can deceive the discriminator. GAN_dis SR discriminates the generated super-resolution result to false and GT to true. But they will converge after adversarial training, making the generator predict superresolution images residing on the manifold of natural image 18 (22) which ensures pixel-wise identical while avoids overly smooth in some super-resolution methods using mean-square error (MSE) 19 .

Note S5 Comparison of image quality enhancement by different networks
Commonly used image quality indicators include MSE, PSNR, and the SSIM index. In most reconstruction tasks, these indicators can effectively evaluate image improvement. However, in superresolution reconstruction tasks, after the introduction of generative adversarial networks in recent years, researchers found that high PSNR or SSIM may not necessarily represent better reconstruction quality. This is because high MSE-based PSNR corresponds to excessive smoothing, while SSIM comprehensively evaluates luminance, contrast, and structure except for the perceived quality of the images. Although SRResNet attained the highest PSNR and SSIM, the reconstructed images tended to be overly smooth (Fig. 5b). ResNet-and RRDB-GAN networks had similar noise suppression capabilities to the DLAM but provided little small-scale information (Fig. 5d). The texture details in high PSNR or SSIM images do not necessarily meet human senses. Thus, we also used no-reference quality metrics, including BRISQUE, NIQE, and PIQE, to evaluate the image quality. Opinion-aware BRISQUE is limited to evaluating the same type of distortion and requires complicated custom differential mean opinion score values obtained through experimentation for the training datastore. It is difficult to reveal ambiguity (unless specially trained for such distorted quality-aware features). NIQE is opinion-unaware, related to local image sharpness 20 , and refers to expected statistical features of the GT images. PIQE is opinion-unaware and unsupervised and can estimate block-wise distortion. These scores gave a large value (low quality) for the reconstruction ambiguity by SRResNet.
Nevertheless, these metrics merely evaluate noise suppression and natural/perceptual quality improvement of ResNet-and RRDB-GAN networks despite a lost sight of reconstruction artifacts. Table S2, including the resolution scale Pearson correlation (RSP) scores and resolution scaled error (RSE) in Table S2, as well as the error maps in Fig. S14, reveal the difference in reconstruction artifacts reconstructed by different networks against the GT. Overall, DLAM guided by dual SE or SE-SAM modules attained a much lower RSE and higher RSP score due to its artifact and blurring prevention.

Note S6 Evaluation of resolution improvement
The embedded super-resolution framework in the generator (Fig. S1) has the capability to enhance the spatial resolution of a microscope. We extracted the PSFs of the label-free SHG microscope images and its downsampled results, as well as the network output images (Fig. S12a) to estimate their Rayleigh resolution 21 . The Gaussian FWHM [22][23][24] for the input image was calculated to be 501 nm, while the proposed deep network greatly improved it to 290 nm, which approached the value (275 nm) of the GT image (Fig. S12b). Nevertheless, axial resolution was not calibrated because pathological analysis of frozen sections is usually performed laterally and the tissue thickness is only 5 μm. To confirm overall resolution improvement by DLAM, we performed a statistical evaluation of the spatial resolution for the input, output, and GT images as shown in Fig. S12c. The mean FWHM for the input images was 481 nm. After the deep network learning, this FWHM was improved to 289 nm, providing a very good match to the PSF results of the SHG modality with a mean FWHM of 282 nm. Additionally, Fourier ring coefficient (FRC) has been reported to be an efficient "blind" resolution metric 25 which are less prone to prejudice in the selection process. We calculated the FRC on the SHG images ( Fig.   S12d) to yield an objective spatial resolution. The result demonstrates a resolution improvement of approximately 155 nm (Fig. S12e).