Unsupervised content-preserving transformation for optical microscopy

The development of deep learning and open access to a substantial collection of imaging data together provide a potential solution for computational image transformation, which is gradually changing the landscape of optical imaging and biomedical research. However, current implementations of deep learning usually operate in a supervised manner, and their reliance on laborious and error-prone data annotation procedures remains a barrier to more general applicability. Here, we propose an unsupervised image transformation to facilitate the utilization of deep learning for optical microscopy, even in some cases in which supervised models cannot be applied. Through the introduction of a saliency constraint, the unsupervised model, named Unsupervised content-preserving Transformation for Optical Microscopy (UTOM), can learn the mapping between two image domains without requiring paired training data while avoiding distortions of the image content. UTOM shows promising performance in a wide range of biomedical image transformation tasks, including in silico histological staining, fluorescence image restoration, and virtual fluorescence labeling. Quantitative evaluations reveal that UTOM achieves stable and high-fidelity image transformations across different imaging conditions and modalities. We anticipate that our framework will encourage a paradigm shift in training neural networks and enable more applications of artificial intelligence in biomedical imaging.


Supplementary Figures
Figure S1 | UTOM can preserve the image content during transformation. Current unsupervised methods do not have the content-preserving ability and the image content is distorted when transformed to the target domain. With the saliency constraint, UTOM can learn content-preserving transformations and the semantic information can be well maintained. Adjacent sections stained with haematoxylin and eosin (H&E) are shown in the bottom row for reference.         (Fig. S4). The saliency constraint was imposed with different constant weights ranging from 0 to 20. For each experiment, NRMSE/PSNR/SSIM were arithmetically averaged on all 144 image patches in the test set. b, Box-dot plots show the distributions of NRMSE/PSNR/SSIM obtained with different ρ. c, Typical results under different ρ. Without saliency constraint, the network was unstable and sometimes converged to wrong mappings. The saliency constraint can effectively correct the mapping bias. However, when constant ρ is relatively large, other terms will be less important and the performance will degrade.

Figure S10 | Network architectures.
Here we take 256×256 input size as an example. Each coral rectangle represents a feature map extracted by corresponding convolutional kernels. a, The generator is a multi-layer residual network with downsampling input layers and upsampling output layers. b, The discriminator (PatchGAN classifier) uses multiple strided convolution for abstract representation. It generates a matrix, in which each element corresponds to a patch in the input image. The ultimate output is the average of the loss over all patches.
Figure S11 | Data pre-processing pipeline. Some of our training sets were from published datasets with paired ground-truth images. We randomly selected one half of the dataset and collected its raw images into domain A, and then selected the other half of the dataset and collected its ground-truth images into domain B. Figure S12 | Tiling and stitching in pre-and post-processing. In most cases, images needed to be transformed are extremely large in pixel size. In our data processing pipeline, large images were partitioned into multiple overlapping tiles to reduce memory requirements and improve training efficiency. The edges of output patches were cut out and the rest parts were stitched together to form a large image.

Supplementary Notes
Network architectures and the loss function. It is worth mentioning that the input and output channel numbers of the two generators should match to ensure that they can form a complete cycle, especially when images in domain A and those in domain B have different channel numbers. In terms of the objective function, the first part is the frequently-used adversarial loss, which can be formulated as the most common form: where DA and DB represent the discriminator of the forward GAN and the backward GAN, respectively. Lowercase letters a and b are images from the domains represented by corresponding uppercase letters. E is the expectation operator.
The second part of the loss function is the cycle-consistency loss, which is most essential for training the two GANs. The last part is the saliency constraint term to correct mapping errors and improve the success rate of training. The full objective function can be