Compressive confocal microscopy imaging at the single-photon level with ultra-low sampling ratios

Laser-scanning confocal microscopy serves as a critical instrument for microscopic research in biology. However, it suffers from low imaging speed and high phototoxicity. Here we build a novel deep compressive confocal microscope, which employs a digital micromirror device as a coding mask for single-pixel imaging and a pinhole for confocal microscopic imaging respectively. Combined with a deep learning reconstruction algorithm, our system is able to achieve high-quality confocal microscopic imaging with low phototoxicity. Our imaging experiments with fluorescent microspheres demonstrate its capability of achieving single-pixel confocal imaging with a sampling ratio of only approximately 0.03% in specific sparse scenarios. Moreover, the deep compressive confocal microscope allows single-pixel imaging at the single-photon level, thus reducing the excitation light power requirement for confocal imaging and suppressing the phototoxicity. We believe that our system has great potential for long-duration and high-speed microscopic imaging of living cells.

network to be robust against noise in real scenes by using real datasets to train the network.We used realistic datasets and simulated datasets to train the network respectively to compare the reconstruction performance.The simulated compressed data y is obtained by multiplying the ground truth x with the sampling matrix A. Fig. S1 shows the test results of the DCCM-Net on real compressed data obtained by DCCM after training with realistic and simulated datasets, respectively.It can be clearly seen that the network trained with real data has better reconstruction performance.

Structural details of the recovery subnet (RS) of our proposed DCCM-Net
As illustrated in Fig. S2, given the initial estimation  ̂init =  IS (, ) =    ∈ ℝ  from initialization subnet (IS)  IS , the sampling ratio  ∈ [0,1], and the sampling patterns  ∈ ℝ × , our RS ( RS ) generates a channel-wise concatenation of  ̂init and sampling ratio map  ∈ ℝ × with all elements being  .Subsequently, a 3 × 3 convolution layer and a PixelUnshuffle 1 of scaling factor  are applied, followed by  PGD 2 -unrolled stages to extract the shallow image feature of  channels, refining it in a step-by-step manner.The final recovered result  ̂ is obtained by a 3 × 3 convolution layer and a PixelShuffle of scaling factor  from the refined feature.The structural details of our RS design are given as follows.
Considering the trade-off among the capacity, receptive field, and computation cost of the network, our  ext employs a convolution, followed by a PixelUnshuffle (•) ↓  with factor  , to fuse the reshaped  ̂init of size 1 ×  ×  and a uniform sampling ratio map  = [] 1×× to generate a downscaled feature as: where [ ̂init , ] ∈ ℝ 2×× is the channel-wise concatenation of  ̂init and .
Each unrolled PGD stage includes a gradient descent module and a proximal mapping module in series.The former performs the analytic gradient descent process on the first  2 feature channels while keeping the other ( −  2 ) channels unchanged to hold a high-throughput transmission 3 without information loss.The latter then employs two cascaded nonlinear activation-free blocks (NAFBs) to carry out a denoising process.Specifically, in the -th unrolled stage module  stage (𝑘) , we map Eqs. ( 3) and ( 4) in the Methods to two cascaded refinement steps.
To maintain a maximized information flow and enhance the network's flexibility, we perform the gradient descent in Eq. ( 3) in the Methods on the front feature part  ̂: 2 (−1) of shape where (•) ↑  is the PixelShuffle of scaling factor  and (•) ↓  is the inverse of PixelUnshuffle.The second step of  stage () is a generalized proximal mapping (or a feature-level denoising step), which is achieved by our developed two cascaded basic NAFBs and can be formulated as: In each NAFB, the input image feature is first passed through a layer normalization (LN) 4 and transformed into a 2-channel feature space via an 1 × 1 convolution layer followed by a 3 × 3 depth-wise convolution layer.
The feature is then evenly split into two -channel parts, which are merged into a -channel one by element-wise multiplication.After that, the merged feature is transformed by a 3 × 3 convolution layer and scaled by the attentive weights generated from another 3 × 3 convolution layer, obtaining the final feature residual and the output.To summarize, each NAFB contains a LayerNorm, four convolutions, and two element-wise multiplications to adaptively learn a non-linear residual for the input feature without activations.Note that a common alternative of NAFB is the classic residual block (RB), composed of a ReLU activation sandwiched by two convolutions with an identity skip connection, and we empirically find that our NAFB, improved upon the baseline transform block, is effective to bring satisfactory quality with low cost on parameters and computation, and is competent to be a better choice than RBs.By merging Eqs.
Finally, our recovery module  rec employs a convolution, followed by a PixelShuffle (•) ↑  to reconstruct the intensity (or image)  ̂ from the final feature  ̂() refined and generated by our unrolled  PGD stages as: Compared to other existing deep reconstruction networks, our RS enjoys three significant advantages.First, it is optimization-inspired and has a well-defined architecture based on the traditional PGD, while maintaining a highcapacity feature-level information flow throughout the network trunk, leading to an excellent balance between its performance and interpretability.Second, it adapts to arbitrary sampling ratios learned once with our training scheme.Third, the design of NAFB, without incorporating any activation function, is proven to be more powerful and efficient than other traditional basic blocks 5 , such as the classic residual block.Furthermore, it is validated to be effective and suitable for compressive confocal microscopic imaging tasks.

Figure S1 .
Figure S1.The reconstruction of DCCM-Net trained with realistic datasets and simulated datasets.R.D.: realistic datasets, S.D.: simulated datasets.The corresponding scale bar is 20 .WF: wide field; GT: ground truth.

Figure S2 .
Figure S2.The illustration of structural details of deep unrolled recovery subnet in our DCCM-Net.s: scaling factor;  ext : extraction module;  rec : recovery module; NAFB: nonlinear activation-free block; Conv: Convolution; LN: layer normalization.