Towards real-time image deconvolution: application to confocal and STED microscopy

Although deconvolution can improve the quality of any type of microscope, the high computational time required has so far limited its massive spreading. Here we demonstrate the ability of the scaled-gradient-projection (SGP) method to provide accelerated versions of the most used algorithms in microscopy. To achieve further increases in efficiency, we also consider implementations on graphic processing units (GPUs). We test the proposed algorithms both on synthetic and real data of confocal and STED microscopy. Combining the SGP method with the GPU implementation we achieve a speed-up factor from about a factor 25 to 690 (with respect the conventional algorithm). The excellent results obtained on STED microscopy images demonstrate the synergy between super-resolution techniques and image-deconvolution. Further, the real-time processing allows conserving one of the most important property of STED microscopy, i.e the ability to provide fast sub-diffraction resolution recordings.


Richardson-Lucy (RL) algorithm
Thanks to the PSF normalization of equation 1 (Main Text) and the modification introduced to account for background contribution [1], the RL algorithm takes the following form: -given x (0) , for k = 0, 1, 2, ... compute .
Here, and in the following, the multiplication of two arrays/cubes indicates the component-wise product; similarly the fraction of two arrays/cubes indicates component-wise quotient.
If we remark that the gradient of the KL-divergence is given by where1 indicates equation 1 can be written as follows and therefore it is a scaled gradient method with scaling x (k) at iteration k.

Split-Gradient method (SGM)
As concerns SGM for the minimization of the function defined by equations 3 and 2 (Main Text), if U 1 (x), V 1 (x) is a pair of nonnegative arrays/cubes such that then the algorithm is as follows: -given x (0) , for k = 0, 1, 2, ... compute It is easy to see that this iteration can be written as a scaled gradient step: · ∇ x f β (x (k) ; y) .
Convergence of SGM can be obtained by a standard line-search strategy that ensures both the descent of the objective function f β (x; y) and the nonnegativity of x (k+1) . Expressions of the functions U 1 (x), V 1 (x) for a number of regularization functions are given, for instance, in [2,3,4]. For the convenience of the reader we provide a very simple argument for justifying SGM. From the stationarity condition provided by the second Karush-Kuhn-Tucker (KKT) condition in the case of nonnegativity constraint, using the decomposition of the gradient of f 1 (x) given in equation 4, we obtain for a stationary point x * the condition which can be written as the following fixed point equation By applying the method of successive approximations we obtain the iterative algorithm of equation 5.
In the case of quadratic (or Tikhonov) regularization, i.e., we have ∇ x f 1 (x) = x ≥ 0 and the splitting V 1 (x) = x, U 1 (x) = 0 can be used in SGM. Then, using again the stationarity condition, we obtain for a stationary point x * or also .
If we write this relationship as the following fixed point equation by applying the method of successive approximation we obtain the SGM iteration for quadratic regularization: -given x (0) , for k = 0, 1, 2, ... compute .
Since SGM and Conchello-McNally algorithms are simple modification of the RL algorithm we expect that their convergence is slow. However SGM can be accelerated by inserting its scaling in SGP.
It is interesting to show that this method can also be considered as an iterative method for solving the fixed point equation given in equation 11. Indeed, if we write this equation as follows (15) u and we consider the second equation as a second degree algebraic equation for given u * , we obtain Therefore the EM method of [5], given the iteration x (k) , consists in updating first u by means of the first relation and then to update x by means of the second relation with u * replaced by u (k+1) . We conclude by remarking that x * , as given by the second relation in equation 16, is the solution of the following denoising problem (17) x * = argmin Therefore the Conchello-McNally algorithm can also be viewed as a RL iteration followed by a denoising of the iterate, assuming Poisson noise and quadratic regularization.

Boundary effect correction
If the unknown object x extends up to the boundary of the image domain so that the effect of the PSF is to generate an image which is not completely contained in the field-of-view of the microscope, then the previous deconvolution methods produce annoying boundary artifacts. Several methods have been proposed for removing this effect but in this paper we focus on an approach proposed in [6] for application to the RL algorithm; it can also be easily extended to the case of regularized deconvolution. Here we give the equations in such a case showing the application to SGP. The non-regularized algorithms can be obtained by setting β = 0. The idea is to reconstruct the object over a domain broader than that of the detected image and to merge, by zero padding, the arrays/cubes of the image and of the object into arrays/cubes with a dimension such that their Fourier transforms can be computed by means of FFT. We denote byS the set of values of the index labeling the pixels/voxels of the broader array/cube, by R that of the object array/cube and by S that of the image array/cube, so that S ⊂ R ⊂S. It is obvious that also the PSF must be defined over S and this can be done in different ways, depending on the specific problem one is considering. In microscopy the extension can be in general performed by zero padding. We point out that the PSF must be normalized to unit volume overS. The reconstruction of the object outside S is not reliable in most cases but its reconstruction inside S is practically free of boundary artifacts, as shown in [6].
If we denote by M R , M S the arrays/cubes, defined overS, which are 1 over R, S respectively and 0 elsewhere, we introduce the following matrices H and H T : where u and v denote generic arrays/cubes defined overS. Both matrices can be easily computed by means of FFT. Then, with these definitions, the data fidelity function is given again by equation 2 (Main Text), with S replaced byS, so that its gradient is now given by where α is the array/cube defined by We remark that the quantity multiplying M R in this equation can be used for defining the reconstruction domain R. Indeed, thanks to the normalization of the PSF, each value α j is ≤ 1 and is a measure of the fraction of the pixel/voxel j contributing to the image; it can be very small or zero for pixels/voxels ofS, depending on the behaviour of the PSFs. Therefore, given a thresholding value σ (for instance σ = 10 −2 ), we use the following definition Then the SGM algorithm, with boundary effect correction, is given by so that each iteration is zero in the pixels/voxels outside R. It is easy to show that also this algorithm is a scaled gradient method. As concerns SGP, the boundary effect correction is introduced by modifying the scaling matrix as follows while all the other steps remain unchanged.

SGP implementation Initialization
As far as the initial point x (0) concerns, a nonnegative image is required. The possible choices implemented in our code are: • a constant image with pixel values equal to the background-subtracted flux of the noisy data divided by the number of pixels/voxels; • any input image provided by the user.
For ensuring a feasible initial point, the given x (0) is projected on the feasible region before to start the iterative minimization procedure. The constant image has been chosen for our numerical experiments; in the case of boundary effect correction, only the pixels/voxels in R become equal to a constant value, while the remaining values ofS are set to zero. The constant value is normalized in such a way that the flux of the constant image over R coincides with the flux of the background-subtracted image over S.

Step-length selection
In the case of non-scaled gradient method (D k equal to the identity matrix) it is possible to prove that, if s (k−1) T w (k−1) > 0, then the inequality 0 < α holds for the BB step-lengths. For unconstrained quadratic minimization problems this property is exploited for designing effective alternation strategies of the two BB rules, able to improve significantly the convergence rate of the gradient schemes. In particular, the alternation strategies are useful for generating step-lengths that better approximate the inverse of the eigenvalues of the objective Hessian, that is a crucial property in accelerating gradient methods [7,8]. Numerical evidence is available that confirms the efficiency of these alternated BB rules also in case of non-linear constrained minimization problems [9,10,11,12].
When scaled gradient directions are exploited, the inequality α can not be proved for the step-lengths of equation 5; nevertheless, in our experiments this inequality has been often observed and the alternation strategies proposed for the non-scaled methods still provide interesting convergence rate improvements [13,14,15]. For this reason, we have equipped our SGP version with the step-lengths alternation criterion proposed in [8], that is one of the most effective rule for non-scaled gradient methods. We provide a description of this criterion in the following.
First of all the values produced by equation 5 (Main Text) are constrained into the interval [α min , α max ]: Then the following adaptive alternation is performed: where M α is a prefixed positive integer and τ 1 ∈ (0, 1). In contrast to the criterion proposed in [8], here a variable threshold τ k is exploited with the aim of avoiding the selection of the same rule for a too large number of iterations. Furthermore, in our experience, the use of the BB values provided by equation 25 (that are generally lower than those provided by α (1) k ) in the first iterations slightly improves the reconstruction accuracy and, consequently, in the proposed SGP version we start the step-length alternation only after the first 20 iterations.

Parameters setting
Even if the number of SGP parameters is certainly higher than that of the RL and SGM approaches, the huge amount of tests carried out in several applications led to an optimization of these values which allows the user to have at his disposal a robust approach without the need of a problemdependent parameters tuning. Some of these values have been fixed according to the original paper [13], as the line-search parameters β and θ which have been set equal to 10 −4 and 0.4, respectively. Also most of the step-length parameters remained unchanged, as α 0 = 1.3, τ 1 = 0.5, α max = 10 5 and M α = 3, while α min has been set equal to 10 −5 .
The main change with respect to [13] concerns the choice of the bounds (L 1 , L 2 ) for the scaling matrices. While in the original paper the choice was a couple of fixed values (10 −10 , 10 10 ), independent of the data, we decided to automatically adapt these bounds to the input image: we perform one step of the RL method and tune the parameters (L 1 , L 2 ) according to the min/max values of the resulting image w according to the rule if w max /w min < 50 then L 1 = w min /10; L 2 = w max · 10; else L 1 = w min ; L 2 = w max ; endif

Stopping rules
As mentioned in the main text, in the case of ML estimates both RL and SGP must not be pushed to convergence and an early stopping of the iterations is required to achieve reasonable reconstructions. In our code, we introduced different stopping criteria, which can be adapted by the user according to his/her purposes: • fixed number of iterations. The user can decide how many iterations of SGP must be done.
• convergence of the algorithm. In such a case, a stopping criterion based on the convergence of the objective function to its minimum value is introduced. Iteration is stopped when where tol is a parameter that can be selected by the user.
• minimization of the reconstruction error. This criterion can be used in a simulation study. If one knows the object x used to generate the synthetic image, then one can stop the iterations when the relative reconstruction error reaches a minimum value. A very frequently used measure of error is given by the ℓ 2 norm, i.e., |·| = ∥·∥ 2 and this is the criterion implemented in our code.
• minimization of the Kullback-Leibler divergence of the reconstructed image x (k) from the known objectx .
Also this case can be applied only on simulation study.
• the use of a discrepancy criterion. In the case of real data, one can use a given value of some measure of the "distance" between the real data and the data computed by means of the current iteration. A recently proposed criterion consists in defining the "discrepancy function" for an image y of size N × N as follows [16] (29) and stopping the iterations when D (k) = η, where η is a given number close to 1. Work is in progress to validate this discrepancy criterion with the purpose of obtaining rules of thumb for estimating η in real applications.
The last stopping rule deserves a few comments. In [16] it is shown that, if x is the object generating the noisy image y, then the expected value of f 0 ( x; y) is close to N 2 /2. This property is used to select a value of the regularization parameter when the image reconstruction problem is formulated as the minimization of the KL divergence with the addition of a suitable regularization term. This use is justified by evidence that in some important cases it provides a unique value of the regularization parameter. Moreover, it has also been shown that the quantity D (k) , defined in equation (29), decreases for increasing k, starting from a value greater than 1. Therefore, it can be used as a stopping criterion. Preliminary numerical experiments described in that paper show that it can provide a sensible stopping rule at least in simulation studies.

Testing platform
This appendix gives some deeper details on our testing platform. As we mentioned, each one of the 4 GPUs in the workstation has 14 streaming multiprocessors. All these streaming multiprocessors can access the system's main global memory where, typically, the problem data are stored. The connection bus with the cores has a bandwidth of 144 GB/sec. The whole system's peak computing performance is over 2.5 Tflops for double precision arithmetic. Each GPU board is connected to the CPU with a 16x PCI-Express bus, which grants a 16 GB/sec transfer rate. It should be noticed that the CPU-to-GPU transfer speed is much slower than the GPU-to-GPU transfer: hence, to maximize the GPU benefits, it is very important to reduce the CPU-to-GPU memory communications and keep all the problem data into the GPU memory.
Nvidia provides a programming framework called CUDA (Compute Unified Device Architecture) for its GPUs. For further information see http: //www.nvidia.com/cuda. By means of CUDA, it is possible to program a GPU using a C-like programming language. In this paradigm the CPU controls the instruction flow, the communications with the peripherals and starts the single computing tasks on the GPU. The GPU, composed by a number of streaming multiprocessors, performs the raw computation tasks, using the problem data stored into the graphics memory. In our numerical experiments we used CUDA 4.3.
The full GPU-to-GPU bandwidth can be obtained only if a coalesced memory access scheme is used (see the Nvidia documentation [17]). This is the reason why our GPU computational kernels are implemented using that memory pattern. Two CUDA kernel libraries are very important in our implementation of the SGP algorithm: CUFFT and CUBLAS. The former enables 1D, 2D and 3D FFTs (complex-to-complex, real-to-complex and complex-to-real versions are all available). The latter provides GPU implementations of the well known level 1, 2 and 3 BLAS libraries. These libraries are strongly recommended for maximizing GPU performances.
In the C ++ implementation, the FFT is provided by the well known FFTW library [18,19], a C library for computing the discrete Fourier transform (DFT) of multidimensional real or complex data of arbitrary size.
The pixel/voxel additional component-wise operations are localized, that is, the result in each pixel/voxel does not depend on that of any other: hence, those opoerations can be easily distributed on all the available processors, which makes them particularly suited for the GPU implementation. It must be noticed that the scalar products for updating the step-length α k from (5) is a performance-critical part of the SGP algorithm, because they require a large number of inter-processor communications and values dependencies prevent a straight parallelization. To face the bottleneck, we used the reduce function provided by the Thrust library [20], which generally achieves remarkable performances while retaining sufficient stability.
Upcoming next-generation GPU architectures, such as Nvidia's Maxwell series, could further improve the CUDA computational performances for the largest-scale images, thanks to the improved CPU-GPU connection and the increased double-precision peak performance. However, for the case of very large 3D reconstructions, the overall amount of memory local to each board will likely remain the limiting factor.