Fixed-point iterative linear inverse solver with extended precision

Solving linear systems, often accomplished by iterative algorithms, is a ubiquitous task in science and engineering. To accommodate the dynamic range and precision requirements, these iterative solvers are carried out on floating-point processing units, which are not efficient in handling large-scale matrix multiplications and inversions. Low-precision, fixed-point digital or analog processors consume only a fraction of the energy per operation than their floating-point counterparts, yet their current usages exclude iterative solvers due to the cumulative computational errors arising from fixed-point arithmetic. In this work, we show that for a simple iterative algorithm, such as Richardson iteration, using a fixed-point processor can provide the same convergence rate and achieve solutions beyond its native precision when combined with residual iteration. These results indicate that power-efficient computing platforms consisting of analog computing devices can be used to solve a broad range of problems without compromising the speed or precision.


I. INTRODUCTION
The problem of solving a large-scale linear system occurs in a broad range of science and engineering problems, such as image reconstruction [1]- [3], compressive sensing [4]- [6], and predictive control systems [7]- [9].These problems aim to deduce the inputs, , that generate the observations, , given the physical model of a linear system, expressed by matrix .Due to the scale of the system model and noises in the measurement process, direct inverse of the observations  is usually impractical and, in many cases, impossible.As a result, many inverse algorithms rely on iterative refinements to approximate the solution [10].Assuming that the measurements are dominated by Gaussian noise, finding the solution  ̂ given the observations  amounts to minimizing the objective function [11], where |⋅| 2 denotes the ℓ 2 norm.The solver starts with an initial guess,  0 , and moves along the gradient direction, /, in steps of size  to update the estimate.Using the objective function in Eq. ( 1), the update from step  to +1 can be expressed as follows, low-precision matrix calculations can introduce cumulative error into the solution and stall the iterative process.This work studies the properties of an iterative linear inverse algorithm in which the matrix computations are performed with fixed-point arithmetic.Using residual iterations for error correction, the precision of the fixed-point Richardson iteration can be theoretically extended to any desired level.Similar methods have been previously proposed for correcting the noise-induced errors in hybrid analog-digital processors [29], [30], but they lack quantitative analysis on the error and convergence, especially for fixed-point solvers.This paper is organized as follows.In Section 2, we introduce two theorems describing the convergence criteria and error of a fixed-point Richardson solver.The theorems have two major implications: 1) the fixed-point solver does not compromise the rate of convergence, and 2) using residual iterations, the fixed-point solver can reach solutions beyond its native precision limit.We verify the two theorems with three examples, namely, matrix inversion, image deconvolution, and tomography reconstruction in Section 3. Section 4 describes a prototype fixed-point iterative solver implemented on a field programable gate array (FPGA), which is more than two times faster than CPU but reaches the solution in the same number of iterations as a floating-point processor.Section 5 concludes the paper.

A. Fixed-point format
In the fixed-point data format, a vector, matrix, or tensor is represented by an array of signed mantissas, while a global exponent is shared among all the elements.Each element,   , in the -bit fixed-point representation,  ̃, of an -element array  ∈ ℝ  can be expressed as,  ̃ = sign(  ) × mant  × 2 expo−(−1) .(4) Here expo denotes the exponent indicating the number of bits above decimal point.To avoid overflow, the exponent is determined by the maximum element of the array, expo = ⌈log 2 max  (abs(  ) )⌉, where ⌈⌉ denotes the rounding to the nearest integer larger than .The ( − 1)-bit mantissas mant  are calculated as, where ⌊⌋ denotes the rounding to the nearest integer smaller than .
The multiplication between a fixed-point matrix  ̃ and a fixed-point vector  ̃ can be simplified as integer arithmetic between the mantissas, accompanied by bit-shifting to match the exponent of the result.Compared with the floating-point matrix-vector product  = , the ℓ 2 norm of the error, , of the fixed-point matrix-vector product  ̃=  ̃ ̃ is bounded by, Here || 2 denote the ℓ 2 norm of the operator.An upper bound of the coefficient  is given by   +   + 3    in Appendix I.The coefficients   and   are defined as: which represent the normalized rounding errors of the fixedpoint vector and matrix, respectively.Both   and   can be determined from the bit widths and exponents of the arrays.In most cases,  is much lower than this upper bound, depending on the distribution of elements in  and .

B. Error and convergence criteria of fixed-point Richardson iteration
The solution  * to a linear system  =  is unique if matrix  has full rank.To ensure that the intermediate estimations in the Richardson iteration {  } converge to the solution  * as the iteration number  → ∞, all the eigenvalues of matrix ( −   ) must fall within (−1,1).Since the largest and smallest eigenvalues are 1 − |  | 2 / and 1 − |  | 2 , respectively, where the operator norm |  | 2 equals the largest eigenvalue, and  is the condition number of   , the choice of the step size, , is confined by: In practice, to ensure the convergence while expediating the iterative solver, the maximum step size,   , is often selected with a safety margin 0 <  ≪ 1 as follows: For fixed-point Richardson iterations, Theorem 1 below summarizes the error and convergence criteria based on the fixed-point matrix multiplication error .Theorem 1: For a linear system  =  solved using fixedpoint Richardson iterations (Eq.( 3)) with step size , under the conditions in Eq. (11), the asymptotic error, , between the estimation,   , and the solution,  * , is bounded by: where  is the condition number of   , and  is the normalized ℓ 2 error of the two matrix-vector multiplications in each iteration.Theorem 1 is proved in Appendix II by examining the ℓ 2 norm of the error of intermediate estimations, | * −   | 2 , as  → ∞.The proof shows a linear convergence with rate | −   /| 2 , which is independent on the error of the fixedpoint matrix-vector product, .Because a typical problem involves a value of  on the order of 10 2 ~10 3 , using the choice of step size  in Eq. (10) where the convergence rate  = (2 − )/.

C. Fixed-point residual iteration
The asymptotic error  in Theorem 1 is independent of the solution  * , which implies that the fixed-point iterative solver can stagnate at an estimation,  ∞ , closer to the solution  * in terms of ℓ 2 distance, if  * has a lower ℓ 2 norm.Hence, constructing a sequence of fixed-point iterative solvers in which the solutions have decreasing ℓ 2 norm becomes the natural choice to approach the unbiased solution of the inverse problem.
The proposed residual iteration algorithm described in Algorithm 1 leverages the adjustable exponent of the fixedpoint format.The algorithm contains  sets of fixed-point Richardson iterations as the inner loops.The acceptance criterion of the inner loop solution is: where  is a small number of the order   .Let  () = ∑  ()    =1 denote the accumulated solutions from  fixed-point Richardson iterative solvers.The proposed algorithm aims to reduce the norm of  () ≔  −  () at each outer loop  by recursively solving the linear system  () =  () with the updated exponents in the next inner loop of the fixed-point Richardson iterations.
It is worth noting that similar precision refinement methods [31], [32] have been employed in iterative solvers using mixed full-and half-precision floating-point data formats.Yet, the floating-point precision refinement typically requires far fewer residual loops, , than Algorithm 1.In addition, floating-point format allows different exponents for every element in the matrix-vector product.As a result, the error analysis in floatingpoint precision refinement [33] does not apply to fixed-point residual iterations.Here we present the upper bound of the normalized error of the solution  () after  fixed-point residual iterations in Theorem 2.
Theorem 2: For a full-rank linear system  =  with its solution  * obtained from the residual iteration algorithm (Algorithm 1), the asymptotic error of the estimation after  residue updates,  () , is bounded by: where  is the asymptotic error of the solution obtained from fixed-point Richardson solver (Eq.( 12)).
Theorem 2 is established based on the decreasing ℓ 2 norm of the solution  * () in each fixed-point Richardson iteration.The upper bound of the solution error after  residue updates can be derived recursively from Eq. ( 12) in Theorem 1. Detailed proof is presented in Appendix III.
For Eq. ( 15) to be valid, the exponents of  () and     should be updated after every iteration of the inner loop.In actual implementations shown in Section 3, due to the computational cost of distribution-based estimation of the infinity norm (Eq.( 19)), the exponents are adjusted after the completion of each inner loop.As a result, the errors may not always satisfy the bound in Eq. (15).
Fig. 1 (c) plots the normalized errors,   , in log-scale compared with the analytical inverses,  1 −1 and  2 −1 , as the iterations progress.The asymptotic errors, , of the 8-bit iterations are 0.21 and 0.083 for the inversions of  1 and  2 , respectively, both of which are consistent with the error bounds given by Eq. ( 12).The convergence rate, , is obtained from the exponential fitting of Fig. 1 (c) with a 95% confidence bound.For the inversions of  1 and  2 ,  is 0.040±0.001and 0.085±0.004respectively, which are inversely proportional to , as predicted by Eq. ( 13).For the same condition number, Fig. 2 compares the fixedpoint Richardson iterations with the 8-bit, 7-bit, 6-bit, and floating-point precisions.The convergence rates are summarized in Table 2, which confirms the implication of Eq. ( 13) that the fixed-point iterative solver maintains the same convergence rate, , as the floating-point counterpart.The independence of convergence rate on the precision suggests that fixed-point iterative solvers do not compromise the convergence speed.The asymptotic errors of the 8-bit, 7-bit, and 6-bit solutions are 0.083, 0.18, and 0.33, respectively, which are all below the upper bound established in Eq. (12).

B. Image deconvolution
Deconvolution is a classic linear inverse problem that has broad applications in signal/image filtering and processing.For a linear, shift-invariant system, the convolution output,  (i.e., filtered signal, or blurred image) is the discrete convolution between the input signal or image  and the kernel i.e.,  =  * , which is defined elementwise as: where  and  are the sizes of the kernel along row and column directions, respectively.Here, we have assumed a twodimensional convolution, as in the case of image processing.This example demonstrates the iterative Richardson-Lucy deconvolution, which recovers images from blurry measurements given the convolution kernel.
The discrete convolution can be formulated as the product between the vectorized input, , and a Toeplitz matrix, , constructed from the kernel [34], producing the output  =  in the vectorized form.Using the convolution theorem, Eq. ( 16) can be rewritten as where  and  † represent the forward and inverse twodimensional Fourier transforms, both of which are unitary operators, † denotes the conjugate transpose of a complex matrix, and diag( ̃) is the diagonal matrix constructed from the Fourier transform of the kernel  ̃.The condition number, , of    can be obtained from the maximum and minimum elements in diag( ̃) † diag( ̃).Fig. 3 shows an example based on Richardson-Lucy deconvolution.The input,  * , is a test image downsampled from the MATLAB image "Saturn", shown in Fig. 3 (a).The image contains128×102 pixels in signed 4-bit fixed-point format with range (-1,1).The convolution kernel, , follows the Gaussian profile, where the parameter  determines the width of the kernel, and  0 is a normalization constant that ensures that ∑ | , | If  is increased beyond 0.90,  would violate the criteria in Eq. ( 11) and lead to a non-converging deconvolution result.Similar phenomenon has been reported in hybrid analog-digital solvers [35].
The measurements, shown in Fig. 4 (a), are calculated from convolving the input image,  * , with the four kernels in Fig. 3 (b), and then quantized to 8-bit to simulate the digitization error from a CCD or CMOS camera.The deconvolutions use 8-bit fixed-point iterations, with  estimated to be 0.014.Step sizes with the safety margin =0.2 are used for all four iterative deconvolutions.Fig. 4 (b) plots the deconvolved images after 200 iterations.The asymptotic errors, , are summarized in Table 3, and are all below the error bound predicted in Eq. ( 12).

C. Tomographic reconstruction with residual iteration
Theorem 2 indicates that the residual iteration can overcome the stagnation of the fixed-point Richardson solver, and reach a solution beyond the fixed-point limit.To verify the performance of the residual iterations, we examine a tomographic reconstruction of a 16×16 "Super Mario" pixel art from its pencil-beam projections, shown in Fig. 5 (a).The pixels of the input image,  * , are stored in the signed 3-bit fixed-point format with range (-1,1), giving a quantization interval of 0.25 between two adjacent intensity levels.The measurements consist of 60 projections between 0 and 180º at 4º intervals, each containing 31 uniformly spaced beams.system matrix, , of the tomographic model is constructed from the distance-driven method [36].Fig. 5 (b) plots the eigenvalue spectrum of the tomographic model.The condition number, , of    is 101.80.Assuming that the maximum step size, , is used in the Richardson iterations, the computation error  must be less than 0.020 to ensure convergence, according to Eq. ( 11).For the size and sparsity of this system matrix, we find that a minimum of =8 bits (with an estimated =0.015) are required for fixed-point iterations to converge.Fixed-point Richardson and residual iterations with 8-bit, 9bit, and 10-bit precisions are performed to reconstruct the "Super Mario" pixel art from the tomographic projections.The step sizes, , with safety margin =0.3 are used in all the solvers.The system matrix  is quantized to the solver precision with expo=0.The measurement, , is serialized to a vector, multiplied by   , and quantized to the solver precision with expo=0 to pre-calculate the fixed-point vector  for the iterations.The fixed-point Richardson iteration sets the exponent of the multiplication between    and   to 2. For residual iterations, the exponents of matrix-vector multiplications are decremented by 1 every 80 steps from 2 to -2.
Fig. 6 plots the reconstructions from Richardson and residual iterations with 8-bit, 9-bit, and 10-bit precisions.Table 4 summarizes the normalized errors of the fixed-point Richardson reconstructions, all of which are below the error bound given by Eq. ( 12).The normalized error, , is plotted as a function of the Richardson iteration number in Fig. 6  bit, and 10-bit solvers all reach normalized errors below 0.1.At this level of , the pixelwise differences between  (5) and  * are below the quantization interval, 0.25, of the true image  * .Therefore, the reconstructions in Fig. 6 (b1-b3) would exhibit no visual differences from the true image if displayed in 3-bit grayscale colormap.Faster convergence is possible via adaptive adjustments of the exponents of  () and     .Fig. 7 shows the error of the 8-bit residual iteration executed on the prototype solver detailed in section 4. The exponent is updated every 5 steps based on the distribution of the elements in a fixed-point array .The exponents are calculated using: where   and   denote the mean and standard deviation, respectively, of all the elements in .The adaptive residual iteration achieves the same convergence speed as the floatingpoint solver, with both methods achieving errors below the quantization interval of the true solution after 28 iterations.

SOLVER
We have built a hardware prototype to perform the fixedpoint Richardson iteration and exponent adjustment.The core component of the solver, i.e., the fixed-point matrix multiplication, is implemented on an FPGA development board (Xilinx ZC706) and communicates with the host PC via PCIe bus.Fig. 8 illustrates the block diagram of the logic design within the FPGA.The systolic multiplier array performs 512 MAC operations in a single clock cycle.The inputs of the multiplier array consist of a 16×16 block  and a 16×2 block .The precisions of both input blocks are signed 8-bit.The multiplication outputs are cached in signed 32-bit and wired to a 16×2 array of accumulators.For the matrix-vector multiplication involving a matrix  larger than 16×16, the matrix  is zero-padded to an integer number of 16×16 blocks , and the batch vector  is zero-padded to an integer number of 16×2 blocks , along the row dimension only.All the decomposed blocks  and  are store in cache.The multiplier array reads the  blocks along the rows and the  blocks along the columns.The blocks of 16×2 results are summed in the accumulators.After completion of one row of  blocks, the multiplier array moves to the next row block of  and cycles through all the blocks of  again.The host PC converts  and  into fixed-point, and then decomposes the mantissas into 16×16  blocks and 16×2  blocks.The decomposed  and  blocks are streamed to the FPGA through PCIe.Once all the blocks are multiplied and accumulated in FPGA, the results  are streamed back to host PC in 32-bit integer format.The host PC masks the results back to 8-bit, with the most significant bit (MSB) and least significant bit (LSB) selected by the exponents.Every five iterations, the host PC updates the exponents according to Eq.  (19).
For the FPGA design in Fig. 8 implemented on the ZC706 evaluation board, the Xilinx synthesis and implementation tool estimates a power consumption of 2.047W, of which 1.253W is consumed by the transceivers for PCIe communication with the host PC, and the remaining 0.794W is consumed by the logics, including the clocks, systolic multiplier array, BRAM cache, and control units that generate the addresses of  and  blocks.The clock rate is 250MHz.Considering the total number of 512 MACs within a single clock cycle, the energy consumption by the fixed-point computation is 6.2pJ/MAC, which is on the same order as that of in-datacenter TPUs [37].
We have tested the speed of the iterative solver implemented on our hardware prototype and CPU (Intel Core i3-4130, dualcore, 3.40GHz clock rate).The total CPU time of the fixedpoint Richardson iterations, excluding exponent adjustment, is measured, and then divided by the total number of iterations in all residual updates to obtain the time per iteration, which is indicative of the calculation speed.The fixed-point solver on CPU takes 1.7±0.2msper iteration, while that on FPGA prototype takes 0.76±0.05msper iteration, representing more than two times of improvement in speed.Errors represent the variance of the calculation time for five repeated tests.

V. CONCLUSION
We have demonstrated a fixed-point iterative solver that computes high-precision solutions for linear inverse problems beyond the precision limit of the hardware.We have described the convergence and error of the fixed-point iterations in two theorems.Theorem 1 shows that the convergence rate is independent of the precision of the solver, which implies that fixed-point iteration does not compromise the convergence speed of the solver.Theorem 2 resolves the stagnation of a fixed-point solver with residual iterations, which correct the error and refine the solution to a higher precision level.We have presented three examples of linear inverse problems, namely, matrix inversion, Richardson-Lucy deconvolution, and tomographic reconstruction, to verify both theorems.
The combination of residual iteration with adaptive exponent adjustment achieves the same rate constant and error as a floating-point Richardson solver.We have prototyped the fixed-point solver with a fixed-point systolic multiplier array on the FPGA and fixed-point arithmetic libraries on host PC.The prototype solver achieves more than twice the calculation speed of a CPU and takes the same number of iterations to reach the solution as its floating-point counterpart.We envision that a low-precision, analog matrix processing core can supersede the current FPGA systolic array in the next generation of fast, energy-efficient solvers for numerous large-scale, highperformance computing applications, including but not limited to convex optimizations, differential equations, and training of artificial neural networks.

APPENDIX I
We derive the upper bound of the normalized error,  for fixed-point matrix-vector multiplication.Based on Eq. ( 5), for an -dimensional vector, , the infinity norm falls in the range: For an  ×  matrix , its operator norm is used in error analysis.Using the same fixed-point conversions (Eq.( 4)-( 6)), the infinity norm of a fixed-point matrix || ∞ falls in the range: The infinity norm of the rounding error, | −  ̃|∞ , is determined by both bit width and size, and falls in the range: 0 ≤ | −  ̃|∞ ≤ 2 expo−(−1) .
(25) We introduce the rounding errors of the fixed-point matrix  =  −  ̃ and the vector  =  −  ̃.Using Eq. ( 22) and Eq. ( 25), the ℓ 2 Assuming the initial guess,  0 = , the ℓ 2 norm of the error at iteration  is, where the first term is the asymptotic error, and the second term is a function of the iteration step .To ensure   decays as the iteration progresses, the convergence criteria of the Richardson iteration must be satisfied, The largest and the smallest eigenvalues in  are 1 − |  | 2 / and 1 − |  | 2 , respectively.Plugging || 2 in the convergence criteria (Eq.( 31)) confirms Eq. (11).Using the maximum eigenvalue || 2 = 1 − |  | 2 / in the first term of Eq. ( 30) gives the upper bound of asymptotic error  in Eq. (12).□ (a), are constructed from the 4×4 discrete cosine transfer matrix by applying different linear scaling factors to its eigenvalues.The condition numbers  of  1 and  2 are 25.0 and 11.1, respectively.The analytical inversions  1 −1 and  2 −1 , shown in Fig. 1 (b), are calculated with floating-point LU decomposition.

Fig. 1 .
Fig. 1.Two 4×4 matrices (a) and their inverses (b) in fixed-point iterative matrix inversion solver. 1 has a  of 25, and  2 has a  of 11.1.All matrices are stored in floating-point format but displayed with two decimal points.(c) Normalized error of the iterative solutions   vs. the iteration number  for the inversion of  1 (=25.0)and  2 (=11.1)using 8-bit fixed-point Richardson iterations.

Fig. 2 .
Fig. 2. Normalized errors of the iterative solutions   vs. the iteration number  for the inversion of  2 (=11.1)using 8-bit, 7-bit, and 6-bit fixed-point iterative solvers.The error of the floating-point iterative solver is plotted for reference.

1 .
Four example kernels with =0.70,0.75, 0.80, and 0.85, are plotted in Fig. 3 (b).Fig. 3 (c) shows the normalized eigenvalue spectra of the four Toeplitz matrices, calculated from the Fourier transform of the kernels.The condition numbers, , of the four kernels are 8.38, 16.59, 35.07, and 78.50, respectively.

Fig. 5 .
Fig. 5. Fixed-point iterative solver for a tomographic reconstruction problem.(a) Tomographic projection and measurement of a 16×16 signed 3-bit "Super Mario" pixel art.(b) Eigenvalue spectrum of the tomographic projection model.The condition number  of    is 101.80.
(c1).Note that the same convergence rate is present regardless of the precision of the solver.The normalized errors of the residual iterations are shown in Fig.6(c2).After =5 residue updates, the 8-bit, 9-

Fig. 7 .
Fig. 7. Normalized error vs. iteration number for the floating-point solver (dashed curve) and fixed-point iterative solver (solid curve), with adaptive exponent adjustments made every 5 steps.

Fig. 8 .
Fig. 8. Block diagram of the fixed-point iterative solver prototype on FPGA depicting the communications and among systolic multiplier array, cache, accumulator, and other resources.FIFO: first-in-first-out; BRAM: block random-access memory.

TABLE 1 .
NORMALIZED ERRORS AND THE CONVERGENCE RATE OF THE 8-BIT RICHARDSON ITERATIONS USED TO OBTAIN THE INVERSE OF  1 AND  2 .A 95% CONFIDENCE BOUND IS USED IN EXPONENTIAL FITTING.

TABLE 2 .
ASYMPTOTIC ERRORS AND CONVERGENCE RATES OF THE 8-BIT, 7-BIT, 6-BIT, AND FLOATING-POINT ITERATIONS.A 95% CONFIDENCE BOUND IS USED IN EXPONENTIAL FITTING.

TABLE 3 .
NORMALIZED ERRORS OF RICHARDSON-LUCY DECONVOLUTION RESULTS.