## Abstract

Solving linear systems, often accomplished by iterative algorithms, is a ubiquitous task in science and engineering. To accommodate the dynamic range and precision requirements, these iterative solvers are carried out on floating-point processing units, which are not efficient in handling large-scale matrix multiplications and inversions. Low-precision, fixed-point digital or analog processors consume only a fraction of the energy per operation than their floating-point counterparts, yet their current usages exclude iterative solvers due to the cumulative computational errors arising from fixed-point arithmetic. In this work, we show that for a simple iterative algorithm, such as Richardson iteration, using a fixed-point processor can provide the same convergence rate and achieve solutions beyond its native precision when combined with residual iteration. These results indicate that power-efficient computing platforms consisting of analog computing devices can be used to solve a broad range of problems without compromising the speed or precision.

## Introduction

Floating-point digital processing is the standard computing paradigm in virtually all scientific and engineering applications, such as image processing^{1,2,3}, compressive sensing^{4,5,6}, and predictive control systems^{7,8,9}. The cores of these applications often involve large-scale linear systems, which require multiplying and/or inverting matrices with thousands to millions of elements. A conventional digital floating-point processor, such as a single CPU core, processes the matrix multiplication as sequential multiply-accumulate (MAC) operations, which are not efficient in handling large-scale linear systems and could yield minutes to hours of computation time even for a moderate-scale problem^{10,11}.

Driven by the growing demand for data processing power, digital accelerators that implement a large array of parallel, half-precision floating-point processing units (FPUs), such as Google TPU^{12} and the Tensor Cores in NVIDIA Volta GPU^{13}, have been developed, particularly for artificial intelligence applications. Due to the on-chip power consumption limit, fixed-point digital MAC unit that requires only a fraction of the energy of the FPU^{14,15} becomes increasingly favorable in large-scale digital accelerators. Recently, analog accelerators that utilize natural physical processes for array multiplications and summations have shown to be even faster and more energy-efficient for matrix processing^{16}, with implementations on both electronic^{17,18} and photonic^{19,20} platforms. Yet, analog accelerators are susceptible to signal artifacts arising from noises and device imperfections^{21}, and are unable to achieve floating-point precision. As a result, fixed-point is the de-facto data format in both analog and digital accelerators^{22}.

Currently, the applications of analog or digital fixed-point accelerators are limited to problems robust against computing errors, such as neural network inference^{23}, adaptive filtering^{24}, and binary optimization^{25}. On the other hand, many computing tasks in science and engineering are inverse problems without analytical solutions, which must be solved iteratively. For these problems, floating-point precision is typically required to cache the intermediate results^{26,27}, because low-precision matrix calculations can introduce cumulative errors and stall the iterative process. The precision requirement excludes many resource-demanding computing applications from deployment on analog or digital fixed-point accelerators.

Extending the iterative linear solver to fixed-point data format can effectively leverage the high speed and efficiency of analog accelerators for a broad range of computing problems in science and engineering. However, the errors of fixed-point arithmetic and the precision of the result in both the forward and inverse processes of a linear system have not been quantitatively studied. This work examines the properties of an iterative linear inverse algorithm (Richardson iteration) in which the matrix computations are performed with fixed-point arithmetic. Using residual iterations for error correction, the precision of the iterative solution can be theoretically extended to any desired level. Similar methods have been previously proposed for correcting the computing errors in reduced precision floating-point processors^{28,29} or hybrid analog–digital processors^{30,31}, but quantitative analyses on the error and convergence of a fixed-point solver are still absent. Here we introduce two theorems describing the convergence criteria and error of a fixed-point Richardson solver. The theorems have two major implications: (1) the fixed-point solver does not compromise the convergence rate, and (2) using residual iterations, the fixed-point solver can reach solutions beyond its native precision limit. We verify the two theorems with three examples, namely, matrix inversion, image deconvolution, and tomography reconstruction. We also present a prototype fixed-point iterative solver implemented on a field programable gate array (FPGA), which reaches the solution in the same number of iterations as a floating-point CPU and reduces the calculation time by two folds.

## Theory

### Iterative solution of linear system

The problem of solving a linear system aims to deduce the inputs, \(\mathbf{x}\), that generate the observations, \(\mathbf{y}\), given the physical model of the system, expressed by a matrix \(\mathbf{A}\). Due to the size of the system matrix and noises in the measurement process, direct inverse of the observations \({\mathbf{y}}\) is usually impractical and, in many cases, impossible. As a result, many inverse algorithms rely on iterative refinements to approximate the solution ^{32}. Assuming that the measurements are dominated by Gaussian noise, finding the solution \({\hat{\mathbf{x}}}\) given the observations \({\mathbf{y}}\) amounts to minimizing the objective function ^{33},

where \(\left| \cdot \right|_{2}\) denotes the *ℓ*^{2} norm. The solver starts with an initial guess, \({\mathbf{x}}_{0}\), and moves along the gradient direction, \(\partial L/\partial {\mathbf{x}}\), in steps of size \(\tau\) to update the estimate. Given the objective function in Eq. (1), the update from step \(k\) to \(k + 1\) can be expressed as follows,

Equation (2) can be rearranged into the form of Richardson iteration^{34} (Eq. (3)) by introducing a pre-calculated vector \({\mathbf{b}} = \tau {\mathbf{A}}^{T} {\mathbf{y}}\),

where each iteration comprises two matrix–vector multiplications and two vector additions. In the following sections, we discuss the error and convergence of the iterative linear solver when Eq. (3) is carried out in fixed-point precision available to analog or digital low-precision accelerators.

### Fixed-point format

In the fixed-point data format, a vector, matrix, or tensor is represented by an array of signed mantissas, and a global exponent is shared among all the elements. Each element,\( x_{i}\), in the \(L\)-bit fixed-point representation, \({\tilde{\mathbf{x}}}\), of an \(N\)-element array \({\mathbf{x}} \in {\mathbb{R}}^{N}\) can be expressed as,

Here \({\text{expo}}\) denotes the exponent indicating the number of bits above decimal point. To avoid overflow, the exponent is usually determined by the maximum element of the array,

where \(x\) denotes the rounding to the nearest integer larger than \(x\). The \(\left( {L - 1} \right)\)-bit mantissas \({\text{mant}}_{i}\) are calculated as,

where \(x\) denotes the rounding to the nearest integer smaller than \(x\).

The multiplication between a fixed-point matrix \({\tilde{\mathbf{M}}}\) and a fixed-point vector \({\tilde{\mathbf{x}}}\) can be simplified as integer arithmetic between the mantissas, accompanied by bit-shifting to match the exponent of the result. Compared with the floating-point matrix–vector product \({\mathbf{y}} = {\mathbf{Mx}}\), the *ℓ*^{2} norm of the error, \({\updelta }{\mathbf{y}}\), of the fixed-point matrix–vector product \({\tilde{\mathbf{y}}} = {\tilde{\mathbf{M}\tilde{x}}}\) is bounded by,

Here \(\left| {\mathbf{M}} \right|_{2}\) denote the *ℓ*^{2} norm of the operator. An upper bound of the coefficient \(\eta\) is given by \(\zeta_{v} + \zeta_{m} + 3\zeta_{v} \zeta_{m}\) in Section S1 of the supplementary materials. The coefficients \(\zeta_{v}\) and \(\zeta_{m}\) are defined as:

which represent the normalized rounding errors of the fixed-point vector and matrix, respectively, and are dependent on the bit widths and exponents of the arrays. In most cases, \(\eta\) is much lower than this upper bound, depending on the distribution of elements in \({\mathbf{M}}\) and \({\mathbf{x}}\).

### Error and convergence criteria of fixed-point Richardson iteration

The solution \({\mathbf{x}}^{*}\) to a linear system \({\mathbf{y}} = {\mathbf{Ax}}\) is unique if matrix \({\mathbf{A}}\) has full rank. To ensure that the intermediate estimations in the Richardson iteration \(\left\{ {{\mathbf{x}}_{k} } \right\}\) converge to the solution \({\mathbf{x}}^{*}\) as the iteration number \(k \to \infty\), all the eigenvalues of matrix \(\left( {{\mathbf{I}} - \tau {\mathbf{A}}^{T} {\mathbf{A}}} \right)\) must fall within \(\left( { - 1,1} \right)\). Since the largest and smallest eigenvalues are \(1 - \tau \left| {{\mathbf{A}}^{T} {\mathbf{A}}} \right|_{2} /\kappa\) and \(1 - \tau \left| {{\mathbf{A}}^{T} {\mathbf{A}}} \right|_{2}\), respectively, where the operator norm \(\left| {{\mathbf{A}}^{T} {\mathbf{A}}} \right|_{2}\) equals the largest eigenvalue, and \(\kappa\) is the condition number of \({\mathbf{A}}^{T} {\mathbf{A}}\), the choice of the step size, \(\tau\), is confined by:

In practice, to ensure the convergence while expediating the iterative solver, the maximum step size, \(\tau_{max}\), is often selected with a safety margin \(0 < \chi \ll 1\) as follows:

For fixed-point Richardson iterations, Theorem 1 below summarizes the error and convergence criteria based on the fixed-point matrix multiplication error \(\eta\).

###
*Theorem 1:*

*Theorem 1:*

For a linear system \({\mathbf{Ax}} = {\mathbf{y}}\) solved using fixed-point Richardson iterations (Eq. (3)) with step size \(\tau\), under the conditions in Eq. (11),

the asymptotic error, \(\theta\), between the estimation, \({\mathbf{x}}_{k}\), and the solution, \({\mathbf{x}}^{*}\), is bounded by:

where \(\kappa\) is the condition number of \({\mathbf{A}}^{T} {\mathbf{A}}\), and \(\eta\) is the normalized *ℓ*^{2} error of the two matrix–vector multiplications in each iteration.

Theorem 1 is proved in Section S2 of the supplementary materials by examining the *ℓ*^{2} norm of the error of intermediate estimations, \(\left| {{\mathbf{x}}^{*} - {\mathbf{x}}_{k} } \right|_{2}\), as \(k \to \infty\). The proof shows a linear convergence with rate \(\left| {{\mathbf{I}} - \tau {\mathbf{A}}^{T} {\mathbf{A}}/\kappa } \right|_{2}\), which is independent on the error of the fixed-point matrix–vector product, \(\eta\). Because a typical problem involves a value of \(\kappa\) on the order of \(10^{2} \sim 10^{3}\), using the choice of step size \(\tau\) in Eq. (10), \(\tau \left| {{\mathbf{A}}^{T} {\mathbf{A}}} \right|_{2}\) = \(2 - \chi \ll \kappa\), we can approximate

where the convergence rate \(\gamma = \left( {2 - \chi } \right)/\kappa\).

### Fixed-point residual iteration

The asymptotic error \(\theta\) in Theorem 1 is independent of the solution \({\mathbf{x}}^{*}\), which implies that the fixed-point iterative solver can stagnate at an estimation, \( {\mathbf{x}}_{\infty }\), closer to the solution \({\mathbf{x}}^{*}\) in terms of *ℓ*^{2} distance, if \({\mathbf{x}}^{*}\) has a lower *ℓ*^{2} norm. Hence, constructing a sequence of fixed-point iterative solvers in which the solutions have decreasing *ℓ*^{2} norm becomes the natural choice to approach the unbiased solution of the inverse problem.

The proposed residual iteration algorithm described in Algorithm 1 leverages the adjustable exponent of the fixed-point format. The algorithm contains \(M\) sets of fixed-point Richardson iterations as the inner loops. The acceptance criterion of the inner loop solution is:

where \(\varepsilon\) is a small number of the order \(\zeta_{v}\). Let \({\mathbf{x}}^{\left( M \right)} = \mathop \sum \limits_{l = 1}^{M} \delta {\mathbf{x}}^{{\left( {\varvec{l}} \right)}}\) denote the accumulated solutions from \(M\) fixed-point Richardson iterative solvers. The residual iteration algorithm aims to reduce the norm of \({\mathbf{r}}^{\left( l \right)} : = {\mathbf{y}} - {\mathbf{Ax}}^{\left( l \right)}\) at each outer loop \(l\) by recursively solving the linear system \({\mathbf{r}}^{\left( l \right)} = {\mathbf{A}}\delta {\mathbf{x}}^{\left( l \right)}\) with the updated exponents in the next inner loop of the fixed-point Richardson iterations.

It is worth noting that similar precision refinement methods ^{28,29} have been employed in iterative solvers using mixed full- and half-precision floating-point data formats. Yet, the floating-point precision refinement typically requires far fewer residual loops, \(M\), than Algorithm 1. In addition, floating-point format allows a different exponent for every element in the matrix–vector product. As a result, the error analysis in floating-point precision refinement ^{35} does not apply to fixed-point residual iterations. Here we present the upper bound of the normalized error of the solution \({\mathbf{x}}^{\left( M \right)}\) after \(M\) fixed-point residual iterations in Theorem 2.

** Theorem 2**: For a full-rank linear system \({\mathbf{Ax}} = {\mathbf{y}}\) with its solution \({\mathbf{x}}^{*}\) obtained from the residual iteration algorithm (Algorithm 1), the asymptotic error of the estimation after \(M\) residue updates, \({\mathbf{x}}^{\left( M \right)}\), is bounded by:

where \(\theta\) is the asymptotic error of the solution obtained from fixed-point Richardson solver (Eq. (12)).

Theorem 2 is established based on the decreasing *ℓ*^{2} norm of the solution \({\mathbf{x}}^{*\left( l \right)}\) in each fixed-point Richardson iteration. The upper bound of the solution error after \(M\) residue updates can be derived recursively from Eq. (12) in Theorem 1. Detailed proof is presented in Section S3 of the supplementary materials.

For Eq. (15) to be valid, the exponents of \({\mathbf{b}}^{\left( l \right)}\) and \({\mathbf{A}}^{T} {\mathbf{A}}\delta {\mathbf{x}}_{k}\) should be updated after every iteration of the inner loop. In practice, due to the computational cost of distribution-based estimation of the infinity norm (Eq. (19)), the exponents are adjusted after the completion of each inner loop. As a result, the errors may not always satisfy the bound predicted by Eq. (15).

## Fixed-point iterative solver experiments

### Matrix inversion

We first verify the convergence rate and the residue error in theorem 1 with a fixed-point matrix inversion solver. The two matrices to be inverted, \({\mathbf{A}}_{1}\) and \({\mathbf{A}}_{2}\), shown in Fig. 1a, are constructed from the 4 \(\times\) 4 discrete cosine transfer matrix by applying different linear scaling factors to its eigenvalues. The condition numbers \(\kappa\) of \({\mathbf{A}}_{1}\) and \({\mathbf{A}}_{2}\) are 25.0 and 11.1, respectively. The analytical inversions \({\mathbf{A}}_{1}^{ - 1}\) and \({\mathbf{A}}_{2}^{ - 1}\), shown in Fig. 1b, are calculated with floating-point LU decomposition.

The solver uses signed 8-bit fixed-point format. The step sizes, \(\tau \), are calculated from Eq. (10) with a safety margin, \(\chi \)=0.2, for the inversions of \({\mathbf{A}}_{1}\) and \({\mathbf{A}}_{2}\). To estimate the normalized *ℓ*^{2} error \(\eta \) of the matrix–vector product, \({\mathbf{A}}^{T}\mathbf{A}{\mathbf{x}}_{k}\), the arrays \({\mathbf{A}}^{T}\mathbf{A}\) and \(\mathbf{b}\) in Eq. (3) are pre-calculated and converted into fixed-point. We enumerate 200 normally distributed \(\widetilde{\mathbf{x}}\) with mean \(\mathbf{b}\) and standard deviation \({\left|\mathbf{b}-{\mathbf{x}}^{*}\right|}_{2}/2\), where \({\mathbf{x}}^{*}\) is the solution. The enumerated \(\widetilde{\mathbf{x}}\) are converted to fixed-point format and multiplied with the fixed-point matrix \({\widetilde{\mathbf{A}}}^{T}\widetilde{\mathbf{A}}\). We then obtain \(\eta \)=0.019 from the average normalized root-mean-square error (RMSE) of the multiplication results.

Figure 1c plots the normalized errors, \({\theta }_{k}\), in log-scale compared with the analytical inverses, \({\mathbf{A}}_{1}^{-1}\) and \({\mathbf{A}}_{2}^{-1}\), as the iterations progress. The asymptotic errors, \(\theta \), of the 8-bit iterations are 0.21 and 0.083 for the inversions of \({\mathbf{A}}_{1}\) and \({\mathbf{A}}_{2}\), respectively, both of which are consistent with the error bounds given by Eq. (12). The convergence rate, \(\gamma\), is obtained from the exponential fitting of Fig. 1c with a 95% confidence level. For the inversions of \({\mathbf{A}}_{1}\) and \({\mathbf{A}}_{2}\), \(\gamma\) is 0.040 ± 0.001 and 0.085 ± 0.004 respectively, which are inversely proportional to \(\kappa\), as predicted by Eq. (13). The asymptotic errors and convergence rates are summarized in Table 1.

For the same condition number, Fig. 2 compares the fixed-point Richardson iterations with the 8-bit, 7-bit, 6-bit, and floating-point precisions. The convergence rates are summarized in Table 2, which confirms the implication of Eq. (13) that the fixed-point iterative solver maintains the same convergence rate, \(\gamma\), as the floating-point counterpart. The independence of convergence rate on the precision suggests that fixed-point iterative solvers do not compromise the convergence speed. The asymptotic errors of the 8-bit, 7-bit, and 6-bit solutions are 0.083, 0.18, and 0.33, respectively, which are all below the upper bound established in Eq. (12).

### Image deconvolution

Deconvolution is a classic linear inverse problem that has broad applications in signal/image filtering and processing. For a linear, shift-invariant system, the convolution output, \({\mathbf{y}}\) (i.e., filtered signal, or blurred image) is the discrete convolution between the input signal or image \({\mathbf{x}}\) and the kernel \({\mathbf{K}}\), i.e., \({\mathbf{y}} = {\mathbf{K}}*{\mathbf{x}}\), which is defined elementwise as:

where \(M\) and \(N\) are the sizes of the kernel along row and column directions, respectively. Here, we have assumed a two-dimensional convolution, as in the case of image processing. This example demonstrates the iterative Richardson-Lucy deconvolution, which recovers images from blurry measurements given the convolution kernel.

The discrete convolution can be formulated as the product between the vectorized input, \({\mathbf{x}}\), and a Toeplitz matrix, \({\mathbf{A}}\), constructed from the kernel ^{36}, producing the output \({\mathbf{y}} = {\mathbf{Ax}}\) in the vectorized form. To estimate the condition number, we rewrite Eq. (16) as Eq. (17) following the convolution theorem,

where \({\mathbf{\mathcal{F}}}\) and \({\mathbf{\mathcal{F}}}^{\dag }\) represent the forward and inverse two-dimensional Fourier transforms, both of which are unitary operators, \(\dag\) denotes the conjugate transpose of a complex matrix, and \({\text{diag}}\left( {{\tilde{\mathbf{K}}}} \right)\) is the diagonal matrix constructed from the Fourier transform of the kernel \({\tilde{\mathbf{K}}}\). The condition number, \(\kappa\), of \({\mathbf{A}}^{T} {\mathbf{A}}\) can be obtained from the maximum and minimum elements in \({\text{diag}}\left( {{\tilde{\mathbf{K}}}} \right)^{\dag } {\text{diag}}\left( {{\tilde{\mathbf{K}}}} \right)\).

Figure 3 shows an example based on Richardson-Lucy deconvolution. The input, \({\mathbf{x}}^{*}\), is a test image downsampled from the MATLAB image “Saturn”, shown in Fig. 3 (a). The image contains128 × 102 pixels in signed 4-bit fixed-point format with range (− 1,1). The convolution kernel, \({\mathbf{K}}\), follows the Gaussian profile,

where the parameter \(\sigma\) determines the width of the kernel, and \(K_{0}\) is a normalization constant that ensures that \(\mathop \sum \limits_{p,q} \left| {K_{p,q} } \right|^{2} = 1\). Four example kernels with \(\sigma\) = 0.70, 0.75, 0.80, and 0.85, are plotted in Fig. 3b. Figure 3c shows the normalized eigenvalue spectra of the four Toeplitz matrices, calculated from the Fourier transform of the kernels. The condition numbers, \(\kappa\), of the four kernels are 8.38, 16.59, 35.07, and 78.50, respectively. If \(\sigma\) is increased beyond 0.90, \(\kappa\) would violate the criteria in Eq. (11) and lead to a non-converging deconvolution result. Similar phenomenon has been reported in hybrid analog–digital solvers ^{37}.

The measurements, shown in Fig. 4a, are calculated from convolving the input image, \({\mathbf{x}}^{*}\), with the four kernels in Fig. 3b, and then quantized to 8-bit to simulate the digitization error of an ordinary CCD or CMOS camera. The deconvolution problems are solved using 8-bit fixed-point iterations, with \(\eta\) estimated to be 0.014. Step sizes with the safety margin \(\chi\) = 0.2 are used for all four iterative deconvolutions. Figure 4b plots the deconvolved images after 200 iterations. The normalized errors as functions of iteration steps are plotted in Fig. 4c. The asymptotic errors, \(\theta\), are summarized in Table 3, and are all below the error bound predicted by Eq. (12).

### Tomographic reconstruction with residual iteration

Theorem 2 indicates that the residual iteration can overcome the stagnation of the fixed-point Richardson solver and reach a solution beyond the fixed-point limit. To verify the performance of the residual iterations, we examine a tomographic reconstruction of a 16 × 16 “Super Mario” pixel art from its pencil-beam projections, shown in Fig. 5a. The pixels of the input image, \( {\mathbf{x}}^{*}\), are stored in the signed 3-bit fixed-point format with range (− 1,1), giving a quantization interval of 0.25 between two adjacent intensity levels. The measurements consist of 60 projections between 0 and 180º at 4º intervals, each containing 31 uniformly spaced beams. The system matrix, \({\mathbf{A}}\), of the tomographic model is constructed from the distance-driven method ^{38}. Figure 5b plots the eigenvalue spectrum of the tomographic model. The condition number, \(\kappa\), of \({\mathbf{A}}^{T} {\mathbf{A}}\) is 101.80. Assuming the maximum step size, \(\tau\), is used in the Richardson iterations, the computation error \(\eta\) must be less than 0.020 to ensure convergence, according to Eq. (11). For the size and sparsity of this system matrix, we find that a minimum of \(L\) = 8 bits (with an estimated \(\eta\) = 0.015) are required for fixed-point iterations to converge.

Fixed-point Richardson and residual iterations with 8-bit, 9-bit, and 10-bit precisions are performed to reconstruct the “Super Mario” pixel art from the tomographic projections. The step sizes, \(\tau\), with safety margin \(\chi\) = 0.3 are used in all the solvers. The system matrix \({\mathbf{A}}\) is quantized to the solver precision with expo = 0. The measurement, \({\mathbf{y}}\), is serialized to a vector, multiplied by \(\tau {\mathbf{A}}^{T}\), and quantized to the solver precision with expo = 0 to pre-calculate the fixed-point vector \({\mathbf{b}}\) for the iterations. The fixed-point Richardson iteration sets the exponent of the multiplication between \({\mathbf{A}}^{T} {\mathbf{A}}\) and \({\mathbf{x}}_{k}\) to 2. For residual iterations, the exponents of matrix–vector multiplications are decremented by 1 every 80 steps from 2 to − 2.

Figure 6 plots the reconstructions from Richardson and residual iterations with 8-bit, 9-bit, and 10-bit precisions. Table 4 summarizes the normalized errors of the fixed-point Richardson reconstructions, all of which are below the error bound given by Eq. (12). The normalized error, \(\theta\), is plotted as a function of the Richardson iteration number in Fig. 6c1. Note that the same convergence rate is present regardless of the precision of the solver. The normalized errors of the residual iterations are shown in Fig. 6c2. After \(M\) = 5 residue updates, the 8-bit, 9-bit, and 10-bit solvers all reach normalized errors below 0.1. At this level of \(\theta\), the pixelwise differences between \({\mathbf{x}}^{\left( 5 \right)}\) and \({\mathbf{x}}^{*}\) are below the quantization interval, 0.25, of the true image \({\mathbf{x}}^{*}\). Therefore, the reconstructions in Fig. 6b1–b3 would exhibit no visual differences from the true image if displayed in 3-bit grayscale colormap.

Faster convergence is possible via adaptive adjustments of the exponents of \({\mathbf{b}}^{\left( l \right)}\) and \({\mathbf{A}}^{T} {\mathbf{A}}\delta {\mathbf{x}}_{k}\). Figure 7 shows the error of the 8-bit residual iteration executed on the hardware prototype solver (Sec. "Hardware prototype for a fixed-point iterative solver"). The exponent is updated every 5 steps based on the distribution of the elements in a fixed-point array \({\mathbf{x}}\). The exponents are calculated using:

where \(\mu_{{\mathbf{x}}}\) and \(\sigma_{{\mathbf{x}}}\) denote the mean and standard deviation, respectively, of all the elements in \({\mathbf{x}}\). The adaptive residual iteration achieves the same convergence speed as the floating-point solver, with both methods achieving errors below the quantization interval of the true solution after 28 iterations.

## Hardware prototype for a fixed-point iterative solver

We have built a hardware prototype to perform the fixed-point Richardson iteration and the adjustment of exponent. The core component of the solver, i.e., the fixed-point matrix multiplication, is implemented on an FPGA development board (Xilinx ZC706) and communicates with the host PC via PCIe bus. Figure 8 illustrates the block diagram of the logic design within the FPGA. The systolic multiplier array performs 512 MAC operations in a single clock cycle. The inputs of the multiplier array consist of a 16 × 16 \({\mathcal{W}}\) block and a 16 × 2 \({\mathcal{X}}\) block. The precisions of both input blocks are signed 8-bit. The multiplication outputs are cached in signed 32-bit and wired to a 16 × 2 array of accumulators. For the matrix–vector multiplication involving a matrix \({\mathbf{W}}\) larger than 16 × 16, the matrix \({\mathbf{W}}\) is zero-padded to an integer number of 16 × 16 \({\mathcal{W}}\) blocks, and the batch vector \({\mathbf{X}}\) is zero-padded to an integer number of 16 × 2 \({\mathcal{X}}\) blocks, along the row dimension only. All the decomposed \({\mathcal{W}}\) and \({\mathcal{X}}\) blocks are cached in on-chip block random access memory (BRAM). The multiplier array reads the \({\mathcal{W}}\) blocks along the rows and the \({\mathcal{X}}\) blocks along the columns. The blocks of 16 × 2 results are summed in the accumulators. After completion of one row of \({\mathcal{W}}\) blocks, the multiplier array reads the \({\mathcal{W}}\) blocks in the next row and cycles through all the blocks of \({\mathcal{X}}\) again.

The host PC converts \({\mathbf{W}}\) and \({\mathbf{X}}\) into fixed-point, and then decomposes the mantissas into 16 × 16 \({\mathcal{W}}\) blocks and 16 × 2 \({\mathcal{X}}\) blocks. The decomposed \({\mathcal{W}}\) and \({\mathcal{X}}\) blocks are streamed to the FPGA through PCIe. Once all the blocks are multiplied and accumulated in FPGA, the results \({\mathbf{Y}}\) are streamed back to host PC in 32-bit integer format. The host PC masks the results back to 8-bit, with the most significant bit (MSB) and least significant bit (LSB) selected by the exponents. The host PC updates the exponents according to Eq. (19) every five iterations.

For the FPGA design in Fig. 8 implemented on the ZC706 evaluation board, the Xilinx synthesis and implementation tools estimate a power consumption of 2.047W, of which 1.253W is consumed by the transceivers for PCIe communication with the host PC, and the remaining 0.794W is consumed by the logics, including the clocks, systolic multiplier array, BRAM cache, and control units that generate the addresses of \({\mathcal{W}}\) and \({\mathcal{X}}\) blocks. The clock rate is 250 MHz. Considering the total number of 512 MACs within a single clock cycle, the energy consumption by the fixed-point computation is 6.2pJ/MAC, which is on the same order as that of datacenter TPUs ^{39}.

We have tested the speed of the iterative solver implemented on our hardware prototype and CPU (Intel Core i3-4130, dual-core, 3.40 GHz clock rate). The total CPU time of the fixed-point Richardson iterations, excluding exponent adjustment, is measured, and then divided by the total number of iterations in all residual updates to obtain the time per iteration, which is indicative of the calculation speed. The fixed-point solver on CPU takes 1.7 ± 0.2 ms per iteration, while that on FPGA prototype takes 0.76 ± 0.05 ms per iteration, representing more than two times of improvement in speed. Errors represent the variance of the calculation time within five repeated tests.

## Conclusion

We have demonstrated a fixed-point iterative solver that computes high-precision solutions for linear inverse problems beyond the precision limit of the hardware. We have described the convergence and error of the fixed-point iterations in two theorems. Theorem 1 shows that the convergence rate is independent of the precision of the solver, which implies that fixed-point iteration does not compromise the convergence speed of the solver. Theorem 2 resolves the stagnation of a fixed-point solver with residual iterations, which correct the error and refine the solution to a higher precision level. We have presented three examples of linear inverse problems, namely, matrix inversion, Richardson-Lucy deconvolution, and tomographic reconstruction, to verify both theorems.

The combination of residual iteration with adaptive exponent adjustment achieves the same rate constant and error as a floating-point Richardson solver. We have prototyped the fixed-point solver with a fixed-point systolic multiplier array on the FPGA and fixed-point arithmetic libraries on host PC. The prototype solver achieves more than twice the calculation speed of a CPU and takes the same number of iterations to reach the solution as its floating-point counterpart. We envision that an array of low-precision, analog matrix processing cores can supersede the current FPGA systolic array in solving numerous large-scale, high-performance computing applications, including but not limited to convex optimizations, solving differential equations, and training of artificial neural networks. Certain nonlinear problems ^{40,41} could also be incorporated into our iterative solver framework in conjunction with digital co-processing to calculate the local gradient on host PC.

## Data availability

The data and software that support the findings of this study are available from the corresponding author upon reasonable request.

## References

Biemond, J., Lagendijk, R. L. & Mersereau, R. M. Iterative methods for image deblurring.

*Proc. IEEE***78**(5), 856–883. https://doi.org/10.1109/5.53403 (1990).A. C. Kak and M. Slaney,

*Principles of Computerized Tomographic Imaging*. IEEE press New York, 2001.Harmany, Z. T., Marcia, R. F. & Willett, R. M. This is SPIRAL-TAP: Sparse poisson intensity reconstruction ALgorithms—theory and practice.

*IEEE Trans. Image Process.***21**(3), 1084–1096. https://doi.org/10.1109/TIP.2011.2168410 (2012).Donoho, D. L. Compressed sensing.

*IEEE Trans. Inf. Theory***52**(4), 1289–1306. https://doi.org/10.1109/Tit.2006.871582 (2006).R. Baraniuk and P. Steeghs, Compressive Radar Imaging, In

*2007 IEEE Radar Conference*, Apr. 2007, pp. 128–133. doi: https://doi.org/10.1109/RADAR.2007.374203.Tsai, T.-H., Llull, P., Yuan, X., Carin, L. & Brady, D. J. Spectral-temporal compressive imaging.

*Opt. Lett.***40**(17), 4054–4057. https://doi.org/10.1364/OL.40.004054 (2015).Wang, Y. & Boyd, S. Fast model predictive control using online optimization.

*IEEE Trans. Control Syst. Technol.***18**(2), 267–278. https://doi.org/10.1109/TCST.2009.2017934 (2010).Morari, M., Garcia, C. E. & Prett, D. M. Model predictive control: Theory and practice.

*IFAC Proc.***21**(4), 1–12. https://doi.org/10.1016/B978-0-08-035735-5.50006-1 (1988).J. L. Jerez, G. A. Constantinides, and E. C. Kerrigan, “An FPGA implementation of a sparse quadratic programming solver for constrained predictive control,” in

*Proceedings of the 19th ACM/SIGDA international symposium on Field programmable gate arrays - FPGA ’11*, 2011, p. 209. doi: https://doi.org/10.1145/1950413.1950454.Choi, Y. & Cong, J. Acceleration of EM-based 3D CT reconstruction using FPGA.

*IEEE Trans. Biomed. Circuits Syst.***10**(3), 754–767. https://doi.org/10.1109/TBCAS.2015.2471813 (2016).J. K. Kim, Z. Zhang, and J. A. Fessler, “Hardware acceleration of iterative image reconstruction for X-ray computed tomography,” in

*2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)*, May 2011, no. 1, pp. 1697–1700. doi: https://doi.org/10.1109/ICASSP.2011.5946827.Y. E. Wang, G.-Y. Wei, and D. Brooks, “Benchmarking tpu, gpu, and cpu platforms for deep learning,” arXiv Prepr. arXiv1907.10701, 2019.

S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, and J. S. Vetter, NVIDIA tensor core programmability, performance & precision, In

*Proceedings of 2018 IEEE 32nd International Parallel Distribitution Processing Symposium Working IPDPSW 2018*, pp. 522–531, 2018, doi: https://doi.org/10.1109/IPDPSW.2018.00091.Sze, V., Chen, Y.-H., Yang, T.-J. & Emer, J. S. Efficient processing of deep neural networks: A tutorial and survey.

*Proc. IEEE***105**(12), 2295–2329. https://doi.org/10.1109/JPROC.2017.2761740 (2017).M. Horowitz, Computing’s energy problem. In

*2014 IEEE International Solid-State Circuits Conferences Digital Technical Papper*, pp. 10–14, 2014, doi: https://doi.org/10.1109/JSSC.2014.2361354.Haensch, W., Gokmen, T. & Puri, R. The next generation of deep learning hardware: Analog Computing.

*Proc. IEEE***107**(1), 108–122. https://doi.org/10.1109/JPROC.2018.2871057 (2019).Li, C.

*et al.*Efficient and self-adaptive in-situ learning in multilayer memristor neural networks.*Nat. Commun.***9**(1), 7–14. https://doi.org/10.1038/s41467-018-04484-2 (2018).Yao, P.

*et al.*Fully hardware-implemented memristor convolutional neural network.*Nature***577**(7792), 641–646. https://doi.org/10.1038/s41586-020-1942-4 (2020).Hamerly, R., Bernstein, L., Sludds, A., Soljačić, M. & Englund, D. Large-scale optical neural networks based on photoelectric multiplication.

*Phys. Rev. X***9**(2), 1–12. https://doi.org/10.1103/physrevx.9.021032 (2019).Shen, Y.

*et al.*Deep learning with coherent nanophotonic circuits.*Nat. Photonics***11**(7), 441–446. https://doi.org/10.1038/nphoton.2017.93 (2017).Tsai, H., Ambrogio, S., Narayanan, P., Shelby, R. M. & Burr, G. W. Recent progress in analog memory-based accelerators for deep learning.

*J. Phys. D. Appl. Phys.***51**(28), 415. https://doi.org/10.1088/1361-6463/aac8a5 (2018).S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, Deep learning with limited numerical precision. In

*32nd International Conferences Machince Learning ICML 2015*, vol. 3, pp. 1737–1746, 2015, http://proceedings.mlr.press/v37/gupta15.htmlP. Merolla, R. Appuswamy, J. Arthur, S. K. Esser, and D. Modha, Deep neural networks are robust to weight binarization and other non-linear distortions,

*arXiv Prepr. arXiv1606.01981*, 2016, http://arxiv.org/abs/1606.01981Rehman, S., Khan, M. F., Kim, H.-D. & Kim, S. Analog–digital hybrid computing with SnS2 memtransistor for low-powered sensor fusion.

*Nat. Commun.***13**(1), 2804. https://doi.org/10.1038/s41467-022-30564-5 (2022).Roques-Carmes, C.

*et al.*Heuristic recurrent algorithms for photonic Ising machines.*Nat. Commun.***11**(1), 249. https://doi.org/10.1038/s41467-019-14096-z (2020).G. H. Golub and C. F. Van Loan,

*Matrix computations*, vol. 3. JHU press, 2013.G. W. Stewart,

*Matrix Algorithms: Volume 1: Basic Decompositions*. SIAM, 1998.A. Abdelfattah

*et al.*, A survey of numerical methods utilizing mixed precision arithmetic,*arXiv Prepr. arXiv2007.06674*, 2020.A. Haidar, H. Bayraktar, S. Tomov, J. Dongarra, and N. J. Higham (2020) Mixed-precision iterative refinement using tensor cores on GPUs to accelerate solution of linear systems,

*Proc. R. Soc. A Math. Phys. Eng. Sci.*, 476(2243): 20200110Cowan, G. E. R., Melville, R. C. & Tsividis, Y. P. A VLSI analog computer/digital computer accelerator.

*IEEE J. Solid-State Circuits***41**(1), 42–53. https://doi.org/10.1109/JSSC.2005.858618 (2006).Y. Huang, N. Guo, M. Seok, Y. Tsividis, K. Mandli, and S. Sethumadhavan, “Hybrid analog-digital solution of nonlinear partial differential equations,” in

*Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture*, Oct. 2017, vol. Part F1312, pp. 665–678. doi: https://doi.org/10.1145/3123939.3124550.Vogel, C. R. Computational methods for inverse problems.

*Soc. Ind. Appl. Math.*https://doi.org/10.1137/1.9780898717570 (2002).Zhu, Z., Sun, Y., White, J., Chang, Z. & Pang, S. Signal retrieval with measurement system knowledge using variational generative model.

*IEEE Access***8**, 47963–47972. https://doi.org/10.1109/ACCESS.2020.2978435 (2020).L. F. Richardson and R. T. Glazebrook, “IX. The approximate arithmetical solution by finite differences of physical problems involving differential equations, with an application to the stresses in a masonry dam,”

*Philos. Trans. R. Soc. London. Ser. A, Contain. Pap. a Math. or Phys. Character*, vol. 210, no. 459–470, pp. 307–357, Jan. 1911, doi: https://doi.org/10.1098/rsta.1911.0009.Buttari, A.

*et al.*Mixed precision iterative refinement techniques for the solution of dense linear systems.*Int. J. High Perform. Comput. Appl.***21**(4), 457–466. https://doi.org/10.1177/1094342007084026 (2007).P. C. Hansen, J. G. Nagy, and D. P. O’leary,

*Deblurring images: matrices, spectra, and filtering*, vol. 3. Siam, 2006.Mandel, J. & Miranker, W. L. New techniques for fast hybrid solutions of systems of equations.

*Int. J. Numer. Methods Eng.***27**(3), 455–467. https://doi.org/10.1002/nme.1620270303 (1989).Siddon, R. L. Fast calculation of the exact radiological path for a three-dimensional CT array.

*Med. Phys.***12**(2), 252–255. https://doi.org/10.1118/1.595715 (1985).N. P. Jouppi

*et al.*, In-Datacenter Performance Analysis of a Tensor Processing Unit. In*Proceedings of the 44th Annual International Symposium on Computer Architecture-ISCA ’17*, 2017, pp. 1–12. doi: https://doi.org/10.1145/3079856.3080246.Naseem, A., Rehman, M. A. & Abdeljawad, T. Some new iterative algorithms for solving one-dimensional non-linear equations and their graphical representation.

*IEEE Access***9**(2016), 8615–8624. https://doi.org/10.1109/ACCESS.2021.3049428 (2021).Li, F., Hu, G., Abdeljawad, T. & Abbas, M. A finite point algorithm for soil water-salt movement equation.

*Adv. Differ. Equ.***2021**(1), 179. https://doi.org/10.1186/s13662-021-03281-4 (2021).

## Acknowledgements

The authors would like to thank Dr. Stephen Becker (Department of Applied Mathematics, University of Colorado Boulder) for helpful discussions. This work is supported in part by the National Science Foundation (1932858), in part by the Army Research Office (W911NF2110321), and in part by the Office of Naval Research (N00014-20-1-2441).

## Author information

### Authors and Affiliations

### Contributions

G. L. and S. P. initiated and supervised the project. Z. Z. and S. P. conceived the idea of fixed-point residue iteration. Z. Z. and A. B. K. built the hardware prototype and performed the numerical experiments. All authors participated in the discussion and writing of the manuscript.

### Corresponding author

## Ethics declarations

### Competing interests

The authors declare no competing interests.

## Additional information

### Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

## Supplementary Information

## Rights and permissions

**Open Access** This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

## About this article

### Cite this article

Zhu, Z., Klein, A.B., Li, G. *et al.* Fixed-point iterative linear inverse solver with extended precision.
*Sci Rep* **13**, 5198 (2023). https://doi.org/10.1038/s41598-023-32338-5

Received:

Accepted:

Published:

DOI: https://doi.org/10.1038/s41598-023-32338-5

## Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.