A blueprint for precise and fault-tolerant analog neural networks

Analog computing has reemerged as a promising avenue for accelerating deep neural networks (DNNs) to overcome the scalability challenges posed by traditional digital architectures. However, achieving high precision using analog technologies is challenging, as high-precision data converters are costly and impractical. In this work, we address this challenge by using the residue number system (RNS) and composing high-precision operations from multiple low-precision operations, thereby eliminating the need for high-precision data converters and information loss. Our study demonstrates that the RNS-based approach can achieve ≥99% FP32 accuracy with 6-bit integer arithmetic for DNN inference and 7-bit for DNN training. The reduced precision requirements imply that using RNS can achieve several orders of magnitude higher energy efficiency while maintaining the same throughput compared to conventional analog hardware with the same precision. We also present a fault-tolerant dataflow using redundant RNS to protect the computation against noise and errors inherent within analog hardware.


INTRODUCTION
Deep Neural Networks (DNNs) are widely employed across various applications today.Unfortunately, their compute, memory, and communication demands are continuously on the rise.The slow-down in CMOS technology scaling, along with these increasing demands has led analog DNN accelerators to gain significant research interest.Recent research has been focused on using various analog technologies such as photonic cores [1][2][3][4][5][6][7] , resistive arrays [8][9][10][11][12] , switched capacitor arrays 13,14 , Phase Change Materials (PCM) 15 , Spin-Transfer Torque (STT)-RAM 16,17 , etc., to enable highly parallel, fast, and efficient matrix-vector multiplications (MVMs) in the analog domain.These MVMs are fundamental components used to build larger general matrixmatrix multiplication (GEMM) operations, which make up more than 90% of the operations in DNN inference and training 18 .
The success of this approach, however, is constrained by the limited precision of the digital-to-analog and analog-to-digital data converters (i.e., DACs and ADCs).In an analog accelerator, the data is converted between analog and digital domains using data converters before and after every analog operation.Typically, a complete GEMM operation cannot be performed at once in the analog domain due to the fixed size of the analog core.Instead, the GEMM operation is tiled into smaller MVM operations.As a result, each MVM operation produces a partial output that must be accumulated with other partial outputs to obtain the final GEMM result.Concretely, an MVM operation consists of parallel dot products between b w -bit signed weight vectors and b in -bit signed input vectors-each with h elements-resulting in a partial output containing b out bits of information, where b out = b in + b w + log 2 (h) − 1.An ADC with a precision greater than b out (i.e., b ADC ≥ b out ) is required to ensure no loss of information when capturing these partial outputs.Unfortunately, the energy consumption of ADCs increases exponentially with bit precision (often referred to as effective number of bits (ENOB)).This increase is roughly 4× for each additional output bit 19 .
As a result, energy-efficient analog accelerator designs typically employ ADCs with lower precision than b out and only capture the b ADC most significant bits (MSBs) from the b out bits of each partial output 20 .Reading only MSBs causes information loss in each partial output leading to accuracy degradation in DNNs, as pointed out by Rekhi et al. 20 .This degradation is most pronounced in large DNNs and large datasets.Fig. 1 shows the impact of this approach on DNN accuracy in two tasks: (1) a two-layer convolutional neural network (CNN) for classifying the MNIST dataset 21 : a simple task with only 10 classes, and (2) the ResNet50 CNN 22 for classifying the ImageNet dataset 23 : a more challenging task with 1000 classes.As the vector size h increases, higher precision is needed at the output to maintain the accuracy in both DNNs.Moreover, ResNet50 experiences accuracy degradation at smaller values of h compared to the twolayer CNN.While using a higher precision ADC can help recover from this accuracy degradation, it significantly reduces the energy efficiency of the analog hardware.Essentially, to efficiently execute large DNNs using analog accelerators, it is crucial to find a better way to achieve high accuracy than simply increasing the bit precision of the data converters.
In this work, we present a universal residue number system (RNS)-based framework to overcome the abovementioned challenge in analog DNN inference as well as DNN training.RNS represents high-precision values using multiple low-precision integer residues for a selected set of moduli.As such, RNS enables high-precision arithmetic without any information loss on the partial products, even when using low-precision DACs and ADCs.Utilization of RNS leads to a significant reduction in the data converter energy consumption, which is the primary contributor to energy usage in analog accelerators.This reduction can reach up to six orders of magnitude compared to a conventional fixed-point analog core with the same output bit precision.
Our study shows that the RNS-based approach enables ≥ 99% FP-32 inference accuracy by using only 6-bit data converters for state-of-the-art MLPerf (Inference: Datacenters) benchmarks 24 and Large Language Models (LLMs).We also demonstrate the applicability of this approach in training and fine-tuning state-of-the-art DNNs using low-precision analog hardware.The RNS approach, however, is susceptible to noise as small errors in the residues scale up during output reconstruction, leading to larger errors in the standard representation.To address this issue, we incorporate the Redundant RNS (RRNS) error-correcting code [25][26][27] to introduce fault-tolerance capabilities into the dataflow.
As RNS is closed under multiplication and addition, no significant changes are required in the design of the analog core or in how GEMM operations are performed.Unlike a conventional analog core design, performing RNS operations necessitates an analog modulo operation.This operation can be implemented by using ring oscillators 28 in an analog electrical core or by using optical phase shifters in an analog optical core.Our proposed framework, however, remains agnostic to the underlying technology.Importantly, arbitrary fixed-point precision can be achieved by combining the positional number system (PNS) and RNS in analog hardware.Overall, our presented RNS-based methodology offers a solution combining high accuracy, high energy efficiency, and fault tolerance in analog DNN inference and training.

DNN Inference and Training Using RNS
The RNS represents an integer as a set of smaller (integer) residues.These residues are calculated by performing a modulo operation on the said integer using a selected set of n co-prime moduli.Let A be an integer.A can be represented in the RNS with n residues as {a 1 , . . ., a n } for a set of co-prime moduli M = {m 1 , . . ., m n } where a i = |A| mi ≡ A mod m i for i ∈ {1 . . .n}.A can be uniquely reconstructed using the Chinese Remainder Theorem (CRT): if A is within the range [0, M ) where M = i m i .Here, Hereinafter, we refer to the integer A as the standard representation, while we refer to the set of integers {a 1 , . . ., a n } simply as the residues.
A DNN consists of a sequence of L layers.During inference, where the DNN is previously trained and its parameters are fixed, only a forward pass is performed.Generically, the input X to (ℓ + 1)-th layer of a DNN during the forward pass is the output generated by the previous ℓ-th layer: where DNN training requires both forward and backward passes as well as weight updates.The forward pass in the training is performed the same way as in Eq. ( 2).After the forward pass, a loss value L is calculated using the output of the last layer and the ground truth.The gradients of the DNN activations and parameters with respect to L for each layer are calculated by performing a backward pass after each forward pass: Using these gradients ∆W (ℓ) = ∂L ∂W (ℓ) , the DNN parameters are updated in each iteration i: with a step size η for a simple stochastic gradient descent (SGD) optimization algorithm.
Essentially, for each layer, one GEMM operation is performed in the forward pass and two GEMM operations are performed in the backward pass.Because RNS is closed under addition and multiplication operations, GEMM operations can be performed in the RNS space.Using the RNS, Eq. ( 2) can be rewritten as: The same approach applies for Eqs. ( 3) and (4) in the backward pass.
The moduli set M must be chosen to ensure that the outputs of the RNS operations are smaller than M , which means that should be guaranteed for a dot product between b in -bit input and b w -bit weight vectors with h-elements.This constraint prevents overflow in the computation.

Precision and Energy Efficiency in the RNS-based Analog Core
The selection of moduli set M, which is constrained by Eq. ( 7), impacts the achievable precision at the output as well as the energy efficiency of the RNS-based analog core.Table I compares RNS-based analog GEMM cores with example moduli sets and regular fixed-point analog GEMM cores.Here, we show two cases for the regular fixed-point representation: (1) the low-precision (LP) case where b out > b ADC = b DAC , and (2) the highprecision (HP) case where b out = b ADC > b DAC .It should be noted that all three analog cores represent data as fixed-point numbers.We use the term 'regular fixedpoint core' to refer to a typical analog core that performs computations in the standard representation (without RNS).'RNS-based core' refers to an analog core that performs computations on the fixed-point residues.
While the LP approach introduces b out − b ADC bits of information loss in every dot product, the HP approach uses high-precision ADCs to prevent this loss.For the RNS-based core, we picked b in = b w = b ADC = b DAC = ⌈log 2 m i ⌉ ≡ b for ease of comparison against the fixedpoint cores.Table I shows example moduli sets that are chosen to guarantee Eq. ( 7) for h = 128 while keeping the moduli under the chosen bit-width b.In this case, for n moduli with bit-width of b, M covers ≈ n • b bits of range at the output.h is chosen to be 128 as an example considering the common layer sizes in the evaluated MLPerf (Inference: Datacenter) benchmarks.The chosen h provides high throughput with high utilization of the GEMM core.
Fig. 2a compares the error (with respect to FP32 results) observed when performing dot products with the RNS-based core and the LP fixed-point core with the same bit precision.Both cores use the configurations described in Table I for the example vector size h = 128.The larger absolute error observed in the LP fixed-point case illustrates the effect of the information loss mentioned above.HP fixed-point case is not shown as it is equivalent to the RNS case.
Fig. 2b shows the energy consumption of DACs and ADCs per dot product for the three aforementioned analog hardware configurations.To achieve the same MVM throughput as the (LP/HP) fixed-point cores, the RNSbased core with n moduli must use n distinct MVM units and n sets of DACs and ADCs.This makes the energy consumption of the RNS-based core n× larger compared to the LP fixed-point approach.However, the LP fixedpoint approach with low-precision ADCs experiences information loss in the partial outputs and hence has lower accuracy.
The RNS-based approach and the HP fixed-point approach provide the same bit precision (i.e., the same DNN accuracy).Yet, using the RNS-based approach is orders of magnitude more energy-efficient than the HP fixedpoint approach.This is mainly because of the high cost of high-precision ADCs required to capture the full output in the HP fixed-point approach.ADCs dominate the energy consumption with approximately three orders of magnitude higher energy usage than DACs with the same bit precision.In addition, energy consumption in ADCs increases exponentially with increasing bit precision 19 .This favors using multiple DACs and ADCs with lower precision in the RNS-based approach over using a single high-precision ADC.Briefly, the RNS-based approach provides a sweet spot between LP and HP fixed-point approaches without compromising on both high accuracy and high energy efficiency.

Accuracy in the RNS-based Analog Core
Fig. 3a compares the inference accuracy of MLPerf (Inference: Datacenters) benchmarks 24 and OPT 29 (a transformer-based LLM) when run on an RNS-based analog core and a fixed-point (LP) analog core.The HP fixed-point analog core results are not shown as they are equivalent to the RNS-based results.The evaluated DNNs, their corresponding tasks, and the datasets are listed in Table II.Fig. 3a shows that the RNS-based approach significantly ameliorates the accuracy drop caused by the low-precision ADCs used in the LP fixed-point approach for all the networks.By using the RNS-based approach, it is possible to achieve ≥99% of FP32 accuracy (this cut-off is defined in the MLPerf benchmarks 24 ) for all evaluated benchmarks when using residues with as  fixed-point analog approaches.a The distribution of average error observed at the output of a dot product performed with the RNS-based analog approach (pink) and the LP regular fixed-point analog approach (cyan).Error is defined as the distance from the result calculated in FP32.The experiments are repeated for 10,000 randomly generated vector pairs with vector size h = 128.b Energy consumption of data converters (i.e., DACs and ADCs) per dot product for the RNS-based analog approach (pink) and the LP (cyan) and HP (dark blue) regular fixed-point analog approaches.See Methods for the energy estimation methodology.
low as 6 bits.This number can be lowered to 5 bits for BERT-large and RNN-T and to 4 bits for DLRM.
Besides inference, the RNS-based approach opens the door for analog computing to be used in tasks that require higher precision than inference such as DNN training.Figure 3b shows the loss during DNN training/finetuning.Table III reports the validation accuracies after FP32 and RNS-based low-precision training.Here, the GEMM operations during forward and backward passes of training follow the same methodology as inference, with weight updates carried out in FP32.Our experiments show that ≥99% FP32 validation accuracy is achievable after training ResNet50 from scratch using the RNS-based approach with only 6-bit moduli.Similarly, fine-tuning BERT-large and OPT-125M by using 5bit and 7-bit moduli, respectively, can reach ≥99% FP32 validation accuracy.The results are noticeably promising as the previous efforts on analog DNN hardware that adopted the LP fixed-point approach had never successfully demonstrated the training of state-of-the-art DNNs due to the limited precision of this approach.Fig. 4 illustrates the dataflow of the RNS-based analog core when performing MVM as part of the DNN inference/training.An input vector X and a weight matrix W to be multiplied in the MVM unit are first mapped to signed integers.To mitigate the quantization effects, X and each row in W are scaled by an FP32 scaling factor that is unique to the vector (See Methods).The signed integers are then converted into RNS residues through modulo operation (i.e., forward conversion).By construction, each residue is within the range of [0, m i ).To achieve the same throughput as a fixed-point analog core, the RNS-based analog core with n moduli requires using n analog MVM units-one for each modulus-and running them in parallel.Each analog MVM unit requires a set of DACs for converting the associated input and weight residues into the analog domain.The MVM operations are followed by an analog modulo operation on each output residue vector.Thanks to the modulo operation, the output residues-to be captured by ADCs-are reduced back to the [0, m i ) range.Therefore, a bit precision of ⌈log 2 m i ⌉ is adequate for both DACs and ADCs to perform input and output conversions without any information loss.The output residues are then converted back to the standard representation in the digital domain using Eq. ( 1) to generate the signed-integer output vector, which is then mapped back to an FP32 final output Y .The non-linear function f (e.g., ReLU, sigmoid, etc.) is then performed digitally in FP32.

Redundant RNS for Fault Tolerance
Analog compute cores are sensitive to noise.In the case of RNS, even small errors in the residues can result in a large error in the corresponding integer they represent.The Redundant Residue Number System (RRNS) [25][26][27] can detect and correct errors-making the RNS-based analog core fault tolerant.RRNS uses a total of n + k moduli: n non-redundant and k redundant.An RRNS(n + k, n) code can detect up to k errors and can correct up to ⌊ k 2 ⌋ errors.In particular, the error in the codeword (i.e., the n + k residues representing an integer in the RRNS space) can be one of the following cases: • Case 1: Fewer than ⌊ k 2 ⌋ residues have errorsthereby they are correctable, • Case 2: Between ⌊ k 2 ⌋ and k residues have errors or the codeword with more than k errors does not overlap with another codeword in the RRNS space-thereby the error is detectable, • Case 3: More than k residues have errors and the erroneous codeword overlaps with another codeword in the RRNS space-thereby the error goes undetected.
Errors are detected by using majority logic decoding wherein we divide the total n + k output residues into groups with n residues per group.One simple way of majority logic decoding in this context is to convert the residues in each group back to the standard representation via CRT to generate an output value for each group and compare the results of the n+k n groups.If more than 50% of the groups have the same result in the standard representation, then the generated codeword is correct.This corresponds to Case 1.In contrast, not having a majority indicates that the generated codeword is erroneous and cannot be corrected.This corresponds to Case 2. In this case, the detected errors can be eliminated by repeating the entire calculation.In Case 3, the erroneous codeword generated by the majority of the groups overlaps with another codeword.As a result, more than 50% of the groups have the same incorrect result and the error goes undetected.To optimize the hardware performance of this process, more efficient base-extension-based algorithms 30 instead of CRT can be used for error detection.
The final error probability in an RRNS code depends on the percentage of the non-correctable errors observed in the residues.This probability is influenced by the chosen moduli set and the number of error correction iterations (See Methods).Let p c , p d , and p u be the probabilities that Cases 1, 2, and 3 occur respectively when computing a single output.Overall, p c +p d +p u = 1.For a single attempt (i.e., R = 1), the probability of producing the incorrect output integer is p err (R = 1) = 1 − p c = p u +p d .Generally, it is possible to repeat the calculations R-times until no detectable error is found at the expense of increasing compute latency.In this case, the probability of having an incorrect output after R attempts of error correction is As the number of attempts increases, the output error probability decreases and converges to lim R→∞ p err (R) = p u /(p u + p c ). Fig. 5 shows p err for different numbers of redundant moduli (k), attempts (R), and moduli sets with different bit-widths.Broadly, as the probability of a single residue error p increases, the output error probability tends to 1.For a given number of attempts, increasing bit precision and the number of redundant moduli decreases p err .For a fixed number of redundant moduli and a fixed number of bits per moduli, p err decreases as the number of attempts increases.
Fig. 6 investigates the impact of noise on the accuracy of two large and important MLPerf benchmarks-ResNet50 and BERT-Large-when using RRNS.The two networks show similar behavior: adding extra moduli and increasing the number of attempts decrease p err at the same value of p. ResNet50 requires ∼3.9 GigaMAC operations (GOp) for one inference on a single input image.For a 128 × 128 MVM unit, inferring an ImageNet image through the entire network involves computing ∼29.4M partial output elements.Therefore, we expect the transition point from an accurate network to an inaccurate network to occur at p err to be ≤ 1/29.4M= 3.4 × 10 −8 .This p err transition point is ≤ 1/358.6M= 2.8 × 10 −9 for BERT-Large.Fig. 6, however, shows that the evaluated DNNs are more resilient to noise than expected: it is able to tolerate higher p err while maintaining good accuracy.The accuracy of ResNet50 only starts degrading (below 99% FP32) when p err ≈ 4.5 × 10 −5 (1000× higher than the estimated value) on average amongst the experiments shown in Figure 6.This transition probability is p err ≈ 4 × 10 −4 for BERT-Large (on average 100, 000× higher than the estimated value).
The analog modulo operation can be performed electrically or optically.In the electronic domain, one can use ring oscillators: a circuit that generates a continuous waveform by cycling through a series of inverters 28 to perform modulo operations.By carefully designing the parameters of the ring oscillator, it is possible to create an output frequency that corresponds to the desired modulus value.Alternatively, the phase of a signal can be used for performing modulo due to the periodicity of phases in optical systems.Optical phase is inherently modular against 2π.By modulating the phase of an optical signal, one can achieve modulo operations in the analog domain.Using RNS requires forward and reverse conversion circuits to switch between the RNS and the standard number system.The forward conversion is a modulo operation while the reverse conversion can be done using the CRT, mixed-radix conversion, or look-up tables.The (digital) hardware costs of these circuits can be reduced by choosing special moduli sets 31,32 .
The RNS framework can be extended with the PNS to  work with arbitrary precision, despite having DACs and ADCs with limited precision.For applications requiring higher-precision arithmetic than the example cases in this study (e.g., some high-performance computing applications, homomorphic encryption, etc.), a higher M value and therefore moduli with higher bit-width might be necessary, which will be bound by the same limitations discussed in this paper.Instead, one can represent an integer value as D separate digits where each digit is represented as a set of residues in the RNS domain and has an RNS range of M .This hybrid scheme can achieve D log 2 M bit precision where D can liberally be increased without increasing the bit precision of the data converters.Different from the RNS-only scheme, the hybrid scheme requires overflow detection and carry propagation from lower digits to higher digits.The overflow detection can be achieved using two sets of residues: primary and secondary.While the operations are performed with both sets of residues, base extension between two sets helps detect any overflow and propagate the carry to the higher digits if required (See Methods).
In conclusion, our work provides a methodology for precise, energy-efficient, and fault-tolerant analog DNN acceleration.Overall, we believe that RNS is a crucial numeral system for the development of next-generation analog hardware capable of both inference and training state-of-the-art neural networks for advanced applications, such as generative artificial intelligence.

METHODS Handling Negative Numbers with RNS
An RNS with dynamic range M allows representing values within the range of [0, M ).This range can be shifted to [−ψ, ψ], where ψ = ⌊(M − 1)/2⌋, to represent negative values.This is achieved by reassigning the values in between (0, ψ] to be positive, 0 to be zero, and the numbers in between (ψ, 2ψ] to be negative (i.e.[−ψ, 0)).Then, the values can be recovered uniquely by using CRT with a slight modification:

Data Converter Energy Estimation
The DAC and ADC energy numbers in Fig. 2 (b) are estimated by using equations formulated by Murmann 19,33 .The energy consumption of a DAC per conversion is where Cu = 0.5 fF is a typical unit capacitance and VDD = 1V is the supply voltage 19 .The energy consumption of an ADC per conversion is estimated as where k1≈100 fJ and k2≈1 aJ.EADC is dominated by the exponential term (i.e., k24 ENOB ) at large ENOB ( ≥ 10-bits).

Accuracy Modeling
Both RNS-based and regular fixed-point analog cores are modeled using PyTorch for estimating inference and training accuracy.Convolution, linear, and batched matrix multiplication (BMM) layers are performed as tiled-GEMM operations which are computed tile-by-tile as a set of tiled-MVM operations.Each input, weight, and output of the tiled MVM are quantized with a desired bit precision.
Pre-quantization, the input vectors and weight tiles are first dynamically scaled, i.e. scaled at runtime, to mitigate the quantization effects as follows: For an h × h weight tile Wt, we denote each row vector as Wrt where the subscript r stands for the row and t for the tile.Similarly, an input vector of length h is denoted as Xt where t indicates the tile.Each weight row Wrt shares a single FP32 scale s w rt = max(|Wrt|) and each input vector Xt shares a single FP32 scale s x t = max(|Xt|).h scales per h × h weight tile and 1 scale per input vector, in total h + 1 scales, are stored for each tiled-MVM operation.The tiled MVM is performed between the scaled weight and input vectors, Wrt = Wrt/s w rt and Xt = Xt/s x t , to produce Yrt = Wt Xt.The output Yrt is then quantized (if required) to resemble the output ADCs and multiplied back with the appropriate scales so that the actual output elements Yrt = Yrt • s w rt • s x t are obtained.Here, the methodology is the same for RNS-based and regular fixedpoint cores.For the RNS-based case, in addition to the description above, the quantized input and weight integers are converted into the RNS space before the tiled-MVM operations.MVMs are performed separately for each set of residues and are followed by a modulo operation before the quantization step.The output residues for each tiled MVM are converted back to the standard representation using the CRT.
The GEMM operations (i.e., convolution, linear, and BMM layers) are sandwiched between an input operation Oin and an output operation Oout.This makes the operation order Oin-GEMM-Oout during the forward pass, and Oout-GEMM-Oin in the backward pass.Oin quantizes the input and weight tensors in the forward pass and is a null operation in the backward pass.In contrast, Oout is a null operation in the forward pass and quantizes the activation gradients in the backward pass.In this way, the quantization is always performed before the GEMM operation.The optimizer (i.e., SGD or Adam) is modified to keep a copy of the FP32 weights to use during the weight updates.Before each forward pass, the FP32 weights are copied and stored.After the forward pass, the quantized model weights are replaced by the previously stored FP32 weights before the step function so that the weight updates are performed in FP32.After the weight update, the model parameters are quantized again for the next forward pass.This high-precision weight update step is crucial for achieving high accuracy in training.
In Fig. 3b, all the convolution, linear, and BMM layers in the models were replaced by the quantized versions.We trained ResNet50 from scratch by using SGD optimizer for 90 epochs with a momentum of 0.9 and a learning rate starting from 0.1.The learning rate was scaled down by 10 at epochs 30, 60, and 80.We fine-tuned BERT-large and OPT-125M from the implementations available in the Huggingface transformers repository 34 .We used the Adam optimizer for both models with the default settings.The script uses a linear learning rate scheduler.The learning rate starts at 3e-05 and 5e-05 and the models are trained for 2 and 3 epochs, respectively for BERT-Large and OPT-125M.

Error distribution in the RRNS code space
For an RRNS(n + k, n) with n non-redundant moduli (m1, m2, ..., mn) and k redundant moduli (mn+1, mn+2, ..., m n+k ), the probability distributions (pc, p d , and pu) of different types of errors (Case 1, Case 2, and Case 3 that were mentioned in Redundant RNS for Fault Tolerance) are related to the Hamming distance distribution of the RRNS code space.In an RRNS(n + k, n), every integer is represented as n + k residues (ri where i ∈ {1, ..., n + k}) and this vector of n + k residues is considered as an RRNS codeword.A Hamming distance of η ∈ {0, 1, ..., n + k} between the original codeword and the erroneous codeword indicates that η out of n + k residues are erroneous.The erroneous codewords create a new vector space of n + k-long vectors where at least one ri is replaced with r ′ i ̸ = ri with i ∈ {1, ..., n + k} and r ′ i < mi.This vector space includes all the RRNS(n + k, n) codewords as well as other possible n + k-long vectors that do not overlap with any codeword in the RRNS code space.A vector represents a codeword and is in the RRNS code space if and only if it can be converted into a value within the legitimate range [0, M ) of the RRNS(n + k, n) by using the CRT.The number of all vectors that have a Hamming distance η from a codeword in RRNS(n + k, n) can be expressed as where represents the summation over all distinct n+k η selections.The number of codewords that are in the RNS code space with a Hamming distance of η ∈ {0, 1, ..., n + k} can be expressed as for k + 1 ≤ η ≤ n + k.For 1 ≤ η ≤ k, Dη = 0 and D0 = 1.
ζ(n + k, η) represents the total number of non-zero common divisors in the legitimate range [0, M ) for any n + k − η moduli out of the n + k moduli of the RRNS(n + k, n) code and can be denoted as where (mi 1 , mi 2 , ..., mi λ ) with 1 ≤ λ ≤ n + k is a subset of the RRNS(n + k, n) moduli set.An undetectable error occurs only if a codeword with errors overlaps with another codeword in the same RRNS space.Given the distance distributions for the vector space V and the codespace D (Eq. ( 12), (13), respectively), the probability of observing an undetectable error (pu) for RRNS(n + k, n) can be computed as where p E (η) is the probability of having η erroneous residues in a codeword which can be calculated as for an error probability of p in a single residue.Eq. ( 13) indicates that for up to η = k erroneous residues Dη = 0, and so it is not possible for an erroneous codeword to overlap with another codeword in the RRNS code space.This guarantees the successful detection of the observed error.If the Hamming distance of the erroneous codeword is η ≤ ⌊ k 2 ⌋, the error can be corrected by the majority logic decoding mechanism.In other words, the probability of observing a correctable error is equal to observing less or equal to ⌊ k 2 ⌋ errors in the residues and can be calculated as All the errors that do not fall under the undetectable or correctable categories are referred to as detectable but not correctable errors with a probability p d where p d = 1 − (pc + p d ).The equations in this section were obtained from the work conducted by Yang 27 .
To model the error in the RNS core for the analysis shown in Fig. 6, pc, p d , and pu are computed for a given RRNS(n+k, n) using Eqs.(15)  and (17).Given the number of error correction attempts, output error probability (perr) is calculated according to Eq. ( 8).Random noise is injected at the output of every tiled-MVM operation using a Bernoulli distribution with the probability of perr.

RNS Operations
The proposed analog RNS-based approach requires modular arithmetic.In this section, we discuss two ways of performing modular arithmetic in the analog domain, one electrical and one optical.

Modular Arithmetic with Ring Oscillators
In a ring oscillator, where each inverter has a propagation delay tprop > 0, there is always only one inverter that has the same input and output-either 1-1 or 0-0-at any given time when the ring oscillator is on.The location of this inverter with the same input and output propagates along with the signal every tprop time and rotates due to the ring structure.This rotation forms a modular behavior in the ring when the location of this inverter is tracked.
Let SRO(t) be the state of a ring oscillator with N inverters.Here, SRO(t) ∈ {0, ..., N − 1} and SRO(t) = k means that the k + 1-th inverter's input and output have the same value at time t.SRO(t) keeps rotating between 0 to N − 1 as long as the oscillator is on.Figure 7a shows a simple example where N = 3.In the first tprop time interval, the input and output of the first inverter are both 0, therefore, the state SRO(t < tprop) = 0. Similarly, when tprop < t < 2tprop, the input and output of the second inverter are 1, so SRO(tprop < t < 2tprop) = 1.Here, the time between two states following one another (i.e., tprop) is fixed and SRO(t) rotates (0, 1, 2, 0, 1, ...).Assume the state of the ring oscillator is sampled periodically with a sampling period of Ts = A • tprop.Then, the observed change in the state of the ring oscillator between two samples (SRO(t = Ts) − SRO(t = 0)) is equivalent to |A| N where A is a positive integer value.Therefore, to perform a modulo against a modulus value m, the number of inverters N should be equal to m.The dividend number A and the sampling period can be adjusted by changing the analog input voltage to a voltage-to-time converter (VTC).
In a modular dot product or an MVM operation, the dividend A is replaced by the output of the dot product.Analog dot products can be performed using traditional methods with no change with any desired analog technology where output can be represented as an analog electrical signal (e.g., current or voltage) before the analog modulo.

Modular Arithmetic with Phase Shifters
The amount of phase shift introduced by a single dual-rail phase shifter when v and −v voltages are applied on the upper and the bottom arms, respectively, is where Vπ•cm is the modulation efficiency of the phase shifter and is a constant value.∆Φ is proportional to both the length of the shifter L and the amount of applied voltage v. Figure 7b shows an example modular dot product operation between two vectors, x and w, using cascaded dual-rail phase shifters.w is encoded digit-by-digit using phase shifters with lengths proportional to 2 j where j represents the digit number.In the example, each element (i.e., w0 and w1) of the 2-element vector w consists of 3 digits and uses 3 phase shifters, each with lengths L, 2L, and 4L.If the j-th digit of the i-th element of w, w j i = 1, a voltage vi is applied to the phase shifter pair (top and bottom) with the length 2 j L. If the digit w j i = 0, then no voltage is applied and therefore no phase shift is introduced to the input signal.To encode the second operand x, a voltage vi that is proportional to xi is applied to all non-zero digits of wi.To take modulo with a modulus m instead of 2π, the input x and therefore the applied voltage v should be multiplied by the constant 2π/m.For encoding xi, should be applied so that the total phase shift at the end of the optical path is The resulting output values are collected at the end of the optical path and are in the form of the phase difference between input and output.These outputs are then re-multiplied by m/2π to obtain the outputs of the modular dot products for each residue.

Extended RNS
By combining RNS and PNS, an integer value Z can be represented as D separate digits, z d where d ∈ {0, 1, ..., D − 1} and 0 ≤ z d < M where M is the RNS range: and can provide up to D log 2 M bit precision.This hybrid scheme requires carry propagation from lower digits to higher digits, unlike the RNS-only scheme.For this purpose, one can use two sets of moduli, primary and secondary, where every operation is performed for both sets of residues.After every operation, overflow is detected for each digit and carried over to the higher-order digits.
Let us define and pick np primary moduli mi, where i ∈ {1, ..., np} and ns secondary moduli mj , where j ∈ {1, ..., ns} and mi ̸ = mj ∀ {i, j}.Here M = Mp • Ms = np i=1 mi • ns j=1 mj is large enough to represent the largest possible output of the operations performed in this numeral representation and Mp and Ms are co-prime.
To execute an operation in this hybrid number system, the operation is performed separately for each digit of the output.These operations for each digit are independent of one another and can be parallelized except for the overflow detection and carry propagation.Assume z d = z d |p;s consists of primary and secondary residues and is a calculated output digit of an operation before overflow detection.
z d can be decomposed as z d |p = Q d |pMp + R d |p where Q d |p and R d |p are the quotient and the remainder of the digit, with respect to the primary RNS.To detect a potential overflow in the digit z d , a base extension from primary to secondary RNS is performed on z d |p and the base extended residues are compared with the original secondary residues of the digit, z d |s.If the residues are the same, this indicates that there is no overflow, i.e., Q d |p;s = 0, and both primary and secondary residues are kept without any carry moved to the next higher digit.As against that, if the base-extended secondary residues and the original secondary residues are not the same, it means that there exists an overflow (i.e., Q d |p;s ̸ = 0).In the case of overflow, the remainder of the secondary RNS, R d |s, is calculated through a base extension from primary to secondary RNS on R The full quotient Q d |p;s is then propagated to the higher-order digit.Algorithm ?? shows the pseudocode for handling an operation □ using the extended RNS representation.The operation can be replaced by any operation that is closed under RNS.
It should be noted that z d |p;s is not always computed as x d |p;s□y d |p;s.For operations such as addition, each digit before carry propagation is computed by simply adding the same digits of the Algorithm 1 Pseudocode for performing the operation □ using the hybrid number system.Here, x and y are the inputs for operation □ and z is the output with D digits.z d represents the digits of the output where z d |p are the primary residues and z d |s are the secondary residues.Primary and secondary residues together are referred to as z ′ d |p;s.Q is the quotient and R is the remainder where z d = Q d Mp + R d .p2s() and s2p() refer to base extension algorithms from primary to secondary residues and from secondary to primary residues, respectively.

Figure 1 :
Figure 1: Inference accuracy versus vector size for varying data bit-width in a conventional analog core.a Inference accuracy for a two-layer CNN classifying handwritten digits from the MNIST dataset.b Inference accuracy for ResNet50 classifying images from the ImageNet dataset evaluated in an analog core with varying precision b and vector sizes h.For both a and b, b-bit precision means b = bDAC = bADC < bout where b varies between 2 and 8.

Figure 2 :
Figure 2: Comparison of the RNS-based and regular

Figure 3 :Figure 4 :
Figure 3: Accuracy performance of the RNS-based analog core.a Inference accuracy of regular fixed-point (LP) and RNS-based cores (See TableI) on MLPerf (Inference: Datacenters) benchmarks.The accuracy numbers are normalized to the FP32 accuracy.b-d Loss during training for FP32 and RNS-based approaches with varying moduli bit-width.ResNet50 (a) is trained from scratch for 90 epochs using SGD optimizer with a momentum.BERT-Large (b) and OPT-125M (c) are fine-tuned from pre-trained models.Both models are fine-tuned using the Adam optimizer with a linear learning rate scheduler for 2 and 3 epochs for BERT-Large and OPT-125M, respectively.All inference and training experiments use FP32 for all non-GEMM operations.

Figure 5 :Figure 6 :
Figure 5: Calculated output error probability (perr) versus single residue error probability (p).a-c perr for one (a), two (b), and infinite (c) error correction attempts and a varying number of redundant moduli (k).

1 p
d |p where R d |p = z d |p.Q d |s can then be computed as Q d |s = (z d |s − R d |s)M −where |Mp • M −1 p | Ms ≡ 1. Q d |p is calculated through base extension from the secondary to primary RNS on the computed Q d |s.

SFigure 7 :
Figure 7: Analog modulo implementations.a Modulo operation performed using a ring oscillator.A ring oscillator with N = 3 inverters isshown to perform modulo against a modulus m = 3.This operation is performed after every analog dot product to perform a modular dot product.b Modular dot product performed using phase shifters.A modular dot product operation between two 2-element vectors x and w, each with 3 digits, is shown by using a dual-rail set of cascaded phase shifters.The transistor switch turns on and supplies voltage to the phase shifter when the corresponding digit of w is 1.

Table I :
Data and data converter precision in RNS-based, LP fixed-point, and HP fixed-point analog cores.RNS-based Core (This work) LP Fixed-Point Core HP Fixed-Point Core bin, bw bDAC log 2 M bADC Moduli Set (M) RNS Range (M ) bDAC bout bADC Lost Bits bDAC bout bADC

Table III :
Validation accuracy results after training/fine-tuning.