Radix-4 CORDIC algorithm based low-latency and hardware efficient VLSI architecture for Nth root and Nth power computations

In this article, a low-complexity VLSI architecture based on a radix-4 hyperbolic COordinate Rotion DIgital Computer (CORDIC) is proposed to compute the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N{{\rm th}}$$\end{document}Nth root and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N{{\rm th}}$$\end{document}Nth power of a fixed-point number. The most recent techniques use the radix-2 CORDIC algorithm to compute the root and power. The high computation latency of radix-2 CORDIC is the primary concern for the designers. \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N{{\rm th}}$$\end{document}Nth root and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$N{{\rm th}}$$\end{document}Nth power computations are divided into three phases, and each phase is performed by a different class of the proposed modified radix-4 CORDIC algorithms in the proposed architecture. Although radix-4 CORDIC can converge faster with fewer recurrences, it demands more hardware resources and computational steps due to its intricate angle selection logic and variable scale factor. We have employed the modified radix-4 hyperbolic vectoring (R4HV) CORDIC to compute logarithms, radix-4 linear vectoring (R4LV) to perform division, and the modified scaling-free radix-4 hyperbolic rotation (R4HR) CORDIC to compute exponential. The criteria to select the amount of rotation in R4HV CORDIC is complicated and depends on the coordinates \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$X^j$$\end{document}Xj and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y^j$$\end{document}Yj of the rotating vector. In the proposed modified R4HV CORDIC, we have derived the simple selection criteria based on the fact that the inputs to R4HV CORDIC are related. The proposed criteria only depend on the coordinate \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$Y^j$$\end{document}Yj that reduces the hardware complexity of the R4HV CORDIC. The R4HR CORDIC shows the complex scale factor, and compensation of such scale factor necessitates the complex hardware. The complexity of R4HR CORDIC is reduced by pre-computing the scale factor for initial iterations and by employing scaling-free rotations for later iterations. Quantitative hardware analysis suggests better hardware utilization than the recent approaches. The proposed architecture is implemented on a Virtex-6 FPGA, and FPGA implementation demonstrates \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$19\%$$\end{document}19% less hardware utilization with better error performance than the approach with the radix-2 CORDIC algorithm.


Radix-2 CORDIC algorithm
The CORDIC is well known for the calculation of complex mathematical functions using very simple hardware.The various classes of the CORDIC algorithm can be created by choosing an appropriate operating mode (vectoring or rotation) and coordinate system (circular, hyperbolic, or linear).The generalized form is illustrated below.
where parameter q, β j , and α j indicate the coordinate system, rotation angle, and direction of the micro-rotation, respectively.By choosing the appropriate value of q and α j , six different classes of the CORDIC algorithm can be generated.For the root and power calculations, circular CORDIC is not required and they are not discussed here.The output of the other classes of the CORDIC algorithm after convergence and the initial values used to achieve the output are listed in Table 1.The coordinate equations for HV-CORDIC and HR-CORDIC can be derived from Eq. (1) by taking q = −1 .For hyperbolic CORDIC to achieve convergence, iterations with indexes j = (3n + 1) = 4, 13, 40, . . .need to be repeated.The convergence criteria of HV-CORDIC are illustrated as follows: Similarly, the convergence criterion of HR-CORDIC is |Z 0 | ≤ 1.1182 .Among all six classes of the CORDIC algorithm, LV and LR have the simplest convergence, and they are very similar to the shift and accumulate architecture of a conventional multiplier.The aforementioned hyperbolic computation augments the coordinates by K h = n j=1 1 − 2 −2j .However, this scale factor can be ignored for HV-CORDIC, as only the value of the Z coordinate is required after the convergence.For HR-CORDIC, the scale factor can be compensated by choosing the initial value of the X coordinate as X 0 = 1 K h .The implementation of root and power computations using these classes of the CORDIC algorithm is discussed next. (1) Conventional architecture to compute Nth root and Nth power The conventional way to determine root and power is based on the following illustrations: A specific CORDIC method may be used to implement the logarithm and exponential operations needed for the computation of the aforementioned illustrations.In the classical approach, the entire computation is separated into three phases.The ln P is computed using HV-CORDIC.Multiplication is performed to compute the Nth power using linear rotation mode CORDIC (LR-CORDIC), and division is performed to compute the Nth root using linear vectoring mode CORDIC (LV-CORDIC).In the last, the exponential is performed using the hyperbolic rotation mode CORDIC (HR-CORDIC).Figure 1 demonstrates this approach.If the HV-CORDIC is initialized with the inputs Y 0 = P − 1 and X 0 = P + 1 then the logarithm can be calculated as follows.
From this discussion, it is clear that the outputs X N and Y N of the HV-CORDIC are not required for further calculation, and hence, the scale factor compensation is not required for HV-CORDIC.As shown in Fig. 1, the multiplication and division are performed for power and root computing using LR-CORDIC and LV CORDIC, respectively.HR-CORDIC computes the final exponential.
The problem with this architecture is that the values of P and N are limited by the convergence criteria of various classes of CORDIC algorithms.The range of P can be derived using the convergence criteria of HV-CORDIC, i.e. tanh −1 Y 0 X 0 ≤ 1.1182 and input X 0 has to be positive.Based on the inputs ( Y 0 = P − 1 and X 0 = P + 1 ) of HV-CORDIC, the range of P can be derived using the following constraints.
From the aforementioned constraints, the range of P can be worked out as follows: Such a small range of P limits the real-time applications of this standard architecture.From Eq. (5), it is clear that the input range of the HV class has to be increased to extend the range of P. For example, if HV-CORDIC can converge in the range, tanh −1 Y 0 X 0 ≤ 1.1182 , then the range of P can be extended to P ∈ 1 403.43  , 403.43 .
Two recent approaches have been proposed to expand the range of P. In the first approach, negative-indexed iterations were proposed for the HV and HR CORDICs.However, additional negative-indexed iterations increase the iterative stages, which require additional computational resources.
In the another research 26 , binary logarithms (log 2 (•)) and binary exponentials (2 (•) ) are used to compute the Nth root and Nth power, as illustrated in Eq. 8.The first step of this approach is to bring the range of P to the range that can be processed by BHV-CORDIC by means of the normalization of P. The normalization factor is always an integer power of two.As a result, this approach does not require performing additional negative index iterations.The value of P can be normalized as follows: Later, the binary logarithm is calculated using a simple adder as follows: In the architecture presented in 26 , authors have used binary HV-CORDIC to compute log 2 p .Similarly, the binary exponential 2 ( •) of the real number V is computed by decomposing the real number (V) into integer (V I ) and fraction ( V F ) parts as follows: In the above illustration, V I is the integer, and 2 V I can be computed using left shift by V I -bits.The 2 V F is computed with a BHR-CORDIC.This method requires a small convergence range (i.e., |Z 0 | ≤ 1 ) of BHR-CORDIC as As a result, this approach does not require performing the negative index iteration.However, both architectures suffer from very high hardware utilization, as radix-2 CORDIC generates one bit of precision in its one iteration.The selection criteria of R4HV-CORDIC to choose the amount of rotation is complicated.Also, the scale factor of R4HR-CORDIC is variable, and compensation necessitates the specific hardware.In this article, we have modified the architectures of R4HV and R4HR CORDICs to simplify the selection criteria and re-scaling of scale-factor for root and power computations.A proposed methodology brings down the complexity of radix-4 CORDIC below that of the standard algorithm.

Proposed methodology
The high computation latency and hardware utilization of the existing design are the primary concerns, as radix-2 CORDIC produces 1-bit precision in each iteration.In the pipelined architecture, the insertion of parallelism between two iterations costs a lot of pipeline resources.The total computational latency of the architectures presented in 25 and 26 is 81 and 73, respectively.In the proposed design, we have attempted to reduce the latency and hardware utilization by introducing modified R4HV-CORDIC to compute the logarithm and R4HR-CORDIC to compute the exponential.The computational complexity of the high-radix CORDIC algorithm other than radix-4 is very high as all the selection functions are not the integer power of two.For example, the radix-8 CORDIC algorithm has a selection function ranging from -4 to +4, and the multiplication of the selection function with the coordinates requires four extra adders in each iteration.As a result, we have used the radix-4 CORDIC algorithm in the proposed design.
In the proposed methodology, the computation of P 1 N and P N is based on the base-4 logarithm and the exponential, as given in the equations below.
We have used modified R4HV-CORDIC to compute log 4 (•) and R4HR-CORDIC to compute 4 (•) .The proper- ties of natural hyperbolic rotation can also be proved for hyperbolic rotation in base-4 as given in 26 .For base-4 hyperbolic rotation, tanh 4 (a) can be defined as follows: From the above illustration, the relation between the inverse hyperbolic function and the logarithm for base-4 can be computed as follows: Figure 2a,b demonstrate the proposed root and power computation methodology, respectively.The range of variables at different stages is also shown in Fig. 2. The input range of the P is considered as P ∈ 10 −6 , 10 6 and P ∈ 10 −2 , 10 2 for root and power computation, respectively.The input range of R4HV-CORDIC is only 1 4. 19  , 4.19 , and it is discussed in the next section.Hence, the normalization is used to bring down the range of The normalization is performed using relation P = 4 q × p ; hence, the q is [−9, 9] and [−3, 3] after normalization for root and power computation, respectively.We have used the modified R4HV-CORDIC to compute the log 4 p , and log 4 P can be computed by adding log 4 p to q using the simple adder in both computa- tions.In the next phase, R4LV-CORDIC is used to divide log 4 P by N for root computation and a simple multiplier is used to multiply log 4 P by N for power computation.Finally, the exponential required to compute P 1 N and P N (9) P = 2 q × p; where, p ∈ [1, 2] (10) The normalization is used to lower the input range of R4HR-CORDIC within the convergence range.The exponential 4 (•) of the real number V is computed by decomposing the real number (V) into integer (V I ) and fraction ( V F ) parts as follows: In the proposed methodology, 4 V I is computed using left shift by 2 * V I bits.The 4 V F is computed with the R4HR- CORDIC.In the following section, the modified R4HV and R4HR CORDICs are discussed.

Modified radix-4 CORDIC
The various classes of the radix-4 CORDIC algorithm that have been utilized in the proposed methodology are discussed here.

Modified radix-4 HV-CORDIC
The R4HV-CORDIC can be defined as follows for base-4 logarithm computation.
where j is the integer starting with 1, and selection function σ j ∈ {−2, −1, 0, 1, 2} .The radix-4 CORDIC does not require repeating any iteration for convergence.The aforementioned rotation introduces the scale factor which is given as The problem with the R4HV-CORDIC is the selection criteria to choose σ j and the complex scale factor K. Since we only use the value of the Z variable at the end of conver- gence, the re-scaling of the rotated vector is not required.However, in the R4HV-CORDIC algorithm, the selection criteria to choose the σ j are complex and depend on both coordinate values X j and Y j .The convergence of the R4HV-CORDIC can be derived using the SRT-division method as given in 27,28 .According to the SRT division, the variable Y j is converted into a new variable as W j = 4 j Y j .After the conversion, the equation given in Eq. ( 16) will look as follows: To guarantee the convergence of the algorithm, the variable W j must be bounded between the lower(L) and upper(U) limits which are defined as L = a − p r−1 X j and U = a + p r−1 X j for radix-r SRT division.Accord- ing to the SRT divison method, to achieve maximum overlap between the intervals used for selecting different values of σ j and for minimal redundancy we have chosen p = r 2 27 .These limits for radix-4 SRT division can be defined as L = a − 2 3 X j and U = a + 2 3 X j .We choose σ j = a according to the criteria given in equation Eq. ( 18) to guarantee convergence.
The intervals to select the σ j can be derived using the criteria given in equation Eq. (18).The value of the variable W j should be bound within this interval in each iteration to ensure convergence.For example, to select σ j = 2 , W j must fall within the interval I 2 : 4 3 X j , 8 3 X j .Similarly, to select σ j = 1 , W j must fall within the interval I 1 : 1 3 X j , 5 3 X j .The overlapping between these two intervals is 4 3 X j , 5 3 X j .Letter, we can select any value from this overlapping between two intervals.The criteria and overlapping intervals for a particular selection function are mentioned in Table 3.
The convergence criteria for R4HV-CORDIC can be defined as follows: The inputs to R4HV-CORDIC are Y 0 = p − 1 and X 0 = p + 1 .The range of p is limited by the convergence range discussed in Eq. ( 5), and it can be derived using the constraints given below.
From the above constraints, the range of p can be derived as p ∈ 1 4.19 , 4.19 .As discussed earlier, this small convergence range is enough as the output of the normalizer is between 1 and 4 for the proposed architecture.
The problem with the R4HV-CORDIC is that the σ j depends on both the coordinates X j and W j .The compu- tation of the selection function is very complex, as in each iteration W j needs to be compared with the complex selection criteria given in Table 3.The computation of 0.5 * X j can be achieved with a simple binary shift, and additional hardware is not required.However, to compute 1.5 * X j , an additional adder may be required.In this section, we have discussed the methodology to simplify the selection criteria for the application of the root and power computations.
Since the inputs to radix-4 HV CORDIC are fixed ( Y 0 = p − 1 and X 0 = p + 1 ), we can derive the selection criteria to choose σ j , which only depends on the variable Y j for any iteration index j.Because of the fixed inputs to R4HV-CORDIC, the variables Y j and X j can also be represented in terms of Y 0 for any iteration index, j.For example, Y 1 and X 1 can be represented in terms of Y 0 using the identities X 0 = Y 0 + 2 and Eq. ( 16) as follows: From the identities given in Eq. (21), the relation between Y 1 and X 1 can be derived as follows: (17) Table 3. Overlapping intervals and criteria.
Similarly, the range of Y 0 can be found for all possible combinations of σ 0 and σ 1 by iterating the equation Eq. ( 25).The range of Y 0 to select various values of σ 0 and σ 1 is summarized in Table 4, and from this range, criteria to select the selection function is shown in the adjutant column in Table 4.All the values of the criteria can be represented using 10-bit, and as a result, a 10-bit comparator is required for the comparison.The problem with this method is that the number of comparison points increases exponentially to the iteration index.For example, the variable Y 0 has to be compared with 125 selection criteria to select σ 0 , σ 1 , and σ 2 .In the proposed architecture, values for σ 0 and σ 1 are selected by comparing the value of Y 0 with the criteria given in Table 4.The proposed method to select the σ j for the iteration index j ≥ 2 is discussed next.
As given in 29 , the criteria to select σ 2 can be used to select σ j for all the iterations with iteration index j ≥ 2 .In the proposed architecture, criteria to select σ 2 are stored on a look-up table, and they are used to decide the value of σ j for the rest of the iterations.As discussed earlier, the relation between variables X 2 and Y 2 can be derived using the identity X 1 = Y 1 + 2 + σ 0 2 and iteration equation for j = 2 as follows: Now, the range of Y 2 needed to select σ 2 can be derived using the identity aX 2 ≤ 64Y 2 ≤ bX 2 as follows: Now the above equation can be iterated with various values of σ 0 and σ 1 to find out the five criteria points A i for σ j = i .The range of Y 2 to select σ j for various values of σ 0 and σ 1 is summarised in Table 5.The last five columns of Table 5 show the criteria to select σ j for various values of σ 0 and σ 1 .All the values of the comparison points can be represented using 8-bit, which results in only an 8-bit comparator.Since each comparison point can be represented using 8-bit, a look-up table with a size of 125 × 8 bits is required to store all the criteria.Once the values of σ 0 and σ 1 are known, the comparison points from the look-up table can be loaded into registers.Later, these comparison points will be used to select the value of σ j for the rest of the iterations.The computation flow of the proposed modified R4HV-CORDIC is presented in Table 6.Table 6 states the computation performed by the X, Y, and Z data paths in each iteration.In the first step, a normalization procedure is performed by evaluating the values of q and p using the identity 4 q ≤ P ≤ 4 q+1 where q is an integer number.By performing a normalization process, P is converted to the convergence range of R4HV-CORDIC as p ∈ [1, 4] , and the logarithm of P is computed as log 4 P = q + log 4 p .Later, the X 0 and Y 0 are initialised as X 0 = p + 1 and Y 0 = p − 1 .The Z-datapath computes σ 0 and σ 1 by comparing Y 0 with the selection criteria given in Table 4. Since σ 0 and σ 1 are already known, the next two stages compute the R4HV-CORDIC iterations with indexes j = 1, 2. Stage 2 also loads the values of comparison points from the look-up table based on the values of σ 0 and ( 23) Table 4. Selection criteria.www.nature.com/scientificreports/σ 1 .These comparison points will be used to decide the value of σ j for the following iterations.In the third stage of the proposed algorithm, first, Z-datapath evaluates the value of σ j by comparing the Y 2 with the comparison points retrieved from the look-up table in the previous stage.The rest of the stages follow this process to get convergence.The architecture of the proposed algorithm is discussed next.
The modified R4HV-CORDIC architecture to compute log 4 P The process of calculating log 4 P is divided into three stages.The initial stage is pre-processing, where the range of P is transformed to the input range of R4HV-CORDIC.The second stage employs the proposed modified R4HV-CORDIC to calculate log 4 p .Finally, the post-processing stage computes log 4 P by adding q to log p + q .
The following section discusses each step in detail.The input range of the R4HV-CORDIC is p ∈ 1 4.19 , 4.19 .In the pre-processing stage, the value of P is normal- ized with factor q in such a way that normalized p is in the range p ∈ [1, 4] .The normalization can be achieved by right-shifting P by 2q-bits for 4 q ≤ P ≤ 4 q+1 , and q can be found out using the simple combinational logic.The relation between the actual value of P and normalized p can be expressed as P = p × 4 q .In addition to normal- izing p, the pre-processing stage calculates the value of X 0 and Y 0 by adding and subtracting normalized p with 1.The pre-processing stage also involves comparing the normalized p with the conditions specified in Table 4 to determine the values of σ 0 and σ 1 .This comparison necessitates a 10-bit comparator, as discussed earlier.The

Datapath
Operations to be performed Compute X 0 and Y 0 .Evaluate σ 0 and σ 1 by comparing p with the criteria given in Table 4.
Compute X 2 , Y 2 , and Z 2 .Load comparison points A i from the look-up table.
Evaluate σ j by comparing Y j with the comparison points loaded from the look-up table in the second iteration, compute X j , Y j , and Z j .
normalization can be achieved by binary shift using a fixed number of bits in fixed-point representation and its delay can be ignored.The radix-4 HV CORDIC receives X 0 , Y 0 , σ 0 , and σ 1 from the pre-processing stage and computes the log 4 p .The generalized architecture of X-Y datapaths of the R4HV-CORDIC is shown in Fig. 3.The adder/subtractor and shifter are the basic components of the X and Y datapaths, and the architecture of the X and Y data paths is the same for all stages except for the shift value.The shifter can be implemented using a simple 3-to-1 multiplexer and it multiplies the σ j with X j (in Y-datapath) and Y j (in X-datapath).The multiplica- tion can be achieved using binary shift as the value of σ j is always an integer power of two.The Z-datapath of the first stage first access the tanh −1 4 σ 0 4 from the ROM table and add or subtract it to Z 0 to generate the Z 1 .Since there are three values of sigma j and negative values of rotation angle can be added by performing subtraction, only three values of tanh −1 4 σ 0 4 need to be pre-computed.The Z-datapath of the second stage pre-loads the comparison points to pipelined registers from the ROM table based on the value of σ 0 and σ 1 .The Z-datapath of the third stage compares the comparison points ( A i ) received from the previous computation with the 64Y 2 to derive σ 2 .Since the X and Y datapaths can only compute after the computation of σ 2 , the critical path delay of the X-Y datapath has an additional delay of a comparator, as compared to the first and second stages.In the proposed architecture, we have proposed pipelined structure where each stage is separated with pipeline registers so that they can compute in parallel.Table 7 summarizes the critical path delay of the X-Y and Z rotators of each stage.
The critical path delays of all rotators are approximately the same as they use an adder/subtractor and comparator, as shown in Table 7. Radix-4 CORDIC rotation requires half the iterations of radix-2 for the same N-bit precision 29 .The proposed modified R4-HV CORDIC algorithm introduces only a minimal overhead of three 3-to-1 multiplexers in each stage.Therefore, implementing the modified R4-HV CORDIC algorithm requires 3N 2 adders and 3N 2 3-to-1 multiplexers, which is significantly less than the 3N adders required by the radix-2 CORDIC algorithm.The proposed R4-HV CORDIC algorithm has better hardware utilization than the radix-2 CORDIC algorithm since the complexity of an adder is approximately twice that of a 3-to-2 multiplexer 30 .In the post-processing stage, the output of the Z-datapath of the last stage of the radix-4 HVCORDIC is shifted right by 1-bit to generate log 4 p .Later, the adder adds log 4 p with the normalized shift value q in the post-processing stage to generate log 4 P , which has a delay of one adder.

Radix-4 LV-CORDIC
The radix-4 LV-CORDIC has the most straightforward architecture among all the versions of CORDIC.The computational equations of R4LV-CORDIC are given in Eq. (28).www.nature.com/scientificreports/For R4LV-CORDIC, iteration index j starts from 0. The extension of the input range of the R4LV-CORDIC can be increased by performing the non-positive index iteration with the same architecture, and additional hardware is not required.The implementation of R4LV-CORDIC necessitates the two multiplexers and adders, each for Y and Z datapaths, as depicted in Fig. 4. The critical path delay of Y and Z rotators includes the delay of one adder and multiplexer.As compared to the radix-2 CORDIC algorithm that uses two adders, one stage of R4-LV CORDIC uses two adders and two 3-to-1 multiplexers.However, the R4-LV CORDIC algorithm achieves convergence in half the iteration.The total hardware complexity of R4-LV CORDIC is N adders and N multiplexers, which is less than the 2N adders used by the radix-2 algorithm.Also, LV mode of the CORDIC does not generate scaling and compensation of the scale factor is not required in LV class of the CORDIC.

Modified radix-4 HR-CORDIC
In this section, the radix-4 hyperbolic rotation CORDIC is discussed.The R4HR-CORDIC is used to determine the exponential (4 (•) ) in the proposed method.The R4HR-CORDIC iteration can be illustrated as follows: where σ j ∈ {−2, −1, 0, 1, 2} , and j = 1, 2, ..., n 2 .The ( X j , Y j ) is the input vector, and ( X j+1 , Y i+1 ) represents the output vector after jth rotation.After the convergence, the final coordinates X n and Y n of the rotated vector are as follows: Where K h is the scale factor.From above equation 30, exponential 4 Z 1 can be computed as follows: However, the computation of the final exponential requires the scale-factor compensation.The scale factor is given by The variable scale factor is the disadvantage of the R4HR-CORDIC.Another problem with R4HR-CORDIC is the convergence range.According to the illustration given in eq1, the minimum convergence range required is |Z 1 | ≤ 1 .However, the convergence range of radix-4 HR CORDIC is only |Z 1 | ≤ 0.501 .In the proposed architecture, an attempt is made to address these issues.In the next section, the convergence of the proposed CORDIC algorithm, its range of convergence, and scale factor compensation are discussed.
(28) The high-radix CORDIC algorithm helps achieve convergence faster than the standard CORDIC algorithm.However, the scale factor of the high-radix CORDIC algorithm is complex, and its compensation may require significant hardware.In the proposed architecture, we have used the Taylor series approximation of hyperbolic sine and cosine for the high-radix CORDIC algorithm to achieve the scaling-free rotation.The Taylor series approximation of sinh(θ) and cosh(θ) with angle θ = σ j 4 −j can be defined as follows: The computation of a high-order Taylor approximation requires substantial hardware, and it is advisable to use a low-complexity Taylor approximation.The Taylor estimation is therefore constrained to two terms in the proposed design.The effect of iteration on accuracy in terms of binary bits (n) has to be studied for the potential error in the representation of the rotation vector.From Eq. ( 32), the term 5! of the Taylor approximation of sinh can be ignored if From this relation, we can conclude that the term ≤ 2 −n can be ignored for iteration index j ≥ n−2 10 .Similarly, term 4! can be ignored for the iteration index j ≥ n−1 8 .For example, if 32-bit precision is targeted, then the terms 5! and 4! can be ignored for iteration index j ≥ 3 and j ≥ 4 , respectively, without introducing any quantization error.The effective word length ( WL E ) is another measure to check the error performance in the two-dimensional rotation.As given in 31 32 , the WL E for two-dimensional rotation can be defined as follows: where, ǫ = ǫ 2 C + ǫ 2 S , and ǫ C and ǫ S es are the absolute errors generated by the Taylor approximation in the cosine and sine components, respectively.
The WL E of the hyperbolic rotation for the proposed Taylor approximation for iteration index j = 4 is 38 bits.It indicates that this rotation may generate an error in the 38th bit.The Taylor approximation is more accurate for the smaller values of the rotation angle, and as a result, the WL E will be improved for higher iteration indices.The two terms of the Taylor approximation of hyperbolic sine and cosine are used for iteration index j ≥ 4 .The terms 3! from a sine approximation and 2! from a cosine approximation can be ignored without any error for iteration index j ≥ 6 and j ≥ 12 , respectively.This way, the hardware required to compute the scaling-free rotation is reduced for higher values of j.However, the scale factor generated by the first three iterations needs to be compensated.The scale factor of the high-radix CORDIC algorithm depends on σ j .In the proposed algorithm, the σ j of the first three iterations is pre-computed by comparing the rotation angle (Z 0 ) with the selection criteria.Once the value of σ j is known, the scale factor can be pre-computed and stored on a ROM table.Initializing the coordinate values with the pre-computed scale factor can result in compensation of the scale factor.This method is discussed in more detail in the next section.

Convergence of the proposed CORDIC algorithm
The minimum convergence range required for exponential computation is 0 ≤ Z 0 ≤ 1 .Tthe small convergence range of R4HR-CORDIC can be defined as |Z 1 | ≤ ∞ j=1 tanh −1 4 (2 × 4 ( − j)) = 0.502 .To increase the con- vergence range to the required value, we propose to rotate the vector through one additional rotation angle, tanh −1 4 (0.625) , as follows: where, σ 0 = 1 and 0 indicate the rotation of the vector through tanh −1 4 (0.625) and no rotation, respectively.This additional rotation has the scale factor K 0 = (1 − (σ 0 0.6252) 2 ) , which depends on σ 0 .As discussed earlier, in the proposed CORDIC algorithm, the selection function σ 0 for the first three iterations needs to be pre-computed for scale factor compensation.The parameter σ 0 of the additional rotation also needs to be pre-computed as it has a variable scale factor.Once the selection functions are known, the scale factor can be pre-computed and stored on a ROM table.Later, X 0 is initialized with 1 K h , and the scale factor can be compensated without any additional hardware.The concept of high-radix SRT division is used to derive the convergence and selection criteria.The lower (L) and upper (U) limits to select σ 0 , σ 1 , σ 2 , and σ 3 can be defined as follows: www.nature.com/scientificreports/ The above limits are pre-computed for all possible combinations of σ 0 , σ 1 , σ 2 , and σ 3 to find the intervals (L, U).Later, the overlapping area between two intervals is found to choose selection criteria.For example, the intervals are [0.4906,0.5057] and [0.4794, 0.4944] for ( σ 0 , σ 1 , σ 2 , and σ 3 )=(0,2,2,1) and (0,2,2,0), respectively.The overlap- ping between these two intervals is [0.4906, 0.4944].As a result, any value from this interval can be chosen to select σ 0 = 0 , σ 1 = 2 , σ 2 = 2 , and σ 3 = 1 .In order to indicate the selection criteria, nine bits must be used for each value of the selection criterion in the proposed method.The criteria to choose σ 0 , σ 1 , σ 2 , and σ 3 along with the scale factors are listed in the Table 8.
The selection criteria to choose σ j for iterations j ≥ 4 can be made independent of the iteration index.Accord- ing to the method 30,33 , we define the new variable W j as W j = 4 j Z j .The new variable W j has to be bounded by upper and lower limits.The upper and lower limits of the new variable W j can be defined as follows: Since L j [q] and U j [q] are monotonous functions, i.e.L j [q] ≤ L j [q + 1] and U j [q] ≤ U j [q + 1] , the selection criteria can be made independent of iteration index.As a result, the largest value of the lower limit (i.e.L ∞ [q] ) and the smallest value of the upper limit (i.e.U 4 [q] ) are chosen to make selection criteria independent of the iteration index.The selection criteria to choose σ j for iteration index j ≥ 4 is given below. (36) This scale factor can be compensated by taking the initial value of the X-coordinate of the rotating vector as The pre-computed scale factor is stored on a ROM table along with a selection function.When j ≥ 4 , the algorithm executes the scaling-free computation.The scale factor compensation for these iterations is not required.The pre-computed scale factors and selection function are accessed from the ROM table by comparing the initial angle Z 0 with the selection criteria listed in Table 8.The proposed CORDIC algorithm is summarized in Table 9.
Table 9 provides the operations carried out by the various rotators in each stage.In the pre-processing stage, the integer and fraction parts of the input angle are derived.In this stage, pre-computed scale factor, and selection functions are retrieved from the ROM table.In the next stage, X 1 is initialized with a pre-computed scale factor, and Y 1 is computed using the relation given in Table 9.The Z rotator of this stage rotates the two-dimensional vector by an angle tanh −1 4 (0.625) if σ 0 = 1 .It does not perform the rotation otherwise.The next three stages compute the standard radix-4 HR CORDIC iterations based on the σ j received from the previous stage.The next three stages compute scaling-free iterations wherein hyperbolic sine and cosine are approximated using two terms.In Taylor's approximation of hyperbolic sine, 3! is replaced with 8(2 3 ) so that computation can be achieved using binary shift only.The absolute error introduced by this approximation is 1.4 × 10 −17 for j = 4 and σ 4 = 2 .The second term in the Taylor approximation of hyperbolic sine and cosine can be ignored for the iterations 7 ≤ j ≤ 12 and 13 ≤ j ≤ n 2 .As a result, the remaining stages compute the standard radix-4 HR CORDIC for j ≥ 13 .The architecture and hardware required to compute these iterations are discussed in the next section.

The architecture of the proposed CORDIC algorithm
The architecture, timing analysis, and hardware complexity of the proposed CORDIC algorithm are discussed in this section.The first stage of the proposed algorithm is the normalizer that separates the integer ( V I ) and fractional ( V Z ) parts of the input angle.At the end of the computation, the result is shifted by 2V I -bits to achieve the actual results.This stage accesses the scale factor and selection functions from the ROM table based on the value of the input angle after the normalization.The X rotator of this stage is simple, and it only initializes the X 1 using 1 K h .The Y-rotator computes the 0.625 K h by adding two partial products 0.5 K h and 0.125 K h using adder.If the selection function σ 0 is zero, Y 1 will be initialized with a zero to indicate that no rotation has occurred.If σ 0 is one, Y 1 is set to 0.625 K h .Similarly, the Z rotator computes Z 1 based on the value of σ 0 .The critical path delay of Y and Z rotators are equal, and it is given as T 0 = T ROM + T MUX21 + T ADD where,T ROM is a delay of read-only memory (ROM), T MUX21 is the delay of 2-to-1 multiplexer, and T ADD indicates the delay of the adder.
Since fixed-point representation is used, the normalizer computes the normalized value by shifting the input angle using fixed bits and it does not add delay.The VLSI implementation of this stage requires two adders and two 2-to-1 multiplexers each for Y and Z rotators.As discussed in the section on word length analysis, the scale factor is represented using 30-bit, and selection functions σ 0 and σ 1 to σ 3 can be represented using one and three bits, respectively.As a result, the ROM table size of this stage is 90x40 bits.The next three stages compute the standard radix-4 HR CORDIC iterations based on the pre-computed selection function received from the previous stage.The architecture of the X and Y rotators of these stages is similar to the architecture of the R4HV-CORDIC.The VLSI implementation of this stage requires three 3-to-1 multiplexers and three adders.The critical Table 9. Computational flow of the R4HR-CORDIC.

X Y Z
Prescaler ---Pre-scale the input angle add selection function Compute conventional radix-4 hyperbolic rotation Compute scaling-free hyperbolic rotation with two terms of hyperbolic sine and cosine.
Compute scaling-free hyperbolic rotation with two terms of hyperbolic cosine and one term of hyperbolic sine.
Compute scaling-free hyperbolic rotation with one term of hyperbolic sine and cosine.
path delay of this stage includes the delay of the adder ( T ADD ) and 3-to-1 multiplexer ( T MUX31 ), and it is given as T 0 = T MUX31 + T ADD .The total hardware complexity of this state is three 3-to-1 multiplexers and three adders.
The next three stages compute the scaling-free iterations.These stages use two terms of the Taylor approximation of hyperbolic sine and cosine to make computation scaling-free.The architecture of the X and Y rotators is similar and only the architecture of the X rotator is shown in Fig. 5.All the terms of the Taylor approximation can be multiplied with X j and Y j using binary shift only and it can be implemented using only a 3-to-1 multiplexer.
For example, for iteration index j=4, the term 2 in cosine approximation can be simplified to 2 −15 , 2 −17 , and 0 for σ 4 = 2 , 1, and 0, respectively which can be implemented using a 3-to-1 multiplexer.The X and Y datapath of these stages requires three 3-to-1 multiplexers to generate three partial products and a 4-to-2 carry-save adder (CSA) to add four partial products as shown in Fig. 5.The VLSI implementation of a 4-to-2 CSA requires two full adders and one adder.The Z rotator of this stage compares Z j with the criteria given in Table 8 to derive  the σ j .Later, Z j+1 is computed based on the value of σ j by adding or subtracting the tanh 4 σ j 4 −j from Z j .The critical path delay of the X and Y rotators of this stage is dominant compared to the Z rotator and it is given as T 2 = T COMP + T ADD + 2T FA + T MUX31 , where, T COMP indicates the delay of the comparator, and T FA is the delay of full adder.The timing and hardware complexity of these stages is the highest compared to other stages.The VLSI implementation of this stage requires seven 3-to-1 multiplexers, four full adders, and three adders.
The stages with iteration index 7 ≤ j ≤ 12 , perform the scaling-free rotation with one term and two terms of the Taylor approximation of sine and cosine, respectively.As shown in Fig. 6, the VLSI implementation of this stage uses five 3-to-1 multiplexers, two full adders, and three adders.The delay of this computation is equal to T 3 = T COMP + T ADD + T FA + T MUX31 .The scale factor can be assumed one for the remaining stages.For example, for j=13, the absolute error is 4.44 × 10 −15 if the scale factor is assumed one.The VLSI implementation of the rest of the stages requires three adders and three 3-to-1 multiplexers.The modified R4-HR CORDIC has a total hardware complexity of 84 adders and 62 3-to-1 multiplexers for 40-bit precision, which is a reasonable improvement from the 126 adders of the radix-2 CORDIC algorithm.

Experimental results and discussion
We give the experimental data from our research study in this section and provide an in-depth evaluation of the outcomes.

Datawidth analysis
The data width of the X, Y, and Z rotators is a crucial factor to consider before implementing the hardware of the proposed methodology.The data width determines the number of bits used to represent the input and output variables of the various stages of the proposed method.The fixed-point representation is used to represent the variables.The format FXP(a,b) indicates a-1 integer bits, b fraction bits, and one sign bit are used to represent the number using fixed-point.As per the methodology presented in 25,26 , the input range of P is assumed to be P ∈ 10 −6 , 10 6 and N ∈ [2, 1002] .For comparison purposes, 27 bits are used to represent the fractional part of the input number P. The maximum value of P is 10 6 , which can be represented using 20 integer bits.As a result, the total number of binary bits required to represent the P is 48 (FXP (21, 27)).
The first step in the logarithm computation is the normalizer.The convergence range of the proposed radix-4 HV CORDIC algorithm is p = 1 4.19 , 4.19 .The normalization process in the proposed method rearranges the fixed-point representation of the input number P as FXP (4, 44) to bring down the P into the convergence range of the proposed R4HV-CORDIC.As a result, FXP (4, 44) precision is used to represent the X and Y coordinates of the R4HV-CORDIC algorithm.The integer data width required to represent the Z rotator of the R4HV-CORDIC algorithm depends on the value of log 4 (P max ) .Since the maximum value of P is 10 6 , four integer bits are required to represent the logarithm of the maximum of P. Also, the normalization factor (q) is 9 for the P max .Hence, FXP (5,27) precision is considered to represent q, and log 4 (P max ).
The number of bits required to represent the integer part of the input of R4LV-CORDIC depends on the maximum value of N. Since N max = 1002 , ten bits are required to represent the integer part of N. As a result, the FXP (11,27) precision is used to represent the N, and log 4 P max is extended to the same precision.The maximum input to R4HR-CORDIC is Z 0 = log 4 (2 20 ) 2 = 5 .Three bits are required to represent the integer part of the Z 0 of R4HR-CORDIC.As a result, FXP (4,27) precision is taken to represent the Z 0 .After the normalization process, the R4HR-CORDIC only rotates the vector through the fractional part of the log 4 (P) 2 ≈ 1 and since cosh4(1) = 2.125, the integer part of the X and Y inputs of the R4HR-CORDIC are represented using two bits.The final output is shifted by Z I bits to get an actual exponential value.Additional 9 bits are considered to represent the factional part of the X and Y inputs.Hence, FXP (3, 36) precision is considered to represent the X and Y inputs.
Next, we analyze the data width required for the computation of P N .As given in 26 , we assume the range of P and N to be limited to the interval 10 −2 , 10 2 and [1, 5] , respectively.The maximum input to R4HV-CORDIC is 100.Hence, 7-bit is considered to define the integer part of the input P. For an average precision of 10 −7 , 27-bit is used for the fraction part of the input P. As a result, in the proposed methodology, input P is represented using the precision FXP (8,27).The first step in the logarithm computation is pre-log normalization.After the www.nature.com/scientificreports/normalization process, normalized P (p) can be defined with the precision FXP (4,31).The maximum output of the logarithm computation is log 4 (100) = 3.32 , and hence, 2-bit is considered to represent the integer part of log 4 (100) .Hence, Z-datapath and log 4 (p) are defined using precision FXP (3,27).The input to the multiplier is log 4 (p) and N. The integer part of the output of the multiplier can be represented using 5 bits as log 4 (p) × N max = 16.61 .Hence the output of the multiplier is defined using precision FXP (6,27).The next step is to calculate the exponential.The first step in an exponential calculation is normalization.In the normalization process, the integer and fractional parts of the input angle of radix-4 HR CORDIC are separated.The fractional part of the input angle can be represented using FXP (3,27).As discussed earlier, the integer part of the X and Y data paths is represented using 2 bits.Hence, the X and Y inputs of the R4HR-CORDIC are represented using FXP (3,27).The data width required to represent the various variables at different stages is summarized in Table 10.

Accuracy and number of iterations
To verify the proposed methodology, RMSE, and maximum absolute error (max(AE)) is measured using the Eq.(38).
where A i and B i indicate the actual and calculated values.The desired accuracy or the number of iterations may be changed to modify the precision of the output coordinates in the conventional CORDIC algorithm.The accuracy of the coordinates increases with the number of executed iterations.However, adding the iterations increases the cost of the computation.Also, the high-radix CORDIC algorithm achieves the desired accuracy faster than the conventional CORDIC algorithm.The study should be carried out to see the impact of iterations on accuracy for various approaches.Figure 7 demonstrates the plot between the iterations performed and the accuracy for different approaches.The accuracy is measured by locating RMSE in output coordinates using equation − log 2 (RMSE) .Figure 7a-c demonstrate the iteration versus accuracy graphs for hyperbolic rotation, (38) www.nature.com/scientificreports/base-2 rotation, and proposed rotation.From this Fig. 7, it can be concluded that the proposed method achieves high accuracy compared to other approaches for the same number of iterations.However, the hardware required to carry out the radix-4 CORDIC rotation is slightly higher than the standard radix-2 CORDIC.The hardware required to implement the proposed approach is analyzed in the next section.
To measure the error performance, a total of 10 5 samples of input P are generated in the range from 10 −6 to 10 6 using the logarithm step to cover the entire range with the minimum samples.The error is measured for the Nth root computation with N=5.Table 11 compares the RMSE and max(AE) for the proposed method and approaches presented in 25 and 26 .The number of stages used to measure the error is also mentioned in Table 11.Hyperbolic and binary logarithmic CORDIC repeats the iteration with indexes j=4, 13, 40, ... and repeated iteration is considered as a stage in Table 11.For example, approach 26 goes through the iterations with indexes j=1 to 14, and iterations 4 and 13 are repeated, resulting in a total of 16 iterations (stages).However, the proposed method uses radix-4 computations where iterations are not required to be repeated.Hence, the proposed approach goes through the iterations j=1 to 8 resulting in 8 iterations (stages).From Table 11, it is apparent that the proposed algorithm uses half the stages compared to approaches 25 and 26 and has better error performance.Next, the hardware complexity of the proposed approach is compared with approaches 25,26 , and it is discussed next.

Hardware complexity analysis
Hardware analysis is carried out quantitatively by computing the transistors required to implement the proposed architecture for two configurations.In the first configuration, the number of iterations of each CORDIC configuration is considered as given in 25,26 .As discussed earlier, the standard CORDIC produces 1-bit precision in each iteration, whereas the radix-4 CORDIC generates 2-bit precision.Hence, in the second configuration, we chose the number of iterations based on the data width of the different variables of three CORDIC configurations.
Adder/subtractor, multiplexer, ROM, and comparator are the basic building block of the proposed algorithm.In the proposed design R4HV, R4LV, and R4HR CORDICs compute logarithm (log 4 (P)) , divison log 4 (P)/N and exponential 4 (log 4 (P)/N) , respectively.For b-bit datapath, 24b and 48b transistors are required to implement the simple adder and adder/subtractor, respectively 25,26 .Similarly, 6b, 10b, and 24b transistors are required to implement b-bit ROM, multiplexer, and comparator 34 .Each stage of the proposed R4HV CORDIC uses one adder/ subtractor and multiplexer for each X, Y, and Z datapath resulting in 174b transistors.Along with that, R4HV uses 1250 bits of memory to store pre-computed selection functions and criteria.Except for the first two stages, all the stages of R4HV use 8-bit comparators.Hence, the transistors required to implement R4HV are summed up in the below equation.
where n indicates the number of stages.Similarly, each stage of the R4LV CORDIC requires an adder/subtractor and multiplexer each for the Y and Z datapaths resulting in 116bn transistors.However, the calculation of transistors needed to implement R4HR CORDIC depends on the target precision, and based on the precision, we consider the order of the Taylor series approximation.The first two stages of the R4HR CORDIC require two simple adders.The following three stages use the precomputed selection function, and hardware complexity is the same as the R4HV CORDIC.The rest of the stages perform the scaling-free computation that requires seven adders and five multiplexers.R4HV stores precomputed scale-factor and selection functions on a memory of 49x90 bits.A total of 2096b + 26460 transistors are needed to implement R4HR CORDIC.Using these illustra- tions, transistors needed to implement log, division/multiplication, and exponentials are listed in Tables 12  and 13 for the first configuration to compute the Nth root and power.Similarly, Tables 14 and 15 summarize the transistors for the second configuration.The RMSE achieved by each computation is also mentioned in the Tables.From Table 12, it is apparent that the proposed Nth root implementation has 37% and 22% less hardware utilization than approaches 25 and 26 , respectively.Table 13 shows that the suggested Nth power computation uses hardware 51% and 17% less than approaches 25 and 26 .
The proposed design is implemented on FPGA Virtex-6 to check the actual hardware utilization.Table 16 summarizes the resource utilization in terms of slice LUTs for approaches 25 and 26 and the proposed design for root and power computations.It is apparent from Table 16 that the proposed implementation has used 47% and 36% less FPGA resources than 25 and 26 for root computation.Further, the proposed implementation has 52% and 34% less FPGA resources than 25 and 26 for power computation.Multiplexers and ROM are the additional resources required to implement the proposed design.The FPGA implements these components more efficiently.For example, LUT6 can work as a 32-bit distributed ROM and adder/subtractor, and a simple adder consumes similar hardware.

Conclusion
The computation of Nth root and Nth power plays a crucial role in many real-time applications.These func- tions help provide valuable solutions to complex equations.The real-time hardware implementation of such functions demands a high clock rate with less hardware utilization.The Newton-Raphson-based method is a traditional way to compute such functions.However, a real-time realization of these methods consumes a lot of hardware resulting in a slow clock rate.Another way to implement these functions is to use various CORDIC configurations to compute mathematical operations.However, the standard CORDIC algorithm suffers from the iterative process.In the proposed method, we have used the various radix-4-based CORDIC configuration to compute log, division/multiplication, and exponentials to implement Nth root and power computations.The main objective of the proposed work is to carry out the FPGA implementation.Therefore, we have conducted a qualitative analysis and FPGA implementation of the proposed approach.The quantitative analysis suggests that the proposed Nth root implementation has 37% and 22% less hardware utilization than approaches 25 and 26 , respectively.The FPGA implementation indicates that the proposed method has 36% and 34% less hardware utilization than the recent approach 26 for root and power computations, respectively.We decided to begin with FPGA implementation due to its quick implementation and validation capabilities.This approach allows us to validate our design and make necessary improvements before the ASIC implementation.However, we will carry out the ASIC implementation of the proposed methodology using commercial CMOS libraries in the future.

Figure 1 .
Figure 1.Standard approach to compute root and power.

Table 1 .
Various classes of CORDIC algorithm and their output.HR: Hyperbolic rotation, LR: Linear rotation, LV: Linear vectoring, and HV: Hyperbolic vectoring

Table 2 .
Impact of m on range of P.

Table 7 .
Critical path delay of R4HV-CORDIC.T NORM : Delay of Normalizer; T add : Delay of adder subtractor; T comp : Delay of comparator; T ROM : Delay time to access ROM table add + T comp First and Second stages of radix-4 HVCORDIC T mux + T add T ROM + T add Rest of the stages of radix-4 HVCORDIC T comp + T add + T mux T comp + T add + T ROM Post-processing T add Vol.:(0123456789) Scientific Reports | (2023) 13:20918 | https://doi.org/10.1038/s41598-023-47890-3 In the proposed algorithm, criteria, given in Eq. (37), are used to decide the selection function for any iteration index j ≥ 4 .The selection functions for the first four iterations are pre-computed and stored on a ROM table.The scale factor related to these rotations is defined as Vol:.(1234567890) Scientific Reports | (2023) 13:20918 | https://doi.org/10.1038/s41598-023-47890-3www.nature.com/scientificreports/

Table 12 .
Hardware complexity comparison to compute P 1/N .

Table 13 .
Hardware complexity comparison to compute P N .

Table 14 .
Hardware complexity comparison to compute P 1/N .

Table 15 .
Hardware complexity comparison to compute P N .