EI SEVIED



Contents lists available at ScienceDirect

# Microelectronics Journal

journal homepage: www.elsevier.com/locate/mejo

# Implementation of a low power 16-bit radix-4 pipelined SRT divider using a modified Split-Path Data Driven Dynamic Logic (SPD<sup>3</sup>L) structure



## Shirin Pourashraf\*, Sayed Masoud Sayedi

Department of Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran.

#### ARTICLE INFO

Article history: Received 17 September 2012 Received in revised form 27 July 2013 Accepted 5 August 2013 Available online 5 October 2013

Keywords: SRT divider Latency Data Driven Dynamic Logic SPCD<sup>3</sup>L Energy reduction Speed

#### ABSTRACT

In this paper a 16-bit radix-4 pipelined divider implemented in a modified version of SPD<sup>3</sup>L family structure (SPCD<sup>3</sup>L: Split-Path Clock-Data driven Dynamic Logic) is presented. Through the modification, the clock signal is also used to pre-charge some critical parts of the circuit. Performance of the circuit is evaluated at different simulation corners. The results show that, compared with Domino structure, the proposed circuit has lower power consumption and higher speed. Latency of the divider is equal to 10 half clock cycles. The design is simulated using HSPICE in a 1.8-V TSMC\_180 nm CMOS process.

© 2013 Elsevier Ltd. All rights reserved.

#### 1. Introduction

In today's advanced circuit fabrication technologies still many challenges related to the optimized implementation of some subcircuits like mathematical operation units, and especially multiplier and divider units, due to their high complexity and transistor count and also due to their high usage number, exist [1-4]. Using dynamic family structures is an effective approach to obtain units with higher speeds and lower areas compared to static family structures. However, in dynamic circuits the complexity of clock routing and also the loading effect of the clock signal that increases the circuit power consumption, especially in high frequencies, are main obstacles. In many applications the clock network consumes 20% to 45% of total chip power [5]. To reduce the problem of power consumption, D<sup>3</sup>L<sup>1</sup> logic family is presented. In this method, a subset of input signals, instead of clock signal, are used to control the pre-charge and evaluation phases. As a result, compared to other dynamic logic families, clock distribution network is reduced significantly. This not only reduces the problems related to the clock buffering and routing, also reduces the problem of power losses in the circuit [6]. However, compared to other dynamic logic families, D<sup>3</sup>L family has slower pre-charge phase and often slower evaluation

*E-mail addresses:* shirinpourashraf@gmail.com (S. Pourashraf), m\_sayedi@cc.iut.ac.ir (S. Masoud Sayedi).

<sup>1</sup> Data Driven Dynamic Logic.

phase. The structure needs so0me modifications to reduce this problem. To that end, various circuit topologies in  $D^3L$  style are proposed. All proposed topologies have positive results in terms of reduction of power consumption.

In [6–8], a 16-bit barrel shifter is implemented in  $D^3L$  structure. It consumes less power compared to its counterpart domino and NP\_CMOS structures, and also its frequency is increased compared to the domino structure. In [9] a 17-bit multiplier is presented in  $D^4L^2$  structure that consumes less power compared to domino structure while its operation frequency is equal to the domino structure. In [10,11] SPD<sup>3</sup>L<sup>3</sup> technique has been applied to 64-bit and 32-bit Kogge–Stone adders, that resulted lower power consumption in the circuits. In [12] a 16-bit multiplier is designed in  $D^3L$  structure with lower power consumption and higher speed compared to its domino counterpart, and also compared to the structure presented in [9]. Also a reconfigurable processor in  $D^3L$  structure with less power consumption and more speed compared to the static and domino structures is presented in [13].

In every general purpose microprocessor structure, a part of the hardware is allocated to the divider unit. In many ongoing applications, like the three dimensional graphic applications, the use of high speed divider units are necessary and the demand for them is increasing [14,15]. In general, sequential execution of division operation leads to high latencies reducing the overall

<sup>\*</sup> Corresponding author. Tel.: +98 936 818 2262.

<sup>0026-2692/</sup>\$ - see front matter © 2013 Elsevier Ltd. All rights reserved. http://dx.doi.org/10.1016/j.mejo.2013.08.001

<sup>&</sup>lt;sup>2</sup> Dual Rail Data Driven Dynamic Logic.

<sup>&</sup>lt;sup>3</sup> Split Path Data Driven Dynamic Logic.

performance of the system. In many divider structures, the operation is performed by doing many repetitive subtract and multiply operations. This causes high area and power consumptions especially in pipelined structures.

In this paper a new structure, based on a modified version of SPD<sup>3</sup>L logic family has been presented [16]. The aim is to reduce, the required steps of division algorithm, the latency, the delay and the power consumption. In the following and in Section 2, the basic structure of Data Driven Dynamic Logic (D<sup>3</sup>L) family is introduced. In Section 3, parallel SRT divider and its circuit blocks are described. In Section 4, implementation of a 16-bit radix-4 pipelined divider in the proposed modified SPD<sup>3</sup>L structure (SPCD<sup>3</sup>L) is presented. Finally, in Section 5, conclusion is provided.

#### 2. Data Driven Dynamic Logic (D<sup>3</sup>L)

In  $D^{3}L$  logic family the clock distribution network is removed and the pre-charge phase is performed by using some of the inputs [6]. These inputs, which are named pre-charge inputs, are a subset of inputs in the pull-down network (PDN) of an equivalent domino circuit that provide following conditions [10]:



**Fig. 1.** Implementation of  $(A+B) \times G$  function in D<sup>3</sup>L structure [6].



**Fig. 2.** A sample SPD<sup>3</sup>L circuit with m=2 [12].

- During each pre-charge phase the PDN is off, the pull-up network (PUN) is on, and the output is zero, similar to the output of domino circuit.
- During evaluation phase there is no contention between PUN and PDN; in other words, the PUN is turned off.

The first condition requires all pre-charge inputs be zero during pre-charge phases. Considering that for each stage the pre-charge inputs are the outputs of previous stages, it means that contrary to the domino circuits, the pre-charge phases cannot be performed simultaneously in all the stages, and there must be a pre-charge cycle wave that starts from first stage and moves toward last stage. Usually the first stage has domino structure and a clock signal is used in the pre-charge phase of this stage. By this, the first zero output is produced. Also, if the inputs of some middle stages are



Final Remainder: R8

Fig. 3. Schematic diagram of the 16-bit radix-4 pipeline divider [19].



Fig. 4. Different parts of the scaling unit.

partly independent and if it is needed, in the pre-charge phases the zero output of these stages are produced by the clock signal too. To implement a function expressed in the form of product of sums  $(\prod_{k=1}^{n} S_k)$  in D<sup>3</sup>L structure, the term  $S_k$  with minimum number of inputs and with two above stated conditions, is selected to replace the clock. This selection minimizes the number of series PMOS transistors in the PUN. The best condition happens if  $S_k$  has only one input. Fig. 1 shows the implementation of  $(A+B) \times G$ function in D<sup>3</sup>L structure [6].

The main advantage of  $D^3L$  structure, compared to domino logic, is its low spread of clock network. This can result in an intense decrease in power consumption. Also the clocked NMOS transistor of domino structure that is connected to the end of PDN, is removed in this structure causing a shorter evaluation path and consequently lower evaluation delay. Using  $D^3L$  structure has some disadvantages too. For example, since some input lines are used in both PUN and PDN blocks,



Fig. 5. 3:1 MUX to produce divisor (dividend) fraction.











Fig. 8. Different parts of dividing unit.

higher load capacitance compared to domino structure is seen by these lines. This reduces the advantage of shorter evaluation path.

Another problem related to D<sup>3</sup>L structure is related to the need of asynchronous execution of pre-charge phases in different stages. As mentioned before a pre-charge wave is propagated sequentially through the stages. This causes a slower pre-charge phase compared to the evaluation phase. To decrease this problem, larger PMOS transistor is needed. However, larger PMOS transistor leads to a longer evaluation phase and also higher power dissipation. This problem especially shows itself when series PMOS transistors exist in PUN block circuit.

Network splitting is an effective method to reduce many problems associated to the D<sup>3</sup>L structure without any serious impact on the advantages of the structure. An example of such structure called SPD<sup>3</sup>L is shown in Fig. 2. In the figure that is a structure with two sub networks (m=2), the number of PMOS transistors in the PUN section of each D<sup>3</sup>L sub networks is one. Compared to D<sup>3</sup>L structure, the absence of series PMOS transistors reduces the loading capacitance of the input lines that connected to both networks, decreases the power consumption, and increases the speed of evaluation and pre-charge phases. Also, due to the smaller widths of keeper transistors in each sub network, and as a result less contention between PDNs and keeper transistors and also less parasitic capacitance, the advantages of higher speed and lower energy consumption are increased. In Fig. 2 the static inverter at the output section of D<sup>3</sup>L structure has been replaced with a static NAND gate that its output is connected to the keeper transistors.

#### 3. The parallel SRT divider

SRT divider is a well known structure commonly used for the implementation of division operation. In this structure, dividend (X) and divisor (D) are normalized within the following ranges:

$$1/2 \le X \le 1, \quad X \le D \tag{1}$$

Also, the recurrence relation used to calculate the remainder is as follows:

$$R_{i} = \beta R_{i} - 1 - q_{i}D, \quad R_{0} = X$$
  
(*i* = 1, ..., *k*),  $q_{i} \in \{-(\beta - 1), ..., 0, ..., \{\beta - 1\}\}$  (2)

$$k = n/\log \beta = n/\log 2^{\beta} = n/m \tag{3}$$

where  $\beta$ , k and i are radix, number of needed stages for the full execution of division operation, and the iteration step of the operation, respectively. Also  $\beta R_i$  and  $q_i$  are the shifted partial remainder and quotient fraction in each step of division. Final quotient (Q) and final remainder (R) are calculated as follows [17]]:

$$Q = \sum_{i=1}^{k} \beta^{-i} \times q_i \tag{4}$$

$$R = \beta^{-k} \cdot R_k \tag{5}$$

Using high radices effectively reduces the number of division steps and latency. On the other hand, using look-up-table for the quotient selection increases the complexity and also the power consumption of the SRT dividers. Employing some techniques to reduce the sizes of look-up-tables or eliminate them, will dramatically increase the speed of the divider, decrease complexity, and in many cases reduce the power consumption. Therefore, SRT dividers with high radix numbers and no look-up-table are desirable structures. In SRT dividers the integer part of  $\beta R_i$  can be used to determine the value of quotient  $(q_{i+1})$ . To implement the algorithm without look-up-table, the dividend and the divisor (D, X) must be pre-scaled before start of division operation. To do that, the divisor range is changed from its initial value to the range of [1,  $D_{\text{max}}$ ], in which:

$$D_{\max} = (\beta - 1 - 2^{-t})/(\beta - 1) \tag{6}$$

In above equation, t is the number of most significant bits of the fraction bits of shifted partial remainder which is considered 2 or 3. The scaling does not change the result of division operation,



Fig. 9. 4-bit and 5-bit Kogge-Stone subtractors: (a) before removal of the not needed cells, (b) after removal of the cells.



**Fig. 10.** Implementation of  $-q_i \times MD_i^+$  generator.

which is:

$$Q = X/D = (MX)/(MD) \tag{7}$$

*M* is considered as the scaling factor in the division. To have high efficiency and low complexity in the division circuit, the main parameters of the algorithm ( $\beta$ , *t*) are chosen as  $\beta$ =4 and *t*=2. The required hardware for the implementation of radix-4 ( $\beta$ =4) algorithm is only slightly more than the required hardware for radix-2 divider. The performance of SRT dividers can be increased by employing pipeline structure [18].

#### 3.1. 16 -Bit radix-4 pipeline divider

Fig. 3 shows the schematic diagram of the proposed 16-bit radix-4 pipeline divider. In the figure, first the multiples of divisors are produced in the scaling unit ( $\pm$  3*MD*,  $\pm$  2*MD*,  $\pm$  *MD*). Then a fixed number of quotient bits are obtained in each divider unit and division iteration. The quotient converter unit converts sign-digit format to binary format. Its inputs are multi quotient fraction bits of each step. In the following the three units shown in Fig. 3 are explained in more detail.

#### 3.1.1. Scaling unit

Fig. 4 shows different parts of the scaling unit of the divider. It consists of scaling factor generator and  $\pm MX$  and  $\pm MD$  coefficients generator:

• Scaling factor generator:

The required factor *M* for changing range of  $D = (0.1d_2d_3d_4...d_{15})$  and *X*, is obtained by the use of three bits  $d_2$ ,  $d_3$  and  $d_4$  of the divisor. As it is shown in Figs. 4 and 5, signals a, b and c (or equivalently signals aa, bb, and cc that are produced by these bits) are connected to the select inputs of a 3:1 multiplexer. The outputs of the multiplexer are the required fraction of signals *D* and *X*. Direct connection of signals aa, bb and cc to many 3:1 MUXs highly increases the loading effects on these signals. To eliminate this effect, as it is shown in Fig. 4 a buffer unit is used.

•  $\pm$  MX and  $\pm$  MD coefficients generator:

In this unit to produce multiples of divisor ( $\pm MD$ ,  $\pm 2MD$ ,  $\pm 3MD$ ), redundant bit adders (RBA) which are free of carry

propagation delay are used. These adders do not need the final carry of the adjacent circuit and their delay is independent of the input size. Fig. 6 shows the RBA structure in the gate level which is designed through a modification of a fast 4-2 compressor structure. Also in this unit inputs and outputs are in a redundant binary form, and each output is expressed by two bits  $(S_i = S_i^{-}S_i^{+})$ .  $\pm MD$  (or  $\pm MX$ ) in redundant binary form is obtained by adding *KD* and *D* (or *KX* and *X*). All bits of KD (KX) are positive because D(X) is in the range of [1/2, 1). Therefore, to produce  $\pm MD(\pm MX)$  a simplified version of RBA (SRBA<sup>4</sup>) can be used. Compared to RBA, advantages of SRBA are lower delay, area, and power dissipation. Fig. 7(a) shows the gate level structure of SRBA. In the figure, first  $S_i^-$  and then  $C_i^+(S_{i+1}^+)$  are produced. In a modified version of the structure which is shown in Fig. 7(b) the two signals are produced simultaneously. This decreases the overall delay of the scaling unit. SRBA cannot be used to generate + 3MD signals. These signals are constructed from  $\pm MD$  to  $\pm 2MD$  which are in redundant binary form. It means their negative positions are not equal to zero. Thus, to generate  $\pm 3MD$ , RBA structure is used.

#### 3.1.2. Dividing unit

In dividing unit which is the main unit of the pipeline divider, the quotient fraction and partial remainder of the corresponding iteration are calculated. Usually delay of the unit is more than that of the scaling unit, and the operation speed of the divider is mainly determined by this unit. As shown in Fig. 8 the unit consists of three following subunits:

• Quotient selection unit (QSU):

In this unit, which is the first block of dividing unit, by using the (m+t+1) most significant bits of the shifted partial remainder of each iteration, the signed format (m+1)bit quotient fraction is generated. The value of 't' is selected arbitrary and the value of 'm' is determined according to the radix of the divider. For t=2 and  $\beta=4$  (m=2), 5 most significant bits of the shifted partial remainder ( $4R_i$ ) are used to determine

<sup>&</sup>lt;sup>4</sup> Simple RBA.

the three-bit quotient fraction in each iteration. In present work to implement the function, 4-bit and 5-bit radix-2 Kogge–Stone subtractors have been used. The Kogge–Stone structure, which is usually used in high speed applications, is a parallel form of the Carry Look-Ahead adder structure. In this structure carry production time is proportional to Log (*n*), where n is the number of input bits of the adder. Due to its low fanout value and regular structure, Kogge–Stone is faster and more commonly used compared to other logarithmic parallel adders [20]. Also, in QSU unit some cells of 4-bit and 5-bit Kogge–Stone subtractors do not have any role in generating quotient, so they can be eliminated. This significantly reduces the consumed power and the area. Fig. 9 shows the subtractors before and after removal of the unnecessary cells.

### • $\pm q_i \times MD_i$ generator unit:

In the radix-4 SRT divider in each iteration of division the value of  $(R_i=4R_{i-1}+(-q_i \times MD_i))$  and as a result  $\pm q_i \times MD_i$  is needed. Since radix of divider is 4, partial quotients set is  $\{-3, -2, -1, 0, 1, 2, 3\}$ , and the partial quotient,  $q_i$  is expressed

by three bits  $(q_i = s q_1 q_0)$ . Here, the first bit (s) shows the sign and the next two bits  $(q_1 q_0)$  show value of the quotient.  $\pm q_i \times MD_i$  generator unit is a set of 8:1 MUXs which use the three bits of  $q_i$  as their control signals. Fig. 10 shows the implementation of  $-q_i \times MD_i^+$  generator unit. The unit needs a buffer at its output.

• Partial remainder (R<sub>i</sub>) generator:

The function of this block in the divider unit is to produce partial remainder. Using equation  $R_i=4R_{i-1}+(-q_i \times MD_i)$ , the value of partial remainder is calculated in each iteration of division. RBA structure is used for this purpose.

#### 3.1.3. Quotient converter

To convert the signed quotient to the binary quotient, the algorithm proposed in [21] is employed. In this algorithm a separate block (the quotient converter unit), which is synchronous with the iterations of the division, converts the quotients fraction



Fig. 11. Final schematic diagram of the proposed 16-bit radix-4 pipelined divider.

to the binary format. It is done without any major impact on the total delay, area and power consumption.

#### 3.2. Using latches to implement the pipeline structure

The designed divider acts sequentially and at each step provides only some of the final quotient bits. Therefore the final result is ready after a certain amount of latency. Employing pipeline structure for the divider, increases the latency, but at the same time increases the operating frequency of the circuit. To implement pipeline structure, latches are used between the divider stages. For each two adjacent stages, if one is in the precharge phase, the other one is in the evaluation phase. Fig. 11 shows the schematic diagram of the proposed 16-bit radix-4 divider.

# 4. Implementation and simulation results of the divider for both domino and the modified version of the SPD<sup>3</sup>L structure

Using HSPICE and TSMC 180 nm, 1.8V CMOS technology, the divider structure of Fig. 11 was designed and implemented in

transistor level in both domino and the modified version of SPD<sup>3</sup>L structures (SPCD<sup>3</sup>L). An accurate transistor sizing was necessary to balance between delay, power, and area parameters. In order to make proper comparisons, in both designs the sizes of the PDNs were chosen in such a way that the generic evaluation path in both circuits was equivalent to the width of a NMOS transistor (Wn). Also since the pre-charge phase in all domino gates occurs simultaneously, the width of the clocked PMOS transistors in domino circuit was set close to the minimum size transistor. In SPD<sup>3</sup>L circuits, even though the series PMOS transistors are not present, since the pre-charge phase does not occur simultaneously but through a propagation process within the cascaded stages, the pre-charge phase is longer than that of domino circuits. To reduce this delay, the size of PMOS transistors in the SPD<sup>3</sup>L circuit is chosen larger than that of in the domino circuit. Static inverter and Static NAND gates in both structures are high-skewed to get better performance, and also in both designs to reduce the effect of leakage current the keeper transistor is employed. As an example, Fig. 12 shows transistor level SPD<sup>3</sup>L structure for the  $\pm q_i \times MD_i$ generator unit.

As mentioned before, by enlarging pre-charge PMOS transistors in SPD<sup>3</sup>L circuit, pre-charge time is reduced, but at the same



**Fig. 12.** Transistor level using SPD<sup>3</sup>L structure for  $\pm q_i \times MD_i$  generator unit.

time, evaluation time is influenced and it is increased. Also, due to use of bigger transistors and higher parasitic capacitances, the power consumption and the occupied area will be increased. To eliminate these disadvantages which are initially caused by the long pre-charge propagation wave, in present modified version of SPD<sup>3</sup>L, for some sections of the circuit which are the most influential in terms of delay, the pre-charge operation is performed by clock signal instead of data signal. By applying this change and having a Split Path Clock and Data Driven Dynamic Logic structure (SPCD<sup>3</sup>L), long pre-charge propagation waves are broken into some shorter waves and the need for very large PMOS transistors is eliminated. The critical sections include some small parts of the dividing unit. Two altered sections that are changed into domino topology are the quotient buffer network and the first section of the RBA block in the dividing unit. After modification, less than one-third of each dividing unit is changed to domino topology and the remaining two-third is in the SPD<sup>3</sup>L topology. By above modification, on one hand, clock distribution network is enlarged and loading on clock signal path is increased which has negative effect on area, power and delay parameters; but on the other hand, by doing such modification, the need for very large PMOS transistors is eliminated and the PMOS sizes (Wp (Scaling Unit) & Wp (Dividing Unit)) are chosen very close to the minimum size transistor. Through the modification process, only a low percentage of the circuit is changed from data driven topology to clock driven topology and the overall effect on the area, delay, and power parameters was positive. The final transistor sizing which was obtained through running many different parametric simulations is as follows:

- Wp <sub>clk(domino)</sub>=0.22 μm
- Wn=0.4  $\mu$ m (in both domino & SPCD<sup>3</sup>L designs).
- Wp spcd3l(scaling Unit)=0.5  $\mu$ m & Wp spcd3l(dividing Unit)= 0.3  $\mu$ m,
- High-skewed Static inverter and Static NAND gates (Wp NAND & Inverter=2 μm, Wn NAND & Inverter=0.4 μm)

For the clock distribution networks of the dynamic gates in the domino and SPCD<sup>3</sup>L implementations of divider similar structures are used. In this structure the clock lines of input registers and scaling unit are separated from each other, and the even and odd dividing units use clock and reverse clock signals, respectively. Also, the latches between the stages are connected to clock and reverse clock. The clock network is shown in Fig. 13. Different clock buffering for different units adjusts different required timing for different stages. Logical effort method [22] was used for sizing the inverter chains of the clock buffer. To do that, first an optimal effort delay (*f*) was chosen for the first line, and then according to following equations, the path delay of that line (*D*) was calculated.

$$f = F^{1/N} \tag{8}$$

$$D = N \times F^{1/N} + \sum P_{inv} \tag{9}$$

In above equations, *F*, *D*, *N*, and  $P_{inv}$  are the path effort, the path delay, the number of stages in the chain, and the parasitic delay of inverter, respectively. In the employed 180 nm technology,  $P_{inv}$  was considered 15 ps. After calculating *D*, the path delay of other lines was set equal to it, and the effort delay of each line calculated accordingly. As a result, despite different loadings, the five clock signals (clk, clk1,2, clk3,4, clk5,6, and clk7,8) and four reverse clock signals (clkn1,2, clkn3,4, clkn5,6, and clkn7,8) have same amount of delay. To have correct operation of the circuit and adjust arrival time of input signals,



Fig. 13. The clock buffer distribution.

only for the clock signal of input registers (clkk) its delay ( $D_0$ ) is chosen less than others ( $D_0 = 1/3$  D).

The divider performance was evaluated through simulation and by applying many systematic and random inputs. As an example Fig. 14 shows the second bit of quotient fraction  $(a_1)$ when D is (0.101100010100111)<sub>2</sub> and X is (0.10011001111100)<sub>2</sub>. It shows how  $q_1$  is produced in the pipelined structure. Simulation results in both domino and the proposed SPCD<sup>3</sup>L structures, at different corners are shown in Table 1. As the results show consumed energy and evaluation delay in the SPCD<sup>3</sup>L structure are less than those of domino structure. Also, as expected, the delay of pre-charge phase in SPCD<sup>3</sup>L structure is larger than that of domino structure. However, since this delay is smaller than the evaluation delay, the operation frequency of the circuit is not determined by this delay. The clock network of SPCD<sup>3</sup>L circuit is smaller than that of domino circuit. Accordingly, simulation results show that clock network of SPCD<sup>3</sup>L consumes much less energy compared to domino circuit (about %55). The clock networks of domino and SPCD<sup>3</sup>L structures consume, respectively, 16.77% and 9.2% of total dissipated energy in the structure. Simulation results at different process corners and temperatures show that the average energy consumption and delay in SPCD<sup>3</sup>L structure are lower than those of domino structure. Accordingly, the energydelay product in SPCD<sup>3</sup>L structure is improved compared to domino structure.

#### 5. Conclusion

Using HSPICE and a TSMC 180 nm technology, a 16-bit radix-4 pipelined divider in a modified version of SPD<sup>3</sup>L structure (SPCD<sup>3</sup>L) was designed. In the proposed circuit, to improve the performance of the data driven structure, the time of pre-charge phase is decreased by replacing some of the data signals used as pre-charge control signals, by clock signal. This is done only at critical nodes of the circuit so that its advantages overcome its drawbacks. Simulation results at different process corners revealed the superiority of the circuit in terms of speed, area, and power consumption, compared to its domino counterpart.





Table 1Simulation results.

| Corner    | Design              | Precharge delay (ps) | Evaluation delay (ps) | Energy of the clock buffer (pJ) | Total energy(pJ) | EDP (pJ*ns) |
|-----------|---------------------|----------------------|-----------------------|---------------------------------|------------------|-------------|
| tt 27 °C  | Dynamic Domino      | 450                  | 950                   | 75                              | 447.28           | 425.92      |
|           | SPCD <sup>3</sup> L | 750                  | 880                   | 33.5                            | 363.9            | 322.05      |
| ss 125 °C | Dynamic Domino      | 710                  | 1440                  | 77.74                           | 436.38           | 628.4       |
|           | SPCD <sup>3</sup> L | 1100                 | 1350                  | 33.17                           | 341.95           | 461.64      |
| ff −55 °C | Dynamic Domino      | 350                  | 660                   | 75.42                           | 487.92           | 322.1       |
|           | SPCD <sup>3</sup> L | 580                  | 620                   | 33.97                           | 386.13           | 239.5       |
| fs 27 °C  | Dynamic Domino      | 470                  | 890                   | 79.67                           | 497.68           | 442.93      |
|           | SPCD <sup>3</sup> L | 760                  | 830                   | 34.19                           | 375.71           | 311.84      |
| sf 27 °C  | Dynamic Domino      | 520                  | 1060                  | 76.83                           | 423.65           | 449.07      |
|           | SPCD <sup>3</sup> L | 820                  | 990                   | 32.97                           | 337.05           | 333.68      |

#### References

- D. Wang, M.D. Ercegovac, Z. Nanning, Design and analysis of high radix complex dividers, in: Second International Conference on Computer Engineering and Technology, ICCET, 2010, pp. 84–88.
- [2] A. Alaaeldin, S. Waleed, High-radix multiplier-dividers: theory, design, and hardware, IEEE Transactions on Computers 59 (8) (2010) 1009–1022.
- [3] J. Fan, L. Batina, I. Verbauwhede, Design and design methods for unified multiplier and inverter and its application for HECC, Integration, the VLSI Journal 44 (4) (2011) 280–289.
- [4] N.S. Chang, C.H. Kim, Y.-H. Park, S. Hong, New bit parallel multiplier with low space complexity for all irreducible trinomials over, Transactions on Very Large Scale Integration Systems (VLSI) 20 (10) (2012), (1903–190).
- [5] H. Kawaguchi, T. Sakurai, A reduced clock-swing flip-flop (RCSFF) for 63% power reduction, IEEE Journal of Solid-State Circuits 33 (5) (1998) 807–811.
- [6] R. Rafati, S.M. Fakhraie, K.C. Smith, Low-power Data-Driven Dynamic Logic, in: IEEE International Symposium on Circuits and Systems. ISCAS. 2000, pp. 752–755.
- [7] R. Rafati, A.Z. Charaki, S.M. Fakhraie, K.C. Smith, Data-Driven Dynamic Logic versus NP-CMOS Logic, a Comparison, in: The 12th International Conference on Microelectronics Tehran, ICM, 31 October–2 November 2000, pp. 57–60.
- [8] R. Rafati, S.M. Fakhraie, K.C. Smith, A 16-bit barrel-shifter implemented in data-driven dynamic logic (D<sup>3</sup>L), IEEE Transactions on Circuits and Systems 53 (10) (2006) 2194–2202.
- [9] R. Rafati, A.Z. Charaki, R.Z. Chaji, S.M. Fakhraie, K.C. Smith, Comparison of a 17b multiplier in dual-rail Domino and in dual-rail D<sup>3</sup>L (D<sup>4</sup>L) logic styles, in: IEEE International Symposium on Circuits and Systems, 257–260, 2002.
- [10] F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, P. Corsonello, Designing high speed adders in power-constrained environments, IEEE Transactions on Circuits and Systems II: Express Briefs 56 (2) (2009) 172–176.
- [11] F. Frustaci, M. Lanuzza, A New Optimized High-speed Low-power Data Driven Dynamic (D<sup>3</sup>L) 32-bit Kogge–Stone Adder, PATMOS, 357–366, 2010.

- [12] F. Frustaci, M. Lanuzza, P. Zicari, S. Perri, P. Corsonello, Low-power split-path data-driven dynamic logic, IET Circuits Devices and Systems 3 (6) (2010) 303–312.
- [13] S. Purohit, M. Lanuzza, S. Perri, M. Margala, Design-space exploration of energy-delay-area efficient coarse-grain reconfigurable data path, in: 22nd International Conference on VLSI Design, 2009, pp. 45–50.
- [14] T. Aoki, K. Nakazawa, T. Higuchi, High-radix parallel VLSI dividers without using quotient digit selection tables, in: Proceedings of the 30th IEEE International Symposium of Multiple-Valued Logic, ISMVL, 2000, pp. 345–352.
- [15] J.P. Deschamps, G.H.A. Bioul, G.D. Sutter, Synthesis of Arithmetic Circuits, FPGA, ASIC and Embedded System, first ed., John Wiley & Sons, New York, NY, 2006.
- [16] S. Pourashraf, S. Sayedi, A low power D<sup>3</sup>L 16-bit radix- 4 pipelined SRT divider, in: Canadian Conference on Electrical and Computer Engineering, 29 April-2 May 2012.
- [17] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, second ed., Oxford University Press, New York, 2010.
- [18] T. Aoki, K. Nakazawa, T. Higuchi, Design of high-radix VLSI dividers without quotient selection tables, IEICE Trans. Fundamentals of Electronics, Communications and Computer Sciences E84-A (11) (2001) 2623–2631.
- [19] X. Guo, C. Sechen, High speed redundant adder and divider in output prediction logic, in: Proceedings of the IEEE Computer Society Annual Symposium on VLSI, 2005, pp. 34 –41.
- [20] P.M. Kogge, H.S. Stone, A parallel algorithm for the efficient solution of a general class of recurrence equations, IEEE Transactions on Computers C-22 (8) (1973) 786–793.
- [21] M.D. Ercegovac, T. Lang, On-the-fly conversion of redundant into conventional representations, IEEE Transactions on Computers C-36 (7) (1987) 895–897.
- [22] I. Sutherland, R. Sproull, D. Harris, Logical Effort: Designing Fast CMOS Circuits, Morgan Kaufmann Publishers Inc, San Francisco, CA, 1999.