# Low-Power and Fast Full Adder by Exploring New XOR and XNOR Gates Hamed Naseri<sup>®</sup> and Somayeh Timarchi<sup>®</sup>, Member, IEEE Abstract-In this paper, novel circuits for XOR/XNOR and simultaneous XOR-XNOR functions are proposed. The proposed circuits are highly optimized in terms of the power consumption and delay, which are due to low output capacitance and low short-circuit power dissipation. We also propose six new hybrid 1-bit full-adder (FA) circuits based on the novel full-swing XOR-XNOR or XOR/XNOR gates. Each of the proposed circuits has its own merits in terms of speed, power consumption, powerdelay product (PDP), driving ability, and so on. To investigate the performance of the proposed designs, extensive HSPICE and Cadence Virtuoso simulations are performed. The simulation results, based on the 65-nm CMOS process technology model, indicate that the proposed designs have superior speed and power against other FA designs. A new transistor sizing method is presented to optimize the PDP of the circuits. In the proposed method, the numerical computation particle swarm optimization algorithm is used to achieve the desired value for optimum PDP with fewer iterations. The proposed circuits are investigated in terms of variations of the supply and threshold voltages, output capacitance, input noise immunity, and the size of transistors. Index Terms—Full adder (FA), noise, particle swarm optimization (PSO), transistor sizing method, XOR-XNOR. ### I. INTRODUCTION TODAY, ubiquitous electronic systems are an inseparable part of everyday life. Digital circuits, e.g., microprocessors, digital communication devices, and digital signal processors, comprise a large part of electronic systems. As the scale of integration increases, the usability of circuits is restricted by the augmenting amounts of power [1] and area consumption. Therefore, with the growing popularity and demand for the battery-operated portable devices such as mobile phones, tablets, and laptops, the designers try to reduce power consumption and area of such systems while preserving their speed. Optimizing the W/L ratio of transistors is one approach to decrease the power-delay product (PDP) of the circuit while preventing the problems resulted from reducing the supply voltage [2]. The efficiency of many digital applications appertains to the performance of the arithmetic circuits, such as adders, multipliers, and dividers. Due to the fundamental role of addition in all the arithmetic operations, many efforts have Manuscript received October 16, 2017; revised January 31, 2018; accepted March 6, 2018. (Corresponding author: Somayeh Timarchi.) The authors are with the Faculty Electrical Engineering, Shahid Beheshti University, Tehran 1983963113, Iran (e-mail: h\_naseri@sbu.ac.ir; s\_timarchi@sbu.ac.ir). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TVLSI.2018.2820999 been made to explore efficient adder structures, e.g., carry select, carry skip, conditional sum, and carry look-ahead adders. Full adder (FA) as the fundamental block of these structures is at the center of attention [31–[5]]. 1 Based on the output voltage level, FA circuits can be divided into full-swing and nonfull-swing categories. Standard CMOS [2], [6], complementary pass-transistor logic (CPL) [7], [8], transmission gate (TG) [9]–[11], transmission function [2], [10], [12], 14T (14 transistors) [7], [13], 16T [10], [12], [14], [15], and hybrid pass logic with static CMOS output drive full adder (HPSC) [3], [12], [16]–[20] FAs are the most important full-swing families. Nonfull-swing category comprises of 10T [4], 9T [21], and 8T [22]. In this paper, we evaluate several circuits for the XOR or XNOR (XOR/XNOR) and simultaneous XOR and XNOR (XOR-XNOR) gates and offer new circuits for each of them. Also, we try to remove the problems existing in the investigated circuits. Afterward, with these new XOR/XNOR and XOR-XNOR circuits, we propose six new FA structures. The rest of this paper is organized as follows. In Section II, the circuits for XOR/XNOR and simultaneous XOR-XNOR are reviewed. In Section III, novel XOR/XNOR and XOR-XNOR circuits are proposed and the simulation results of these structures are presented. Furthermore, based on the introduced XOR/XNOR and XOR-XNOR gates, six new FA circuits are proposed and advantages and disadvantages of them are investigated. In Section IV, the transistor sizing methods are first investigated, and then by providing an appropriate method for transistor sizing, the circuits are simulated for power, delay, and PDP parameters. The simulation results are analyzed and compared in Section V. Section VI concludes this paper. #### II. REVIEW OF XOR AND XNOR GATES # A. XOR-XNOR Circuits Hybrid FAs are made of two modules, including 2-input XOR/XNOR (or simultaneous XOR-XNOR) gate and 2-to-1 multiplexer (2-1-MUX) gate [3]. The XOR/XNOR gate is the major consumer of power in the FA cell. Therefore, the power consumption of the FA cell can be reduced by optimum designing of the XOR/XNOR gate. The XOR/XNOR gate has also many applications in digital circuits design. Many circuits have been proposed to implement XOR/XNOR gate [11], [12], [16], [24], which a few examples of the most efficient ones are shown in Fig. 1. Fig. 1(a) shows the full-swing XOR/XNOR gate circuit [16] designed by double pass-transistor logic (DPL) style. Fig. 1. (a) and (b) Full-swing XOR/XNOR and (c)-(g) XOR-XNOR circuits. (a) [16]. (b) [11]. (c) [16]. (d) [3]. (e) [7], [13]. (f) [18]. (g) [23]. This structure has eight transistors. The main problem of this circuit is using two high power consumption NOT gates on the critical path of the circuit, because the NOT gates must drive the output capacitance. Therefore, the size of the transistors in the NOT gates should be increased to obtain lower critical path delay. Furthermore, it causes the creation of an intermediate node with a large capacitance. Of course, this means that the NOT gates drives the output of circuit through, for example, pass transistor or TG. Therefore, the short-circuit power and, thus, the total power dissipation of this circuit are widely increased. Moreover, in the optimum PDP situation, the critical path delay will also be increased slightly. Fig. 1(b) shows another example of the full-swing XOR/XNOR gate [11], each made of six transistors. This circuit is based on the PTL logic style, whose delay and power consumption are better than the circuit depicted in Fig. 1(a). The only problem of this structure is using a NOT gates on the critical path of the circuit. The XOR circuit of Fig. 1(b) has the lower delay than its XNOR circuit, because the critical path of XOR circuit is comprised of a NOT gates with an nMOS transistor (*N*3). But the critical path of XNOR circuit is comprised of a NOT gates and a pMOS transistor (*P*5) (pMOS transistor is slower than nMOS transistor). Therefore, to improve the XNOR circuit speed, the size of pMOS transistor (*P*5) and NOT gates should be increased. #### B. Simultaneous XOR-XNOR Circuits In recent years, the simultaneous XOR-XNOR circuit is widely used in hybrid FA structures [3], [9], [16], [18]. Commonly, in the hybrid FAs, the XOR-XNOR signals are connected to the inputs of 2-1-MUX as select lines. Therefore, two simultaneous signals with the same delay are necessary to avoid glitches in the output nodes of the FA. Fig. 1(c) shows an example of the simultaneous XOR–XNOR circuit [16]. This circuit is based on the CPL logic style that has been designed by using ten transistors. In this structure, the outputs have been driven only by nMOS transistor, and thus, two pMOS transistors are connected to outputs (XOR and XNOR) as cross coupled to recover the output-level voltages. One problem of this XOR–XNOR circuit is to have the feedback (cross-coupled structure) on the outputs, which increases the delay and short-circuit power of this structure. Therefore, to mitigate the imposed delay, the size of transistors should be increased. Another disadvantage of this structure is the existence of two NOT gates in the critical path. Goel *et al.* [3] removed two transistors (a NOT gates) from the XOR–XNOR circuit of Fig. 1(c) to reduce the power dissipation of the circuit. In Fig. 1(d), when the inputs are in AB = 00, the transistors N3, N4, and N5 are turned OFF and logic "0" is passed through the transistor N2 to XOR output. This "0" on XOR charges the XNOR output to $V_{\rm DD}$ by transistor P3. Therefore, the critical path of this circuit is larger than that of the circuit of Fig. 1(c). Also, in this structure, the short-circuit current will be passed through the circuit when the input is changed from AB = 01 to AB = 00. When the inputs are in state AB = 01, logic "1" is passed through the transistors N2, N3, and P2 to XOR output and logic "0" is passed through the transistor N4 to XNOR output. When the inputs change to AB = 00, all transistors will be turned OFF except transistors N2 (through the input A) and P2 (through the XNOR output, which has not changed now). Therefore, the short-circuit current will pass from the transistors P2 and N2. If the amount of current being sourced from the transistor P2 is larger than that of current being sunk from the transistor N2, the short-circuit current will continue to be drawn from $V_{\rm DD}$ and will never switch XOR and XNOR output. This situation also occurs when the input is changed from AB = 11 to AB = 10 and impacts the proper functioning of the circuit. To grantee the proper operation of this circuit, the ON-state resistance of transistors P2 and P3 should not be smaller than that of transistors N2 and N5 $(R_{P2} > R_{N2}, R_{P3} > R_{N5})$ , respectively. Furthermore, this structure is very sensitive to process variation; if the size of transistors is changed, the circuit may not operate properly. In [7] and [13], full-swing XOR-XNOR gate with only six transistors is proposed [shown in Fig. 1(e)]. The two complementary feedback transistors (N3 and P3) restore the weak logic in the output nodes (XOR and XNOR) when the inputs equal to AB = 00, 11. However, this circuit suffers from the high worst case delay, because when the inputs change from AB = 01, 10 to AB = 11, 00, the outputs reach its final voltage value in two steps. To clarify the issue, when the inputs equal to AB = 10, logic "1" and logic "0" are passed through the N2 (XOR output) and P2 (XNOR output) transistors, respectively. By changing the input mode to AB = 11, the transistors P1 and P2 are turned OFF (XOR node is initially high impedance) and weak logic "1" $(V_{DD} - V_{th_n})$ is passed through the transistors N1 and N2 to the XNOR output. The weak logic "1" on the XNOR turns ON the feedback N3 so that the XOR output is pulled down to weak logic "0," which this weak logic "0" turns ON the feedback P3. Eventually, positive feedback is made and the XNOR and XOR outputs will have strong logic "1" and logic "0," respectively. This slow response problem is worse in the low-voltage operation Fig. 2. (a) Nonfull-swing XOR/XNOR gate [24]. (b) Proposed full-swing XOR/XNOR gate. (c) RC model of proposed XOR for AB = 10. (d) RC model of proposed XOR for AB = 11. (e) Proposed XOR–XNOR gate. and also increases the short-circuit current [when one of the outputs (XOR or XNOR) is high impedance and circuit feedback has not yet acted completely, the short-circuit current is passing through the circuit]. Also, if the size of transistors in this circuit is not properly selected, the circuit may not be correctly operated. Thus, this structure is very sensitive to process–voltage–temperature (PVT) variations. Chang et al. [18] have proposed a new structure of the simultaneous XOR-XNOR gate [shown in Fig. 1(f)] by improving the six-transistor XOR-XNOR circuit of Fig. 1(e). In the circuit of Fig. 1(f), to solve the slow response problem and operate in low voltage supplies, two nMOS transistors (for AB = 11) and two pMOS transistors (for AB = 00) have been added to the XOR and XNOR outputs, respectively. The advantages of this structure are good driving capability, full-swing output, and robustness against transistor sizing and supply voltage scaling. The main problem of this circuit is the structure of feedback that imposes extra parasitic capacitance to the XOR and XNOR output nodes. Thus, the delay and power consumption significantly increase. Fig. 1(g) [23] shows another circuit for improving the structure of Fig. 1(e). In this structure, a NOT gate is used to improve the circuit speed. This circuit has a better speed than Fig. 1(e), because in Fig. 1(g), the transistors N5 and P5 have the path from GND or $V_{\rm DD}$ to the output nodes in two states of inputs (AB = X1 for N5 and AB = X0 for P5). But in Fig. 1(e), the transistors N4 and P5 have the same path for only one state of inputs (AB = 11 for N4 and AB = 00 for P5). Also, with the addition of a NOT gate, an intermediate node with a large capacitance will be created that will increase the power consumption of the circuit. Therefore, Fig. 1(g) has more power consumption than Fig. 1(e). Combination of two XOR and XNOR circuits of Fig. 1(a) and (b) will result in two simultaneous XOR–XNOR gates. These new structures will have all advantages and disadvantages of their XOR/XNOR circuits. # III. PROPOSED CIRCUITS # A. Proposed XOR-XNOR Circuit The nonfull-swing XOR/XNOR circuit of Fig. 2(a) [24] is efficient in terms of the power and delay. Furthermore, this structure has an output voltage drop problem for only one input logical value. To solve this problem and provide an optimum structure for the XOR/XNOR gate, we propose the circuit shown in Fig. 2(b). For all possible input combinations, the output of this structure is full swing. The proposed XOR/XNOR gate does not have NOT gates on the critical path of the circuit. Thus, it will have the lower delay and good driving capability in comparison with the structures of Fig. 1(a) and (b). Although the proposed XOR/XNOR gate has one more transistor than the structure of Fig. 1(b), it demonstrates lower power dissipation and higher speed. The input A and B capacitances of the XOR circuit shown in Fig. 2(b) are not symmetric, because one of these two should be connected to the input of NOT gates and another should be connected to the diffusion of nMOS transistor. Furthermore, the input capacitances of transistors N2 and N3 are not equal in the optimal situation (minimum PDP). Also, the order of input connections to transistors N2 and N3 will not affect the function of the circuit. Thus, it is better to connect the input A, which is also connected to the NOT gates, to the transistor with smaller input capacitance. By doing this, the input capacitances are more symmetrical, and thus, the delay and power consumption of the circuit will be reduced. To clarify which transistor (N2 or N3) has larger input capacitance, let us consider the condition that the inputs change from AB = 00 to AB = 10. In this condition, as the RC model of XOR is shown in Fig. 2(c) and (d), the transistor N2 is driving only the capacitance of node X from GND to $V_{\rm DD} - V_{\rm th_{\rm re}}$ [Fig. 2(c)], so it will not require lower $R_{\rm N2}$ . But, when the inputs change from AB = 10 to AB = 11, according to Fig. 2(d), we have $$k_{N2} = \frac{W_{N2}}{W_{\min}}, \quad k_{N3} = \frac{W_{N3}}{W_{\min}}, \dots, k_{P3} = \frac{W_{P3}}{W_{\min}}$$ $$R_{N2} = \frac{R_{\min}}{k_{N2}}, R_{N3} = \frac{R_{\min}}{k_{N3}}, \quad a = k_{N4} + k_{P2} + k_{P3}$$ $$C_X = C_{d_{\min}} \times k_{N2} + C_{d_{\min}} \times k_{N3} = C_{d_{\min}}(k_{N2} + k_{N3})$$ $$C_{\text{out}} = C_{d_{N4}} + C_{d_{P2}} + C_{d_{P3}} + C_{d_{\min}} \times k_{N2}$$ $$C_{\text{out}} = a \times C_{d_{\min}} + C_{d_{\min}} \times k_{N2} = C_{d_{\min}}(a + k_{N2})$$ (1) where $W_{\min}$ is the minimum transistor width, $R_{\min}$ is the ON-state resistance for the nMOS transistor with $W_{\min}$ , $C_{d_{\min}}$ is the diffusion capacitance of the transistor, and a is the total size of the transistors P2, P3, and N4. The Elmore delay [25] $(T_{d_{AB=10\rightarrow11}})$ of Fig. 2(c) and (d) is equal to $$T_{d_{AB=10\to 11}} = C_{\text{out}} \left( \frac{R_{\text{min}}}{k_{N2}} + \frac{R_{\text{min}}}{k_{N3}} \right) + C_X \left( \frac{R_{\text{min}}}{k_{N3}} \right)$$ $$= C_{d_{\text{min}}} R_{\text{min}} \left[ a \left( \frac{1}{k_{N2}} + \frac{1}{k_{N3}} \right) + 2 \left( 1 + \frac{k_{N2}}{k_{N3}} \right) \right]$$ (2) now, the average dynamic power dissipation (for the condition that the inputs change from AB = 10 to AB = 11) can be written as [2] $$P_{AB=10\to 11} = C_{\text{total}} V_{\text{DD}}^{2} = (C_{d_{\text{min}}} (k_{N2} + k_{N3}) + C_{d_{\text{min}}} (a + k_{N2}) + k_{N3} C_{g_{\text{min}}} + k_{P2} C_{d_{\text{min}}} + k_{P3} C_{g_{\text{min}}} + k_{N4} C_{d_{\text{min}}}) V_{\text{DD}}^{2}$$ (3) Fig. 3. Normalized PDP with a = 3 for $1 \le k_{N2}, k_{N3} \le 4$ . Fig. 4. Circuit layout of proposed XOR/XNOR. (a) Circuit layout of proposed XOR. (b) Circuit layout of proposed XNOR. where $C_{g_{\min}}$ is the gate capacitance of the transistor, and $C_{\text{total}}$ is all capacitances that are switched. By assuming $C_{d_{\min}} \approx C_{g_{\min}} = C$ and a = 3 (the size of transistors P2, P3, and N4 equal to the $W_{\min}$ ) $$P_{AB=10\to 11} = ((k_{N2} + k_{N3})C + (3 + k_{N2})C + k_{N3}C + 3C)V_{DD}^{2}$$ $$= CV_{DD}^{2}(2k_{N2} + 2k_{N3} + 6). \tag{4}$$ Finally, by having the value of delay and power dissipation, the PDP of the circuit can be obtained. For a better comparison, the normalized PDP (PDP<sub>n</sub>) is considered $$PDP_{n} = \frac{T_{d_{AB=10\to11}} \times P_{AB=10\to11}}{CR_{\min} \times CV_{DD}^{2}}$$ $$= \left[3\left(\frac{1}{k_{N2}} + \frac{1}{k_{N3}}\right) + 2\left(1 + \frac{k_{N2}}{k_{N3}}\right)\right] (2k_{N2} + 2k_{N3} + 6). \tag{5}$$ Fig. 3 shows the value of normalized PDP with a=3 for $1 \le k_{N2}, k_{N3} \le 4$ . Fig. 3 also shows that, in the optimal condition, the value of $k_{N3}$ is bigger than that of $k_{N2}$ . Therefore, the W/L ratio of the transistor N3 is larger than that of the transistor N2. Thus, the input capacitance of transistor N3 is higher than that of transistor N2 and, to obtain the optimal circuit, it is better to connect input A to the transistor N2. The advantages of the proposed XOR/XNOR circuits are full-swing output, good driving capability, smaller number of interconnecting wires, and straightforward circuit layout. Fig. 4(a) and (b) shows the circuit layout of the proposed XOR and XNOR gates, respectively, designed for minimum power consumption [26]. #### TABLE I SIMULATION RESULTS (OPTIMUM SIZE OF TRANSISTORS IN nm, POWER IN e-6W, DELAY IN ps, AND PDP IN aJ) FOR XOR/XNOR AND SIMULTANEOUS XOR—XNOR CIRCUITS IN 65-nm TECHNOLOGY WITH 1.2-V POWER SUPPLY VOLTAGE AT 1 GHZ | Design | s | N1 P1 N | 2 P2 | N3 | P3 | N4 | P4 | N5 | P5 | N6 | P6 | Delay | Power | PDP | |----------------------------------------------------------------------------------------------------------------|---------------|-----------|--------|-------|-------|-----|-----|-----|-----|-----|-----|-------|-------|-------| | Fig. 1(a) [16] | XOR | 130 610 1 | 80 130 | 130 | 130 1 | 130 | 262 | | | | | 26.1 | 2.48 | 64.7 | | Fig. 1(a) [16] Fig. 1(b) [11] Fig. 2(b)* Fig. 1(a) Fig. 1(c) Fig. 1(c) Fig. 1(c) Fig. 1(c) Fig. 1(f) Fig. 1(g) | XNOR | 195 130 1 | 30 640 | | | | | 130 | 130 | 155 | 240 | 25.8 | 2.50 | 64.5 | | Fig. 1(b) [11] | XOR | 342 130 1 | 30 190 | 166 2 | 250 | | | | | | | 23.6 | 2.14 | 50.5 | | Fig. 1(b) [11] | XNOR | 130 793 | | | 1 | 130 | 130 | 130 | 456 | | | 25.6 | 2.47 | 63.2 | | Fig. 2(b)* | XOR | 130 130 3 | 30 245 | 170 3 | 344 1 | 130 | | | | | | 21.9 | 2.22 | 48.6 | | Fig. 2(b) | XNOR | 130 130 2 | 04 732 | 130 5 | 578 | | 130 | | | | | 21.5 | 2.46 | 52.9 | | Fig. 1(a) [16]** | | 223 588 1 | 91 561 | 130 | 130 1 | 130 | 130 | 130 | 130 | 130 | 130 | 33.6 | 4.30 | 144.5 | | Fig. 1(b) [11]** | | 514 876 1 | 30 130 | 130 2 | 205 1 | 130 | 130 | 130 | 527 | | | 29 | 4.50 | 130.5 | | Fig. 1(c) [16] | | 362 720 4 | 03 709 | 249 | 130 3 | 357 | 130 | 357 | | 273 | | 39.6 | 5.43 | 215.2 | | Fig. 1(d) | Fig. 1(d) [3] | | 41 154 | 130 | 178 1 | 130 | | 430 | | | | 62.7 | 5.31 | 332.9 | | Fig. 1(e) | [13] | 190 404 1 | 90 404 | 138 4 | 167 | | | | | | | 157.2 | 4.89 | 768.7 | | Fig. 1(f) | [18] | 130 273 1 | 87 309 | 130 1 | 130 1 | 130 | 677 | 373 | 405 | | | 38.6 | 4.71 | 181.8 | | Fig. 1(g) | [23] | 281 999 3 | 75 130 | 130 4 | 126 1 | 130 | 130 | 130 | 506 | | | 36.0 | 5.25 | 189.0 | | Fig. 2( | e)* | 130 183 1 | 14 577 | 130 3 | 373 1 | 130 | 130 | 130 | 258 | 242 | 232 | 26.4 | 4.14 | 109.3 | <sup>\*\*</sup>This two simultaneous XOR-XNOR gates are achieved by combining of the two XOR and XNOR circuits of Fig. 1(a) and Fig. 1(b). # B. Proposed XOR-XNOR Circuit Fig. 2(e) shows the proposed structure of the simultaneous XOR-XNOR gate consisting of 12 transistors. This structure is obtained by combining the two proposed XOR and XNOR circuits of Fig. 2(b). If the inputs of this circuit are connected as mentioned in Section III-A, the input A and B capacitances are not equal (the inputs A and B are connected to the same transistor count). Thus, to equal the input of capacitances, they are connected to the circuit, as shown in Fig. 2(e). In this case, the input capacitances are approximately equal and the power and delay are optimized. This structure does not have any NOT gates on the critical path and its output capacitance is very small. For this reason, it is very high speed and consumes low power. The delay of XOR and XNOR outputs of this circuit is almost identical, which reduces the glitch in the next stage. Other advantages of this circuit are good driving capability, full-swing output, as well as robustness against transistor sizing and supply voltage scaling. The proposed XOR/XNOR and simultaneous XOR-XNOR structures were compared with all the above-mentioned structures (Fig. 1). The simulation results at TSMC 65-nm technology and 1.2-V power supply voltage $(V_{DD})$ are shown in Table I. The input pattern is used as all possible input combinations have been included [Fig. 5(a)]. The maximum frequency for the inputs was 1 GHz and 4× unit-size inverter (FO4) was connected to the output (as a load). The size of transistors has been selected for optimum PDP by using the proposed transistor sizing method, which the proposed procedure will be described in Section VI. The optimum size of transistors for each XOR/XNOR and XOR-XNOR circuits are expressed in Table I. In the output rise and fall transition, the delay is calculated from 50% of the input voltage level to 50% of the output voltage level. The PDP will be calculated by multiplying the worst case delay by the average power consumption of the main circuit. The results indicate that the performance of the proposed XOR/XNOR and simultaneous XOR-XNOR structures is better Fig. 5. Simulation results of XOR-XNOR circuits. (a) Time-domain simulation results (waveform) of the proposed XOR-XNOR. (b) Simulation results of XOR-XNOR circuits versus $V_{DD}$ . (c) Simulation results of XOR-XNOR circuits versus output load. than that of the compared structures. The proposed XOR and XNOR circuits [Fig. 2(b)] have the lowest PDP and delay, respectively, compared with other XOR/XNOR circuits. Also, the delay of these two proposed circuits is very close together that prevents the creation of glitch on the next stage. The delay, power consumption, and PDP of the XOR and XNOR circuits of Fig. 1(a) are almost equal, due to having the same structures. As mentioned earlier and according to the obtained results, the XOR circuit of Fig. 1(b) has a better performance than its XNOR circuit. The proposed circuit for simultaneous XOR-XNOR has better efficiency in all three calculated parameters (delay, power dissipation, and PDP) when it is compared with other XOR-XNOR gates. The proposed XOR-XNOR circuit is saving almost 16.2%-85.8% in PDP, and it is 9%-83.2% faster than the other circuits. The circuits of Fig. 1(d) and (e) have the very high delay due to its output feedback (which have the slow response problem). As can be seen in Table I, the efficiency of Fig. 1(e) is much worse and its delay is four times more than that of other circuits. Table I indicates that the structures have shown a better performance, which have the minimum NOT gates on the critical path and also have not feedback on the outputs to correct the output voltage level. To better evaluate the XOR-XNOR circuits, they are simulated at different power supply voltages from 0.6 to 1.5 V and also at different output loads from FO1 to FO16. The results of these two simulations are shown in Fig. 5(b) and (c). As seen in Fig. 5(b) and (c), the proposed XOR-XNOR circuit has the best performance in both simulations when compared with other structures. # C. Proposed FAs We proposed six new FA circuits for various applications which have been shown in Fig. 6. Also, Fig. 7 shows the circuit layout of proposed FA cell shown in Fig. 6(a). These new FAs have been employed swith hybrid logic style, and all of them are designed by using the proposed XOR/XNOR or XOR—XNOR circuit. The well-known four-transistor 2-1-MUX structure [Fig. 8(a)] is used to implement the proposed hybrid FA cells. This 2-1-MUX is created with TG logic style that has no static and short-circuit power dissipation. Fig. 6(a) shows the circuit of first proposed hybrid FA (HFA-20T) which is made by two 2-to-1 MUX gates and the Fig. 6. Proposed six new hybrid FA circuits. (a) HFA-20T. (b) HFA-17T. (c) HFA-B-26T. (d) HFA-NB-26T. (e) HFA-22T. (f) HFA-19T. XOR-XNOR gate of Fig. 2(e). The circuit of HFA-20T has not high power consumption NOT gates on critical path and consists of 20 transistors. The advantages of this structure are full-swing output, low power dissipation and very high speed, robustness against supply voltage scaling, and transistor sizing. If $A \odot B = 1$ , then the output $C_{\text{out}}$ signal equals to the input signal A or B. But to equalize the inputs capacitance, both of the input signals A and B are used for implementation and are connected to the transistors N9 and P10 [in Fig. 6(a)], respectively. The only problem of HFA-20T is reduction of the output driving capability when it is used in the chain structure applications, such as ripple carry adder. Of course, this problem exists in the circuits that use the transmission function theory in their implementation without buffering output. Fig. 7 shows the circuit layout of proposed HFA-20T which designed for minimum power consumption [26]. One way to reduce the power consumption of the FA structures is to use a XOR/XNOR gate and a NOT gates to generate the other XOR or XNOR signal. The proposed hybrid FA cell (HFA-17T) shown in Fig. 6(b) is designed by using the Fig. 7. Circuit layout of proposed HFA-20T. XOR gate of Fig. 2(b). This structure is made by 17 transistors that has three transistors less than the HFA-20T. The delay of HFA-17T is higher than that of HFA-20T due to the addition of NOT gates on the critical path of the HFA-17T (for making the XNOR signal from the XOR signal). It may be expected that the power consumption of HFA-17T is less than that of HFA-20T due to the reduction in the number of transistors. But the NOT gate on the critical path of the circuit increases the short circuit power. So there is no significant reduction in total power dissipation of the HFA-17T. Also, the NOT gate will slightly improve the output driving capability of the circuit. As mentioned earlier, using the buffer on the output of a circuit is almost mandatory, especially in applications that the output capacitance of each stage is high. In practice, the driving capability of VLSI circuits is degraded due to the creation of the parasitic capacitors and resistors during the fabrication, as well as increasing the threshold voltage of transistors over the time, but the output buffer improves this situation. Fig. 6(c) presents the third proposed hybrid FA with buffers on the Sum and $C_{out}$ outputs (HFA-B-26T), and it is made with 26 transistors. There are XOR-XNOR gate, one 2-1-MUX gate, and NOT gates on the critical path of HFA-B-26T. The output NOT gates are used to prevent the driving output nodes by the inputs of the circuit and also reduce the resistance from the output node of the circuit to the sources ( $V_{\rm DD}$ and GND). The power consumption and delay of HFA-B-26T are more than that of HFA-20T and HFA-17T FAs. Fig. 6(d) shows another proposed hybrid FA with new buffers (HFA-NB-26T), where they are placed in the data inputs of 2-1-MUX gates instead of placing the buffers in the outputs. If the input signals of A and C are produced by the buffer, then for all possible input combinations, the Sum and $C_{\rm out}$ outputs are not driven by the inputs of the circuit. To do this work, three additional NOT gates are enough, because there was already the $\overline{A}$ signal and can be made the buffered A signal with an extra NOT gate. So the HFA-NB-26T FA circuit is made by 26 transistors. The data input nodes of 2-1-MUXs reach to their final value (GND or $V_{DD}$ ) before the XOR and XNOR signals are produced. Thus, the critical path of HFA-NB-26T consists of an XOR-XNOR gate and a 2-1-MUX Fig. 8. (a) 2-1-MUX. (b) and (c) Simulation test bench to carry out the circuit parameters. gate, and its delay is reduced compared with the HFA-B-26T. The driving capability of the HFA-NB-26T is slightly less than that of HFA-B-26T due to existing the 2-1-MUX gate between the buffer and the output node [which increases the resistance from the output node to the sources ( $V_{\rm DD}$ and GND)]. The circuits of HFA-20T and HFA-17T have been designed so that the less number of transistors has been used. To produce the output Sum signal, the XOR, XNOR, and C signals are only used so no additional NOT gates needs to generate the $\overline{C}$ signal, whereas if the C signal is also used to produce the Sum output, then XOR and XNOR signals will not drive the Sum output through the TG multiplexer, but only they will be connected to the data select lines of 2-1-MUX. So the capacitance of XOR and XNOR nodes become smaller, and the delay of the circuit will be improved. The circuits of Fig. 6(e) and (f) (named HFA-22T and HFA-19T, respectively) have been created by applying the above idea to HFA-20T and HFA-17T, respectively. It is expected that the power consumption and delay of the HFA-22T and HFA-19T FA circuits are less than that of HFA-20T and HFA-17T, respectively (despite having two more transistors), due to the less capacitance of XOR and XNOR nodes. Also, by adding the $\overline{C}$ signal, the driving capability of HFA-22T and HFA-19T will be better than that of HFA-20T and HFA-17T, respectively. #### IV. SIMULATION ENVIRONMENT # A. Simulation Setup All the circuits have been simulated using HSPICE in the 65-nm TSMC CMOS process technology, and were supplied with 1.2 V as well as the maximum frequency for the inputs was 1 GHz. Fig. 8(b) and (c) shows the typical simulation test bench to carry out the circuit parameters. There are two NOT gates on the input of structure shown in Fig. 8(b) with two separate power supplies $(V_{DD_1}$ and $V_{DD_2})$ . As can be seen in Fig. 8(b), the main circuit and the NOT gates connected to it have the same power supply $(V_{DD_1})$ . By subtracting the power consumption of $V_{\rm DD_1}$ in Fig. 8(c) from the power consumption of $V_{\rm DD_1}$ in Fig. 8(b), the power consumption of the main circuit will be achieved. The input pattern for the both structures of Fig. 8(b) and (c) is exactly the same. With this method, the calculated power consumption of the main circuit will be much more accurate and the power consumption of all input capacitance is also considered. Output load of FO4 is used for delay and power dissipation measurements, which has a different power supply from the main circuit. The sizes of input buffers are selected, such as [3] and [27]. In the output rise and fall transition, the delay is calculated from 50% of Fig. 9. Time-domain simulation results (waveform) of the proposed FA. # TABLE II SIMULATION RESULTS (POWER IN e-6W, DELAY IN ps, PDP IN aJ, AND EDP IN e-29Js) FOR FA CIRCUITS IN 65-nm TECHNOLOGY WITH 1.2-V POWER SUPPLY VOLTAGE AT 1 GHZ | Designs | Mini | imum Po | wer | | Minimu | Improvement | | | | |----------------|------------------------------|---------|-------|-------|--------|-------------|--------|--------|--------| | Designs | Power | Delay | PDP | Power | Delay | PDP | EDP | PDP% | EDP% | | HFA-20T* | 3.90 | 85.5 | 333.5 | 4.44 | 51.8 | 230 | 1191.4 | 12.9 | 31.3 | | HFA-17T° | 3.78 | 94.3 | 356.5 | 4.40 | 59 | 259.6 | 1531.6 | 1.7 | 11.7 | | HFA-B-26T* | 4.52 | 73.8 | 333.6 | 4.66 | 63.1 | 294 | 1855.4 | -11.3 | -6.9 | | HFA-NB-26T* | 4.28 | 82.7 | 354 | 4.52 | 57.7 | 260.8 | 1504.8 | 1.2 | 13.3 | | HFA-22T° | 4.08 | 59.1 | 241.1 | 4.17 | 48.5 | 202.2 | 980.9 | 23.4 | 43.5 | | HFA-19T° | 3.96 | 74.3 | 294.2 | 4.11 | 59.4 | 244.1 | 1450.2 | 7.6 | 16.4 | | CMOS [11] | 3.98 | 119.2 | 474.4 | 4.25 | 95.4 | 405.4 | 3868 | -53.5 | -122.9 | | M-CMOS [6] | 3.93 | 103.7 | 407.5 | 4.08 | 92.7 | 378.2 | 3506 | -43.2 | -102.1 | | CPL [8] | 6.88 | 63.7 | 438.3 | 7.04 | 60.9 | 428.7 | 2611 | -62.3 | -50.5 | | New-14T [13] | 3.61 | 212 | 765.3 | 4.33 | 142.7 | 617.9 | 8817.3 | -134 | -408.1 | | 16T [15] | 3.52 | 90.7 | 319.3 | 4.02 | 65.7 | 264.1 | 1735.2 | 0 | 0 | | DPL [16] | 4.89 | 98.8 | 483.1 | 5.32 | 66.3 | 352.7 | 2338.5 | -33.5 | -34.8 | | Hybrid-FA [12] | 3.71 | 116.8 | 433.3 | 4.5 | 64.1 | 288.4 | 1849 | -9.2 | -6.6 | | SR-CPL [16] | 4.78 | 88.3 | 422.1 | 5.01 | 69.4 | 347.7 | 2413 | -31.7 | -39.1 | | TFA [10] | 3.81 | 93.8 | 357.4 | 4.21 | 66.7 | 280.8 | 1873 | -6.3 | -7.9 | | TGA [11] | 4.23 | 96.8 | 409.5 | 4.48 | 65.8 | 294.8 | 1939.7 | -11.6 | -11.8 | | HPSC [18] | 4.60 | 89.2 | 410.3 | 4.82 | 79.9 | 385.1 | 3077.1 | -45.8 | -77.3 | | New-HPSC [3] | ew-HPSC [3] 4.97 111.5 554.2 | | 5.02 | 95 | 476.9 | 4530.6 | -80.6 | -161.1 | | " Means proposed design. the input voltage level to 50% of the output voltage level. The PDP will be calculated by multiplying the worst case delay by the average power consumption of the main circuit. Fig. 9 shows the time-domain simulation results (waveform) of the proposed FA. The performance of the FA circuits is evaluated in terms of power consumption, worst case delay, and PDP for a range of supply voltages (from 0.65 to 1.5 V) at 1-GHz frequency. Furthermore, their performances are evaluated by changing the output load ranged from FO4 to FO64 at the 1.2-V power supply voltage and 1-GHz frequency. The lowest power consumption of a circuit is achieved when the width of transistors is as minimum as possible [2]. However, in this case, the lowest PDP cannot be guaranteed. Because the delay of the circuit is not in the optimum state and increases the PDP. To better analysis, the values of the delay, power consumption, and PDP are presented in Table II for a minimum feature size $(W_{1,2,\dots,n} = W_{\min} = 4\lambda = 130 \text{ nm})$ . #### B. Transistor Sizing Optimal implementation (less PDP [14]) of arithmetic circuits in the VLSI systems is very important. The optimization methods, such as choosing the optimal circuit structure for the intended purpose, the appropriate logic style, and transistor sizing, have been utilized for improving the performance of circuits. The transistor sizing method, which contains reducing or enlarging the width of transistors, is an effective and powerful tool for optimizing the VLSI circuits and should be used in the design process of high-performance circuits. With the advancement of technology and reduced transistor sizes, the behavior and performance of a circuit could not be investigated without transistor sizing, since a small change in the size of transistors may considerably change the performance of the circuit. Therefore, choosing the appropriate method of transistor sizing, before obtaining the important parameters of the circuit, is necessary. There are several methods for transistor sizing [9], [18]. 1) Review on Transistor Sizing Methods: Shams et al. [9] present the method for transistor sizing. Since this method is very simple and fast, the simulation time for optimizing the circuit is much reduced. The main problem of this method is that the transistors involved in the critical path are only considered, whereas in a circuit, all (OFF or ON) transistors are involved in the critical path delay of the circuit because all of them affect the power dissipation and nodes capacitance of the circuit (and also PDP). Therefore, the more appropriate method is to consider all transistors of the circuit at the same time, even if they are OFF. Another problem of this method is that it tries to reduce the delay instead of reducing the PDP of the circuit, while our main goal is to reduce the PDP of the circuit. In [18], another method for transistor sizing is proposed, which is almost the easiest method, and its performance is better than the previous method. In the method [18], similar to method [9], all transistors of the circuit and dependences between them have not been considered at the same time. Also, the final result is highly dependent on the initial size of transistors. Of course, these methods lose their precision in favor of reducing the simulation time. 2) Proposed Method for Transistor Sizing: The abovementioned methods do not consider the dependence of transistors and, therefore, do not have good accuracy. There are different ways to optimize a function, which one of these methods is particle swarm optimization (PSO) [28]–[30]. In computer science, the PSO algorithm is a numerical computation method that optimizes a function iteratively. It is trying to suggest a better solution with consideration of obtained value of the function. It solves a problem by giving the candidate solutions, which is known as the particles, and moving these particles to the best known position in the search-space according to simple mathematical formulas (7). If the better position is found in the search space by other particles, it is updated as the best known position. Eventually, the swarm moves toward the best solution. The PSO algorithm is becoming popular due to its simplicity of implementation and ability to rapidly converge to a good solution. # Method Proposed Method for Transistor Sizing - 1: Initialize width $\vec{\mathbf{W}}_{m \times n}^{(1)} > n$ is the number of circuit's transistors, and m is the number of particles. - 2: i = 1 - 3: **do** - Simulate the circuit with $\vec{\mathbf{W}}^{(i)}$ and compute the PDP of the circuit 4: - $(\boldsymbol{\Theta}_{min}, \vec{\mathbf{W}}_{\boldsymbol{\Theta}_{min}1\times n}, \vec{\mathbf{W}}_{m\times n}^{(i+1)}) = \mathbf{PSO}(\vec{\mathbf{W}}^{(i)}, \vec{\boldsymbol{\Theta}}^{(i)})$ 5: - 6: - 7: while (i > t) $\triangleright t$ is the number of desired iterations. - 8: Outputs o $\vec{\mathbf{W}}_{\Theta_{min}}$ and $\Theta_{min}$ The position vector and the velocity vector of the ith iteration for m number of particles in the n-dimensional search space can be represented as $\vec{\mathbf{X}}_{m \times n}^{(i)}$ and $\vec{\mathbf{V}}_{m \times n}^{(i)}$ , respectively $$\vec{\mathbf{X}}_{m\times n}^{(i)} = \begin{bmatrix} \vec{\mathbf{x}_1}_{1\times n}^{(i)^T} & \vec{\mathbf{x}_2}_{1\times n}^{(i)^T} & \cdots & \vec{\mathbf{x}_m}_{1\times n}^{(i)^T} \end{bmatrix}^T$$ $$\vec{\mathbf{V}}_{m\times n}^{(i)} = \begin{bmatrix} \vec{\mathbf{v}_1}_{1\times n}^{(i)^T} & \vec{\mathbf{v}_2}_{1\times n}^{(i)^T} & \cdots & \vec{\mathbf{v}_m}_{1\times n}^{(i)^T} \end{bmatrix}^T.$$ (6) $\vec{\mathbf{X}}_{1\times n}^{(i)}$ is the position of one particle in the *i*th iteration. The best position of particles in the ith iteration (which corresponds to the best fitness value obtained by that particle at the *i*th iteration) is $\vec{\mathbf{p}}_{1\times n, \infty}^{(i)}$ and the fittest particle found so far at the *i*th iteration is $\vec{\mathbf{g}}_{1\times n}^{(i)}$ . Therefore, the new positions and velocities of the particles for the next time of evaluation are calculated by the following equation [28], [30]: $$\vec{\mathbf{V}}_{m \times n}^{(i)} = w \cdot \vec{\mathbf{V}}_{m \times n}^{(i-1)} + c_1 \cdot \text{rand}_1 \cdot (\vec{\mathbf{P}}_{m \times n}^{(i-1)} - \vec{\mathbf{X}}_{m \times n}^{(i-1)}) + c_2 \cdot \text{rand}_2 \cdot (\vec{\mathbf{G}}_{m \times n}^{(i-1)} - \vec{\mathbf{X}}_{m \times n}^{(i-1)})$$ (7a) $$\vec{\mathbf{X}}_{m \times n}^{(i)} = \vec{\mathbf{X}}_{m \times n}^{(i-1)} + \vec{\mathbf{V}}_{m \times n}^{(i)}$$ (7b) $$\vec{\mathbf{X}}_{m \times n}^{(l)} = \vec{\mathbf{X}}_{m \times n}^{(l-1)} + \vec{\mathbf{V}}_{m \times n}^{(l)} \tag{7b}$$ where $c_1$ and $c_2$ are two positive numbers, rand<sub>1</sub> and rand<sub>2</sub> are two separately generated and uniformly distributed random numbers in the range [0, 1], and $\vec{\mathbf{P}}_{m\times n}^{(i)}$ and $\vec{\mathbf{G}}_{m\times n}^{(i)}$ are as follows: $$\vec{\mathbf{P}}_{m\times n}^{(i)} = \begin{bmatrix} \vec{\mathbf{p}}_{1\times n}^{(i)^T} & \vec{\mathbf{p}}_{1\times n}^{(i)^T} & \cdots & \vec{\mathbf{p}}_{1\times n}^{(i)^T} \end{bmatrix}^T \\ \vec{\mathbf{G}}_{m\times n}^{(i)} = \begin{bmatrix} \vec{\mathbf{g}}_{1\times n}^{(i)^T} & \vec{\mathbf{g}}_{1\times n}^{(i)^T} & \cdots & \vec{\mathbf{g}}_{1\times n}^{(i)^T} \end{bmatrix}^T.$$ (8) Also, w (an inertia weight) can be a positive constant or even a positive linear or nonlinear function of time, which it plays the role of balancing the global search and local search [31]. In the proposed method, the initial width $(\dot{\mathbf{W}})$ , the number of particles (m), and iterations (t) are very effective in the final result and should be chosen carefully. After achieving the optimum size of transistors for the intended circuit, the simulations can be done and its results are analyzed. # C. Feasible Environment Test Setup To investigate the performance of FA cells, they are used in a feasible structure [9], as shown in Fig. 10. It simulates the circuits such as binary adders and regular multipliers, which is made of n cascaded FA (n-CFA) cells. The inputs are driven from the buffers, and the outputs are loaded with FO4 inverters. The delay for this structure is measured from the input Fig. 10. n-CFA circuit. signals of the first FA to the output signals of the last FA. The power consumption value is the average power of all n-CFA cells, and it do not include the power dissipation of inputs and outputs buffers. To evaluate the performance of the FAs more accurately, all 56 possible input transitions are applied to the circuit. Of course, in this structure, except for the first FA cell, the rest of them sense only 12 input transitions. Simulation is performed at 100-MHz input frequency, 1.2-V supply voltage, and room temperature. # D. PVT Variation Test Setup PVT variations cause changes in the parameters of the transistor, such as threshold voltage, capacitances, and ON- and OFF-state currents. These changes will have the effect on the performance of the circuit. Therefore, the robustness of different FA cells should be investigated in unfavorable and unknown conditions. To reach this purpose, the Monte Carlo transient analysis for W, L, and $V_{th}$ variations, as well as process corners simulation is performed. The distribution of W and L dimensions is assumed as Gaussian, and a standard deviation of $\pm 10$ nm and $\pm 5$ nm from nominal values is considered. To get enough accurate results, we have performed N = 1000 simulations for each condition and evaluated the delay, power dissipation, and PDP of the FA cells. # E. Noise Immunity Test Setup Noise in VLSI circuits is defined as any disturbance that changes the voltage of circuit nodes from their nominal value. Noise sources that have a significant impact on the performance of digital circuits consist of crosstalk, alpha particles, skin effect, electron migration, electromagnetic interference, IR drop, charge sharing, charge leakage, ground bounce, power supply noise, and so on [2]. Digital circuits are inherently very low sensitive to noise, and they filter the noise pulses with high amplitude and adequate narrow width. Noise immunity curve (NIC) [32] is used to measure the noise-tolerance performance of FAs. It is a locus point (T, V), which T and V are noise pulsewidth and noise pulse amplitude, respectively. Each point on the NIC depicts that if a noise with (T, V) (or higher amplitude) is applied to the input of digital gate, then it will make a logic error in the output. To get a numerical value of noise immunity, a metric called average noise threshold energy (ANTE) [33] derived from the NIC is used. It is equal to the energy of noise pulse and is obtained as $$ANTE = E(V^2 \cdot T) \tag{9}$$ where $E(\cdot)$ denotes the expectation operator. It is evident that the higher ANTE value for a digital circuit demonstrates that it has higher immunity to input noise pulse. The NIC and Fig. 11. Simulation results of FAs versus $V_{DD}$ . (a) Delay. (b) Average power consumption. (c) PDP. ANTE metrics are widely used for the evaluation of noise immunity in [3], [27], and [33]. The ANTE metric is reduced for a certain structure by scaling the size of transistors, and for structures, that have good speed and performance will result in a small amount. Therefore, the ANTE metric cannot be used to compare different structures accurately. However the value of ANTE for a circuit reveals useful information about input noise immunity. Another metric that can be used to compare the structures and does not change by scaling the size of transistors is the normalized ANTE (NANTE) [33] metric with PDP of the circuit $$NANTE = \frac{ANTE}{PDP}.$$ (10) The 50% of the output swing is used for threshold level. # V. SIMULATIONS AND PERFORMANCE ANALYSIS In this section, the simulations results are discussed, and also the performance of the various mentioned designs is compared. In all simulations, the size of transistors is chosen in such a way that the minimum PDP is achieved for the circuit. To reach this aim, the proposed method for transistor sizing is used. Table II shows the simulation results of various FA circuits. In Table II, we have reported the average power consumption, critical path delay, PDP, and energy-delay product (EDP) metrics for different designs. Also, for better comparison, the PDP and EDP improvements of the designs compared with the 16T structure are given. We first discuss the results related to the minimum power conditions (MPCs), which is named minimum power in Table II. In the specified input frequency, output load, and supply voltage, the minimum power consumption of a circuit is dependent on its structure and number of transistors (n), while $W_{1,2,\dots,n} = W_{\min}$ . The 16T circuit has the lowest power compared with other circuits. This circuit produces full-swing $C_{\text{out}}$ and Sum outputs despite having the nonfull-swing XOR-XNOR signals. The CPL FA cell has the highest power, because of having the high number of transistors, compared with the other designs. However, it has good speed and good driving capability. In the MPC, the proposed HFA-22T saves the PDP of about 24%, 32%, 40%, 41%, 42%, and 50% compared with 16T, TFA, Mir-CMOS, TGA, New-HPSC, and DPL, respectively. By comparing the obtained results for the MPC and minimum PDP conditions (MPDPCs), the efficiency of transistor sizing methods, which is used for improving the performance of the circuits, becomes so apparent. By comparing the results of MPC and MPDPC, the maximum improvement in PDP is achieved for the Hybrid-FA circuit which is equal to 33%. Also, the CPL FA circuit shows 2% improvement in PDP metric that is lower than the other structures. Generally, for CPL logic style, the size of transistors in the MPC and MPDPC is very close to each other [26]. In the following, we discuss the simulation results for the MPDPC. The proposed FAs have superior speed, PDP, and EDP against other FA designs. The 16T circuit consumes lower power than that of other FA cells. Also, it shows better PDP and EDP compared with other circuits except for the FAs presented in this paper. The proposed HFA-22T circuit has the best delay, PDP, and EDP among FA cells. The structures of HFA-B-26T, HFA-NB-26T, CMOS, M-CMOS, CPL, HPSC, and New-HPSC have buffers at their outputs. The proposed HFA-NF-26T circuit saves PDP up to 35%, 31%, 39%, 32%, and 45% compared with CMOS, M-CMOS, CPL, HPSC, and New-HPSC, respectively. # A. Performance Analysis Against V<sub>DD</sub> The delay, power, and PDP of the FA cells at supply voltage range from 0.65 to 1.5 V are shown in Fig. 11(a)-(c), respectively. The nominal supply voltage for the 65-nm TSMC CMOS process technology is 1.2 V. Also, the transistor sizes optimized at 1.2 V are used for the simulation at all supply voltage ranges. The simulation results confirm that the proposed designs have superior speed, power, and PDP than other FA designs. The 14T, 16T, DPL, and New-HPSC FAs, due to the threshold voltage drop problem, can only work at and above 0.95, 0.75, 0.7, and 0.7 V, respectively. The delay of the 14T increases faster with decreasing supply voltage than other FAs. For all supply voltage ranges from 0.65 to 1.5 V, the proposed HFA-22T has the lowest delay and PDP compared with other FA circuits. The simulation results show that all proposed FA cells can work reliably at the supply voltage as low as 0.65 V. Despite of having good speed, the CPL [8] adder circuit consumes very higher power than other FAs because of its dual-rail structure and the high transistor count. Generally, as shown in Fig. 11(c), the 14T, New-HPSC, and 16T circuits have not suitable performance against supply voltage variations and are not recommended for use in the VLSI circuits. The minimum PDP of the CMOS, | TABLE III | |--------------------------------------------------------------------------------------------------| | SIMULATION RESULTS (POWER IN e-6W, DELAY IN ps, AND PDP IN fJ) FOR FA CIRCUITS BY DIFFERENT WORD | | LENGTHS OF n-CFA IN 65-nm TECHNOLOGY WITH 1.2-V POWER SUPPLY VOLTAGE AT 100 MHz | | n-Bit | 2-Bit | | 4-Bit | | | 8-Bit | | | 16-Bit | | | 32-Bit | | | 64-Bit | | | | |-----------------|--------|-------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|---------|---------|--------|--------|--------| | Design | Power | Delay | PDP | Power | Delay | PDP | Power | Delay | PDP | Power | Delay | PDP | Power | Delay | PDP | Power | Delay | PDP | | HFA-20T° | 0.8215 | 195.4 | 0.1606 | 1.6155 | 829.2 | 1.3396 | 3.4852 | 3463.4 | 12.071 | Failed | HFA-17T° | 0.8087 | 220.3 | 0.1782 | 1.6223 | 849.6 | 1.3783 | 3.7891 | 3528.2 | 13.368 | Failed | HFA-B-26T° | 0.8508 | 156.9 | 0.1335 | 1.5864 | 289.9 | 0.4599 | 3.0586 | 571.6 | 1.7482 | 6.0028 | 1135.3 | 6.8150 | 11.891 | 2262.8 | 26.907 | 23.668 | 4517.7 | 106.92 | | HFA-NB-26T° | 0.8582 | 147.1 | 0.1263 | 1.6379 | 291.7 | 0.4778 | 3.1975 | 576.6 | 1.8437 | 6.3167 | 1146.4 | 7.2415 | 12.555 | 2285.9 | 28.689 | 25.031 | 4565.1 | 114.27 | | HFA-22T° | 0.7745 | 155.5 | 0.1205 | 1.4795 | 398.5 | 0.5895 | 2.9643 | 1423.3 | 4.219 | 6.7159 | 5528.7 | 37.13 | Failed | Failed | Failed | Failed | Failed | Failed | | HFA-19T° | 0.7586 | 165.6 | 0.1257 | 1.4741 | 361.4 | 0.5327 | 3.0315 | 1231.7 | 3.7339 | 6.8326 | 4693.3 | 32.067 | Failed | Failed | Failed | Failed | Failed | Failed | | CMOS [11] | 0.8035 | 231.6 | 0.1861 | 1.5173 | 483 | 0.7328 | 2.9451 | 985.6 | 2.9026 | 5.8006 | 1990.8 | 11.548 | 11.511 | 4002 | 46.067 | 22.931 | 8028.8 | 184.11 | | M-CMOS [6] | 0.7481 | 217.4 | 0.1627 | 1.3944 | 452.8 | 0.6314 | 2.6872 | 923.5 | 2.4815 | 5.2731 | 1864.8 | 9.8332 | 10.444 | 3747.4 | 39.14 | 20.787 | 7520.6 | 156.33 | | CPL [8] | 1.1873 | 269.9 | 0.3205 | 2.2857 | 934.2 | 2.1353 | 4.7655 | 3333.3 | 15.884 | Failed | 16T [15] | 0.7221 | 343.6 | 0.2481 | 1.369 | 1405.5 | 1.9241 | 2.8345 | 5917.9 | 16.774 | Failed | <b>DPL</b> [16] | 0.9335 | 335.3 | 0.313 | 1.9008 | 1580 | 3.0033 | Failed | Hybrid-FA [12] | 0.794 | 204.3 | 0.1622 | 1.5667 | 844.8 | 1.3236 | 3.5833 | 3726 | 13.351 | Failed | SR-CPL [16] | 0.9056 | 292.7 | 0.2651 | 1.7736 | 1291.1 | 2.2899 | 3.7451 | 5235.9 | 19.609 | Failed | <b>TFA</b> [10] | 0.7884 | 240 | 0.1892 | 1.6014 | 789.3 | 1.2641 | 3.877 | 4029.5 | 15.622 | Failed | TGA [11] | 0.8043 | 164.3 | 0.1322 | 1.5509 | 488.07 | 0.7569 | 3.3786 | 1485.6 | 5.0193 | Failed | HPSC [18] | 0.9877 | 264 | 0.2607 | 1.9225 | 557.4 | 1.0716 | 3.7569 | 1204.8 | 4.5263 | 7.458 | 2500.5 | 18.543 | 14.733 | 5091.60 | 75.0130 | Failed | Failed | Failed | | New-HPSC [3] | 0.9714 | 269.4 | 0.2617 | 1.9033 | 625.4 | 1.1903 | 3.7673 | 1338.3 | 5.0418 | 7.4951 | 2778.6 | 20.826 | 14.95 | 5672.9 | 84.809 | Failed | Failed | Failed | <sup>\*</sup> Means proposed design. Fig. 12. Simulation results of FAs versus load. (a) Delay. (b) Average power consumption. (c) PDP. M-CMOS, HFA-17T, HFA-NB-26T, HPSC, SR-CPL, TFA, and CPL has been achieved at the supply voltage of 1.05 V and the minimum PDP of the DPL, Hybrid-FA, HFA-20T, HFA-19T, HFA-22T, HFA-B-26T, and TGA has been achieved at the supply voltage of 1.1 V. ### B. Performance Analysis Against Output Load In this section, we analyze the performance of the FA cells against the output load variations ranging from FO4 to FO64. Fig. 12(a)–(c) shows the simulation results for the delay, power, and PDP of the FA circuits, respectively, at FO4, FO8, FO16, FO32, and FO64 output loads. At the load of FO64, the speed of the proposed HFA-B-26T FA is 39%, 41%, 15%, 14%, 10%, 5%, 25%, and 8% higher than the 16T, DPL, New-HPSC, M-CMOS, HFA-20T, HFA-22T, HFA-NB-26T, and CPL, respectively. The CPL and 16T consume the highest and the lowest power, respectively. The PDP of CPL FA at the load of FO64 is equal to 20 fJ; however, to have better illustration, the maximum value of the PDP on the vertical axis of Fig. 12(c) has been limited to 16 fJ. At loads of FO4, FO8, FO16, and FO32, the proposed HFA-22T cell has the least PDP. Besides the proposed HFA-B-26T has the lowest PDP at the load of FO64. Among the six proposed FA cells, the HFA-NB-26T and HFA-22T have lowest and highest average PDP at various output loads, respectively. Fig. 13. Six different modes for connecting two FA cells. In conclusion, the proposed FA cells have more superior speed, power, and PDP against other cells, and are extremely suitable for low-power and high-speed applications. # C. Feasible Environment In applications where the FA cells are used for the cascaded stage, output driving capability of the circuit is very important. To investigate the performance of the FA circuits in a larger structure (real and feasible structure), all the considered FA cells are embedded in an n-CFA with a word length of the n=2,4,8,16,32,64 bits. In the simulation of n-CFA, no buffers have been used at intermediate cascaded stages. A circuit may have a good performance in the single mode but when placed in a larger structure loses its advantages. Therefore, to make clear the merits and demerits of the circuit, its performance must be analyzed under different conditions. Fig. 14. Simulation results of FAs. (a) PDP of FAs for 2-CFA simulation in six different connection modes. (b) PDP of FAs versus W, L, and $V_{th}$ variations. (c) PDP of FAs in different process corners. Fig. 15. Simulation results for noise immunity of the FAs. (a) NIC for the FAs. (b) ANTE for the FAs. (c) NANTE for the FAs. Each FA cell has three inputs $(A, B, \text{ and } C_{in})$ and two outputs $(Sum \text{ and } C_{\text{out}})$ . In most of the papers [9], [12] for the *n*-CFA simulation, the output signal of Sum is connected to one of the inputs (usually A) of the next stage FA, and also the output signal of $C_{\text{out}}$ is connected to the both remaining inputs of the next stage FA. But each circuit has a different delay and power consumption for each input state that should be noted in the *n*-CFA simulation. Two FA cells can be connected to each other in six different modes, such as Fig. 13. Fig. 14(a) shows the PDP of the FAs, which is obtained for 2-CFA simulation in six different connection modes. The 14T FA circuit fails in four connection modes and works properly only in two modes. Fig. 14(a) shows that the PDP value of 2-CFA in various connection modes is very different, for example, in 16T FA, the PDP of SCC and CCS modes is 247.8 and 63.8 aJ, respectively, which demonstrates a huge difference (3.9 times) relative to each other. Therefore, the input connection mode between the two FA cells should be considered when the n-CFA is used to simulate the FA circuits. Table III displays the obtained results in the n-CFA simulation for the various bits. Each FA cells is placed in the structure of n-CFA, such that [according to Fig. 14(a)] its PDP is maximized. Then, for example, the 16T, CMOS, and DPL FA cells are connected together, such as SCC, SSC, and CCS. The 14T FA circuit does not operate properly even in a low number of bits (n = 2, 4) for the structure of n-CFA, and therefore, it is removed from Table III. 1) Results for 2-CFA: Despite having the lowest power consumption, the 16T FA cell does not have good speed and PDP (it has the worst delay). This FA also does not have suitable output driving capability due to the use of nonfull-swing XOR-XNOR gate in its structure. The proposed HFA-NB-26T cell has the highest speed compared with the other FA cells. This FA is 36.5%, 32.3%, 45.5%, 44.2%, and 45.4% faster than the CMOS, M-CMOS, CPL, HPSC, and New-HPSC, respectively. Despite the lack of the output buffer in the structure of HFA-22T, but it has the best performance in terms of PDP among all the evaluated FA circuits. 2) Results for 4-CFA: The proposed HFA-B-26T has the best performance in terms of speed and PDP compared with other FAs. The results of two proposed HFA-B-26T and HFA-NB-26T are very close together. By increasing the number of bits in the *n*-CFA, the difference of power consumption between these two structures is increased, which is due to the use of a new buffer structure in its output. HFA-B-26T is 1.6, 3.2, 5.5, 1.9, and 2.2 times faster than the M-CMOS, CPL, DPL, HPSC, and New-HPSC, respectively. 3) Results for 8-CFA: The DPL FA cell is the first structure that its delay becomes more than $1/(f_{max} = 100 \text{ MHz}) = 10 \text{ ns}$ and full-swing outputs are only generated for up to n = 4-CFA. The proposed HFA-B-26T FA circuit has the best performance in terms of speed and PDP for the n-CFA simulation with a word length of the n = 4, 8, 16, 32, 64 bits when compared with other FAs. An important issue is that the HFA-19T and HFA-22T, which do not have an output buffer in their structures, have a better performance for 8-CFA against the CPL, HPSC, and New-HPSC, which have the buffer on their outputs. 4) Results for 16-CFA: In this case (16-CFA), only eight FAs can produce full-swing output, and two of them (HFA-22T and HFA-19T) do not have buffer on the outputs. These results indicate that the driving ability of HFA-22T and HFA-19T are very good, and they can be used in various applications. The HFA-B-26T offers lower PDP of about 41%, 31%, 63%, and 67% compared with CMOS, M-CMOS, HPSC, and New-HPSC, respectively. - 5) Results for 32-CFA: The HFA-B-26T, HFA-NB-26T, CMOS, M-CMOS, HPSC, and New-HPSC only have a full-swing output for 32-CFA at the 100-MHz input frequency. All these six FAs have the buffer. The M-CMOS consumes less power than the rest of the compared FAs in the n-CFA simulation for n = 8, 16, 32, 64. - 6) Result for 64-CFA: In 64-CFA, only four FA cells (HFA-b-26T, HFA-NB-26T, CMOS, and M-CMOS) were able to correctly operate. By dividing the delay to a number of used cells (n=64), the average delay of each cell will be achieved 70.6, 71.3, 125.5, and 117.5 ps for the HFA-b-26T, HFA-NB-26T, CMOS, and M-CMOS, respectively. The results of Table III show that the structures without output buffer, which are composed of CPL or TG logic styles, do not have a good drive capability and by cascading the stages of the circuit, the delay of the circuit is dramatically high. # D. PVT Variation Fig. 14(b) shows the simulation results for FAs against W, L, and $V_{\rm th}$ variations of the transistors. All of these variations have been applied to the circuit at the same time, and the PDP of circuits has been extracted. For better comparison, the normalized PDP is calculated. Fig. 14(b) shows that the 14T, 16T, and New-HPSC FA cells are very sensitive to the process variations. For example, the PDP of 14T is changed from -10% to +14%. The rest of the FA circuits have lower sensitivity to the process variations. Fig. 14(c) also shows the PDP of the FAs simulated in different process corners. In all corners, the proposed HFA-22T FA has the best performance compared with the rest of the FAs. # E. Noise The NIC, ANTE, and NANTE results for FA circuits are shown in Fig. 15(a)–(c), respectively. As can be seen in Fig. 15(a), CPL, M-CMOS, DPL, SR-CPL, and CMOS FAs have good noise immunity in the small pulsewidth. All cells except the New-HPSC, HFA-B-26T, HFA-NB-26T, and TGA have suitable ANTE. The CPL has the highest ANTE due to high input capacitance and that the nMOS transistors are just used in the structure (nMOS transistor does not pass the strong "1," so the noise pulse is not transmitted well). The proposed HFA-22T has the highest NANTE among all the compared FAs. The proposed FA circuits have very good immunity against the input noise. ## VI. CONCLUSION In this paper, we first evaluated the XOR/XNOR and XOR-XNOR circuits. The evaluation revealed that using the NOT gates on the critical path of a circuit is a drawback. Another disadvantage of a circuit is to have a positive feedback on the outputs of the XOR-XNOR gate for compensating the output voltage level. This feedback increases the delay, output capacitance, and, as a result, energy consumption of the circuit. Then, we proposed new XOR/XNOR and XOR-XNOR gates that do not have the mentioned disadvantages. Finally, by using the proposed XOR and XOR-XNOR gates, we offered six new FA cells for various applications. Also, a modified method for transistor sizing in digital circuits was proposed. The new method utilizes the numerical computation PSO algorithm to select the appropriate size for transistors on a circuit and also it has very good speed, accuracy, and convergence. After simulating the FA cells in different conditions, the results demonstrated that the proposed circuits have a very good performance in all simulated conditions. Simulation results show that the proposed HFA-22T cell saves PDP and EDP up to 23, 4% and 43.5%, respectively, compared with its best counterpart. Also, this cell has better speed and energy at all supply voltages ranging from 0.65 to 1.5 V when is compared with other FA cells. The proposed HFA-22T has superior speed and energy against other FA designs at all different process corners. All proposed FAs have normal sensitivity to PVT variations. ### REFERENCES - [1] N. S. Kim *et al.*, "Leakage current: Moore's law meets static power," *Computer*, vol. 36, no. 12, pp. 68–75, Dec. 2003. - [2] N. H. E. Weste and D. M. Harris, CMOS VLSI Design: A Circuits and Systems Perspective, 4th ed. Boston, MA, USA: Addison-Wesley, 2010. - [3] S. Goel, A. Kumar, and M. Bayoumi, "Design of robust, energy-efficient full adders for deep-submicrometer design using hybrid-CMOS logic style," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 14, no. 12, pp. 1309–1321, Dec. 2006. - [4] H. T. Bui, Y. Wang, and Y. Jiang, "Design and analysis of low-power 10-transistor full adders using novel XOR-XNOR gates," *IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.*, vol. 49, no. 1, pp. 25–30, Jan. 2002. - [5] S. Timarchi and K. Navi, "Arithmetic circuits of redundant SUT-RNS," IEEE Trans. Instrum. Meas., vol. 58, no. 9, pp. 2959–2968, Sep. 2009. - [6] J. M. Rabaey, A. P. Chandrakasan, and B. Nikolic, *Digital Integrated Circuits*, vol. 2. Englewood Cliffs, NJ, USA: Prentice-Hall, 2002. - [7] D. Radhakrishnan, "Low-voltage low-power CMOS full adder," *IEE Proc.-Circuits, Devices Syst.*, vol. 148, no. 1, pp. 19–24, Feb. 2001. - [8] K. Yano, A. Shimizu, T. Nishida, M. Saito, and K. Shimohigashi, "A 3.8-ns CMOS 16×16-b multiplier using complementary pass-transistor logic," *IEEE J. Solid-State Circuits*, vol. 25, no. 2, pp. 388–395, Apr. 1990. - [9] A. M. Shams, T. K. Darwish, and M. A. Bayoumi, "Performance analysis of low-power 1-bit CMOS full adder cells," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 10, no. 1, pp. 20–29, Feb. 2002. - [10] N. Zhuang and H. Wu, "A new design of the CMOS full adder," *IEEE J. Solid-State Circuits*, vol. 27, no. 5, pp. 840–844, May 1992. - [11] N. Weste and K. Eshraghian, Principles of CMOS VLSI Design. New York, NY, USA: Addison-Wesley, 1985. - [12] P. Bhattacharyya, B. Kundu, S. Ghosh, V. Kumar, and A. Dandapat, "Performance analysis of a low-power high-speed hybrid 1-bit full adder circuit," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 23, no. 10, pp. 2001–2008, Oct. 2015. - [13] M. Vesterbacka, "A 14-transistor CMOS full adder with full voltageswing nodes," in *Proc. IEEE Workshop Signal Process. Syst. (SiPS)*, Oct. 1999, pp. 713–722. - [14] M. Alioto, G. Di Cataldo, and G. Palumbo, "Mixed full adder topologies for high-performance low-power arithmetic circuits," *Microelectron. J.*, vol. 38, no. 1, pp. 130–139, 2007. - [15] A. M. Shams and M. A. Bayoumi, "A novel high-performance CMOS 1-bit full-adder cell," *IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process.*, vol. 47, no. 5, pp. 478–481, May 2000. - [16] M. Aguirre-Hernandez and M. Linares-Aranda, "CMOS full-adders for energy-efficient arithmetic applications," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 19, no. 4, pp. 718–721, Apr. 2011. - [17] I. Hassoune, D. Flandre, I. O'Connor, and J. D. Legat, "ULPFA: A new efficient design of a power-aware full adder," *IEEE Trans. Circuits* Syst. I, Reg. Papers, vol. 57, no. 8, pp. 2066–2074, Aug. 2010. - [18] C.-H. Chang, J. Gu, and M. Zhang, "A review of 0.18-μm full adder performances for tree structured arithmetic circuits," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 13, no. 6, pp. 686–695, Jun. 2005. - [19] P. Kumar and R. K. Sharma, "Low voltage high performance hybrid full adder," Eng. Sci. Technol., Int. J., vol. 19, no. 1, pp. 559–565, 2016. [Online]. Available: http://dx.doi.org/10.1016/j.jestch.2015.10.001 - [20] M. Ghadiry, M. Nadi, and A. K. A. Ain, "DLPA: Discrepant low PDP 8-bit adder," *Circuits Syst. Signal Process.*, vol. 32, no. 1, pp. 1–14, 2013. - [21] S. Wairya, R. K. Nagaria, and S. Tiwari, "New design methodologies for high-speed low-voltage 1 bit CMOS Full Adder circuits," *Int. J. Comput. Technol. Appl.*, vol. 2, no. 2, pp. 190–198, 2011. - [22] S. R. Chowdhury, A. Banerjee, A. Roy, and H. Saha, "A high speed 8 transistor full adder design using novel 3 transistor XOR gates," *Int. J. Electron., Circuits Syst.*, vol. 2, no. 4, pp. 217–223, 2008. - [23] M. A. Valashani and S. Mirzakuchaki, "A novel fast, low-power and high-performance XOR-XNOR cell," in *Proc. IEEE Int. Symp. Circuits* Syst. (ISCAS), vol. 1. May 2016, pp. 694–697. - [24] J.-M. Wang, S.-C. Fang, and W.-S. Feng, "New efficient designs for XOR and XNOR functions on the transistor level," *IEEE J. Solid-State Circuits*, vol. 29, no. 7, pp. 780–786, Jul. 1994. - [25] W. C. Elmore, "The transient response of damped linear networks with particular regard to wideband amplifiers," *J. Appl. Phys.*, vol. 19, no. 1, pp. 55–63, 1948. - [26] M. Alioto and G. Palumbo, "Analysis and comparison on full adder block in submicron technology," *IEEE Trans. Very Large Scale Integr. (VLSI) Syst.*, vol. 10, no. 6, pp. 806–823, Dec. 2002. - [27] S. Goel, M. A. Elgamel, M. A. Bayoumi, and Y. Hanafy, "Design methodologies for high-performance noise-tolerant XOR-XNOR circuits," *IEEE Trans. Circuits Syst. I, Reg. Papers*, vol. 53, no. 4, pp. 867–878, Apr. 2006. - [28] R. Eberhart and Y. Shi, "Particle swarm optimization: Developments, applications and resources," in *Proc. Congr. Evol. Comput.*, vol. 1. May 2001, pp. 81–86. - [29] J. J. Liang, A. K. Qin, P. N. Suganthan, and S. Baskar, "Comprehensive learning particle swarm optimizer for global optimization of multimodal functions," *IEEE Trans. Evol. Comput.*, vol. 10, no. 3, pp. 281–295, Jun. 2006. - [30] A. Ratnaweera, S. K. Halgamuge, and H. C. Watson, "Self-organizing hierarchical particle swarm optimizer with time-varying acceleration coefficients," *IEEE Trans. Evol. Comput.*, vol. 8, no. 3, pp. 240–255, Jun. 2004. - [31] Y. Shi and R. Eberhart, "A modified particle swarm optimizer," in *Proc. IEEE Int. Conf. Evol. Comput. IEEE World Congr. Comput. Intell.*, May 1998, vol. 189. no. 5, pp. 69–73. - [32] G. A. Katopis, "Delta-I noise specification for a high-performance computing machine," *Proc. IEEE*, vol. 73, no. 9, pp. 1405–1415, Sep. 1985. - [33] L. Wang and N. R. Shanbhag, "Noise-tolerant dynamic circuit design," in *Proc. IEEE Int. Symp. Circuits Syst. (ISCAS)*, vol. 1. Jul. 1999, pp. 549–552. Hamed Naseri received the B.Sc. degree in electronic engineering from Zanjan University, Zanjan, Iran, in 2012 and the M.S. degree in digital electronic engineering from Shahid Beheshti University, Tehran, Iran, in 2015, where he is currently working toward the Ph.D. degree at the Faculty Electrical Engineering. His current research interests include digital systems, computer arithmetic, VLSI design, low-power digital circuits, error control coding, fault tolerant system design, computer architecture, and residue number system. Somayeh Timarchi (M'17) received the B.Sc. degree in computer engineering (hardware) from Shahid Beheshti University, Tehran, Iran, in 2002, the M.Sc. degree in computer system architecture from the Sharif University of Technology, Tehran, Iran, in 2004, and the Ph.D. degree in computer system architecture from Shahid Beheshti University in 2010. She performed studies on computer arithmetic at the Computer Engineering Laboratory, Delft University of Technology, Delft, The Netherlands. In 2010, she joined the Department of Electrical Engineering, Shahid Beheshti University, as an Assistant Professor. She has authored or co-authored more than 40 publications on journals and conference proceedings. Her current research interests include computer arithmetic, residue and redundant number systems, VLSI design, low-power and ultralow-power digital circuits, and computer architecture.