# Robust Circuit Design for Low-Voltage VLSI 

by

Yejoong Kim

A dissertation submitted in partial fulfillment
of the requirements for the degree of
Doctor of Philosophy
(Electrical Engineering)
in The University of Michigan
2015

Doctoral Committee:
Professor David Blaauw, Chair
Associate Professor Kenn Richard Oldham
Professor Dennis Michael Sylvester
Assistant Professor Zhengya Zhang

## TABLE OF CONTENTS

LIST OF FIGURES ..... iv
LIST OF TABLES ..... vi
CHAPTER
1 Introduction ..... 1
2 Robust Level Converter Circuits for Wide-Range Voltage Conversion ..... 9
2.1 Introduction ..... 9
2.2 LC ${ }^{2}$ : Limited-Contention Level Converter ..... 10
2.2.1 DCVS Level Converter and Its Current Margin ..... 10
2.2.2 Operation of LC ${ }^{2}$ ..... 12
2.2.3 Measurements ..... 14
2.3 SLC: Split-Control Level Converter ..... 19
2.3.1 Previous Level Converters ..... 19
2.3.2 Operation of SLC ..... 19
2.3.3 Measurements ..... 22
2.4 Conclusions ..... 26
3 A Robust 7T SRAM Design ..... 28
3.1 Introduction ..... 28
3.2 Ultra Low-Leakage 7T SRAM ..... 28
3.2.1 Auto-Shut-Off Sensing ..... 29
3.2.2 Quasi-Static READ ..... 32
3.2.3 Bit-Interleaving with PMOS Pass-Gate ..... 33
3.3 Conclusions ..... 36
4 A Static Single-Phase Contention-Free Flip-Flop ..... 39
4.1 Introduction ..... 39
4.2 Previous Flip-Flops ..... 40
$4.3 \quad S^{2}$ CFF (Static Single-phase Contention-free Flip-Flop ..... 44
4.3.1 Schematic and Operation Details ..... 44
4.3.2 Hold Time Path ..... 46
4.4 On-Chip Testing Circuits ..... 48
4.4.1 Setup/Hold Time ..... 48
4.4.2 C-Q Delay ..... 49
4.4.3 Power ..... 50
4.5 Measurements ..... 52
4.6 Conclusions ..... 57
5 A Testing Harness for Low-Voltage Flip-Flop Timing Characterization ..... 59
5.1 Introduction ..... 59
5.2 Issues in Low $V_{D D}$ Flip-Flop On-Chip Measurements ..... 60
5.3 A New Phase Detection Circuit for Low $V_{D D}$ Operation ..... 62
5.4 A Setup/Hold-Time Measurement Circuit for Wide Voltage-Range Oper- ation ..... 64
5.5 Measurements ..... 67
6 Conclusion ..... 75
6.1 Future Works ..... 78
6.2 Related Publications and Patents ..... 79
BIBLIOGRAPHY ..... 80

## LIST OF FIGURES

Figure
1.1 A cubic-millimeter intraocular pressure monitoring system [4] ..... 2
1.2 A modular $1 \mathrm{~mm}^{3}$ sensing platform [7] ..... 2
1.3 A typical architecture of low-voltage VLSI systems ..... 5
1.4 Bitcell size comparison between commercial 6T and 8T ..... 6
1.5 Power breakdown of SPARC T4 processor [32] ..... 7
1.6 Normalized unit-FO4 delay measurement in 45 nm ..... 7
2.1 DCVS LC and its current margin plots ..... 10
2.2 $\mathrm{LC}^{2}$ operation ..... 11
$2.3 \mathrm{LC}^{2}$ schematic and its waveforms ..... 13
2.4 LC $^{2}$ current margin plot ..... 14
2.5 Simulation results of LC ${ }^{2}$ and DCVS LC ..... 15
2.6 Measured delay compared to DCVS LC ..... 16
2.7 Measured power consumptions (freq $=5 \mathrm{kHz}, \alpha=2$ ) ..... 16
2.8 Measured delay variations ..... 17
2.9 Impact of voltage fluctuations ..... 18
2.10 Number of operating LCs over temperature ..... 18
2.11 (a) Conventional DCVS LC with Monte Carlo simulation result, (b) Interrupted DCVS LC with Monte Carlo simulation results ..... 20
2.12 Level converter in [19] ..... 21
2.13 SLC schematic ..... 22
2.14 (a)(b) Comparisons between LC of [19] and SLC, (c) Monte Carlo simulations of SLC ..... 23
2.15 Measured result comparisons ..... 24
2.16 Yield comparison at very low temperature $\left(-25^{\circ} \mathrm{C}\right)$ ..... 25
2.17 (a) Die photo of the test chip, (b) Die photos of low voltage timer designs [42][7] ..... 27
3.1 Bitcell size and standby power ..... 29
3.2 7T bitcell schematic and the L-shaped layout ..... 30
3.3 Auto-Shut-Off sensing and the measured improvement in READ energy ..... 31
3.4 Circuit implementation of Auto-Shut-Off sensing ..... 32
3.5 Quasi-Static READ ..... 33
3.6 Measured improvement in read error rate due to Quasi-Static READ ..... 34
3.7 Bit-interleaving with PMOS pass-gate ..... 35
3.8 Effects of body biasing ..... 36
3.9 Shmoo plot ..... 37
3.10 Die photo ..... 38
4.1 Schematics of TGFF and ACFF [35] ..... 41
4.2 Schematics of TGPL [36] and TSPC [37] ..... 42
4.3 Waveforms in TSPC when D stays 0 for consecutive cycles ..... 43
4.4 Schematic of $\mathrm{S}^{2} \mathrm{CFF}$ ..... 45
4.5 Operation of $S^{2} \mathrm{CFF}$ ..... 46
4.6 Hold time paths in TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ ..... 47
4.7 Setup/hold time measurement circuit ..... 48
4.8 C-Q delay measurement circuit ..... 50
4.9 Power measurement circuit ..... 51
4.10 Measured total power ..... 53
4.11 Measured energy ..... 54
4.12 Measured C-Q delay ..... 55
4.13 Measured leakage power ..... 55
4.14 Die photo of the test chip fabricated in 45 nm SOI ..... 57
5.1 Mismatch sources in a setup/hold-time measurement circuit ..... 60
5.2 A simplified diagram of the mismatch sources in a setup/hold-time measurement circuit ..... 61
5.3 Edge alignment and offset $\left(\Delta T_{L}+T_{O F F}\right)$ measurement when D rises ..... 63
5.4 Dynamic NAND/NOR structures for edge alignment ..... 64
5.5 Phase detector circuit diagram ..... 65
5.6 Setup/hold-time measurement circuit ..... 65
5.7 (a) Clock Buffer schematic (b) Current-starved buffer for delay tuning ..... 66
5.8 Hold-time distribution of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ at 1.0 V and 0.4 V ( 172 flip-flops of each type) ..... 68
5.9 Hold-time distribution of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ at 0.35 V and 0.32 V (172 flip-flops of each type) ..... 69
5.10 Hold-time distribution of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ at 1.0 V and 0.4 V ( 43 chips) ..... 70
5.11 Hold-time distribution of TGFF and $S^{2} \mathrm{CFF}$ at 0.35 V and 0.32 V ( 43 chips) ..... 71
5.12 Maximum hold-time value from the measured 172 flip-flops of each type ..... 72
5.13 Die photo of the test chip fabricated in 45 nm SOI ..... 74

## LIST OF TABLES

Table
2.1 Comparison of wide-range LCs at $25^{\circ} \mathrm{C}$ ..... 26
3.1 Comparison of low-power SRAMs ..... 37
4.1 Comparison of conventional flip-flops ..... 40
4.2 Setting activity ratio in power measurement circuit ..... 51
4.3 Measurement and topology comparison of flip-flops ..... 56
5.1 Comparison of the hold-time variations of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ ( 172 flip-flops of each type) ..... 67
5.2 Comparison of the hold-time variations of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ (43 chips) ..... 72

## CHAPTER 1

## Introduction

The insatiable demand for more integration and performance recently resulted in a 15 -core, 30 -thread commercial processor with 4.31 billion transistors [1]. The clock frequency, one of the key indicators of chip performance, once reached at 4 GHz in a 90 nm CMOS process in 2004 [3]. However, it could not follow the trend observed in the transistor count and has remained near-constant over the years [2], where it seems to be saturated in the range of $5 \sim 6 \mathrm{GHz}$. The main reason behind this is the "power-wall" where the excessive power density significantly limits chip reliability and yield as well as the performance and cooling expense [6]; this requires chip designers to consider the power consumption at all design levels.

At the other end of the spectrum lies portable hand-held devices and wireless sensor nodes. Their low power consumption requirement comes from the small form-factor where only a limitedsized battery is available. For example, an intraocular pressure monitoring system [4] shown in Figure 1.1 measures $1.5 \mathrm{~mm} \times 2 \mathrm{~mm} \times 0.5 \mathrm{~mm}$ and includes an $1 \mu \mathrm{Ah}$ thin-film battery. Due to this small capacity of the battery, every part of the system has been specifically designed for the target application.

A more general and modular approach to the wireless sensor nodes was introduced in [7], and the system photo is shown in Figure 1.2. It is a $1 \mathrm{~mm}^{3}$ wireless sensor node platform and limits its total volume within $1.4 \mathrm{~mm} \times 2.8 \mathrm{~mm} \times 1.6 \mathrm{~mm}$, hence only allowing a $0.6 \mu \mathrm{Ah}$ thin-film battery on which two ARM Cortex ${ }^{\text {TM }}-\mathrm{M} 0$ processors and other digital/analog circuits, including sensors, have to reliably operate. Although this system allows stacking many different IC-layers fabricated in different processes using a low-power inter-layer communication bus [5] making it


Figure 1.1: A cubic-millimeter intraocular pressure monitoring system [4]


Figure 1.2: A modular $1 \mathrm{~mm}^{3}$ sensing platform [7]
easier to expand system functionality, the severe power constraint requires the entire system to consume less than $40 \mu \mathrm{~W}$ active power while utilizing duty-cycled operations with extremely low sleep power ( 11 nW ). Therefore, every circuit component in this system must take into account the low-power concerns while still guaranteeing robust system functionality.

Generally, the dynamic power consumption of typical digital circuits can be found as below.

$$
\begin{equation*}
P_{d y n}=C_{e f f} V_{D D}^{2} f_{c l k} \tag{1.1}
\end{equation*}
$$

where $C_{e f f}$ indicates the effective switching capacitance, and $V_{D D}$ and $f_{c l k}$ indicate the supply voltage and the operating clock frequency, respectively. While technology scaling helps reduce the intrinsic capacitance, many circuit techniques have been developed to utilize the quadratic relationship of $V_{D D}$ for effective power reduction.

One of the widely used techniques is dynamic voltage and frequency scaling (DVFS) [8], where the supply voltage and the clock frequency become dynamically adjusted depending on load conditions or operation modes. The effectiveness of DVFS has made this technique quite popular, and many leading institutions and companies have applied DVFS in various types of designs [1][9][10][11][12], where the processors are aimed to achieve power savings without degrading the critical performance. In extremely power-constrained systems, further voltage scaling down to near- or sub-threshold level has been applied. An FFT processor in [13] achieves 90 nW of FFT operations by lowering the supply voltage to 180 mV , which is at the sub-threshold level in the standard $0.18 \mu \mathrm{~m}$ CMOS logic process used in the work. Obviously, the lower supply voltage indicates lower power consumption as shown in Eq. (1.1). However, this lower power does not necessarily mean 'lower energy.' As the supply voltage becomes lower, the maximum achievable clock frequency becomes also degraded due to the reduced device on-current ( $I_{O N}$ ). The slower operating frequency (i.e., a longer clock period) increases the leakage energy per cycle, hence reducing the ratio of the dynamic energy to the leakage energy. Therefore, there exists a minimum energy point where further voltage scaling does not reduce the overall energy consumption due to the dominating leakage energy. As a result, the FFT processor above achieves the minimum energy point at 350 mV with $155 \mathrm{~nJ} / \mathrm{FFT}$, whereas the minimum voltage point at 180 mV consumes more than $1 \mu \mathrm{~J} / \mathrm{FFT}$.

This minimum energy point typically occurs at a voltage slightly lower than the device threshold voltage (hence, sub-threshold). However, researchers found that the energy reduction is only $\sim 2 \times$ when $V_{D D}$ is scaled from the near-threshold regime to the sub-threshold regime, whereas delay increases by $50-100 \times$ over the same region [14]. Thus, for many applications, the nearthreshold regime can be a better choice than the sub-threshold in terms of an energy-delay trade-off, and the near-threshold computing (NTC) has become an attractive solution for low-power VLSI systems [15][16][17].

However, there are several issues in the NTC operations [14]. First, the lower supply voltage
significantly degrades the performance, although this could be compensated by parallelism to some extent. Second, NTC exhibits degraded process/voltage/temperature (PVT) variations. In the NTC region, the MOSFET drive current has an exponential dependency on the supply voltage ( $V_{D D}$ ), device threshold voltage $\left(V_{T H}\right)$, and temperature. Thus, even a small amount of variation can lead to a severe yield reduction especially in ratioed designs in which the circuit functionality depends on a relative device sizing. Therefore, proper circuit-level techniques have to be applied for lowvoltage VLSI.

In this dissertation, we identify several circuit components that are critical to low-voltage VLSI operation and propose new and advanced techniques to improve their robustness and performance. A typical architecture of low-voltage VLSI systems is shown in Figure 1.3; level converter circuits, SRAM, and clocked sequential elements are highlighted, and each will be discussed in detail in the following chapters.

Level converters are one of the main concerns especially in aggressively voltage-scaled systems. Typically, digital cores operate at low supply voltages to save the power, but other peripherals are not always able to be run at such low voltages. For example, it is hard to apply the voltage scaling technique to analog circuits due to the reduced voltage headroom (hence reduced margins/offsets). Also, I/O voltages are not very well scalable due to the noise concerns. Thus, level converters are required at the interface between the low-voltage digital core and the highvoltage analog and peripherals. However, as the cores become deeply voltage-scaled, the voltage difference between the low voltage ( $V_{D D L}$ ) and the high voltage ( $V_{D D H}$ ) becomes larger. Especially for the core running in the NTC region, the reduced $I_{O N} / I_{O F F}$ ratio makes it extremely difficult to achieve robust level conversions. The use of native- $V_{T H}$ (or zero- $V_{T H}$ ) devices in [18] improves robustness by allowing to use thin gate-oxide devices (i.e., more stronger devices) for pull-down, but still, other techniques are required to further achieve a good performance, lower energy consumption, as well as a good yield. A well-known approach to improve the robustness is weakening the pull-up strength or strengthening the pull-down. For example, [19] uses PMOS diodes to weaken the pull-up strength, and [20] and [22] include reduced-swing inverters. A dynamic level converter can improve the speed and the robustness at the cost of extra power and a complicated synchronization circuit [21]. In Chapter 2, we will propose new static level converter circuits and a quantitative design method to guarantee robustness.


Figure 1.3: A typical architecture of low-voltage VLSI systems


Figure 1.4: Bitcell size comparison between commercial 6 T and 8 T

SRAMs are one of the major bottlenecks in the voltage scaling [23]; the standard 6T bitcell requires the ratioed device sizing, and the two-sided constraint (READ and WRITE) significantly degrades the robustness at the low voltage regime. Using 8T bitcells decouples READ and WRITE operations, making it possible to separately optimize the two operations at the cost of a larger bitcell area [24][25][26]. Generally, 8T bitcells have a $30 \sim 55 \%$ area penalty compared to the standard 6T bitcell, and one of the examples in an advanced technology node is shown in Figure 1.4. This significant area overhead makes the 8 T bitcell unacceptable in severely area-constrained applications. In the NTC region, the functionality of the bitcell is further impacted due to the aggravated PVT variations. Thus, in this case, even the 8 T requires assists from extra peripheral circuits for correct functionality [26][27], or a bitcell with more number of devices is preferred such as 10T bitcells in [7] and [28]. Recently, 7T bitcells have been proposed in [29] and [30]; they are supposed to have a smaller bitcell size than the 8 T bitcell while still providing the similar robustness (i.e., decoupled READ and WRITE). In Chapter 3, we will address issues in the 7T structure and propose a new solution, still fully utilizing inherent advantages of the 7 T .

The next key component is the clocked sequential element, called a flip-flop in short. Flipflops are one of the critical components in today's digital processors. For example, both of POWER $7^{\mathrm{TM}}$ and SPARC T4 processors have more than 2 million flip-flops, taking up to $20 \%$ of the total core power [31][32] as shown in Figure 1.5. Mainly because of its importance in digital circuits, numerous flip-flop designs have been investigated and proposed [33][33]. The main issue


Figure 1.5: Power breakdown of SPARC T4 processor [32]


Figure 1.6: Normalized unit-FO4 delay measurement in 45 nm
of the conventional flip-flops in the NTC region is the degraded hold-time variation [38], which requires excessive buffer insertions to meet the hold-time margin under severe PVT variations. In Chapter 4, we will further discuss issues in conventional flip-flops in literature [35][36][37], and propose a new flip-flop that is static, single-phase, and contention-free, which also provides a $\sim 40 \%$ power reduction compared to the conventional flip-flop.

The last topic in this dissertation is a testing harness for flip-flop timing characterization. Rep-
resentative timing parameters of flip-flops are usually setup-time ( $T_{S E T U P}$ ), hold-time ( $T_{H O L D}$ ), and C-Q delay $\left(T_{C Q}\right)$. These parameters are usually in the range of $1 \sim 5$ FO4 delay, so an accurate Time-to-Digital Converter (TDC) is required to measure such a short delay. In addition, a more difficult problem arises in that those parameters are usually determined by mismatches in devices used to implement the flip-flops. At full $V_{D D}$ level, those mismatches can be minimized by upsizing transistors and careful layout techniques, but it is almost impossible to achieve the same measurement accuracy in low $V_{D D}$ due to the severe variations mentioned earlier in this chapter. For example, Figure 1.6 shows that the standard deviation of measured unit-FO4 delays in 45 nm degrades by $118 \times$ when going from 1.0 V to 0.32 V , while the average (mean) value is increased by only $29 \times$. These variations have severer effects in complicated circuits, and in the flip-flop timing characterization, they often cause a large offset in measurements. In Chapter 5, we will propose effective techniques to eliminate the measurement offsets incurred by the mismatches and provide setup/hold-time measurements at near- $V_{T H}$ to demonstrate the benefit of the new flip-flop introduced in Chapter 4.

Finally, in Chapter 6, we will conclude this dissertation by summarizing the proposed circuits and discussing possible future works.

## CHAPTER 2

# Robust Level Converter Circuits for Wide-Range Voltage Conversion 

### 2.1 Introduction

Low-voltage circuit design has been widely investigated for ultra-low power applications, reaching as low as 230 mV in a recent multi-pipelined processor [39], and requiring wide-range level conversion for communication with I/O pads and high-voltage circuit blocks. In addition, cores on a chip multiprocessor are increasingly voltage scaled independently [9], necessitating level conversion between core voltage domains in high performance applications. Another example is a multi-core system in [41], which suggests an optimal voltage/frequency mapping among the cores and requires thousands of level converters (LCs).

LCs become more critical as the voltage difference grows, for instance, between aggressively voltage-scaled DSP accelerators [13] and I/O. An extreme case is the wireless sensor node platform in [7], where the core is operated at a sub-threshold level while sensors and radio use the battery voltage (3.6V). Due to such significant voltage differences, these applications require wide-range LCs with fast and low power operation. However, level conversion is challenging at reduced voltages since conventional approaches suffer from severe contention between weak pull-down devices and strong pull-up devices, making them vulnerable to process / voltage / temperature (PVT) variations. Also, LCs in many sensing applications, such as environmental monitoring, will be exposed to extreme conditions, exacerbating robustness challenges in the LCs.


Figure 2.1: DCVS LC and its current margin plots

In this chapter, we will present two robust level converters, called Limited-Contention Level Converter (LC ${ }^{2}$ ) and Split-control Level Converter (SLC), respectively. Operation details and measurement comparisons are following.

### 2.2 LC ${ }^{2}$ : Limited-Contention Level Converter

### 2.2.1 DCVS Level Converter and Its Current Margin

Figure 2.1 shows the operation of a conventional Differential Cascode Voltage Switch (DCVS) approach. A zero- $V_{T H}$ device prevents oxide breakdown in the thin oxide devices, making it possible to use a fast standard- $V_{T H}$ (SVT) pull-down device [18]. The DCVS LC suffers from a twosided constraint on the PMOS device: if the PMOS is too weak, the pull-up transition becomes slow and the node may not be kept high, giving rise to performance and robustness issues; if the PMOS is too strong, the NMOS cannot overcome it and the circuit fails. The current margin plots in Figure 2.1 show that severe variations at the low voltages exacerbate this two-sided constraint. Although the circuit is designed such that $I_{N M O S} \gg I_{P M O S}$ to discharge node $n 1$ or $n 2$, as little as $2 \sigma V_{T H}$ variation causes failure due to $I_{N M O S}<I_{P M O S}$. Increasing NMOS size by $3.5 \times$ guarantees $3 \sigma$ robustness, but results in very large devices $\left(W_{N M O S}=105 \mu \mathrm{~m}\right)$ with undesirable leakage $(9 \mathrm{nA})$. In addition, the increased diffusion capacitance slows the pull-up transition. This two-sided con-


Figure 2.2: LC ${ }^{2}$ operation
straint severely limits DCVS LC robustness under PVT variation. Multiple LC stages can improve robustness but introduce overhead due to intermediate supplies and increased latency. Other static LCs [19][20] have similar two-sided constraints and require precise transistor sizing, and have lacked silicon measurements. A recently proposed dynamic LC [21] uses a high-voltage clock, which improves robustness but increases layout size and power consumption. Furthermore, none of the previous LCs has demonstrated robustness through comprehensive silicon measurements.

### 2.2.2 Operation of $\mathrm{LC}^{2}$

We propose a new approach called Limited Contention Level Converter (LCLC or LC ${ }^{2}$ ) that eliminates the two-sided constraint without the use of high-voltage clocks. Figure 2.2 shows the conceptual operation of $\mathrm{LC}^{2}$. Before the rising transition, node $n 1$ is held high by the weak keeper, which is sub-threshold-biased, while all other switches are off; hence $V_{n 1}=V_{D D H}$ and $V_{n 2}=0$. Once $V_{I N}$ rises to $V_{D D L}$, the pull-down driver starts to discharge $n l$ and easily overcomes the weak keeper. This transition on $n l$ causes "Pull-Up Control" to activate both the weak keeper and the strong switch on the other side, which quickly charges up n2. "Pull-Down Control" is then triggered to directly connect $n l$ to ground, rapidly discharging it and completing the transition. Finally, a delay element turns off all switches (except the appropriate keeper) after all transitions are finalized. The next transition can then proceed such that the only contention is with the weak keeper. The use of separate and different strength pull-up devices for holding state and charging/discharging $n 1$ and $n 2$ substantially improves design robustness and performance.

Figure 2.3 shows the schematic of $L^{2}$ with detailed timing waveforms. At the beginning of a rising transition, $V_{n 1}=V_{n 3}=V_{D D H}$ and $V_{n 2}=V_{n 4}=0$, hence M6 and M11 are off and M1 contends only with the weak keeper Mx. Once M1 and M3 start to discharge n1, positive feedback from M10 and M 7 boosts transition speed by pulling the gate of M 7 to $V_{D D H}$. Thus, M10 can be sized for fast rising transitions on $n 2$ (using a min length device). In contrast, this transistor must remain weak in the conventional approach to minimize the contention, making it slower and less robust. Once the transition completes, M5 and M12 are turned off after an inverter chain delay to prepare for the next transition. Devices M5-M12 use minimum width, and the inverter chains simply require sufficient delay to fully charge $n l$ or $n 2$, simplifying device sizing. Although the pull-down drivers (M1 and M2) and keepers should be carefully sized, keeper size can be easily determined using known techniques [40], after determining M1 and M2 sizes based on the desired speed-power trade-off. A simple diode chain is used to generate the keeper voltage ( $V_{\text {KEEPER }}$ ), setting the current supplied by the keeper. The current margin plot in Figure 2.4 shows that this design is robust to $>3 \sigma$ variation in simulation. Simulation results in Figure 2.5 indicate that DCVS is highly vulnerable to $V_{T H}$ shifts, while $\mathrm{LC}^{2}$ functions correctly within the entire process corner without significant delay change. Note that the vertices of polygons represent the pre-defined process corners (FF, FS,


Figure 2.3: LC ${ }^{2}$ schematic and its waveforms

SF, SS) of the specified devices in the figure. White regions indicate a delay larger than 10 FO 4 or functional failure.


Figure 2.4: LC $^{2}$ current margin plot

### 2.2.3 Measurements

We measured 40 dies in 130 nm CMOS; each die has two LC $^{2}$ s and two DCVS LCs designed for 0.3 V to 2.5 V conversions ( $V_{D D L}=0.3 \mathrm{~V}, V_{D D H}=2.5 \mathrm{~V}$ ) with a minimum-sized inverter as an output load. Figure 2.6 shows measured delay across temperature. LC ${ }^{2}$ is $3.2 \times$ faster than DCVS with 2.38 FO 4 delay at $25^{\circ} \mathrm{C}$ ( FO 4 measured at $V_{D D L}$ supply and corresponding temperature). In addition, DCVS shows a $10.4 \times$ delay change across $10 \sim 100^{\circ} \mathrm{C}$, while LC ${ }^{2}$ changes by only $4.3 \times$. Normalizing to FO4 delays, LC ${ }^{2}$ delay increases $18 \%$ from 10 to $100^{\circ} \mathrm{C}$ while DCVS worsens by $104 \%$. This is due to the much reduced contention in $\mathrm{LC}^{2}$. Figure 2.7 shows measured power consumption across temperature. While DCVS consumes 7.15 nW static power, LC $^{2}$ consumes $15 \times$ less $(475 \mathrm{pW})$ at $25^{\circ} \mathrm{C}$, mainly due to the smaller pull-down device $(1.5 \mu \mathrm{~m})$. It consumes 2.29 nW active power at $25^{\circ} \mathrm{C}$ which is $4.9 \times$ less than DCVS $(11.21 \mathrm{nW})$, as well as nearly constant active power over a wide temperature range. Due to the lack of contention, its active energy is dominated by charging of capacitances rather than short-circuit current as in DCVS, making it temperature insensitive. Active power changes only $2 \%$ (from 2.27 nW to 2.32 nW ) in the $10 \sim 100^{\circ} \mathrm{C}$ range while DCVS shows a $7.7 \times$ change (from 4.15 nW to 31.88 nW ) and high power consumption at low temperature. Unlike LC ${ }^{2}$, not all 80 DCVS LCs function below $10^{\circ} \mathrm{C}$ since the low temperature increases $V_{T H}$, weakening the NMOS exponentially and the PMOS linearly, exacerbating


Figure 2.5: Simulation results of $\mathrm{LC}^{2}$ and DCVS LC


Figure 2.6: Measured delay compared to DCVS LC


Figure 2.7: Measured power consumptions (freq=5kHz, $\alpha=2$ )


Figure 2.8: Measured delay variations
contention.
To show the impact of process variations, Figure 2.8 displays measured delay distributions for the LCs at $25^{\circ} \mathrm{C}$. LC ${ }^{2}$ shows $6 \times$ smaller standard deviation than DCVS. For voltage variations, Figure 2.9 shows performance degradations across voltage drop. While DCVS delay increases by $7.7 \times$ with $10 \% V_{D D L}$ drop, LC ${ }^{2}$ slows by only $6 \%$ (normalized to FO4 delays at the corresponding voltages), indicating that the keeper sizing strategy is sufficiently robust to handle expected voltage variations. Figure 2.10 shows the number of operating LCs at 1 MHz across temperature. DCVS was designed to operate as fast as 20 MHz at $25^{\circ} \mathrm{C}$, and the 1 MHz clock allows $20 \times$ delay degradation. While all $\mathrm{LC}^{2}$ s operate reliably in the $-20 \sim 100^{\circ} \mathrm{C}$ range, the first DCVS fails at $20^{\circ} \mathrm{C}$, and only 5 of 80 work at $-20^{\circ} \mathrm{C}$, showing the robustness of LC ${ }^{2}$ to PVT variations.


Figure 2.9: Impact of voltage fluctuations


Figure 2.10: Number of operating LCs over temperature

### 2.3 SLC: Split-Control Level Converter

### 2.3.1 Previous Level Converters

$\mathrm{LC}^{2}$ introduced in the previous section shows robust level conversion with superior performance and power. However, in systems requiring thousands of LCs, the area of $\mathrm{LC}^{2}$, which is comparable to the conventional DCVS LC, could become a limiting factor. Hence, a smaller (and probably simpler) LC can be beneficial in those applications.

As already discussed, DCVS LC shows poor robustness; in Figure 2.11(a), its yield is only $64.72 \%$ over 100,000 Monte Carlo simulations at $25^{\circ} \mathrm{C}$ even with very large pull-down devices of $(W / L)_{M 1, M 2}=30 \mu m / 0.12 \mu m$. The interrupted DCVS LC in Figure 2.11(b) has an additional PMOS M7 (or M8) that is expected to be weakened when $V_{I N B}=V_{D D L}$ (or $V_{I N}=V_{D D L}$ ), thus reducing $I_{P M O S, O N}$. However, this is not effective for $V_{D D L} \ll V_{D D H}$ since $\left|V_{G S}\right|$ of M7 (or M8) remains close to $V_{D D H}$. Monte Carlo simulations show only marginal improvement over conventional DCVS in this case. Previously proposed LCs either use a sensitive sub-threshold analog circuit i.e., a Reduced Swing Inverter - which has not been fully demonstrated in silicon [20][22], or a high voltage clock $\left(V_{C L K}=V_{D D H}=2.5 \mathrm{~V}\right)$ that results in high power consumption and a complex synchronization circuit [21], causing $1016 \times$ larger layout size than the conventional DCVS LC.

The LC in [19] is shown in Figure 2.12 and includes zero- $V_{T H}$ devices and additional PMOS diodes to tolerate 0.3 V to 2.5 V conversion in 130nm CMOS. The diodes (M9-M12) serve as current limiters, effectively reducing $I_{P M O S, O N}$ and hence improving robustness. However, they also prevent nodes $n 3$ and $n 4$ from fully discharging to ground, hence this design requires additional pull-down devices (M5-M8) that add internal node capacitance. Thus, discharge speed at $n 4$ (or $n 3$ ) is slow, causing short-circuit current in the output inverter. Also, n1 (or n2) is never fully charged to $V_{D D H}$ due to the diode voltage drop $\left(V_{D}\right)$ and causes static near-threshold current as depicted in the figure.

### 2.3.2 Operation of SLC

Figure 2.13 shows the proposed LC, named Split-Control Level Converter (SLC). It includes a new output structure (M11 and M12) to avoid the aforementioned problems. At the beginning

|  |  |  | Thick Oxide Zero- ${ }_{\text {TH }}$ ) |  |  | Thin Oxide (SVT) |  |  | $\begin{aligned} & V_{D D L}=0.3 \mathrm{~V} \\ & \mathrm{~V}_{\mathrm{DDH}}=2.5 \mathrm{~V} \end{aligned}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |


(a)

(b)

Figure 2.11: (a) Conventional DCVS LC with Monte Carlo simulation result, (b) Interrupted DCVS LC with Monte Carlo simulation results


## 4 Pull-Down Devices and 4 ZVT Devices

Figure 2.12: Level converter in [19]
of a rising transition at $\mathrm{IN}, V_{n 1}=0$ and $V_{n 2}=V_{D D H} V_{D}$, where $V_{D}$ represents the diode voltage drop through M6/M8 (or M5/M7). Once $V_{I N}$ goes high to $V_{D D L}$, M2 can easily discharge node $n 2$ because of the current-limiting diodes. Node $n 4$ is also discharged to $V_{D}$, and M11 is strongly on with a large $\left|V_{G S}\right|$, quickly charging up the output node while M12 is completely off. The circuit does not require the additional pull-down paths that contain the largest devices in the circuit, which results in at least $1.8 \times$ lower static power across process corners as shown in Figure 2.14(a). This also results in reduced internal loading at $n 4$ and $n 3$, speeding transitions at these nodes. In addition, M11 and M12's gate voltages are separately controlled in the output buffer (hence the name Split-control LC). This configuration ensures that the transistor turning off in the M11M12 stack always leads the transistor turning on, reducing short circuit current significantly and also improving the charging (or discharging) speed. Overall, Figure 2.14(b) shows that the circuit


Figure 2.13: SLC schematic
provides a $3.8-12.9 \times$ reduction in short-circuit energy consumption across process corners. Monte Carlo simulations show high yield (98.93\%) with much lower delay variability (Figure 2.14(c)). Compared to the LC in [19] which has $\mu=2.02$ FO4, $\sigma=0.79$ FO4, SLC has improved the delay because of the output buffer.

### 2.3.3 Measurements

We compare SLC to the conventional DCVS rather than the design in [19], since the four zero$V_{T H}$ devices in the LC of [19] make it slower than conventional DCVS at $>25^{\circ} \mathrm{C}$ due to increased internal loading. The minimum size requirement of zero- $V_{T H}$ devices also makes it comparable to the size of the large pull-down devices in DCVS, such that the LC in [19] has only $17 \%$ smaller


Figure 2.14: (a)(b) Comparisons between LC of [19] and SLC, (c) Monte Carlo simulations of SLC
layout size than DCVS despite the use of $15 \times$ smaller pull-down devices. Hence, DCVS provides a more challenging comparison point. We measured 40 dies in 130 nm CMOS; each die had two DCVS LCs and two SLCs, providing 80 LCs for each type. The LCs were designed for 0.3 V to 2.5 V conversion. Also, we used the simulated unit-FO4 delay to convert measured delays into FO4 delays. The unit-FO4 delay was simulated at $V_{D D L}$ and the corresponding temperature.

Figure $2.15(\mathrm{a})$ shows that SLC has a delay of 3.37 FO 4 at $25^{\circ} \mathrm{C}, 2.3 \times$ faster than DCVS. Normalized to FO4 delay, SLC delay varies by only $9.5 \%$ over $10-100^{\circ} \mathrm{C}$, while DCVS changes by more than $2 \times$. In Figure $2.15(\mathrm{~b})$, the new design has $9.9 \times$ lower static power at $25^{\circ} \mathrm{C}$, mainly due to the smaller pull-down devices. Also, active power is $5.9 \times$ lower than DCVS, demonstrating the benefits of reduced contention. Across $10-100^{\circ} \mathrm{C}$, the active power of SLC varies by $33 \%$, while DCVS exhibits $7.7 \times$ variation over the same range.

Figure 2.15(c) shows that SLC has a $5.2 \times$ smaller standard deviation in measured delay at $25^{\circ} \mathrm{C}$. The measured delay-power scatter plot in Figure 2.15(d) demonstrates much better robustness to process variations especially at the low temperature, since the exponential dependency of $I_{N M O S, O N}$ exacerbates the direct contention in DCVS.


Figure 2.15: Measured result comparisons


Figure 2.16: Yield comparison at very low temperature $\left(-25^{\circ} \mathrm{C}\right)$

Figure 2.15(e) and (f) show the effects of voltage/temperature variations. For a $10 \% V_{D D L}$ drop, DCVS LC delay degrades by $7.7 \times$, while SLC speed reduces by only $5.6 \%$. Although the DCVS LC is designed to operate at up to 20 MHz at $25^{\circ} \mathrm{C}$, some measured DCVS LCs fail to achieve 1 MHz operation at $20^{\circ} \mathrm{C}$ and overall its functionality severely degrades as temperature is lowered. In contrast, SLC operates reliably over the full temperature range of -20 to $100^{\circ} \mathrm{C}$. SLC robustness becomes more pronounced in severe conditions, as Figure 2.16 demonstrates all measured devices are functional even with $>10 \% V_{D D L}$ drop at very low temperature $\left(-25^{\circ} \mathrm{C}\right)$, whereas DCVS LC is essentially non-functional at this condition. For sensor node applications, it is critical to work in a range of environments to enable true 'ubiquitous' networks; hence the robustness of SLC is a key advantage for such systems.

|  | LC $^{2}$ | SLC | TVLSI'11 [21] | ESSCIRC'07 [19] |
| :---: | :---: | :---: | :---: | :---: |
| Technology | 130 nm | 130 nm | 130 nm | 180 nm |
| Conversion | 0.3 V to 2.5 V | 0.3 V to 2.5 V | 0.3 V to 2.5 V | 0.3 V to 1.8 V |
| Type | Static | Static | Dynamic <br> (w/2.5V clock) | Static |
| Delay | 41.51 ns | 58.78 ns | 125 ns | $\sim 600 \mathrm{~ns}$ |
| Static Power | 475 pW | 724 pW | N/A | N/A |
| Energy per <br> Transition | 229 fJ | 191 fJ | 1.7 pJ | $\sim 20 \mathrm{pJ}$ |
| Area | $102.26 \mu \mathrm{~m}^{2}$ <br> (including the diode chain) | $71.94 \mu \mathrm{~m}^{2}$ | $0.1118 \mathrm{~mm}^{2}$ | No silicon <br> implementation |

Table 2.1: Comparison of wide-range LCs at $25^{\circ} \mathrm{C}$

### 2.4 Conclusions

In this chapter, we presented new level converters and their measurements. Figure 2.17(a) shows the die photo and Table 2.1 shows comparisons to recent wide-range LCs.

Despite having more transistors than DCVS, LC ${ }^{2}$ is smaller than DCVS in layout even including the extra diode chain, which can be shared among multiple $\mathrm{LC}^{2} \mathrm{~s}$.

The static nature of LC ${ }^{2}$ and SLC does not require clocks or complex synchronizing schemes, enabling $1093 \times$ and $1554 \times$ smaller area, respectively, compared to [21], which is also fabricated in 130 nm CMOS. Compared to [21], LC ${ }^{2}$ shows $7.4 \times$ lower energy per transition and $3 \times$ faster speed, while SLC has $8.9 \times$ lower energy per transition.

SLC is $35 \%$ smaller than the conventional DCVS, making it the smallest LC reported for wide-range ( 0.3 V to 2.5 V ) conversions. We incorporated SLC in a previously reported low-power timer [42] and observed $15.8 \%$ reduction in switching energy; this improvement is conservative as the new timer includes overhead from an LDO regulator, which was not included in the previous design. Figure 2.17(b) shows the die photos of both timers. The new timer including SLC was successfully incorporated into the wireless sensor node system in the 130nm layer of [7]. This system also uses SLC (ported to 180 nm CMOS) for its CPU, memory, and power management unit (PMU) interfaces. This SLC consists of thick-oxide I/O devices ( $V_{T H}>700 \mathrm{mV}$ ) and successfully operates for a $0.6 \mathrm{~V}-3.6 \mathrm{~V}$ conversion range.

(a)


Previously reported timer in 130 nm CMOS (NOT Including LDO) $660 \mathrm{pW} / 0.36 \mathrm{~Hz}=1.83 \mathrm{~nJ} /$ switching


New timer with SLC in 130 nm CMOS (including LDO) $8.6 \mathrm{nW} / 5.6 \mathrm{~Hz}=1.54 \mathrm{~nJ} /$ switching
(b)

Figure 2.17: (a) Die photo of the test chip, (b) Die photos of low voltage timer designs [42][7]

## CHAPTER 3

## A Robust 7T SRAM Design

### 3.1 Introduction

SRAM suffers from reduced robustness due to severe process variation in nanoscale CMOS. In particular, it is challenging to jointly ensure reliable READ and WRITE operation in conventional 6T SRAM. As a result, 8T and even larger bitcells are widely used, particularly for low-voltage memories; they isolate READ and WRITE operations, so it is possible to separately optimize their robustness. However, this added robustness comes at the expense of density; 8 T bitcells incur $\sim 30 \%$ area overhead compared to minimum achievable 6T bitcells [24][26][25]. In addition, 8T bitcells exhibit the so-called "Half-Select" problem making it difficult to apply column-muxing, as necessary for high array efficiency and SER robustness [25]. These issues are further complicated in emerging low power sensor systems due to ultra-low leakage requirements. For instance, the modular sensing system in [7] requires fW/bit standby power, necessitating the use of a 10T HVT bitcell (marked as ' $K$ ' in Figure 3.1) that is $4 \times$ larger than a commercial 6T SVT bitcell. Such area penalties are often not acceptable and hence there is a need for low leakage, low voltage tolerant designs that also achieve reasonable density.

### 3.2 Ultra Low-Leakage 7T SRAM

In this chapter, we propose a novel 7T SRAM that has decoupled READ/WRITE operation, similar to an 8T SRAM. It achieves robust operation at low voltage with $3.35 \mathrm{fW} / \mathrm{bit}$ standby


Figure 3.1: Bitcell size and standby power
power and reduces the area penalty of an 8 T bitcell by $47 \%$. It features a new dynamic read completion detection technique to avoid short-circuit current during READ and uses PMOS Pass Gate (PG) combined with dual supply voltages to mitigate the Half-Select problem and enable bit-interleaving. Prior 7T bitcells, using an L-shape layout, were presented in [29][30]. However, [29] uses tunneling FETs while [30] does not address the power overhead incurred by substantial short-circuit current during READ. Furthermore, [30] depends on Write-Back scheme to enable bit-interleaving, causing area/power overhead. The proposed 7T SRAM (8kB macro, 32-bit I/O with 2-way column-muxing) was fabricated in 180 nm CMOS and addresses these issues while also providing extremely low leakage, making this SRAM applicable to low power applications without sacrificing area efficiency (Figure 3.1).

### 3.2.1 Auto-Shut-Off Sensing

Figure 3.2 shows the proposed 7T bitcell, which includes an HVT 6T portion and a single SVT READ Device (RD). As depicted in Figure 3.3, conventional READ in a 7T topology causes


* N-WELL connected to $\mathrm{V}_{\mathrm{DDH}}$

Figure 3.2: 7T bitcell schematic and the L-shaped layout
large short-circuit current from unselected cells ( $I_{U N S E L}$ ) once $V_{R B L}$ drops below $V_{D D}-V_{T H}$, turning on READ Devices (RD) along the column in bitcells storing Data1. This $I_{U N S E L}$ limits the BL swing and incurs a large power penalty. The proposed 7T SRAM introduces an Auto-ShutOff mechanism in which the selected READ wordline (RWLB) is automatically disabled during READ, thereby maintaining $V_{R W L B}$ above $V_{D D}-V_{T H}$ and cutting off $I_{U N S E L}$. The READ wordline is not disabled if all selected bitcells store Data0. The proposed 7T SRAM uses dual voltages ( $V_{D D}=0.6 V, V_{D D H}=0.95 \mathrm{~V}$ ) to provide a wider BL swing with negligible $I_{U N S E L}$. As shown in Figure 3.3, Auto-Shut-Off sensing with dual voltages reduces 7T READ energy by $6.8 \times$ (measured). The Auto-Shut-Off technique employs two sense amplifiers: coarse and fine (Figure 3.4). Once the fastest column discharges RBL sufficiently to trigger the coarse sense amp, RSTB (Reset Bar) is discharged, lowering RWL_EN to deactivate all wordlines so that all RBLs stop discharging and become floating. RSTB also asserts SAE, which fires the fine sense amp and isolates it


Figure 3.3: Auto-Shut-Off sensing and the measured improvement in READ energy
from RBL. Since the operation is stopped by the fastest column, the slowest column may have discharged a much smaller amount due to variations. To address this, the coarse sense amp must be margined to guarantee sufficient voltage differential for the fine sense amp to correctly detect the slowest RBL discharge. The fine sense amp is a biased topology designed to correctly detect voltage swings as small as 60 mV . In the All-Data0 case, RBL remains high at $V_{D D H}$, as does RSTB. In this case, RWLB and RBL are reset at the falling edge of PULSE (Figure 3.4).


Figure 3.4: Circuit implementation of Auto-Shut-Off sensing

### 3.2.2 Quasi-Static READ

The proposed dual- $V_{D D}$ 7T SRAM exhibits an innate bitline leakage suppression effect in unselected bitcells resulting from negative $V_{G S}$ on their READ devices. When reading Data0 as in Figure 3.5, the worst-case scenario in 8 T occurs when all unselected bitcells on a column have Data1, maximizing bitline leakage current. In contrast, $I_{\text {LEAK }}$ from unselected cells in the 7 T topology flows in the opposite direction, and therefore can help keep RBL high. Thus, the worstcase in a 7 T occurs when all unselected bitcells also have Data0, creating a larger negative $V_{G S}$ in unselected bitcells and thus reducing the beneficial $I_{\text {LEAK }}$ while increasing $I_{G A T E}$. However, $I_{G A T E}$ is significantly smaller than $I_{L E A K}$ especially at high temperature. Also, due to the negative $V_{G S}$ ( $=V_{D D L}-V_{D D H}$ or $-V_{D D H}$, depending on cell data), $I_{L E A K}$ is greatly suppressed and becomes negligible. Simulation shows that 7T bitline leakage is $113 \times$ smaller than in an 8 T , such that the design shows quasi-static READ behavior. 8T SRAM generally requires a bitline keeper at low


Figure 3.5: Quasi-Static READ
frequencies, which creates additional complexity, requires margining, and reduces robustness at low $V_{D D}$. The proposed 7T maintains robust operation without the keeper across supply voltages, as shown by measured results in Figure 3.6.

### 3.2.3 Bit-Interleaving with PMOS Pass-Gate

The use of conventional NMOS PG devices makes bit-interleaving difficult in low-voltage memories. As shown in Figure 3.7, $V_{G S}=V_{W W L}$ and is $V_{D D H}$ for both written and the half-selected cells. Reducing $V_{W B L}$ in the half-select cells does not improve the margin substantially between $I_{P G}$ (WRITE) and $I_{P G}$ (Half-Select), causing the PG device to fully transfer $V_{W B L(B)}\left(=V_{D D}\right)$ to the internal node during Half-Select. Several techniques [30][44] have been proposed to address this problem, resulting in significant complexity and area overhead. The proposed dual-voltage 7T


Figure 3.6: Measured improvement in read error rate due to Quasi-Static READ
instead uses PMOS PG such that $\left|V_{G S}\right|=V_{W B L(B)}$ and PG strength can be modulated by applying different $V_{W B L(B)}$ in WRITE and Half-Select columns. Also, the PMOS PG is reverse body-biased during Half-Select ( $V_{B S}=V_{D D H}-V_{D D}$ ), increasing $V_{T H}$ of these HVT devices such that the PG operates in the near- $V_{T H}$ regime. This increases sensitivity of the PG to $V_{G S}$ through $V_{W B L(B)}$ modulation, allowing us to further separate the Half-Select and WRITE PG currents as shown in Figure 3.7, in which a 0.35 V change in $V_{W B L}$ between WRITE $\left(V_{W B L}=V_{D D H}\right)$ and Half-Select $\left(V_{W B L}=V_{D D}\right)$ changes drain current by $\sim 104 \times$ at TT corner. This controllability enables true column multiplexing without area overhead. Measurements in Figure 3.8 show that $V_{D D H}-V_{D D}>$ 100 mV is sufficient to create enough $V_{G S}$ sensitivity of the PG, resulting in no READ error from Half-Select columns. Since NWELL is biased at $V_{D D H}$, this reverse body-biasing also reduces standby power, which is minimized at $V_{D D H}-V_{D D}=200 \mathrm{mV}$.


Figure 3.7: Bit-interleaving with PMOS pass-gate


Figure 3.8: Effects of body biasing

### 3.3 Conclusions

A new 7T SRAM was fabricated in 180 nm CMOS, and the 8 kB macro shows the benefits from the novel Auto-Shut-Off sensing, Quasi-Static READ, and the bit-interleaving with PMOS PG devices. This 7T cell is $2.3 \times$ smaller than the 10T bitcell in [7], but still enables fW/bit standby power ( $3.35 \mathrm{fW} / \mathrm{bit}$ ). It shows $>3500 \times$ reduction in standby power compared to a commercial 6 T bitcell. Figure 3.9 is a Shmoo plot showing $V_{\text {MIN }}$ of 320 mV . Table 3.1 shows a comparison with other low-power SRAMs, where the lowest bitcell leakage power and the column-muxing without extra circuit overhead (e.g., Write-Back) of the proposed 7T are clearly noticeable. The proposed bitcell is $20 \%$ larger than the 6 T bitcell, while the 8 T in [27] and the 10T in [43] have more than $60 \%$ increase in bitcell size. The die photo is shown in Figure 3.10.


Figure 3.9: Shmoo plot

|  | This Work | JSSC'13 [30] | ISSCC'06 [43] | JSSC'09 [27] |
| :---: | :---: | :---: | :---: | :---: |
| Devices | $7 \mathrm{~T}(\mathrm{HVT})$ | $7 \mathrm{~T}(\mathrm{SVT})$ | $10 \mathrm{~T}(\mathrm{SVT})$ | $8 \mathrm{~T}(\mathrm{SVT})$ |
| Process | 180 nm | 65 nm | 65 nm | 130 nm |
| Voltage | Nominal $V_{D D}=0.6 \mathrm{~V}$ <br> $V_{M I N}=0.32 \mathrm{~V}$ | $V_{M I N}=0.26 \mathrm{~V}$ | 0.4 V | Nominal $V_{D D}=1.2 \mathrm{~V}$ <br> $V_{M I N}=0.23 \mathrm{~V}$ |
| Bitcell Size <br> (Normalized by 6T) | $7.75 \mu \mathrm{~m}^{2}\left(239 \mathrm{~F}^{2}\right)$ <br> $=1.20 \times 6 \mathrm{~T}(\mathrm{HVT})$ <br> $=1.66 \times 6 \mathrm{~T}(\mathrm{SVT})$ | $<1.15 \times 6 \mathrm{~T}(\mathrm{SVT})$ | $1.66 \times 6 \mathrm{~T}(\mathrm{SVT})$ | $6.36 \mu \mathrm{~m}^{2}\left(442 \mathrm{~F}^{2}\right)$ <br> $=3.12 \times 6 \mathrm{~T}(\mathrm{SVT})$ |
| \#Bitcells/bitline | 128 | 256 | 256 | 512 |
| Column Muxing | $2: 1(\mathrm{w} / \mathrm{o} \mathrm{assist)}$ | $8: 1(\mathrm{w} / \mathrm{Write-Back)}$ | No | Not Reported |
| Energy | $390 \mathrm{fJ} / \mathrm{bit} \mathrm{@} 0.6 \mathrm{~V}$ | $350 \mathrm{fJ} / \mathrm{bit} @ 0.26 \mathrm{~V}$ | $54 \mathrm{fJ} / \mathrm{bit}$ | $2.69 \mathrm{pJ} / \mathrm{bit} \mathrm{@} \mathrm{0.23V}$ |
| Leakage/Bit | $3.35 \mathrm{fW} / \mathrm{bit}$ | Not Reported | $7.6 \mathrm{pW} / \mathrm{bit}$ | $45 \mathrm{pW} / \mathrm{bit}$ |

Table 3.1: Comparison of low-power SRAMs


Figure 3.10: Die photo

## CHAPTER 4

# A Static Single-Phase Contention-Free Flip-Flop 

### 4.1 Introduction

Near-threshold computing (NTC) is an attractive solution to stagnating energy efficiencies in digital integrated circuits, arising from slowed voltage scaling in nanometer CMOS [15][45]. However, the design of sequential elements for NTC, as well as in voltage-scaled systems operating at both near-threshold and super-threshold, has not been extensively studied; a recent study analyzes and compares many existing flip-flop topologies [33][34], but it is limited to the full $V_{D D}$ (i.e., super-threshold) operations and does not take into account process / voltage / temperature (PVT) variations. In NTC, these variations become a critical concern for circuit robustness, and a correct operation at one PVT corner does not necessarily guarantee functional correctness at other PVT corners. The design of sequential elements is not an exception, and it is well known that they have a strong sensitivity to process variations in NTC [45], which can have a significant impact on system yield and power consumption. In order to achieve reliable energy-efficient operation across a wide operating voltage range, a flip-flop should have the following attributes: a) static operation, since dynamic nodes are highly susceptible to PVT variations at low voltage; $b$ ) contention-free transitions, since ratioed logic has poor robustness across the wide range of device $I_{\text {ON }} / I_{\text {OFF }}$ ratios incurred with voltage scaling; c) single-phase clocking, which avoids toggling of internal clock inverters and incurs a corresponding power penalty; $d$ ) minimum or no area penalty compared to conventional ones.

While many flip-flops have been proposed, no prior design meets all these requirements for an

|  | TGFF | ACFF | TGPL | TSPC |
| :---: | :---: | :---: | :---: | :---: |
| Static Operation | YES | YES | YES | NO |
| Single-Phase Clock | NO | YES | NO | YES |
| Contention-Free | YES | NO | YES | YES |
| Device Count | 24 | 22 | 28 | 11 |

Table 4.1: Comparison of conventional flip-flops
energy-efficient, highly voltage-scalable sequential element [33][34][35][36][37]. In the following sections, we will briefly discuss the issues with the conventional flip-flops, and then present a new flip-flop which owns all the above-mentioned characteristics. Details on operations and a beneficial "simple hold time path" will be presented, followed by measured data and comparisons with conventional ones.

### 4.2 Previous Flip-Flops

Figures 4.1 4.2 show schematics of several common flip-flop designs: transmission-gate flipflop (TGFF), which is widely used in commercial standard-cell libraries; adaptive-coupling flipflop (ACFF) [35]; transmission-gate pulsed-latch (TGPL) [36]; and true single-phase clock flipflop (TSPC) [37]. Shortcomings of these flip-flops are summarized in Table 4.1.

The conventional TGFF is completely static and contention-free thus showing robust operations with voltage scaling. Its robustness and a highly-optimized cell layout with 24 transistors make it a de facto standard in commercial standard-cell libraries. However, it exhibits high power consumption due to a large number of clocked nodes (i.e., not single-phase clocked). It is possible to remove the two clock inverters from TGFF and distribute both CK and CKB through a clock tree design; this reduces the number of the always-toggling clock nodes in the flip-flop, but handling both polarities with ever-increasing clock skew is not an attractive option for voltage-scaled designs.

ACFF [35] is a static flip-flop which also incorporates single-phase clocking operation and has fewer devices than TGFF. The single-phase clock and the fewer device count results in lower energy consumption at low activity ratio at super-threshold regime. However, it has a degraded state-holding in the master latch. For example, suppose that $F N=0$ and $F=1$ right before the positive


Figure 4.1: Schematics of TGFF and ACFF [35]

< TGPL (Transmission-Gate Pulsed-Latch) >

< TSPC (True Single-Phase Clock Flip-Flop) >

Figure 4.2: Schematics of TGPL [36] and TSPC [37]


Figure 4.3: Waveforms in TSPC when D stays 0 for consecutive cycles
edge of CK , which also means $B N=0, B=1, G N=0$, and $G=1$. With the CK rising transition, M1 and M 3 becomes turned off, and $F N$ is held low by $G N$ node through M6, while $F$ is held high by $G$ node through M7. If D changes during $\mathrm{CK}=1$ phase, $B N$ and $B$ will change their values (i.e., it becomes $B N=1$ and $B=0$ ), thus turning off M6/M7 and turning on M5/M8. This causes $F N$ kept low through a PMOS (M5) and $F$ kept high through an NMOS (M8), which is undesirable for low voltage operation. ACFF also experiences current contention in the slave latch when updating $H$ and $H N$ nodes through M 2 and M 4 ; this causes rapidly increasing active power with higher activity ratio as well as functional failures at low voltage operation. This contention can be suppressed at the expense of additional devices, which then requires 26 transistors in total.

TGPL [36] is based on pulsed operation and achieves high performance at full $V_{D D}$ but has poor robustness at low $V_{D D}$ due to increased process variation sensitivity in pulse generation. Its hold time requirement is determined by the pulse width, hence the hold time of TGPL is positive unlike the above-mentioned flip-flops. At low $V_{D D}$, the pulse width becomes unpredictable, so does the hold time, because the delay element used for the pulse generation becomes quite susceptible to PVT variations. This often results in an excessive hold time margining during the design time, which causes power and area overhead.

TSPC [37] employs single-phase clock operation and uses only 11 devices. However, its dynamic operation degrades robustness, especially at low $V_{D D}$. In addition, Figure 4.3 illustrates a non-negligible glitch at node $Q N$ in TSPC whenever CK goes high while D remains 0 . This arises since precharged net2 begins to discharge $Q N$ before M5/M6 can pull net2 low. Although $Q N$ will be eventually recovered back to the correct state (=high) by the discharged net 2 and M7, this glitch results in unnecessary power consumption or even system malfunction. From Monte Carlo simulations in 45 nm SOI, the glitch at Q exceeds $V_{D D} / 2$ with $\sim 1 \%$ probability ( $92 / 10,000$ Monte Carlo simulations, $V_{D D}=1.0 \mathrm{~V}$ ), potentially allowing for propagation to subsequent logic.

## 4.3 $\quad \mathbf{S}^{\mathbf{2}} \mathbf{C F F}$ (Static Single-phase Contention-free Flip-Flop

### 4.3.1 Schematic and Operation Details

This work presents a new flip-flop, referred to as $\mathrm{S}^{2} \mathrm{CFF}$ (Static Single-phase Contention-free Flip-Flop) that meets all the requirements mentioned in the introduction; it is static, completely contention-free, and uses single-phase clocking. It has the same device count as a TGFF, with only a $7 \%$ increase in layout size that corresponds to one poly-pitch increase in 45 nm technology where fixed poly-pitch is enforced. Figure 4.4 shows the $S^{2} \mathrm{CFF}$ schematic, and the detailed operations are described in Figure 4.5 where grayed-out devices indicate OFF devices while others are ON.

In the schematic, M1~M4 becomes an inverter during $\mathrm{CK}=0$ phase. Hence, netl holds an inverted D value when $\mathrm{CK}=0$. Since M3 is fully turned on by the precharged net2 (precharged through M8), any change in D can propagate to net1, i.e., it is transparent, and both net1 and net 2 are static during $\mathrm{CK}=0$ phase. At the positive edge of CK , depending on the netl value, net 2 will be staying high or discharged through M9~M10. This will update the slave latch (M17~M22); $Q N$ will be charged up by M13 if net 2 becomes low; otherwise, $Q N$ will be discharged through M14~M16 if net 2 stays high. In this CK=1 phase, M22 is conditionally turned on/off depending on net 2 value (data-dependent), while M19 is always off. M3 is an isolation device that prevents a change in D from affecting net1. M5~M7 are keeper devices and make netl/net2 fully static. M11~M12 generates netlb signal that controls the keeper (M7) as well as the glitch prevention device (M15), which will be explained later.


Figure 4.4: Schematic of $S^{2} C F F$

If $\mathrm{D}=0$, net 1 holds an inverted D value (=high) and net 2 precharges through M 8 while $\mathrm{CK}=0$. In this state, there is no keeper needed; the keepers M5 and M6 are off because both net1 and net 2 are high, and the keeper M7 is also off since netlb is low. The slave latch (M17~M22) stores the previous data and is isolated from the previous stage because M13 and M14 are turned off. At the positive edge of CK, the high netl starts discharging net 2 through M9 and M10. Then, the discharged net 2 turns off M3, completely isolating the circuit from changes in D. Also, the low net 2 charges $Q N$ through M13, updating the data in the slave latch. The low net 2 activates the keeper M5, which holds net1 high. M9 and M10 keep net 2 low during CK=1 phase.

If $\mathrm{D}=1$, netl holds an inverted D value (=low) and net2 precharges through M8 while $\mathrm{CK}=0$, as same as before. However, the positive edge of CK does not generate any dynamic transitions at netl and net 2 since the low netl turns off M9 so that net 2 just stays at the precharged state (=high) after the clock rising transition. Note that net1 is kept low by M7/M10, and M6 holds net2 high during $\mathrm{CK}=1$ phase. If the previous Q value is same as the current D input (i.e, $\mathrm{Q}=1, Q N=0$ ), there is also no transition at $Q N$. Otherwise, $Q N$ discharges through M14~M16. Although M3 stays on during $\mathrm{CK}=1$ phase due to the high net 2 , it does not affect the netl state (=low). If D changes from 1 to 0 during $\mathrm{CK}=1$ phase, it cuts off the discharging path (M3~M4) by turning off M4; however, netl is still held low by M7 and M10, so it still remains static.

Signal netlb is also used to control M15 to prevent glitches; without this sub-circuit, $Q N$ will


Figure 4.5: Operation of $S^{2} \mathrm{CFF}$
glitch when CK rises with D staying low in consecutive cycles, similar to TSPC. M15 eliminates this glitch by cutting off the discharge path (M14~M16) depending on netls value; it turns off M15 if netl is high (i.e., $\mathrm{D}=0$, netl $b=0$ ), hence $Q N$ can stay high without a glitch. M15 stays on if netl is low (i.e., $\mathrm{D}=1$, netlb=1). $Q N$ can be discharged as intended through M14~M16 in this case.

It should be noted that there is no contention throughout the operation, all internal nodes are fully static, and only one clock phase (CK) is used. Moreover, all of these are achieved with 24 transistors, which is same as in TGFF. This implies that the area penalty is just negligible, if not zero.

### 4.3.2 Hold Time Path

An additional benefit of the $S^{2} \mathrm{CFF}$ topology is that it simplifies the "hold time path" compared to a regular TGFF. Figure 4.6 shows the hold time paths of TGFF and $S^{2} \mathrm{CFF}$. As described in [38], the worst-case hold time in a TGFF is when D changes from 1 to 0 just after the CK rising transition. Due to clock inversion in I4, the PMOS in I2 always turns off later than its NMOS. The 0 -to- 1 transition at node $D N$ (1-to-0 at D ) has more time to propagate through I2 compared


Figure 4.6: Hold time paths in TGFF and $\mathrm{S}^{2} \mathrm{CFF}$
to the 1-to-0 transition at node $D N$ (0-to-1 at D). Also, the clocked PMOS in I5 always turns on earlier than its NMOS counterpart, thereby weakening the pull-down strength at node $M N$. Hence, node $M N$ becomes more vulnerable to the 0 -to- 1 transition (1-to-0 at D ) around the positive edge of CK. In addition, the data arrival time at $D N$ is dictated by I1, while the clock arrival time at I2 is determined by I3 and I4. Thus, in sum, TGFF hold time is dictated by the mismatch among the clock/data inverters ( $\mathrm{I} 1, \mathrm{I} 3, \mathrm{I} 4$ ), causing a severe hold time degradation at low $V_{D D}$ where mismatch is accentuated.

On the contrary, the worst-case hold time in $S^{2} \mathrm{CFF}$ occurs when D changes from 0 to 1 just after the CK rising transition. The high netl starts discharging net 2 , and the discharged net 2 turns off M3, isolating the D input. A hold failure may occur, if D becomes 1 before net 2 shuts off M3,


Figure 4.7: Setup/hold time measurement circuit
and thus discharges net1. Only the discharging speed of net 2 through PATH $_{\text {CK }}$ (M9 and M10) dictates the hold time. It should be noted that $\mathrm{PATH}_{\mathrm{D}}$ (M3 and M4) delay does not affect the worst-case hold time mentioned above, because: if $\mathrm{PATH}_{\mathrm{D}}$ is faster than $\mathrm{PATH}_{\mathrm{CK}}$, there is always a hold violation, so the (required) hold time must be the $\mathrm{PATH}_{\mathrm{CK}}$ delay (or less); if PATH $\mathrm{P}_{\mathrm{D}}$ is slower than $\mathrm{PATH}_{\mathrm{CK}}$, there is no hold violation at all. As a result, the hold time of $\mathrm{S}^{2} \mathrm{CFF}$ is determined solely by the discharging speed through $\mathrm{PATH}_{\mathrm{CK}}$ thus is much less prone to variability compared to a TGFF, which involves the time difference of several gate delays. The plot in Figure 4.6 shows a substantial reduction $\left(3.4 \times\right.$ ) in hold time at the $3-\sigma$ value at 0.32 V for $\mathrm{S}^{2} \mathrm{CFF}$ (Monte Carlo simulations). This suggests large potential benefit for NTC, since small hold time variation reduces buffer-insertion overhead, reducing power and improving system yield.

### 4.4 On-Chip Testing Circuits

On-chip testing circuits are required to accurately measure sequential elements' timing characteristics, such as setup/hold time and C-Q delay. It is also important to measure flip-flops' power in various conditions. This section discusses each testing circuits in the following sub-sections.

### 4.4.1 Setup/Hold Time

The on-chip setup/hold time measurement circuit is shown in Figure 4.7, which is based on the structure in [46]. The fast main clock $(\sim 1.5 \mathrm{GHz})$ is divided by 32768 to generate a suf-
ficiently slow periodic signal. Coarse Control block generates two periodic signals based on the divided clock, and one signal can be made lagged or led by the other signal using COARSE_TUNE bits. One signal becomes a data input to DUT (net_data), while the other becomes a clock input (net_clk). This Coarse Control block is basically a counter operated by the fast main clock, so the control resolution is determined by the main clock frequency. Fine Control block is a long inverter chain. The data path and the clock path have its own Fine Control block, so that the delays are separately controlled using tuning bits (FINE_TUNE_DATA and FINE_TUNE_CLK), and the control resolution is one FO1 delay. Finally, Analog Control consists of current-starved inverter chains where the delay can be controlled using analog voltages ( $V_{\text {BIAS_DATA }}$ and $V_{\text {BIAS_CLK }}$ ) which provides a further fine resolution $(<1 \mathrm{ps})$. This Delay Control Block makes a delay difference $\left(T_{D}\right)$, and the two signals are delivered into DUT through buffers. Phase Detector is used to align the edges of data/clock signals on net_data and net_clk. Based on this alignment, which indicates $T_{D U T}=0$, a slight time difference can be made by changing the tuning bits or the bias voltages in Delay Control Block, while Error Counter determines whether there is a setup/hold failure by checking the DUT output. Pulse Gen generates a pulse whose pulse width corresponds to $T_{D U T}+T_{\text {OFFSET }}$. This pulse width is then measured using the sub-1ps resolution TDC [47]. At full $V_{D D}$, buffer mismatch ( $\Delta T_{B 1}$ and $\Delta T_{B 2}$ ) is negligible compared to $T_{D}$, and setup/hold times can be accurately measured.

### 4.4.2 C-Q Delay

The C-Q delay measurement circuit is shown in Figure 4.8. It incorporates a new flip-flop ring, where a short pulse at EN input triggers the oscillation of DUT Ring with a period that is proportional to $T_{C Q}$ with an offset value.

$$
\begin{equation*}
T_{P, O S C}=2 N \times T_{C Q}+2 N \times T_{M}+N \times\left(T_{B}+T_{I}\right) \tag{4.1}
\end{equation*}
$$

where $N$ is the number of Unit Cells in a ring, $T_{C Q}$ is the C-Q delay of DUT. $T_{M}, T_{B}$, and $T_{I}$ represent the mux, buffer, and inverter delays in Unit Cell, respectively. The offset value can be measured using Reference Ring. The period of the oscillation in Reference Ring is:

$$
\begin{equation*}
T_{P, R E F_{-} O S C}=2 N \times T_{M}+N \times\left(T_{B}+T_{I}\right) \tag{4.2}
\end{equation*}
$$



Figure 4.8: C-Q delay measurement circuit
Thus, the average C-Q delay can be obtained by subtracting $T_{P, R E F}$ _OSC from $T_{P, O S C}$.

$$
\begin{equation*}
T_{C Q}=\left(T_{P, O S C}-T_{P, R E F_{-} O S C}\right) / 2 N \tag{4.3}
\end{equation*}
$$

With a large $N$ value, local mismatch is effectively cancelled out making it possible to obtain accurate C-Q delays. While only 4 unit cells are shown in the figure for simplicity, the actual test chip implementation includes 100 Unit Cells in DUT Ring ( $N=100$ ). Reference Ring alternates Unit Cell A and Unit Cell B, with 50 of each in the full ring. The DUT Ring also gives insight on DUT yield, since oscillation stops unless all 100 DUTs in the ring are functional.

### 4.4.3 Power

Figure 4.9 shows the power measurement circuit where the activity ratio is controlled from $0 \%$ to $100 \%$ by loading the 20 -bit INITIAL_PATTERN, as shown in Table 4.2 . In order to mimic a realistic scenario, it has one clock buffer driving 10 DUTs. The current flowing into 'CLKBUF + 10 DUTs' is measured and then divided by 10 . Hence, measured power consumptions in this paper also take into account the clock driving power.


Figure 4.9: Power measurement circuit

| INITIAL_PATTERN[19:0] | Activity Ratio |
| :---: | :---: |
| 00000000000000000000 | $0 \%$ |
| 10000000000000000000 | $10 \%$ |
| 10100000000000000000 | $20 \%$ |
| 10101000000000000000 | $30 \%$ |
| 10101010000000000000 | $40 \%$ |
| 10101010100000000000 | $50 \%$ |
| 10101010101000000000 | $60 \%$ |
| 10101010101010000000 | $70 \%$ |
| 10101010101010100000 | $80 \%$ |
| 10101010101010101000 | $90 \%$ |
| 10101010101010101010 | $100 \%$ |

Table 4.2: Setting activity ratio in power measurement circuit

### 4.5 Measurements

$S^{2}$ CFF was characterized in a 45 nm SOI test chip, and TGFF, ACFF, and TGPL were also implemented in the same test chip for fair comparisons; 50 dies were fabricated and measured.

Figures 4.10 and 4.10 show measured total power and energy. $S^{2} \mathrm{CFF}$ does not require internal clock inverters, and this enables a clock power reduction, where the clock power is defined as total power at $0 \%$ activity ratio with $\mathrm{D}=0$. From the power measurement, $\mathrm{S}^{2} \mathrm{CFF}$ shows a clock power reduction of $41 \%$ and $40 \%$ at $1 \mathrm{~V} / 1 \mathrm{GHz}$ and $0.4 \mathrm{~V} / 200 \mathrm{MHz}$ operations, respectively, compared to TGFF. Assuming that flip-flops in a typical system have $20 \%$ activity ratio, $\mathrm{S}^{2} \mathrm{CFF}$ provides $39 \%$ and $38 \%$ improvement in total sequential power at $1 \mathrm{~V} / 1 \mathrm{GHz}$ and $0.4 \mathrm{~V} / 200 \mathrm{MHz}$, respectively, compared to TGFF. ACFF also has single-phase clocking operation thus showing a similarly low clock power as $S^{2}$ CFF. However, the total power of ACFF increases rapidly as activity rises due to contention in the slave latch; this makes $S^{2} \mathrm{CFF}$ the lowest power flip-flop at any activity ratio. TGPL has a delay element, which leads to higher total power consumption even at $0 \%$ activity ratio. In terms of active energy consumption, $\mathrm{S}^{2} \mathrm{CFF}$ shows $32 \%$ and $34 \%$ reduction at 1.0 V and 0.4 V , respectively, compared to TGFF. $\mathrm{S}^{2} \mathrm{CFF}$ is the lowest energy flip-flop due to the static, single-phase clock, and contention-free operation.

Figures 4.12 and 4.12 show measured $\mathrm{C}-\mathrm{Q}$ delays and leakage power. The $\mathrm{C}-\mathrm{Q}$ delay in $\mathrm{S}^{2} \mathrm{CFF}$ is determined by net 2 being staying precharged or discharged depending on the netl value at the positive edge of CK, followed by updating $Q N$ (thus Q ) node. Compared to TGFF where the $\mathrm{C}-\mathrm{Q}$ delay is determined by the delay through one transmission-gate and two inverters, $\mathrm{S}^{2} \mathrm{CFF}$ shows modest improvement across $V_{D D}$ with $14.8 \%$ faster C-Q delay at 1.0 V . ACFF has the fastest C-Q delay by placing the output inverter right after the passgate (M4 in Figure 4.1). However, it should be noted that the missing points in the plot indicate that ACFF fails to have $100 \%$ yield at 0.4 V . This is due to the current contention in the slave latch as well as the degraded state-holding in the master latch, as described earlier. Similarly, TGPL fails at $V_{D D} \leq 0.6 \mathrm{~V}$, mainly due to hold time failures; it has a positive hold time constraint because of the pulsed operation, and the pulse width becomes very sensitive to PVT variations especially at low $V_{D D}$. This illustrates the importance of static and contention-free operation at low $V_{D D}$, since only TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ show $100 \%$ yield across the wide $V_{D D}$ range. From the leakage measurement, $\mathrm{S}^{2} \mathrm{CFF}$ has $35 \%$ and $37 \%$ lower leakage power


Figure 4.10: Measured total power


Figure 4.11: Measured energy


Figure 4.12: Measured C-Q delay


Figure 4.13: Measured leakage power

|  | $\mathbf{S}^{2} \mathbf{C F F}$ <br> (This Work) | TGFF <br> Standard Cell Lib. | ACFF Teh, ISSCC' 11 | TGPL Naffziger, JSSC'02 | CSP $^{3} \mathbf{L}^{*}$ <br> Consoli, ISSCC' 12 | DMFF ${ }^{*}$ Nomura ISSCC' 08 | CPSA <br> Ueda, ISSCC'06 | CCFF <br> Kong, JSSC' 01 | HLFF* <br> Partovi, ISSCC'96 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Type | Static | Static | Static | Pulsed | Pulsed | Pulsed | Static | Pulsed | Pulsed |
| Contention-Free | Yes | Yes | No | Yes | Yes | No | No | No | No |
| Single Phase Clock | Yes | No | Yes | No | No | No | Yes | No | No |
| Number of Transistors | 24 | 24 | $22^{1)}$ | $28^{2)}$ | $42^{3)}$ | $24^{4)}$ | 28 | 35 | 20 |
| Normalized Layout Size | 1.07 | 1.00 | 1.13 | 1.40 | 1) It becomes 26 if ACE (Adaptive-Coupling Element) is added to the slave latch for low-voltage robustness <br> 2) Delay element has 5 inverters to generate a pulse <br> 3) 16 transistors (pulse generator) can be shared among multiple flip-flops. <br> 4) Assuming 3 inverters are used for delay generation |  |  |  |  |
| Measured C-Q Delay @ 1.0V | 33.9ps | 39.8ps | 27.1ps | 37.9ps |  |  |  |  |  |
| Measured Setup Time @ 1.0V | 34.0ps | 40.6ps | 77.8ps | 8.5ps |  |  |  |  |  |
| Measured Hold Time @ 1.0V | -25.7ps | -31.4ps | -66.1ps | 1.28ps |  |  |  |  |  |
| Measured Total Power <br> @ $1.0 \mathrm{~V}, 1 \mathrm{GHz}, 20 \%$ Activity | $10.02 \mu \mathrm{~W}$ | $16.36 \mu \mathrm{~W}$ | $13.45 \mu \mathrm{~W}$ | $24.57 \mu \mathrm{~W}$ |  |  |  |  |  |
| Measured Leakage @ 1.0V | 592nW | 909nW | 967nW | 1283nW |  |  |  |  |  |

* CSP ${ }^{3}$ L, DMFF, CPSA, CCFF, HLFF are not implemented in this test chip.

Table 4.3: Measurement and topology comparison of flip-flops
than TGFF at 1.0 V and 0.4 V , respectively. This is because $\mathrm{S}^{2} \mathrm{CFF}$ has a fewer number of leakage paths than TGFF.

Finally, Table 4.3 includes the measured setup/hold time as well as the comparisons with other recently proposed flip-flops. $\mathrm{S}^{2} \mathrm{CFF}$ has $15.5 \%$ faster 'setup time $+\mathrm{C}-\mathrm{Q}$ delay' at 1.0 V compared to TGFF, with the lowest power consumption among the compared flip-flops. The table also shows that $\mathrm{S}^{2} \mathrm{CFF}$ is the only flip-flop that provides static, contention-free, and single-phase clock operations without increasing the device count compared to the conventional TGFF. While TGFF, ACFF, and TGPL have been already discussed in detail in the previous sections, other flip-flops also fail to meet these requirements: $\operatorname{CSP}^{3} \mathrm{~L}$ [48] is based on pulsed operation and does not provide singlephase clocking, while the device count exceeds that of TGFF; DMFF [49] has the same device count as TGFF, but it requires an clock inverter and Q node suffers from contention; CPSA [50] is a static, single-phase clocking flip-flop, but internal nodes suffer from contention; CCFF [51] also suffers from contention and area penalty ( 35 devices), and it is based on pulsed operation; HLFF [52] also has pulsed operation and requires clock inverters, and the output is not contention-free. The $\mathrm{S}^{2} \mathrm{CFF}$ layout size is only $7 \%$ larger than TGFF, which corresponds to one poly-pitch increase in 45 nm technology. The die photo of the test chip is shown in Figure 4.14 with the locations of the testing circuits annotated.


Figure 4.14: Die photo of the test chip fabricated in 45 nm SOI

### 4.6 Conclusions

We presented a new flip-flop named $\mathrm{S}^{2} \mathrm{CFF}$ which incorporates all the characteristics that an energy-efficient, highly voltage-scalable sequential element requires: static operation, contentionfree transitions, single-phase clocking, and minimum or no area penalty compared to conventional ones. The robust operation with the lowest power consumption is demonstrated from the silicon measurements using the test chips fabricated in 45 nm SOI. $\mathrm{S}^{2} \mathrm{CFF}$ is reliably operating at nearthreshold voltage $(0.4 \mathrm{~V})$ and is one of the only two flip-flops that shows $100 \%$ yield across the wide $V_{D D}$ range. The other flip-flop with the $100 \%$ yield is TGFF, but $S^{2} \mathrm{CFF}$ further reduces the power and energy consumptions, demonstrating $32 \%$ less active energy, $41 \%$ less clock power, and $35 \%$ less leakage power. It also improves 'setup time + C-Q delay' by $15.5 \%$, and more importantly, all of these are achieved using the same device count as in TGFF, which implies that the area penalty is just negligible, if not zero. In this implementation, compared to the commercial TGFF, $\mathrm{S}^{2} \mathrm{CFF}$ has only $7 \%$ larger layout size, which corresponds to one poly-pitch increase in

45 nm SOI. It is also shown that the simple hold time path in $\mathrm{S}^{2} \mathrm{CFF}$ results in a $3.4 \times$ reduction in hold time at the $3-\sigma$ value at near-threshold voltage $(0.32 \mathrm{~V})$. All of these suggest that $\mathrm{S}^{2} \mathrm{CFF}$ is an attractive candidate for sequential elements for low-power and highly voltage-scalable systems.

## CHAPTER 5

## A Testing Harness for Low-Voltage Flip-Flop Timing Characterization

### 5.1 Introduction

Electronic design automation (EDA) tools are indispensable in today's VLSI designs. The reliability of these tools depends on how accurate the devices and gates have been modeled. For example, more accurate MOSFET I-V characteristics in a SPICE model file can lead to more accurate simulation results.

If the modeling is not accurate, automatic place-and-route (APR) tools, for example, could insert unnecessarily many buffers for fixing the hold-time margin of flip-flops. While the functionality of the system remains same, in a large system where millions of flip-flops are used, these extra buffers would take up a significant portion of the total power. In addition, as the supply voltage becomes scaled down, the effects from any kind of variations, including the hold-time variation, can negatively impact the system yield and performance. Therefore, these variations must be addressed with a special concern at low $V_{D D}$. Although conventional flip-flops at low $V_{D D}$ have been studied through simulations [38], it is hard to find any on-chip testing circuits aimed for actual silicon measurements at low $V_{D D}$. There are on-chip testing circuits proposed for accurate flip-flop measurements [53], but it is limited to the full (i.e., nominal) $V_{D D}$ measurements.

In this chapter, we will discuss the issues in on-chip testing circuits for flip-flop timing characterizations, mainly focused on wide-range $V_{D D}$ measurements, and then we will propose a new test-


Figure 5.1: Mismatch sources in a setup/hold-time measurement circuit
ing harness for accurate low-voltage measurements. This technique will be demonstrated through silicon measurements.

### 5.2 Issues in Low $V_{D D}$ Flip-Flop On-Chip Measurements

Figure 5.1 shows possible mismatch sources in the setup/hold-time measurement circuit used in Chapter 4 for timing characterization. The basic operation is explained in Section 4.4.1. The Delay Control runs with the Main Clock, and it generates CK and D signals depending on the tuning bits and bias voltages. The time difference between CK and D at the Delay Control output is $T_{D}$. Note that the Delay Control is running at the full voltage ( $V_{D D}$ ). However, the DUT must be at a separate voltage domain $\left(V_{D D L}\right)$, and this voltage could be lower than $V_{D D}$ in order to measure the voltage dependency of setup/hold-time. Thus, there must be down-conversion buffers between the Delay Control and the DUT. Since there are two separate paths (CK and D), and each path has its own down-conversion buffer, there is a mismatch between those buffer delays. In Figure 5.1, each buffer's delay is $T_{B C}$ and $T_{B D}$ in the clock path and the data path, respectively, but each has its own delay variation at a lower voltage, which is indicated by $\Delta T_{B C}$ and $\Delta T_{B D}$, respectively. Thus, they are combined together to generate the relative mismatch, $\Delta T_{B}=\Delta T_{B C}-\Delta T_{B D}$, and this mismatch appears at the DUT input (i.e., $T_{D U T}=T_{D}+\Delta T_{B}$ ) on net_clk and net_data as shown in


Figure 5.2: A simplified diagram of the mismatch sources in a setup/hold-time measurement circuit the figure.

There are other in-accuracies involved in this testing circuit, too. Since the DUT is running at $V_{D D L}$, there must be level converters to generate pulses to be measured with the TDC. For accurate measurements, the TDC must be operating at the full voltage ( $V_{D D}$ ). These level converters themselves also have mismatches as indicated by $\Delta T_{L}$ in the figure. However, the sum of this mismatch and the offset from the Pulse Gen $\left(\Delta T_{L}+T_{O F F}\right)$ can be measured as long as the edges of net_clk and net_data are accurately aligned. The perfect alignment of net_clk and net_data indicates that the pulse width at the TDC input is just a sum of the level converter mismatch and the Pulse Gen offset (i.e., $T_{M}=\Delta T_{L}+T_{O F F}$ ), and this can be measured using the TDC. A Phase Detector shown in the figure is used to align those edges. However, this Phase Detector is not ideal, too. It can be modeled as a 'Ideal Phase Detector' and 'non-ideal input buffers' as shown in the figure. The 'Ideal Phase Detector' is assumed to have 'zero' mismatch, but now the 'non-ideal input buffers' have $\Delta T_{P}$ causing imperfect alignments of the net_clk and net_data signals. Thus, in real measurements, $T_{0}$ can be made zero by tuning the delays in the Delay Control, but this does not necessarily mean $T_{D U T}=0$ due to the non-ideality $\left(\Delta T_{P}\right)$ of the Phase Detector.

All of these mismatch components are summarized and shown in Figure 5.2. Note that the mismatches can be effectively alleviated at the full voltage ( $V_{D D}$ ) through device up-sizing and a careful layout. However, this becomes almost impossible at lower $V_{D D}$ due to the severe variations.

### 5.3 A New Phase Detection Circuit for Low $V_{D D}$ Operation

We discussed that the mismatches in the Phase Detector can result in inaccurate measurement. In other words, if the perfect alignment between the CK and D edges is guaranteed, the sum of the level converter mismatch $\left(\Delta T_{L}\right)$ and the Pulse Gen offset ( $T_{O F F}$ ) can be measured and subtracted out from the final $T_{M}$ value since $T_{M}$ is given by the following equation:

$$
\begin{equation*}
T_{M}=T_{D}+\Delta T_{B}+\Delta T_{L}+T_{O F F} \tag{5.1}
\end{equation*}
$$

Since $T_{D U T}$ is a sum of $T_{D}$ and $\Delta T_{B}$,

$$
\begin{equation*}
T_{D U T}=T_{D}+\Delta T_{B} \tag{5.2}
\end{equation*}
$$

Eq. (5.1) can be written as following:

$$
\begin{equation*}
T_{M}=T_{D U T}+\Delta T_{L}+T_{O F F} \tag{5.3}
\end{equation*}
$$

We are interested in finding out $T_{D U T}$ at which the DUT starts having setup/hold failures.

$$
\begin{equation*}
T_{D U T}=T_{M}-\left(\Delta T_{L}+T_{O F F}\right) \tag{5.4}
\end{equation*}
$$

Since $T_{M}$ can be measured using the TDC, the remaining unknown is $\left(\Delta T_{L}+T_{O F F}\right)$. The only way to measure this is to perfectly align the CK and D edges on net_clk and net_data. This also must be done in a wide voltage-range to provide an accurate setup/hold-time measurement at low voltages. Therefore, this problem is narrowed down to a design of an accurate phase detector for a wide voltage-range.

Note that if D changes from 0 to 1 around the CK rising edge, there is no need to have a decent phase detector since just shorting net_clk and net_data will provide the perfect alignment, as suggested in Figure 5.3. However, it is difficult to align a D falling edge with a CK rising edge, and this is where the accurate phase detector is required. Since CK and D have the opposite directions, the traditional D-flip-flop or SR-latch approach incurs inaccuracies because at least one input from the two paths ( CK or D ) must have an additional inverter, hence causing imbalanced


Figure 5.3: Edge alignment and offset $\left(\Delta T_{L}+T_{O F F}\right)$ measurement when D rises delays.

In order to solve this issue, we adopt an alternate approach, where one circuit detects "nonoverlapping" of CK and D while the other circuit detects "overlapping". The key components of these approaches are shown in Figure 5.4, where the non-overlapping detector and the overlapping detector are shown. They are based on the dynamic NOR/NAND structures, and the example waveforms are shown in the figure. The reason of using the dynamic structures is that, if there is only a slight non-overlap (or a slight overlap), then net0 would see just a small glitch, but the phase detector should be able to detect it. The conventional static-approaches (D-flip-flop or SR-latch) cannot do this because they require a voltage rise to be larger than their trip point, which is usually around the half- $V_{D D}$. From corner simulations, the worst-case error of these dynamic structures is 0.061 FO4 and 0.057 FO 4 at 1.0 V and 0.3 V , respectively. In addition, by using periodic CK and D signals and running this circuit for many cycles, it can tolerate more non-idealities.

Figure 5.5 shows the whole circuit diagram of the phase detector. The Non-Overlapping and Overlapping detection circuits are at the core of this circuit, and the SR-latches and the flip-flops are sampling the output from the detection circuits which are then fed into a controller circuit that counts the number of CK LEAD and D_LEAD. There are also static NOR and AND gates at the bottom of the circuit diagram for trivial cases where the amount of the non-overlapping (or overlapping) duration is sufficiently long enough to trip the static gate's output. The Disable Control block resets the Non-Overlapping/Overlapping detection circuits after some amount of

Overlapping Detector



Figure 5.4: Dynamic NAND/NOR structures for edge alignment
delay from the CK rising edge; this is to prevent a false trigger of the dynamic circuits due to the leakage current. All of this operation is repeated many times, and the outputs from the detection circuits increment counters, which can be then used to determine the edge alignment.

### 5.4 A Setup/Hold-Time Measurement Circuit for Wide VoltageRange Operation

Figure 5.6 shows an overall circuit diagram of the proposed setup/hold-time measurement circuit. The Delay Control, Down-Conversion Buffers, Level Converters, and the Pulse Gen are same


Figure 5.5: Phase detector circuit diagram


Figure 5.6: Setup/hold-time measurement circuit


Figure 5.7: (a) Clock Buffer schematic (b) Current-starved buffer for delay tuning
as in the previous circuit shown in Figure 5.1. The Phase Detector is the one described in Section 5.3. There are four pairs of switches (transmission-gates) and they are controlled at the full voltage $\left(V_{D D}\right)$ to minimize their channel resistance. When measuring the offset value $\left(\Delta T_{L}+T_{O F F}\right)$ for a D-rising edge, SW B and SW C are on, while SW A and SW D are off. This provides a short between the inputs of the two level converters, so the perfect alignment of net_clk and net_data is guaranteed. Then, the Main Clock provides a periodic signal, and the corresponding pulse width ( $T_{M}=\Delta T_{L}+T_{O F F}$ ) is measured. When measuring the offset value for a D-falling edge, SW B and SW D are on, while SW A and SW C are off. Then, the Main Clock provides a periodic signal to the Clock Buffer. The schematic of this Clock Buffer is shown in Figure 5.7, which generates the CK and D signals as well as the RESET signal for the Phase Detector. The analog bias voltage ( $V_{\text {BIAS }}$ ) in Figure 5.7(b) shall be kept being changed until the Phase Detector outputs indicate that there is a good alignment between the CK and D edges. At this point, the corresponding offset value ( $T_{M}=\Delta T_{L}+T_{O F F}$ ) can be measured. Finally, in order to check setup/hold-time failures, SW A is on, while the others are off. The Delay Control tuning bits and voltages shall be kept being changed until the DUT fails, and then the corresponding pulse width ( $T_{M}=T_{D U T}+\Delta T_{L}+T_{O F F}$ ) can be measured. Once $\Delta T_{L}+T_{O F F}$ is subtracted from $T_{M}$, the remaining $T_{D U T}$ will be the final setup- (or hold-) time.

It should be noted that all the important mismatch values, such as $\Delta T_{L}$, can be subtracted out using the provided switches. Also, the reliable operation of the Phase Detector allows a wide voltage-range timing characterization of flip-flops.

| $V_{D D}$ |  | TGFF | $\mathrm{S}^{2} \mathrm{CFF}$ | Improvement |
| :---: | :---: | :---: | :---: | :---: |
| 1.00 V | Mean | Sigma | 4.30 ps | 5.62 ps |
|  | Maximum | 24.38 ps | 2.11 ps | - |
|  | Minimum | -3.28 ps | 0.33 ps | $2.2 \times$ |
| 0.40 V | Mean | 3.66 ps | 22.63 ps | - |
|  | Sigma | 40.72 ps | 23.23 ps | - |
|  | Maximum | 155.27 ps | 82.49 ps | $1.9 \times$ |
|  | Minimum | -84.91 ps | -35.61 ps | - |
| 0.35 V | Mean | 31.42 ps | 37.46 ps | - |
|  | Migma | 97.69 ps | 46.33 ps | $2.1 \times$ |
|  | Maximum | 351.21 ps | 167.10 ps | $2.1 \times$ |
|  | Minimum | -184.17 ps | -74.87 ps | - |
| 0.32 V | Mean | 11.38 ps | 51.48 ps | - |
|  | Sigma | 111.69 ps | 74.86 ps | $1.5 \times$ |
|  | Maximum | 486.52 ps | 276.03 ps | $1.8 \times$ |
|  | Minimum | -217.53 ps | -130.53 ps | - |

Table 5.1: Comparison of the hold-time variations of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ (172 flip-flops of each type)

### 5.5 Measurements

Test chips were fabricated in a 45 nm SOI technology. Each test chip contains 4 TGFFs and 4 $S^{2}$ CFFs. 43 chips have been measured using the proposed timing characterization circuit, thus the sample size is 172 for each flip-flop. Hold-time distributions are measured at the full $V_{D D}(=1.0 \mathrm{~V})$, $0.40 \mathrm{~V}, 0.35 \mathrm{~V}$, and 0.32 V , where $\sim 0.35 \mathrm{~V}$ indicates the near- $V_{T H}$. Histograms from the 172 flipflops of each type are shown in Figure 5.8 and Figure 5.9, measured at each specified voltage, and the statistical results are summarized in Table 5.1. Also, an average value from each chip (i.e., an average value of the hold-time of the 4 flip-flops of each type in the same chip) is calculated, hence total 43 average values, and shown as histograms in Figure 5.10 and Figure 5.11, measured at each specified voltage. This is to observe chip-to-chip variations while reducing effects from within-die variations. Statistical results from these distributions are summarized in Table 5.2.

From these measurements, it is obvious that $\mathrm{S}^{2} \mathrm{CFF}$ provides much less hold-time variations. In Figure 5.8 and Figure 5.9, also summarized in Table 5.1, it shows $2.3 \times$ and $2.1 \times$ less sigma values at 1.0 V and 0.35 V , respectively, mainly because of the simple hold-time path described in Section 4.3.2. The most critical measurement is the 'Maximum' value of the hold-time, since a hold-



Figure 5.8: Hold-time distribution of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ at 1.0 V and 0.4 V ( 172 flip-flops of each type)



Figure 5.9: Hold-time distribution of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ at 0.35 V and 0.32 V (172 flip-flops of each type)



Figure 5.10: Hold-time distribution of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ at 1.0 V and 0.4 V ( 43 chips)



Figure 5.11: Hold-time distribution of TGFF and $\mathrm{S}^{2} \mathrm{CFF}$ at 0.35 V and 0.32 V ( 43 chips)

| $V_{D D}$ |  | TGFF | $\mathrm{S}^{2} \mathrm{CFF}$ | Improvement |
| :---: | :---: | :---: | :---: | :---: |
| 1.00 V | Mean | 6.30 ps | 5.62 ps | - |
|  | Sigma | 4.53 ps | 1.65 ps | $2.7 \times$ |
|  | Maximum | 22.89 ps | 8.81 ps | $2.6 \times$ |
|  | Minimum | -1.20 ps | 1.93 ps | - |
| 0.40 V | Mean | 3.66 ps | 22.63 ps | - |
|  | Sigma | 31.40 ps | 12.43 ps | $2.5 \times$ |
|  | Maximum | 88.71 ps | 46.44 ps | $1.9 \times$ |
|  | Minimum | -48.34 ps | -4.54 ps | - |
| 0.35 V | Mean | 31.42 ps | 37.46 ps | - |
|  | Sigma | 83.07 ps | 24.68 ps | $3.4 \times$ |
|  | Maximum | 218.98 ps | 85.44 ps | $2.6 \times$ |
|  | Minimum | -103.07 ps | -6.39 ps | - |
| 0.32 V | Mean | 11.38 ps | 51.48 ps | - |
|  | Migma | 76.75 ps | 40.27 ps | $1.9 \times$ |
|  | Maximum | 202.51 ps | 124.96 ps | $1.6 \times$ |
|  | Minimum | -138.15 ps | -23.36 ps | - |

Table 5.2: Comparison of the hold-time variations of TGFF and $S^{2} \mathrm{CFF}$ (43 chips)


Figure 5.12: Maximum hold-time value from the measured 172 flip-flops of each type
time fix process in a system design must take the worst-case value of the hold-time into account, adding buffers in order to make the shortest path delay exceed the worst-case hold-time. It is
clearly shown that $S^{2} \mathrm{CFF}$ provides $2.2 \times$ and $2.1 \times$ reduction in the maximum hold-time at 1.0 V and 0.35 V , respectively, implying that it can reduce the number of the hold-time fixing buffers by $>2 \times$, followed by overall system power reduction and yield improvement.
$S^{2} \mathrm{CFF}$ shows much more improvements in the hold-time variations when it comes to chip-tochip variations. In Figure 5.10 and Figure 5.11, also summarized in Table 5.2, $\mathrm{S}^{2} \mathrm{CFF}$ shows $2.7 \times$ and $3.4 \times$ less sigma values at 1.0 V and 0.35 V , respectively. The figures suggest that TGFF has significantly degraded variations especially at low voltages, whereas $\mathrm{S}^{2} \mathrm{CFF}$ still maintains good spreads at low voltages. As explained in Section 4.3.2, TGFF's hold-time is mainly determined by the mismatches among several gates. Since it is prone to any kind of variations, it is not unexpected that the global variations (i.e., chip-to-chip variations) have more effects compared to the local variations (i.e., within-die variations). In contrast, $\mathrm{S}^{2} \mathrm{CFF}$ 's hold-time is mainly determined by the discharging speed through PATH $_{\mathrm{CK}}$ (Figure 4.6), so it shows smooth bell-shaped distributions in all the measurements even at near- $V_{T H}$.

The maximum hold-time values from the 172 flip-flops of each type are also plotted in Figure 5.12 to show a trend. The maximum hold-time value of $S^{2} \mathrm{CFF}$ at 0.32 V is even shorter than the maximum hold-time of TGFF at a higher voltage $(0.35 \mathrm{~V})$. Therefore, $\mathrm{S}^{2} \mathrm{CFF}$ can provide either: 1 ) a smaller number of buffers added for hold-time fix; 2) a lower $V_{M I N}$. Both benefits can lead to an overall system power reduction, while still guaranteeing the system robustness (i.e., no hold-time failure).

A die photo is provided in Figure 5.13.


Figure 5.13: Die photo of the test chip fabricated in 45 nm SOI

## CHAPTER 6

## Conclusion

The on-going demand for achieving faster computing speed has met a major huddle in increasing the clock frequency due to the excessive power consumption. Thus, in recent years, low-power design is not optional anymore; it has become one of the most important design criteria that virtually all digital/analog circuits should meet. Voltage scaling is an effective way to reduce the overall power consumption, but the major challenges in sub- or near- $V_{T H}$ operations include performance degradation and reliability issues due to PVT variations. Although the performance degradation could be compensated by utilizing more parallelism (e.g., multi-core systems), the reliability concerns must be correctly addressed during design phase in order to avoid serious system failure.

In this dissertation, we identified several important circuit components that are prone to such variations in NTC, proposed new techniques to improve robustness, and demonstrated the effectiveness through silicon measurements.

Level converters are critical components in voltage-scaled VLSI systems in that they must provide a reliable interface between two different voltage domains. Digital cores tend to run at severely voltage-scaled domains, while other analog/peripheral circuits still require a high voltage, and especially in the NTC region, the reduced $I_{\text {ON }} / I_{\text {OFF }}$ ratio makes it extremely difficult to achieve robust level conversions. In Chapter 2, we proposed two static level converter designs called LC ${ }^{2}$ and SLC. LC ${ }^{2}$ adopts a novel thyristor and pulsed-operation and modulates its pull-up strength depending on its state. During idle state where there is no input change, it holds the internal state through the week keepers, whereas the strong devices running at $V_{D D H}$ participate in actual signal transitions when the input changes. The device sizing of the keepers are the most important
design criteria in $L^{2}$. We demonstrated that it can easily meet the $3 \sigma$ robustness requirement through the systematic approach using the current margin plot. Because the actual transitions are handled by the strong devices, $\mathrm{LC}^{2}$ provides the fastest performance compared to other designs, demonstrating $3.2 \times$ speed improvement over DCVS. SLC inherently reduces the contention by incorporating diodes in the stack, so that the pull-down devices are fighting with the diode whose $\left|V_{G S}\right|$ corresponds to the diode voltage-drop $\left(V_{D}\right)$. Compared to other designs where the pull-down devices contends with a strong PMOS device whose $\left|V_{G S}\right|$ is usually $\sim V_{D D H}$, SLC provides a great improvement in the robustness resulting in $98.93 \%$ yield from Monte-Carlo simulations as well as no failure in a wide temperature range during silicon measurements. Moreover, the simple schematic and the small layout size of SLC make it suitable to fit in standard-cell libraries and could streamline the system design process.

SRAMs exist in virtually all processors. However, they are also a major bottleneck in voltagescaling due to its inherent ratioed bitcell design. Widely-used 8T bitcells decouples READ and WRITE operations, eliminating the two-sided constraint, at the cost of a larger bitcell size. Usually, the area overhead is in a $30 \sim 55 \%$ range, thus sometimes preventing it to be used in severely areaconstrained applications. In Chapter 3, we proposed a novel 7T SRAM bitcell and the peripherals, in order to alleviate the area overhead and provide a robust operation. The Auto-Shut-Off sensing effectively eliminates the short-circuit current from unselected cells, resulting in a $6.8 \times$ READ energy reduction. Also, the 7T bitcell's innate bitline leakage suppression effect in un-selected bitcells resulting from negative $V_{G S}$ on their READ device provides the $113 \times$ less bitline leakage compared to the conventional 8T memory through the simulation. This Quasi-Static READ has been also demonstrated through the silicon measurement which shows the much improved READ error rate. In addition, the use of PMOS transistors as Pass-Gate devices improves the half-select robustness by directly modulating the transistor $\left|V_{G S}\right|$ through the WRITE bitline voltage. The silicon measurement shows a robust bit-interleaved operation and achieves the $3.35 \mathrm{fW} / \mathrm{bit}$ leakage power.

The clocked sequential element, a flip-flop in short, is ubiquitous in today's digital systems. While many flip-flop designs have been proposed, the main issue has still remained same: the hold-time variation. This often causes unnecessarily excessive buffer insertions to meet the hold time margin under the severe PVT variations. Also, in terms of robustness and design-overhead, it
is very hard to find a flip-flop that is static and contention-free with negligible or no area overhead compared to the widely-used TGFF. In Chapter 4, we proposed a new flip-flop called $\mathrm{S}^{2} \mathrm{CFF}$. It is single-phase, meaning that it does not require the inverted clock signal. It is static and contentionfree, and it also has the same number of devices ( 24 transistors) as in the TGFF. This makes the area overhead of $S^{2} \mathrm{CFF}$ quite negligible. It is the only flip-flop that meets all of these requirements (single-phase, static, contention-free, same device count) among the compared baseline designs. Mainly due to the single-phase operation, $\mathrm{S}^{2} \mathrm{CFF}$ shows a $\sim 40 \%$ power reduction compared to the TGFF through silicon measurements. In addition, due to its static and contention-free operation, it demonstrates the robust low-voltage operations similar to TGFF, reliably running at 0.4 V , while other designs fail. Another benefit of $\mathrm{S}^{2} \mathrm{CFF}$ is its simple hold-time path. This reduces its mismatches that determine the hold-time, followed by $3.4 \times$ improvement in $3 \sigma$ hold-time compared to TGFF.

The flip-flop testing harness for the timing characterization was also discussed and demonstrated through the silicon measurements. This testing harness incorporates the dynamic NAND/NOR structures and many-cycle operations, in order to more accurately align the CK and D edges. This makes it easy to measure the offset caused by the severe mismatches in low $V_{D D}$ operations, so the offset can be easily subtracted out through a simple calculation. By measuring the testchips, it was demonstrated that $S^{2}$ CFF has up to $3.4 \times$ reduction in the standard deviation of the measured hold-time at 0.35 V , compared to the TGFF. It was also showed that $\mathrm{S}^{2} \mathrm{CFF}$ at 0.32 V has a better worst-case hold-time, even when compared to TGFF at a higher voltage $(0.35 \mathrm{~V})$.

All of these new circuit techniques proposed in this dissertation can be extensively used in most VLSI systems. Especially, the NTC operations could benefit more from the proposed techniques, since the new circuits are targeted for much improved robustness while still providing excellent performance and low power consumption. The wireless sensor node platform [7] mentioned in Chapter 1 already uses SLC as its standard level conversion circuits and demonstrates robust and power-efficient operations with three different voltage domains $(0.6 \mathrm{~V} / 1.2 \mathrm{~V} / 3.6 \mathrm{~V})$, while the 7 T SRAM and $S^{2} C F F$ are also planned to be implemented in future-version of the system. We expect that these robust circuit designs for low-voltage VLSI can foster the development of future lowpower system designs.

### 6.1 Future Works

Based on the circuit techniques presented in this dissertation, there are other possibilities to further improve circuit robustness and performance in low-voltage VLSI. As mentioned before, the 7T SRAM is planned to be implemented in the wireless sensor node platform [7], which currently has only a 3 kB SRAM, and this SRAM capacity is a limiting factor in achieving more flexible system functionality. The bitcell size of the 10T bitcell used in the current version of the sensor node is almost $2 \times$ larger than the 7 T bitcell size, so from a simple estimation, it is expected to have at least $\sim 6 \mathrm{kB}$ of SRAM capacity in the future version by having the 7T SRAM. One more advantage of the 7T SRAM is that, it provides a much more robust bit-interleaving capability, and this will further improve the array efficiency. There are other concerns specifically related to this sensor node platform; for example, its extremely low sleep power requirement enforces a use of HVT (I/O) devices in the bitcells. In order to achieve a reliable operation with these HVT devices, the decoupled READ and WRITE is a must. This necessitates a use of bitcells that have $>6$ devices, unless a peripheral assist circuit is also implemented. Most of conventional peripheral assist techniques, such as [54][55][56][57], usually incur a non-negligible area/power overhead. In addition, it is hard to find a decent assist circuit that is very effective under severe variations at such a low voltage [58]; note that the supply voltage used in the sensor node system [7] is 0.6 V which is a sub- $V_{T H}$ regime since the HVT devices' threshold voltage is in the range of $0.7 \mathrm{~V} \sim$ 0.8 V . One of the ways to assist the HVT bitcells is to utilize their extremely low leakage. For example, even if the supply voltage becomes lost, these HVT bitcells could retain their data for a limited amount of time, and it is expected to survive a longer power-loss duration compared to standard SVT bitcells. A similar approach has been presented in an advanced process node (hence with more device variations) [59]. It is interesting that the most advanced process nodes and the sub- $V_{T H}$ operations in old (hence mature) process nodes bear a similarity in that both are prone to variations.

The small size and the low power consumption of the sensor node platform will enable many new applications most of which have been regarded impossible due to their size and power limitations. Some examples found in recent literature include glucose monitoring systems [60][61] and other bio applications [62][63][64]. However, most of them still suffer from limited battery-life
and system functionality. Developing a robust and flexible sensing node platfrom through further circuit innovations are the most primary future goal of this dissertation.

### 6.2 Related Publications and Patents

- Yejoong Kim, Dennis Sylvester, and David Blaauw, "LC': Limited Contention Level Converter for Robust Wide-Range Voltage Conversion," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2011, pp. 188-89.
- Yejoong Kim, Yoonmyung Lee, Dennis Sylvester, and David Blaauw, "SLC: Split-Control Level Converter for Dense and Stable Wide-Range Voltage Conversion," in Proc. European Solid-State Circuits Conf., Sep. 2012, pp.478-481.
- Yejoong Kim, Dennis Sylvester, and David Blaauw, "A 3.35fW/bit Bit-Interleaved 7T SRAM with Quasi-Static Read and Auto-Shut-Off Sensing," planned to be submitted to IEEE J. Solid-State Circuits, 2015.
- Yejoong Kim, Wanyeong Jung, Inhee Lee, Qing Dong, Michael Henry, Dennis Sylvester, and David Blaauw, "A Static Contention-Free Single-Phase-Clocked 24T Flip-Flop in 45nm for Low Power Applications," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2014, pp. 466-467.
- Yejoong Kim, Michael Brewer Henry, Dennis Michael Sylvester, David Theodore Blaauw, "Static Signal Value Storage Circuitry Using a Single Clock Signal," US Patent 13/860,756, filed on April 11, 2013.
- Yejoong Kim, Dennis Michael Sylvester, David Theodore Blaauw, Brian Tracy Cline, "Measurement Circuitry and Method for Measuring a Clock Node to Output Node Delay of a Flip-Flop," US Patent 14/175,015, filed on February 3, 2014.


## BIBLIOGRAPHY

[1] S. Rusu, H. Muljono, D. Ayers, S. Tam, W. Chen, A. Martin, S. Li, S. Vora, R. Varada, and E. Wang, "Ivytown: A $22 \mathrm{~nm} 15-$ Core Enterprise Xeon ${ }^{\circledR}$ Processor Family," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2014, pp. 102-103.
[2] A. Wang, K. C. Smith, and L. C. Fujino. (2013, Nov. 1). ISSCC 2014 Trends [Online]. Available: http://www.isscc.org/doc/2014/2014_Trends.pdf
[3] S. Mathew, M. Anders, B. Bloechel, T. Nguyen, R. Krishnamurthy, and S. Borkar, "A 4GHz 300mW 64b Integer Execution ALU with Dual Supply Voltages in 90nm CMOS," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2004, pp. 162-163.
[4] G. Chen, H. Ghaed, R. Haque, M. Wieckowski, Y. Kim, G. Kim, D. Fick, D. Kim, M. Seok, K. Wise, D. Blaauw, and D. Sylvester, "A Cubic-Millimeter Energy-Autonomous Wireless Intraocular Pressure Monitor," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2011, pp. 310-311.
[5] Y.-S. Kuo, P. Pannuto, G. Kim, Z. Foo, I. Lee, B. Kempke, P. Dutta, D. Blaauw, and Y. Lee, "MBus: A $17.5 \mathrm{pJ} / \mathrm{bit} / \mathrm{chip}$ Portable Interconnect Bus for Millimeter-Scale Sensor Systems with 8nW Standby Power," in Proc. IEEE Custom Integrated Circuits Conference, Sep. 2014.
[6] R. Viswanath, V. Wakharkar, A. Watwe, and V. Lebonheur, "Thermal Performance Challenges from Silicon to Systems," Intel Tech. J., Q3 2000.
[7] Y. Lee, S. Bang, I. Lee, Y. Kim, G. Kim, M. H. Ghaed, P. Pannuto, P. Dutta, D. Sylvester, and D. Blaauw, "A Modular $1 \mathrm{~mm}^{3}$ Die-Stacked Sensing Platform With Low Power I ${ }^{2}$ C Inter-Die Communication and Multi-Modal Energy Harvesting," IEEE J. Solid-State Circuits, vol. 48, no. 1, pp. 229-243, Jan. 2013.
[8] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen, "A Dynamic Voltage Scaled Microprocessor System," IEEE J. Solid-State Circuits, vol. 35, no. 11, pp. 1571-1580, Nov. 2000.
[9] J. Howard, S. Dighe, Y. Hoskote, S. Vangal, D. Finan, G. Ruhl, D. Jenkins, H. Wilson, N. Borkar, G. Schrom, F. Pailet, S. Jain, T. Jacob, S. Yada, S. Marella, P. Salihundam, V. Erraguntla, M. Konow, M. Riepen, G. Droege, J. Lindemann, M. Gries, T. Apel, K. Henriss, T. Lund-Larsen, S. Steibl, S. Borkar, V. De, R. Van Der Wijngaart, and T. Mattson, "A 48Core IA-32 Message-Passing Processor with DVFS in 45nm CMOS," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2010, pp. 108-109.
[10] E. J. Fluhr, J. Friedrich, D. Dreps, V. Zyuban, G. Still, C. Gonzalez, A. Hall, D. Hogenmiller, F. Malgioglio, R. Nett, J. Paredes, J. Pille, D. Plass, R. Puri, P. Restle, D. Shan, K. Stawiasz, Z. T. Deniz, D. Wendel, and M. Ziegler, "POWER8 ${ }^{\text {TM. }}$ : A 12-Core Server-Class Processor in 22 nm SOI with $7.6 \mathrm{~Tb} / \mathrm{s}$ Off-Chip Bandwidth," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2014, pp. 96-97.
[11] M. Putic, L. Di, B. H. Calhoun, and J. Lach, "Panoptic DVS: A Fine-Grained Dynamic Voltage Scaling Framework for Energy Scalable CMOS Design," in Proc. IEEE Int. Conf. Computer Design, Oct. 2009, pp. 491-47.
[12] A. Muramatsu, T. Yasufuku, M. Nomura, M. Takamiya, H. Shinohara, and T. Sakurai, " $12 \%$ Power Reduction by Within-Functional-Block Fine-Grained Adapative Dual Supply Voltage Control in Logic Circuits with 42 Voltage Domains," in Proc. European Solid-State Circuits Conf., Sep. 2011, pp.191-194.
[13] A. Wang and A. Chandrakasan, "A $180-\mathrm{mV}$ Subthreshold FFT Processor Using a Minimum Energy Design Methodology," IEEE J. Solid-State Circuits, vol. 40, no. 1, pp. 310-319, Jan. 2005.
[14] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, "Near-Threshold Computing: Reclaiming Moores Law through Energy Efficient Integrated Circuits," Proc. IEEE, vol. 98, no. 2, pp. 253-266, Feb. 2010.
[15] B. Zhai, R. G. Dreslinski, D. Blaauw, T. Mudge, and D. Sylvester, "Energy Efficient Nearthreshold Chip Multi-processing," in Proc. ACM/IEEE Int. Symp. Low Power Electronics and Design, Aug. 2007, pp. 32-37.
[16] S. Hanson, B. Zhai, M. Seok, B. Cline, K. Zhou, M. Singhal, M. Minuth, J. Olson, L. Nazhandali, T. Austin, D. Sylvester, and D. Blaauw, "Performance and Variability Optimization Strategies in a Sub-200mV, 3.5pJ/inst, 11nW Subthreshold Processor," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2007, pp. 152-153.
[17] M. Seok, S. Hanson, Y. Lin, Z. Foo, D. Kim, Y. Lee, N. Liu, D. Sylvester, and D. Blaauw, "The Phoenix Processor: A 30pW Platform for Sensor Applications," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2008, pp. 188-189.
[18] W.-T. Wang, M.-D. Ker, M.-C. Chiang, C.-H. Chen, "Level Shifters for High-Speed 1V to 3.3V Interfaces in a $0.13 \mu \mathrm{~m}$ Cu-Interconnection/Low-k CMOS Technology," in Proc. VLSI Technology, Systems, and Applications, Apr. 2001, pp.307-310.
[19] H. Shao and C.-Y. Tsui, "A Robust, Input Voltage Adaptive and Low Energy Consumption Level Converter for Sub-threshold Logic," in Proc. European Solid-State Circuits Conf., Sep. 2007, pp.312-315.
[20] I. J. Chang, J.-J. Kim, and K. Roy, "Robust Level Converter Design for Sub-threshold Logic," in Proc. Int. Low Power Electronics and Design, Oct. 2006, pp.14-19.
[21] I. J. Chang, J.-J. Kim, K. Kim, and K. Roy, "Robust Level Converter for Sub-Threshold/Super-Threshold Operation: 100mV to 2.5V," IEEE Trans. Very Large Scale Integration Systems, vol. 19, no. 8, pp.1429-1437, Aug. 2011.
[22] Y. Lin and D. Sylvester, "Single Stage Static Level Shifter Design for Subthreshold to I/O Voltage Conversion," in Proc. ACM/IEEE Int. Symp. Low Power Electronics and Design, Aug. 2008, pp. 197-200.
[23] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar, "Near-threshold Voltage (NTV) Design - Opportunities and Challenges," in Proc. ACM/IEEE Design Automation Conference, Jun. 2012, pp.1149-1154.
[24] L. Chang, D. Fried, J. Hergenrother, J. Sleight, R. Dennard, R. Montoye, L. Sekaric, S. McNab, A. Topol, C. Adams, K. Guarini, and W. Haensch, "Stable SRAM Cell Design for the 32nm Node and Beyond," in Symp. VLSI Technology Dig. Tech. Papers, Jun. 2005, pp. 128-129.
[25] L. Chang, R. Montoye, Y. Nakamura, K. Batson, R. Eickemeyer, R. Dennard, W. Haensch, and D. Jamsek, "An 8T-SRAM for Variability Tolerance and Low-Voltage Operation in HighPerformance Caches," IEEE J. Solid-State Circuits, vol. 43, no. 4, pp. 956-963, Apr. 2008.
[26] J. Kulkarni, B. Geuskens, T. Karnik, M. Khellah, J. Tschanz, and V. De, "CapacitiveCoupling Wordline Boosting with Self-Induced $V_{C C}$ Collapse for Write $V_{M I N}$ Reduction in 22-nm 8T SRAM," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2012, pp. 234-235.
[27] T. Kim, J. Liu, and C. Kim, "A Voltage Scalable 0.26V, 64kb 8T SRAM with $V_{M I N}$ Lowering Techniques and Deep Sleep Mode," IEEE J. Solid-State Circuits, vol. 44, no. 6, pp. 1785-1795, Jun. 2009.
[28] B. H. Calhoun, A. P. Chandrakasan, "A 256-kb 65-nm Sub-threshold SRAM Design for Ultra-Low-Voltage Operation," IEEE J. Solid-State Circuits, vol. 42, no. 3, pp. 680-688, Mar. 2007.
[29] Y. Lee, D. Kim, J. Cai, I. Lauer, L. Chang, S. J. Koester, D. Blaauw, D. Sylvester, "LowPower Circuit Analysis and Design Based on Heterojunction Tunneling Transistors (HETTs)," IEEE Trans. Very Large Scale Integration Systems, vol. 21, no. 9, pp. 1632-1643, Sep. 2013.
[30] M. Chang, M. Chen, L. Chen, S. Yang, Y. Kuo, J. Wu, H. Su, Y. Chu, W. Wu, T. Yang, and H. Yamauchi, "A Sub-0.3V Area-Efficient L-shaped 7T SRAM with Read Bitline Swing Expansion Schemes Based on Boosted Read-Bitline, Asymmetric- $V_{T H}$ Read-Port, and Offset Cell $V_{D D}$ Biasing Techniques," IEEE J. Solid-State Circuits, vol. 48, no. 10, pp. 2558-2569, Oct. 2013.
[31] D. F. Wendel, R. Kalla, J. Warnock, R. Cargnoni, S. G. Chu, J. G. Clabes, D. Dreps, D. Hrusecky, J. Friedrich, S. Islam, J. Kahle, J. Leenstra, G. Mittal, J. Paredes, J. Pille, P. J. Restle, B. Sinharoy, G. Smith, W. J. Starke, S. Taylor, J. Van Norstrand, S. Weitzel, P. G. Williams, and V. Zyuban, "POWER7TM , a Highly Parallel, Scalable Multi-Core High End Server Processor," IEEE J. Solid-State Circuits, vol. 46, no. 1, pp. 145-161, Jan. 2011.
[32] J. L. Shin, R. Golla, H. Li, S. Dash, Y. Choi, A. Smith, H. Sathianathan, M. Joshi, H. Park, M. Elgebaly, S. Turullols, S. Kim, R. Masleid, G. K. Konstadinidis, M. J. Doherty, G. Grohoski, and C. McAllister, "The Next Generation 64b SPARC Core in a T4 SoC Processor," IEEE J. Solid-State Circuits, vol. 48, no. 1, pp. 82-90, Jan. 2013.
[33] M. Alioto, E. Consoli, and G. Palumbo, "Analysis and Comparison in the Energy-DelayArea Domain of Nanometer CMOS Flip-Flops: Part I - Methodology and Design Strategies," IEEE Trans. Very Large Scale Integration Systems, vol. 19, no. 5, pp. 725-736, May 2011.
[34] M. Alioto, E. Consoli, and G. Palumbo, "Analysis and Comparison in the Energy-DelayArea Domain of Nanometer CMOS Flip-Flops: Part II - Results and Figures of Merit," IEEE Trans. Very Large Scale Integration Systems, vol. 19, no. 5, pp. 737-750, May 2011.
[35] C. K. Teh, T. Fujita, H. Hara, and M. Hamada, "A 77\% Energy-Saving 22-Transistor SinglePhase Clocking D-Flip-Flop with Adaptive-Coupling Configuration in 40nm CMOS," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2011, pp. 338-339.
[36] S. D. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. J. Sullivan, and T. Grutkowski, "The Implementation of the Itanium 2 Microprocessor," IEEE J. Solid-State Circuits, vol. 37, no. 11, pp. 1448-1460, 2002.
[37] J. Yuan and C. Svensson, "High-Speed CMOS Circuit Technique," IEEE J. Solid-State Circuits, vol. 24, no. 1, pp. 62-70, 1989.
[38] C.-H. Chen, K. Bowman, C. Augustine, Z. Zhang, and J. Tschanz, "Minimum Supply Voltage for Sequential Logic Circuits in a 22nm Technology," in Proc. ACM/IEEE Int. Symp. Low Power Electronics and Design, Sep. 2013, pp. 181-186.
[39] H. Kaul, M. A. Anders, S. K. Mathew, S. K. Hsu, A. Agarwal, R. K. Krishnamurthy, and S. Borkar, "A 300mV 494GOPS/W Reconfigurable Dual-Supply 4-Way SIMD Vector Processing Accelerator in 45nm CMOS," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2009, pp. 260-261.
[40] M. Seok, S. Hanson, J. Seo, D. Sylvester, and D. Blaauw, "Robust Ultra-Low Voltage ROM Design," in Proc. IEEE Custom Integrated Circuits Conference, Sep. 2008, pp. 423-426.
[41] S. Dighe, S. Gupta, V. De, S. Vangal, N. Borkar, S. Borkar, and K. Roy, "A 45nm 48-Core IA Processor with Variation-Aware Scheduling and Optimal Core Mapping," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2011, pp. 250-251.
[42] Y. Lee, B. Giridhar, Z. Foo, D. Sylvester, and D. Blaauw, "A 660pW Muti-Stage TemperatureCompensated Timer for Ultra-Low-Power Wireless Sensor Node Synchronization," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2011, pp. 46-47.
[43] B. Calhoun and A. Chandrakasan, "A 256kb Sub-threshold SRAM in 65nm CMOS," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2006, pp. 2592-2593.
[44] M. Chang, J. Wu, K. Chen, Y. Chen, Y. Chen, R. Lee, H. Liao, and H. Yamauchi, "A Differential Data-Aware Power-Supplied ( $D^{2}$ AP) 8T SRAM Cell with Expanded Write/Read Stabilities for Lower VDDmin Applications," IEEE J. Solid-State Circuits, vol. 45, no. 6, pp. 1234-1245, Jun. 2010.
[45] S. Jain, S. Khare, S. Yada, V. Ambili, P. Salihundam, S. Ramani, S. Muthukumar, M. Srinivasan, A. Kumar, S. K. Gb, R. Ramanarayanan, V. Erraguntla, J. Howard, S. Vangal, S. Dighe, G. Ruhl, P. Aseron, H. Wilson, N. Borkar, V. De, and S. Borkar, "A 280mV-to-1.2V Wide-Operating-Range IA-32 Processor in 32nm CMOS," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2012, pp. 66-67.
[46] B. Giridhar, M. Fojtik, D. Fick, D. Sylvester, and D. Blaauw, "Pulse Amplification Based Dynamic Synchronizers with Metastability Measurement Using Capacitance De-rating," in Proc. IEEE Custom Integrated Circuits Conference, Sep. 2013.
[47] D. Fick, N. Liu, Z. Foo, M. Fojtik, J. Seo, D. Sylvester, and D. Blaauw, "In Situ Delay-Slack Monitor for High-Performance Processors Using an All-Digital Self-Calibrating 5ps Resolution Time-to-Digital Converter," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2010, pp. 188-189.
[48] E. Consoli, M. Alioto, G. Palumbo, and J. Rabaey, "Conditional Push-Pull Pulsed Latches with 726fJ.ps Energy-Delay Product in 65nm CMOS," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2012, pp. 482-483.
[49] S. Nomura, F. Tachibana, T. Fujita, C. K. Teh, H. Usui, F. Yamane, Y. Miyamoto, C. Kumtornkittikul, H. Hara, T. Yamashita, J. Tanabe, M. Uchiyama, Y. Tsuboi, T. Miyamori, T. Kitahara, H. Sato, Y. Homma, S. Matsumoto, K. Seki, Y. Watanabe, M. Hamada, and M. Takahashi, "A 9.7mW AAC-Decoding, 620mW H. 264 720p 60fps Decoding, 8-Core Media Processor with Embedded Forward-Body-Biasing and Power-Gating Circuit in 65 nm CMOS Technology," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2008, pp. 262263.
[50] Y. Ueda, H. Yamauchi, M. Mukuno, S. Furuichi, M. Fujisawa, F. Qiao, and H. Yang, "6.33mW MPEG Audio Decoding on a Multimedia Processor," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2006, pp. 1636-1637.
[51] B.-S. Kong, S.-S. Kim, and Y.-H. Jun, "Conditional-Capture Flip-Flop for Statistical Power Reduction," IEEE J. Solid-State Circuits, vol. 36, no. 8, pp. 1263-1271, Aug. 2001.
[52] H. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper, "Flow-through Latch and Edge-Triggered Flip-Flop Hybrid Elements," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 1996, pp. 138-139.
[53] N. Nedovic, W. W. Walker, and V. G. Oklobdzija, "A Test Circuit for Measurement of Clocked Storage Element Characteristics," IEEE J. Solid-State Circuits, vol. 39, no. 8, pp. 1294-1304, Aug. 2004.
[54] M. Yabuuchi, K. Nii, Y. Tsukamoto, S. Ohbayachi, Y. Nakase, and H. Shinohara, "A 45nm 0.6V Cross-Point 8T SRAM with Negative Biased Read/Write Assist," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2009, pp. 158-159.
[55] M. Sinangil, H. Mair, and A. P. Chandrakasan, "A 28nm High-Density 6T SRAM with Optimized Peripheral-Assist Circuits for Operation Down to 0.6 V ," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2011, pp. 260-262.
[56] A. Bhavnagarwala, S. Kosonocky, C. Radens, Y. Chan, K. Stawiasz, U. Srinivasan, S. P. Kowalczyk, and M. M. Ziegler, "A sub-600mV, Fluctuation Tolerant 65nm CMOS SRAM Array with Dynamic Cell Biasing," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2007, pp. 78-79.
[57] H. Pilo, I. Arsovski, K. Batson, G. Braceras, J. Gabric, R. Houle, S. Lamphier, C. Radens, and A. Seferagic, "A 64 Mb SRAM in 32 nm High-k Metal-Gate SOI Technology with 0.7 V Operation Enabled by Stability, Write-Ability and Read-Ability Enhancements," IEEE J. Solid-State Circuits, vol. 47, no. 1, pp. 97-106, Jan. 2012.
[58] B. Zimmer, S. O. Toh, H. Vo, Y. Lee, O. Thomas, K. Asanovic, and B. Nikolic, "SRAM Assist Techniques for Operation in a Wide Voltage Range in 28nm CMOS," IEEE Trans. Circuits and Systems - II: Express Briefs, vol. 59, no. 12, pp. 853-857, Dec. 2012.
[59] E. Karl, Y. Wang, Y.-G. Ng, Z. Guo, F. Hamzaoglu, U. Bhattacharya, K. Zhang, K. Mistry, and M. Bohr, "A 4.6 GHz 162 Mb SRAM Design in 22 nm Tri-Gate CMOS Technology with Integrated Active $V_{M I N}$-Enhancing Assist Circuitry," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2012, pp. 230-231.
[60] A. D. Dehennis, M. Mailand, D. Grice, S. Getzlaff, and A. E. Colvin, "A Near-FieldCommunication (NFC) Enabled Wireless Fluorimeter for Fully Implantable Biosensing Applications," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2013, pp. 298-299.
[61] S. Tankiewics, J. Schaefer, and A. Dehennis, "A Co-Planar, Near Field Communication Telemetry Link for a Fully-Implantable Glucose Sensor Using High Permeability Ferrites," in Proc. IEEE Sensors, Nov. 2013.
[62] D. Jeon, Y.-P. Chen, Y. Lee, Y. Kim, Z. Foo, G. Kruger, H. Oral, O. Berenfeld, Z. Zhang, D. Blaauw, and D. Sylvester, "An Implantable 64nW ECG-Monitoring Mixed-Signal SoC for Arrhythmia Diagnosis," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2014, pp. 416-417.
[63] S.-Y. Hsu, Y. Ho, Y. Tseng, T.-Y. Lin, P.-Y. Chang, J.-W. Lee, J.-H. Hsiao, S.-M. Chuang, T.-Z. Yang, P.-C. Liu, T.-F. Yang, R.-J. Chen, C. Su, and C.-Y. Lee, "A Sub-100 $\mu$ W MultiFunctional Cardiac Signal Processor for Mobile Healthcare Applications," in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2012, pp. 156-157.
[64] S. Kim, L. Yan, S. Mitra, M. Osawa, Y. Harada, K. Tamiya, C. van Hoof, and R. F. Yazicioglu, "A $20 \mu$ W Intra-Cardiac Signal-Processing IC with 82dB Bio-Impedance Measurement Dynamic Range and Analog Feature Extraction for Ventricular Fibrillation Detection," in IEEE Int. Solid-State Circuits Conf. Dig. Tech. Papers, Feb. 2013, pp. 302-303.

