# Design of VLSI Integrated Circuits A (very) deep dive into processors...

**Olivier Sentieys** 

IRISA/INRIA – Cairn team University of Rennes 1

olivier.sentieys@inria.fr



http://people.rennes.inria.fr/Olivier.Sentieys/?page\_id=95



#### VLSI Design

• Chips, logic gates and transistors



Intel's Xeon Chip



Α

B

R

### **Key Questions**

- A deep dive into processors... (I hope not too deep)
- What is CMOS? How basic logic gates, registers and memory are designed?
- How to calculate the delay and the maximal frequency?
- How much power does my processor consume?
- What can advanced semiconductor technology bring?
- Are (homogeneous) multicores the right solution for performance or energy efficiency?

# Outline

- The Fundamental Element: MOSFET Transistor
- Design of CMOS Cells: Combinatorial Logic
- Memory Cells
- Delay
- Power Consumption
- Synchronous Design
- Technology Scaling (Moore's Law revisited)
- Multicore: power and utilization walls

# Fundamental Building Block: MOSFET Transistor



#### The Basic Element: Transistor

Transistor as a switch





- Vgs > Vt: NMOS on
   Resistance R<sub>DS</sub>
- Vgs < Vt: NMOS off
  - Leakage  ${\rm I}_{\rm off}$

- Gate: capacitance C<sub>G</sub>
- Switch: resistance R<sub>DS</sub>

#### The Basic Element: Transistor

 Cutoff or sub-threshold mode:

> $V_{GS} < V_t$  $R_{DS} \approx +\infty$



- Saturation mode:  $V_{GS} > V_t \text{ and } V_{DS} > V_{GS} - V_t$ 
  - A channel is created which allows current to flow between the drain and the source



#### **MOS Transistor Models**



- W: gate width
- L: gate length
- *tox*: oxyde width (#L/10)
- $K = \mu . \varepsilon . W/(tox.L) = \mu . Cox.W/L = k W/L$
- *Cox*: gate oxide capacitance per unit area
- $\mu$ : charge-carrier effective mobility NMOS (electrons)  $\mu_N = 500 \text{ cm}^2/\text{V-sec} \# 2 \mu\text{P}$ PMOS (holes)  $\mu_P = 270 \text{ cm}^2/\text{V-sec}$
- $\varepsilon$ : oxyde permittivity # 4  $\varepsilon_0$  = 3.5 10<sup>-13</sup> F/cm

 $I_{ds} = \begin{cases} 0 & \text{off} & Vgs - Vth < 0\\ K\left[(Vgs - Vth)Vds - \frac{Vds^2}{2}\right] & \text{linear} & 0 < Vds < Vgs - Vth\\ \frac{K}{2}(Vgs - Vth)^2 & \text{saturated} & 0 < Vgs - Vth < Vds \end{cases}$ 

- K defines transistor speed,  $K \propto W/L$ ,  $K_{NMOS} \sim 2.K_{PMOS}$
- Temperature increases  $\rightarrow \mu$  decreases

#### **NMOS** Parasitic Elements



Length

9

#### Transistors

- Bulk CMOS
- Ultra Thin Body (FD) SOI
  - Total dielectric isolation
    - Lower S/D capacitances & leakages
    - Latch-up immunity
  - Improved VT variation
  - Promoted by STMicroelectronics







#### Transistors

(intel)

#### • Intel FinFET: transistors go 3D

#### 22 nm Tri-Gate Transistor



Tri-Gate transistors can have multiple fins connected together to increase total drive strength for higher performance





22 nm Tri-Gate Transistors



(intel)





# **NMOS/PMOS Transistors**

- NMOS
  - A '0' is well transmitted
  - A degraded '1' is transmitted (Vdd-Vtn)
- Vgs < Vtn
- Vgs > Vtn



- PMOS
  - A '1' is well transmitted
  - A degraded '0' is transmitted (Vss+|Vtp|)
- Vgs < |Vtp|</li>

Vgs > |Vtp|

# Outline

- The Fundamental Element: MOSFET Transistor
- Design of CMOS Cells: Combinatorial Logic
- Memory Cells
- Delay
- Power Consumption
- Synchronous Design
- Technology Scaling (Moore's Law revisited)
- Multicore: power and utilization walls

#### **Combinatorial Logic Cells**

- Complementary Logic (CMOS)
  - CMOS Static Logic



#### NAND and NOR



#### Complex gates

 One CMOS stage can generate any sum-of-product or product-of-sum:

 $S = f(E1, E2, ..., EN) = \overline{SUM [PROD]} = \overline{PROD [SUM]}$ 



# General rules for constructing F(X)



17

### Static Logic

• Examples

Direct application of the design rules

- Example:  $S = \overline{A.B + C.D}$ 
  - AOI (And-Or-Invert) gate



- Multiple-Stage Complex Functions
  - Optimisation of the logic equation
  - Trade-off between speed and area
  - -S3 = A.B.C.D
  - -S4 = !A.B+A.!B (XOR)

### Pass-Transistor Logic

• Switch or Transmission Gate



• Example: 2-input multiplexer

$$A = \begin{bmatrix} A & \text{if } C = 0 \\ B & B & B \end{bmatrix}$$
$$S = \begin{bmatrix} A & \text{if } C = 0 \\ B & \text{if } C = 1 \end{bmatrix}$$

• Example: XOR

## Outline

- The Fundamental Element: MOSFET Transistor
- Design of CMOS Cells: Combinatorial Logic
- Memory Cells
- Delay
- Power Consumption
- Synchronous Design
- Technology Scaling (Moore's Law revisited)
- Multicore: power and utilization walls

# **Elementary Memory Cells**

• Static Memory Basic Cell: Latch









### **Elementary Memory Cells**

- Dynamic Memory Cell
  - MOS Capacitor: Cl = f(area)



- State ('1') (voltage level) is stored for few ms
  - Leakage current
  - Need for refreshing state
- Ex. Shift Register



# Sequential Logic Circuits

- D Flip-Flop (edge-triggered)
  - Two latches in series





- D is sampled in inverter (1) when clk = 0
- Latch (1) and (2) keeps D value when clk = 1 until !D is transferred to second latch (3) and (4)
- Asynchronous clear signal: replace inv. (1) and (4) by NAND

# Memory

- L2 Cache contains 4 Millions SRAM cells
  - Raw/column of 2000 cells







#### 6-Transistor CMOS SRAM Cell

- Latch where WL replaces clock
- Dual-rail bit-lines required to increase noise margin during R/W
- WL selection: WL[i] = 1
- Write 0: BL=0 et !BL=1 ⇔ Reset of Latch
- Read: BL et !BL pre-charged to 1, WL selection -> BL=Q and !BL=!Q
  - Sense amplifiers will act as a comparator to increase speed of Latch value to output





#### **3-Transistor DRAM**

- 2 lines WL and BL: read and write
- No amplification





# Outline

- The Fundamental Element: MOSFET Transistor
- Design of CMOS Cells: Combinatorial Logic
- Memory Cells
- Delay
- Power Consumption
- Synchronous Design
- Technology Scaling (Moore's Law revisited)
- Multicore: power and utilization walls

# Simplified Delay Model



- Rising Output:  $t_{plh} = Rp.C_L$
- Falling Output:  $t_{phl} = Rn.C_L$

#### **Delay of Complex Gates**

#### k-input NAND



 $t_{plh} =$  $t_{phl} =$ 

#### k-input NOR



 $t_{plh} = t_{phl} = 29$ 

# **Transistor Sizing**

Complex function

— F =

- Tplh =
- Tphl =
- Indicate critical path
- Which input values give the best/worst case delay?



#### Logic-Level Delay Model

- Fan-In (or Drive): relative to size of transistors
   Basic inverter is 1x
- Fan-Out: ratio between load capacitance and drive
- Relative Fan-Out (RF): ratio between fan-out and nextstage fan-in



#### Logic-Level Delay Model

- Tp = transport delay + inertial delay = TD + ID
- ID = RF.UD
- Equivalent to  $Tp = R_{DS}[C_{int} + C_{ext}]$



## Outline

- The Fundamental Element: MOSFET Transistor
- Design of CMOS Cells: Combinatorial Logic
- Memory Cells
- Delay
- Power Consumption
- Synchronous Design
- Technology Scaling (Moore's Law revisited)
- Multicore: power and utilization walls

#### **Power Consumption**

- Dynamic power: *Pdyn* 
  - Charge and discharge of node capacitance
- Short-circuit power: Psc
  - Short circuit path in logic cells (Vdd → Vss) during commutation
  - Strongly depends on rising time and on Vth (NMOS/PMOS)
- Static power: Ps
  - Sub-threshold leakage current (~OFF)
  - Source/Drain-Bulk junction leakage

P = Pdyn + Psc + Ps



#### Dynamic power

• Energy per transition =  $C_L V dd^2$ 

**Power** 

• Power = Energy per transition x rate of transition

$$Pc = C_{L} Vdd^{2} f_{0 \rightarrow 1}$$

$$Pc = C_{L} Vdd^{2} f Prob_{0 \rightarrow 1}$$

$$Pc = \alpha C_{L} Vdd^{2} f$$

$$Pc = \alpha.f.C_{L}.Vdd^{2}$$

 $\alpha$ : activity, C<sub>L</sub>: total load capacitance, f : frequency

Data dependant

**Activity dependant** 

# Activity

- Activity  $\alpha_i$  is the probability to have a  $0 \rightarrow 1$  transitions at the output of a gate
- Example: AND gate

$$-P_{S} = P(S=1) = P_{A}P_{B}$$

 $-\alpha_i = P_s(1 - P_s)$ 



Activity propagation


# Propagating Activity is not So Simple

Conditional probabilities

$$\frac{A}{C} \xrightarrow{X} \frac{S}{1/4} \xrightarrow{S} \frac{A}{1/8} = \frac{X}{1/4} \xrightarrow{S} \frac{A}{1/4} \xrightarrow{X} \frac{S}{1/4} \xrightarrow{X} \frac{S}{1/4}$$

- Glitches: gate delay
  - Significant in arithmetic



## Example: Adder



### Static Power: Leakage

• High performance



Low leakage



#### I<sub>off</sub>: Sub-threshold Leakage Current

- Exponential in inverse of Vt
- Exponential in temperature
- ~Linear in device count

 $\mathbf{P_{stat_i}} = \mathbf{N}.\mathbf{I_{off}}.\mathbf{Vdd}$ 

#### Sum-up: Power at Gate Level





 $\mathbf{P_i} = \alpha_i \cdot \mathbf{f_i} \cdot \mathbf{C_i} \cdot \mathbf{Vdd^2} + \mathbf{I_{leak_i}} \cdot \mathbf{Vdd}$ 

$$\mathbf{P} = \sum_{\mathbf{i}} \left[ \alpha_{\mathbf{i}}.\mathbf{f_i}.\mathbf{C_i}.\mathbf{Vdd^2} + \mathbf{I_{leak_i}}.\mathbf{Vdd} \right]$$

#### Power vs. Performance



- Delay of a gate
- Dynamic power  $P_{dyn_i} = \alpha_i . f_{clk} . C_i . Vdd^2$
- Leakage power  $P_{stat_i} = N.I_{off}.Vdd$

#### Dynamic Power vs. Performance

Decreasing Vdd reduces power but increases delay

 $\mathbf{P_{dyn_i}} = \alpha_i . \mathbf{f_{clk}} . \mathbf{C_i} . \mathbf{Vdd^2}$ 



## Minimum Energy per Operation

• Putting all together



#### **Conclusion:** Power

#### $P = \alpha f C_L V_{DD}^2 + V_{DD} I_{peak} (P_{0 \rightarrow 1} + P_{1 \rightarrow 0}) + V_{DD} I_{leak}$

Dynamic power (≈ 40-70% today and decreasing relatively) Short-circuit power (≈ 10% today and decreasing absolutely) Leakage power (≈ 20-50% today and increasing)

$$P = \frac{energy}{operation} \times rate + static \ power$$

# **Reducing Power**

- Power gating, multi-Vt
- Clock gating
- Vdd scaling
  Parallel, pipeline
- Activity reduction
  - Pre-computation, correlation, encoding
- Glitch Power Reduction



#### **Dynamic Power Management**

- Dynamic Voltage and Frequency Scaling (DVFS)
- Reduce speed (clock freq.) and Vdd depending on processor activity



# Outline

- The Fundamental Element: MOSFET Transistor
- Design of CMOS Cells: Combinatorial Logic
- Memory Cells
- Delay
- Power Consumption
- Synchronous Design
- Technology Scaling (Moore's Law revisited)
- Multicore: power and utilization walls

# **Timing Parameters**

- D Flip-Flop
  - Setup Time: Tsetup
  - Hold Time: Thold
  - Propagation Time: Tp
  - on Clock and Reset

| Propa | gation | Delay |
|-------|--------|-------|
|-------|--------|-------|

nanoSeconds, as a function of C (load in pF) and Tr (input transition time in nS)

| Cell     | Path | Event       | Best 1.32V -<br>40C              | Worst 1.08V<br>125C              | Nominal 1.2V<br>25C              |
|----------|------|-------------|----------------------------------|----------------------------------|----------------------------------|
| FD1QLL   | CP-Q | CP_Q (fall) | 0.082 +<br>0.119*Tr +<br>1.221*C | 0.195 +<br>0.179*Tr +<br>2.777*C | 0.125 +<br>0.148*Tr +<br>1.731*C |
| FD1QLL   | CP-Q | CP_Q (rise) | 0.075 +<br>0.118*Tr +<br>1.672*C | 0.178 +<br>0.180*Tr +<br>3.473*C | 0.113 +<br>0.148*Tr +<br>2.408*C |
| FD1QLLP  | CP-Q | CP_Q (fall) | 0.087 +<br>0.121*Tr +<br>0.644*C | 0.205 +<br>0.182*Tr +<br>1.428*C | 0.133 +<br>0.150*Tr +<br>0.903*C |
| FD1QLLP  | CP-Q | CP_Q (rise) | 0.079 +<br>0.120*Tr +<br>0.836*C | 0.189 +<br>0.182*Tr +<br>1.727*C | 0.120 +<br>0.150*Tr +<br>1.198*C |
| FD1QLLX4 | CP-Q | CP_Q (fall) | 0.111 +<br>0.122*Tr +<br>0.342*C | 0.267 +<br>0.183*Tr +<br>0.760*C | 0.173 +<br>0.152*Tr +<br>0.482*C |
| FD1QLLX4 | CP-Q | CP_Q (rise) | 0.093 +<br>0.121*Tr +<br>0.425*C | 0.224 +<br>0.184*Tr +<br>0.891*C | 0.141 +<br>0.151*Tr +<br>0.612*C |





#### Truth Table

| IQ | Q  |
|----|----|
| IQ | IQ |

#### **Truth Table**

| D | СР | IQ | IQ |
|---|----|----|----|
| D | /  | -  | D  |
| - | -  | IQ | IQ |

#### **Physical Dimensions**

| Property  | FD1QLL | FD1QLLP | FD1QLLX4 |
|-----------|--------|---------|----------|
| Area(um2) | 28.241 | 28.241  | 30.258   |

#### Capacitance

picoFarads

|    | Cell   | Property      | Best 1.32V -40C | Worst 1.08V<br>125C | Nominal 1.2V<br>25C |
|----|--------|---------------|-----------------|---------------------|---------------------|
| F  | D1QLL  | CP Input Cap. | 0.0032          | 0.0028              | 0.0030              |
| F  | D1QLL  | Q Max Load    | 0.160           | 0.160               | 0.160               |
| F  | D1QLL  | D Input Cap.  | 0.0023          | 0.0020              | 0.0021              |
| FD | 1QLLP  | Q Max Load    | 0.320           | 0.320               | 0.320               |
| FD | 1QLLP  | D Input Cap.  | 0.0022          | 0.0019              | 0.0021              |
| FD | 1QLLP  | CP Input Cap. | 0.0032          | 0.0027              | 0.0029              |
| FD | 1QLLX4 | CP Input Cap. | 0.0032          | 0.0027              | 0.0029              |
| FD | 1QLLX4 | Q Max Load    | 0.640           | 0.640               | 0.640               |
| FD | 1QLLX4 | D Input Cap.  | 0.0022          | 0.0019              | 0.0020              |

## Synchronous Circuits



## Synchronous Circuits



## **Critical Path**

- All circuits have a maximal frequency, which is given by finding its critical path
  - Data must be stable when sampled by the clock
- Tcp: critical path delay of the logic

$$Tcp = MAX_{\forall i}(D_i)$$
, with  $D_i$  Delay of path *i*

Maximal Frequency

$$Fclk_{max} = \frac{1}{Tcp + Tp + Tsetup}$$

## **Critical Path in Processor Pipelines**

• A typical (yet simple) processor pipeline



## **Clock Skew**

• Every FF receives the clock edge at a different time



- Light Speed:  $300 \mu m/ps$
- Diagonal : 30 mm (21mm side)
- 100 ps
- 1 clock cycle @ 10GHz
- 5-10 clock cycles @ 1-2GHz

#### Clock Skew: problems

- Skew  $\delta$  can be negative or positive
  - Reduction of maximal frequency
  - Maximal skew for circuit operation
    - Worst case is when receiving edge arrives late
      - Edge f ' of CLK2 should not violate hold time of D2
      - Race between data and clock





Fe<sub>max</sub>

 $\overline{Tcc + Tp + Tsetup + \delta}$ 

# **Clock Distribution**

• Geometric buffering

Tree-based

**H-tree**: constant skew in each block with equivalent number of flip-flops



Buffering: local reduction of skew



# Outline

- The Fundamental Element: MOSFET Transistor
- Design of CMOS Cells: Combinatorial Logic
- Memory Cells
- Delay
- Power Consumption
- Synchronous Design
- Technology Scaling (Moore's Law revisited)
- Multicore: power and utilization walls

# **Technology Evolution and Scaling**

1971

- 10 μm 1971
- 6 μm 1974
- 3 μm 1977
- 1.5 μm 1982
- 1 µm 1985
- 800 nm 1989
- 600 nm 1994
- 350 nm 1995
- 250 nm 1997
- 180 nm 1999
- 130 nm 2001
- 90 nm 2004
- 65 nm 2006
- 45 nm 2008
- 32 nm 2010
- 22 nm 2012
- 14 nm 2014
- 10 nm 2017
- 7 nm -~2018
- 5 nm ~2020
- and then?

1958







2010

2000



# The First Microprocessor

- Intel 4004
- 1971
- 400 kHz
- 4 bits
- 200 US\$ (1200FF)
- 0,06 MOPS
- 10 microns
- 2300 transistors
- 640 addressable bytes



intel.

## **Microprocessor Gallery**

#### INTEL 4004 (1971)

4-bit data

2300 transistors, 10 microns 0,06 MOPS, 108 kHz





#### **INTEL Pentium II (1996)**

32-bit data

5.5M transistors, 0.35μ, 2 cm<sup>2</sup> 200 MHz, 200 MOPS, 3.3V, 35W

## **Microprocessor Gallery**

2000 : Intel® Pentium® 4 Processor 42M Tr, 0.18um, 1.5GHz – 3.6GHz



2010: NVIDIA Tegra 2 SoC 260M Tr, 40nm, 49mm<sup>2</sup>, 2 Cortex A9 1Ghz, 300 Mhz (rest of the chip)



- Scaling factor: s
- Between two successive generations: *s* # 0.7



- Device dimensions W, L, tox: s
- Transistor density: s<sup>2</sup>
- Speed (before power wall...)
  - Vdd, Vt: *s*
  - delay: s
  - frequency: 1/s



- Energy
  - $E = C.Vdd^2$
  - Capacitances C=W.L.Cox: s
  - Energy: s<sup>3</sup>
- Power is decreased by 50%
  - $P = f.C.Vdd^2$
  - Power: s<sup>2</sup>
  - Activity is supposed constant
- But this is for a constant transistor count!
  - But...Transistor density (#Tr/cm<sup>2</sup>): s<sup>2</sup>
    - Power Density
  - And power supply current increases a lot
    - 100W at 1v equals to ...?

# Technology (Dennard's) Scaling

- Scaling factor: s
- Between two successive generations: s # 0.7

| Device dimensions :             | S                     |
|---------------------------------|-----------------------|
| W, L, tox, junction depth       |                       |
| Transistor area (W.L)           | <b>S</b> <sup>2</sup> |
| Capacitance per unit area : Cox | 1/s                   |
| Capacitances : C=WLCox          | S                     |
| Vdd, Vt                         | S                     |
| Gate delay                      | S                     |
| Power/gate                      | S <sup>2</sup>        |
| Power.delay product             | S <sup>3</sup>        |
| Power density                   | 1                     |

# Outline

- The Fundamental Element: MOSFET Transistor
- Design of CMOS Cells: Combinatorial Logic
- Memory Cells
- Delay
- Power Consumption
- Synchronous Design
- Technology Scaling (Moore's Law revisited)
- Multicore: power and utilization walls

# And then came the "Power Wall"



Source: C. Batten, Cornell

# and the "Multicore Era"

Increasing performance by increasing # of cores



# Moving to multicore

- 1 core@2GHz@1.2V@1W
- 1 core@1GHz@0.8V@0.25W
- 2 cores@1GHz@0.8V@0.5W
- But... twice area (and not so simple)
- Advanced technology nodes?









14 nm



#### **Classical (Dennard's) scaling**

| Utilization      | 1                     |
|------------------|-----------------------|
| Device power     | 1/S <sup>2</sup>      |
| Capacitance, Vdd | 1/S                   |
| Device frequency | S                     |
| Device count     | <b>S</b> <sup>2</sup> |





# End of Dennard's Scaling

 Energy efficiency is not scaling along with integration capacity

#### Leakage limited scaling







(w/o) leakage

# **Multicore and Dark Silicon**



• Replace dark cores with specialized cores (10-100x more energy efficient)



#### **Energy Cost in a Processor**


#### **Energy Cost in a Processor**

Fetching operands costs more than computing



#### **Energy Savings in Specialized HW**



### An example: Bitcoin Mining



| Туре | Model                   | Mhash/s   | Mhash/J | Power (W) |
|------|-------------------------|-----------|---------|-----------|
| GPP  | Intel Xeon X5355 (dual) | 22.76     | 0.09    | 120       |
| GPP  | ARMCortex-A9            | 0.57      | 1.14    | 1.5       |
| GPP  | Intel Core i7 3930k     | 66.6      | 0.51    | 130       |
| GPU  | AMD 7970x3              | 2050      | 2.41    | 850       |
| GPU  | Nvidia GTX460           | 158       | 0.66    | 240       |
| ASIC | AntMiner S1             | 180.000   | 500     | 360       |
| ASIC | AntMiner S5             | 1.155.000 | 1957    | 590       |
| FPGA | Bitcoin Dominator X5000 | 100       | 14.7    | 6.8       |
| FPGA | Butterflylabs Mini Rig  | 25.200    | 20.16   | 1250      |



## Time has Come for Specialization

• Microsoft Unveils Catapult to Accelerate Bing!



- One FPGA per blade
- 6 × 8 2-D torus topology
- High-end Stra
- Running Bing l extraction and



- Increase ranking throughput by 95% at comparable latency to software-only
- Increase power consumption by 10%
- Increase total cost of ownership by less than 30%



#### **Towards Heterogeneous Multicores**

Embedded and High-Performance Computing



 C to hardware high-level synthesis boosts hardware designer productivity

#### Conclusions

- A not too deep dive into processors?
- Transistors, logic gates, registers and memory
- Delay and maximal frequency
- Power is data dependent and dominated by data transfers
- Energy efficiency is no more scaling along with integration density
- Efficiency of hardware specialization
- Dark Silicon is an opportunity
  - Heterogeneous manycore architectures
  - Bring a new demand for genuinely high level synthesis tools and (JIT) compilers that map programs to accelerators

## **On-Chip Interconnect?**

- Gate delay decreases but... wire delay increases
- Crossing chip in 5-10 clock cycles
- Also affected by noise...



- Metal layers to reduce wire delay
- Repeaters

 Towards networkon-chip

# Chips go 3D!

- 3D Integrated Circuits
  - Stack Multiple Dies
- Wire Length Reduction
  - Replace long, high capacitance wires by Through Silicon Vias (TSVs)
  - Low latency, low energy, high bandwidth
- Heterogeneous Integration
  - Image Sensors, Sensor Network Nodes
  - Processor + Memory



Micro-bumps

Micro-bumps

Micro-bumps

Bumps (

Balls

TSVs I/Os + Power



HEATSINK

BULK

BULK

BULK

BULK

PACKAGE

AYERS

METAL LAYERS

METAL LAYERS

Tier 4

Tier 3

Tier 2

Tier 1

**PRINTED CIRCUI** 

#### **3D Heterogeneous Multicores**

• 3D Optical Manycore Project

