EECS 151: Complete Study Guide
Introduction to Digital Design and ICs — UC Berkeley, Spring 2026
EECS 151是UC Berkeley的数字集成电路设计入门课。涵盖:布尔逻辑、SystemVerilog硬件描述语言、有限状态机、RISC-V处理器设计、流水线、CMOS晶体管电路、延迟分析、逻辑努力、导线与能耗、加法器与乘法器、时序分析与存储器、功耗分析、移位器与交叉开关、除法器、FPGA与ASIC比较。本网站覆盖全部讲义内容加上61C背景知识复习,提供三份考试速查表、交互式动画和练习题。
Course Map
61C Background / Prerequisites
Review of CS 61C concepts needed for EECS 151
这一节回顾61C的核心基础:数的表示(无符号、二进制补码、符号扩展),RISC-V指令集基础(6种格式),内存模型(字节寻址、大小端、对齐),以及C语言到汇编的映射。就像盖房子之前要先了解砖块和水泥一样,这些是设计数字电路的"原材料"。
Number Representation
- Unsigned N-bit: range [0, 2N-1]. Simply binary place values.
- Two's complement N-bit: range [-2N-1, 2N-1-1]. MSB has weight -2N-1.
- Negation: flip all bits, add 1. Example: -5 in 8-bit = ~(00000101)+1 = 11111011
- Sign extension: replicate the MSB to fill wider register. E.g., 1011 (4-bit, -5) → 11111011 (8-bit, still -5)
- Zero extension: pad with 0s (for unsigned values)
二进制补码像一个"环形数轴":从0往上数到最大正数(01...1),再往上就跳到最小负数(10...0)。符号扩展就是复制最高位来保持数值不变,就像把"-5"从4位搬到8位时,前面都填1。
Convert -13 to 8-bit two's complement:
- 13 in binary = 00001101
- Flip bits = 11110010
- Add 1 = 11110011
- Check: -128+64+32+16+0+0+2+1 = -128+115 = -13
RISC-V ISA Quick Reference
| Format | Key Fields | Examples |
|---|---|---|
| R-type | funct7 | rs2 | rs1 | funct3 | rd | opcode | add, sub, and, or, xor, slt |
| I-type | imm[11:0] | rs1 | funct3 | rd | opcode | addi, lw, jalr, slli |
| S-type | imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode | sw, sb, sh |
| B-type | imm bits | rs2 | rs1 | funct3 | imm bits | opcode | beq, bne, blt, bge |
| U-type | imm[31:12] | rd | opcode | lui, auipc |
| J-type | imm bits | rd | opcode | jal |
All instructions are 32 bits. x0 is hardwired to 0. rd = destination register.
RISC-V的6种指令格式就像6种不同的"信封"模板。R型用于寄存器间运算(如加减),I型用于带立即数的操作(如addi、加载),S型用于存储(把数据写入内存),B型用于条件跳转,U/J型用于大跳转和上半立即数。
Memory Model
- Byte addressing: each memory address points to one byte (8 bits)
- Word: 4 bytes (32 bits) in RV32.
lwloads 4 bytes starting at address. - Little-endian (RISC-V): least significant byte at lowest address
- Alignment: word access should be at addresses divisible by 4; half-word by 2
- Misaligned access may cause exceptions or be slower
内存就像一排编了号的"小格子",每格存1字节。小端序(Little-endian)就是低位字节放在低地址,像把数字"从右往左"排列。对齐要求4字节数据必须放在4的倍数地址上,否则硬件访问会出问题。
C to Assembly Mapping
// C: int x = a + b;
// ASM: add x0_reg, a_reg, b_reg
// C: if (a == b) { ... }
// ASM: beq a_reg, b_reg, label
// C: int arr[10]; x = arr[3];
// ASM: lw x_reg, 12(base_reg) # 3*4=12 byte offset
// C: for (int i=0; i
Boolean Logic & Combinational Circuits
Lectures 2-3, 7-8 | Midterm 1 scope
布尔代数是数字设计的数学基础。变量只有0和1两个值。基本运算:AND(与)、OR(或)、NOT(非)。关键定律:德摩根定律(NOT(AB)=A'+B')、分配律、吸收律。组合逻辑的输出仅取决于当前输入。可用真值表、卡诺图、SOP/POS形式表达和化简。噪声容限衡量信号在传输中抗干扰的能力。
Boolean Algebra Laws
- X + 0 = X, X · 1 = X
- X + 1 = 1, X · 0 = 0
- X + X = X, X · X = X
- X + X' = 1, X · X' = 0
- De Morgan's: (AB)' = A' + B', (A+B)' = A'B'
- Distributive: A(B+C) = AB+AC, also A+BC = (A+B)(A+C)
- Absorptive: A+AB = A, A(A+B) = A
- Consensus: AB+A'C+BC = AB+A'C
德摩根定律是最重要的化简工具,可以记为"打破长杠,变换运算":AND变OR,OR变AND,各变量取反。吸收律(A+AB=A)的直觉是:如果A已经是1,加上AB不改变结果。共识定律说多余项BC可以去掉,因为它被AB和A'C覆盖。
Canonical Forms
- SOP (Sum of Products): OR of minterms where f=1. Each minterm is an AND of all variables.
- POS (Product of Sums): AND of maxterms where f=0. Each maxterm is an OR of all variables.
SOP就是把真值表中输出为1的行写成"乘积之和",每一项包含所有变量(取值为0的变量加非号)。POS则相反,取输出为0的行写成"和之积"。两者互为对偶。
Noise Margins
噪声容限就像信号的"安全边距"。NM_H是高电平输出和下一级高电平输入阈值之间的差距,NM_L是低电平的对应差距。越大越抗干扰,就像围墙越高越安全。
Design Metrics
- Propagation delay (t_p): time from input 50% to output 50%. t_p = (t_pHL + t_pLH)/2
- Rise time (t_r): output 10% to 90%. Fall time (t_f): 90% to 10%.
- Throughput: operations per unit time. Latency: time from start to finish of one task.
传播延迟是信号从输入到输出经过50%点的时间。吞吐量是单位时间能完成多少操作(像流水线工厂每秒出几个产品),延迟是一个操作从头到尾的时间(一个产品走完整条流水线要多久)。
SystemVerilog
Lectures 4-6 | Midterm 1 scope
SystemVerilog是硬件描述语言,分为结构化(门级连接)和行为化(算法描述)两种风格。关键区别:assign是连续赋值(组合逻辑),always_ff用于时序逻辑(触发器),always_comb用于组合逻辑。阻塞赋值(=)按顺序执行,非阻塞赋值(<=)并行更新(用于触发器)。模块可参数化、可用generate循环实例化。
Two Styles
module mux2(input a,b,sel, output out);
logic s0, w0, w1;
not (s0, sel);
and (w0, s0, a);
and (w1, sel, b);
or (out, w0, w1);
endmodule
Instantiates gates explicitly. Output is always first argument.
module mux2(input a,b,sel, output out);
assign out = sel ? b : a;
endmodule
assign = continuous assignment (combinational logic). Changes on RHS immediately reflected.
结构化描述就像画电路图,一个一个门连起来。行为化描述更像写代码,描述"做什么"而不是"怎么连"。两种方式综合后可以得到一样的电路。assign相当于"永远保持这个等式成立的导线连接"。
Always Blocks
| Block | Use For | Assignment | Sensitivity |
|---|---|---|---|
always_comb | Combinational logic | = (blocking) | Auto (all inputs) |
always_ff | Flip-flops | <= (non-blocking) | @(posedge clk) |
always_latch | Latches (avoid!) | <= | Level-sensitive |
always_comb像一个"随时更新的计算器",输入一变输出立刻变。always_ff像一个"快门",只在时钟上升沿那一刻拍一张"快照"存起来。记住:组合逻辑用=(阻塞),触发器用<=(非阻塞)。
Blocking vs Non-Blocking
Non-blocking (<=) - for FF
always_ff @(posedge clk) begin
q1 <= in; // all evaluate
q2 <= q1; // using OLD values
q3 <= q2; // = shift register!
endBlocking (=) - WRONG for FF
always_ff @(posedge clk) begin
q1 = in; // updates immediately
q2 = q1; // uses NEW q1 = in
q3 = q2; // all get same value!
endParameterized Modules & Generate
module counter #(parameter N = 4) (
input clk, rst, enb,
output logic [N-1:0] count
);
always_ff @(posedge clk)
count <= rst ? '0 : (enb ? count+1 : count);
endmodule
参数化模块像一个"可调尺寸的模具"。parameter定义可调参数(如位宽N),generate/for可以批量生成重复硬件结构(如N个加法器)。实例化时用 #(.N(8)) 指定参数值。
Sequential Logic & FSMs
Lectures 6-7 | Midterm 1 scope
时序逻辑的输出取决于当前输入和历史状态(通过触发器存储)。有限状态机(FSM)是常见的时序电路。Moore机输出仅取决于当前状态,Mealy机输出取决于当前状态和输入。FSM用两段式always block实现:一个always_ff更新状态寄存器,一个always_comb计算下一状态和输出。同步/异步复位的区别在于复位是否依赖时钟。
Moore vs Mealy
| Moore | Mealy | |
|---|---|---|
| Output depends on | State only | State + Input |
| Output changes | On clock edge | Asynchronously with input |
| Typically needs | More states | Fewer states |
Moore机像一个"只看当前房间决定行为"的机器人,Mealy机还会"听门外的声音来决定"。Moore更安全(输出同步),Mealy更省状态但输出可能有毛刺。
FSM in SystemVerilog (Two-Block Style)
typedef enum logic [1:0] {IDLE, RUN, DONE} state_t;
state_t state, next_state;
always_ff @(posedge clk)
state <= rst ? IDLE : next_state;
always_comb begin
next_state = state; // default: stay
case (state)
IDLE: if (start) next_state = RUN;
RUN: if (finish) next_state = DONE;
DONE: next_state = IDLE;
endcase
end
两段式FSM:第一段(always_ff)是"记忆",每个时钟存一次当前状态;第二段(always_comb)是"决策",根据当前状态和输入决定下一步。注意always_comb里要有默认赋值(next_state=state),否则会综合出latch。
Sync vs Async Reset
Synchronous Reset
always_ff @(posedge clk)
if (!reset_n) q <= 0;
else q <= d;
Reset only on clock edge.
Asynchronous Reset
always_ff @(posedge clk or
negedge reset_n)
if (!reset_n) q <= 0;
else q <= d;
Reset immediately, no clock needed.
RISC-V & Datapath
Lectures 8-10 | Midterm 1 scope
RISC-V是开源指令集架构,有6种指令格式(R/I/S/B/U/J)。处理器是一个大型状态机:每个时钟周期取指、译码、执行、访存、写回。数据通路包括PC、指令存储器、寄存器文件、ALU、数据存储器、立即数生成器。控制信号(PCSel, ALUSel, RegWEn等)由指令的opcode/funct字段决定。JAL/JALR用于跳转,Branch用于条件分支。
Instruction Formats
| Format | Fields (31..0) | Used By |
|---|---|---|
| R-type | funct7 | rs2 | rs1 | funct3 | rd | opcode | ADD, SUB, AND, OR, XOR, SLT... |
| I-type | imm[11:0] | rs1 | funct3 | rd | opcode | ADDI, LW, JALR, SLLI... |
| S-type | imm[11:5] | rs2 | rs1 | funct3 | imm[4:0] | opcode | SW, SB, SH |
| B-type | imm[12|10:5] | rs2 | rs1 | funct3 | imm[4:1|11] | opcode | BEQ, BNE, BLT, BGE... |
| U-type | imm[31:12] | rd | opcode | LUI, AUIPC |
| J-type | imm[20|10:1|11|19:12] | rd | opcode | JAL |
RISC-V指令格式里,立即数的位分布看起来很乱,但其实是为了让不同格式的相同位尽量对齐(简化硬件)。记住关键:R型用于寄存器运算,I型用于立即数和加载,S型用于存储,B型用于分支。
Single-Cycle Datapath
Data flows: PC → IMEM → Decode/RegRead → ALU → DMEM → WriteBack
单周期数据通路就像一条"流水线工厂":PC提供地址 → 指令存储器取出指令 → 译码并读寄存器 → ALU计算 → 访问数据存储器 → 结果写回寄存器。控制逻辑就像"车间主任",根据指令类型发出不同控制信号。
Pipelining & Hazards
Lectures 10-11 | Midterm 1 scope
流水线将数据通路分成多个阶段(取指F、译码D、执行X、访存M、写回W),用寄存器分隔,使多条指令可以同时在不同阶段执行,提高吞吐量。但引入三种冒险:结构冒险(资源冲突)、数据冒险(数据依赖)、控制冒险(分支)。解决方法:转发(forwarding)、停顿(stalling)、分支预测。最大频率由最慢阶段决定。
5-Stage Pipeline
Hazards
| Type | Cause | Solution |
|---|---|---|
| Structural | Two stages need same resource | Separate IMEM/DMEM, multi-port regfile |
| Data | Instruction depends on result of prior instr | Forwarding, stalling |
| Control | Branch/jump changes PC | Branch prediction, flush, delay slots |
流水线冒险就像高速公路上的"堵车":结构冒险是两辆车抢同一车道(加车道解决);数据冒险是后车需要前车的结果(转发=直接传递,停顿=等一等);控制冒险是不知道该走哪个出口(分支预测=猜一个方向先走着)。
Timing
- Max frequency: f = 1/Tmin
- Tmin = tclk-q + tlogic,max + tsetup
- Hold constraint: tclk-q + tlogic,min > thold
- Retiming: move registers to balance critical path across pipeline stages
CMOS Fundamentals
Lectures 13-14 | Midterm 2 scope
CMOS是现代芯片的基础。NMOS在栅极高电平时导通(拉低输出),PMOS在栅极低电平时导通(拉高输出)。每个门有互补的上拉(PUN,PMOS)和下拉(PDN,NMOS)网络,是对偶关系:串联变并联。静态CMOS门只能实现反相功能(NAND/NOR)。晶体管尺寸设计:串联时加倍宽度以保持等效驱动能力。
NMOS vs PMOS
NMOS就像一个"高电平才开的水龙头",导通时把输出拉到地(输出0)。PMOS相反,"低电平才开",导通时把输出拉到电源(输出1)。NMOS传递强0弱1,PMOS传递强1弱0,所以CMOS两者配合使用。
CMOS Gate Structure
- PDN (NMOS, to GND): Series = AND, Parallel = OR
- PUN (PMOS, to VDD): Dual of PDN (Series↔Parallel)
- Output = complement of PDN expression
- Static CMOS is always inverting: can build NAND, NOR, not AND/OR directly
PDN决定什么时候输出拉低(接地),所以如果PDN中两个NMOS串联(都导通才接地=AND条件),输出就是(AB)的反=NAND。对偶规则:PDN串联↔PUN并联,像镜子里的倒影。
Interactive: Build a CMOS Gate
Transistor Sizing
Goal: equal pull-up/pull-down strength as reference inverter (Wp/Wn = β)
- NMOS in series × N: each width = N × Wn,min
- PMOS in series × N: each width = N × Wp,min
- Parallel: keep minimum width
| Gate | NMOS config | PMOS config | Each NMOS W | Each PMOS W | Total (β=2) |
|---|---|---|---|---|---|
| NAND2 | 2 series | 2 parallel | 2 | 2 | 8 |
| NAND3 | 3 series | 3 parallel | 3 | 2 | 15 |
| NOR2 | 2 parallel | 2 series | 1 | 4 | 10 |
| NOR3 | 3 parallel | 3 series | 1 | 6 | 21 |
NAND gates are smaller than NOR gates! (because PMOS in parallel, not series)
晶体管串联时电阻加倍,所以要加倍宽度来补偿。NAND中NMOS串联(需要加倍)而PMOS并联(不用加倍),NOR则相反。因为PMOS本身就比NMOS大(beta=2),所以NOR中PMOS串联会变得很大,这就是为什么NAND比NOR更受欢迎。
CMOS Delay
Lecture 15 | Midterm 2 scope
CMOS延迟用RC模型分析。每个晶体管导通时有电阻R(与宽度成反比),栅极和漏极有电容。传播延迟:t = ln(2) × R × (寄生电容 + 负载电容)。Elmore延迟用于复杂RC网络:对每个电容,乘以从源到该电容的共享电阻,求和。关键:包含非通路电容,不包含非通路电阻。反相器链最优扇出约为e≈2.7。
RC Model
- On-resistance: R = Ron / W (wider = lower resistance)
- Gate capacitance: Cg = W × Ci
- Drain capacitance: Cd = γ × Cg (γ ≈ 1 in this class)
- Inverter delay: tp = ln(2) × Req × (Cparasitic + Cload)
RC模型把晶体管简化为"水管"(电阻R)和"水桶"(电容C)。宽度大=水管粗=电阻小=充电快。延迟就是给电容充到50%所需的时间=ln(2)*RC。就像用水管往桶里灌水,管越粗(R小)桶越小(C小)就越快。
Elmore Delay
tD = ln(2) × ∑i Rshared,i × Ci
For each capacitor: multiply by total resistance from source to that cap. Sum all.
Include off-path CAPACITANCES. Exclude off-path RESISTANCES.
Elmore延迟就像算"每个水桶被上游所有水管拖慢了多少"。对每个电容C_i,找从源到它经过的所有电阻加起来(共享路径),乘以C_i,最后全部求和。非通路上的电容要算(因为也要充电),但非通路上的电阻不算(因为电流不经过那里)。
Interactive: RC Charging
Inverter Chain Optimization
- Overall fanout: F = CL/Cin
- N stages, optimal fanout per stage: f = F1/N
- Optimal f ≈ e ≈ 2.718 (practical: 3-4)
- Chain delay: D = N × tp0(1 + γ/f)
如果要驱动一个很大的负载(比如C_L是输入的1000倍),直接用一个反相器会太慢。解决方案:用一串逐渐变大的反相器,每级放大约e倍(2.7倍)。级数N=ln(F)/ln(e),总延迟最小。就像搬重物时用多个滑轮逐级放大力量。
Logical Effort
Lecture 16 | Midterm 2 scope
逻辑努力是比较不同门速度并优化尺寸的方法。门延迟 d = g×h + p(逻辑努力×电气努力 + 寄生延迟)。路径总努力 F = G×B×H,最优每级努力 f = F^(1/N)。从输出往回算每个门的输入电容来确定尺寸。NAND比NOR快(逻辑努力更小)。
Gate Delay Formula
d = g × h + p
- g = logical effort (gate type property)
- h = Cout/Cin = electrical effort (fanout)
- p = parasitic delay
- f = g×h = stage effort
逻辑努力g衡量"这个门比反相器慢多少"。电气努力h衡量"负载有多重"。寄生延迟p是门自身的固有延迟。总延迟d就像做一件事的难度=事情本身的难度(g) × 负担大小(h) + 准备时间(p)。
Logical Effort Table
| Gate | g (logical effort) | p (parasitic delay) |
|---|---|---|
| Inverter | 1 | 1 |
| NAND2 | 4/3 | 2 |
| NAND3 | 5/3 | 3 |
| NOR2 | 5/3 | 2 |
| NOR3 | 7/3 | 3 |
Path Optimization
- Compute: G = ∏gi, B = ∏bi (branching), H = Cout/Cin
- Path effort: F = G × B × H
- Optimal stage effort: f̂ = F1/N
- Min delay: D = N × f̂ + ∑pi
- Size backwards: Cin,i = gi × Cout,i / f̂
路径优化的步骤:(1)算总努力F=G*B*H;(2)求最优每级努力f=F^(1/N);(3)从输出往回推每个门的输入电容C_in = g*C_out/f。就像规划旅程:先算总路程,再平均分配每天的行程,然后从目的地往回安排住宿点。
Wires & Energy
Lecture 17-18 | Midterm 2 scope
导线有电阻和电容,先进工艺中导线延迟可超过门延迟。总功耗 = 动态 + 短路 + 漏电。动态功耗 P = αCV²f,其中α是活动因子=2p(1-p)。降压可大幅降低功耗(P∝V³)但频率也降(f∝V)。每次开关能量=CV²。短路功耗在PUN/PDN同时导通时产生,峰值在VDD/2。
Wire Delay
- Wire R: R = ρ × L / (W × t) = Rsq × L/W
- Wire C: C = ε × A / d
- Lumped: t = 0.7 RC
- Distributed (N→∞): t ≈ 0.35 RC
导线不是理想的,有电阻(越长越细越大)和电容(与邻近导线的耦合)。集中式模型把整条导线的RC集中在一点,分布式模型更准确(延迟减半)。在先进工艺中导线延迟已经超过门延迟,成为瓶颈。
Dynamic Power
Pdynamic = α × CL × VDD² × fclk
- Activity factor: α = 2p(1-p) where p = P(output=1)
- Energy per switch: E = CL × VDD²
- Short-circuit: Psc = Isc × VDD, peaks at Vin = VDD/2
- Static: Pleak = Ileak × VDD
动态功耗P=alpha*C*V^2*f是最重要的公式。alpha是"活动因子"=信号翻转的概率(对于随机信号,p=0.5时alpha最大=0.5)。降压效果显著:电压降一半,功耗降到1/8(因为f也降一半)。但不能无限降压,低于阈值电压晶体管就不工作了。
Interactive: Voltage Scaling
Base: C=10nF, f=200MHz@1.0V, α=0.1. Assumes ID ∝ VDD.
Adders
Lectures 18-19 | Midterm 2 scope
加法器核心信号:生成G=AB,传播P=A⊕B,Cout=G+P·Cin。行波进位O(N)最慢但最省面积和能量。进位跳跃O(√N)利用分组跳过全传播段。进位选择O(√N)并行计算两种进位假设再选择。进位前瞻O(logN)最快但面积最大。
Full Adder
- S = A ⊕ B ⊕ Cin
- Cout = G + P · Cin where G = AB, P = A⊕B
- Kill: K = ĀB̄ (carry always killed)
全加器是加法器的"砖块"。G(生成)=两个输入都是1时自己产生进位。P(传播)=恰好一个是1时把进位传递下去。K(杀死)=两个都是0时进位被吃掉。C_out = G + P*C_in 就是"要么自己生成,要么传递进来的"。
Adder Comparison
| Type | Delay | Area | Energy | Key Idea |
|---|---|---|---|---|
| Ripple-Carry | O(N) | O(N) | Lowest | Chain carries |
| Carry-Bypass | O(√N) | O(N) | Low | Skip groups if all P=1; b=√(N/2) |
| Carry-Select | O(√N) | ~2N | Med | Compute Cin=0,1 in parallel |
| Carry-Lookahead | O(log N) | O(NlogN) | High | Parallel G/P tree |
Interactive: Carry Propagation
Multipliers
Lectures 19-20 | Midterm 2 scope
N位乘法三步:部分积生成(AND)、部分积压缩(CSA)、最终加法。CSA将3个数变2个数仅需O(1)。Wallace树用多层CSA,O(log N)层压缩。Booth编码(基4)把N个部分积减半为N/2:3A=4A-A,避免困难乘法。Baugh-Wooley用于有符号乘法的符号扩展优化。
Three Stages
乘法器就像手算竖式乘法的硬件版:(1)算出所有部分积(每一位乘以被乘数);(2)用CSA(进位保留加法器)快速压缩,每次3个数变2个数;(3)最后用快速加法器把剩下的2个数加起来。Wallace树就是一层层CSA嵌套。
Booth Recoding (Radix-4)
| Bk+1BkBk-1 | 000 | 001 | 010 | 011 | 100 | 101 | 110 | 111 |
|---|---|---|---|---|---|---|---|---|
| Partial Product | 0 | +A | +A | +2A | -2A | -A | -A | 0 |
N partial products → N/2. Key insight: 3A = 4A - A (defer 4A to next group).
Booth编码的核心思想:把连续的1变成"高位+1、低位-1"。比如0111=1000-0001=8-1。基4版本每次看3位(重叠1位),把部分积数量减半。硬件只需要做0、+A、-A、+2A、-2A五种操作,其中2A就是左移一位。
Timing & Latches
Lecture 21 | Final scope
时序分析:建立时间(setup)是数据在时钟边沿前必须稳定的时间,保持时间(hold)是之后必须保持的时间。最小周期T ≥ tclk-q + tlogic,max + tsetup + tskew + tjitter。保持约束:tclk-q + tlogic,min > thold + tskew。锁存器(latch)是电平敏感的,透明时D直通到Q;触发器(flip-flop)是边沿触发的,由两个锁存器组成。时钟偏移(skew)是空间变化,抖动(jitter)是时间变化。
Setup and Hold
- Setup (max delay): T ≥ tclk-q + tlogic,max + tsetup + tskew + tjitter
- Hold (min delay): tclk-q + tlogic,min ≥ thold + tskew + tjitter
- Setup slack: T - (tclk-q + tlogic,max + tsetup - tskew + tjitter)
- Hold slack: (tclk-q + tlogic,min - tskew - tjitter) - thold
Setup too slow? Reduce logic, retime, pipeline. Hold too fast? Add buffers.
建立时间就像"提前到机场的时间"——数据必须在时钟来之前就稳定好。保持时间就像"不能提前离开考场"——时钟来了之后数据还要保持一段时间。建立违例可以通过降频解决(给更多时间),但保持违例不行(与频率无关),只能加缓冲器。
Flip-Flop vs Latch
| Flip-Flop | Latch | |
|---|---|---|
| Triggered by | Clock edge | Clock level |
| Transparent when | Never (samples at edge) | CLK high (or low) |
| Built from | Two latches (primary+secondary) | Transmission gate + inverter |
| Verilog | always_ff @(posedge clk) | always_latch if(clk) q<=d; |
锁存器像一扇"电动门"——时钟高电平时门开着,输入直接穿过到输出(透明);时钟低电平时门关闭,保持最后的值。触发器像"快门相机"——只在时钟上升沿那一瞬间"拍照"(采样)。触发器由两个锁存器组成(主从结构)。
Clock Skew & Jitter
- Clock skew (tSK): spatial variation in clock arrival. Deterministic + random.
- Clock jitter (tJ): temporal variation in consecutive clock edges. Random noise.
- Positive skew (receiver late) helps setup, hurts hold
- Negative skew (receiver early) hurts setup, helps hold
时钟偏移(skew)是"空间上的不同步"——芯片不同角落的时钟到达时间不同(像广播信号到达不同城市有延迟)。抖动(jitter)是"时间上的不稳定"——同一个点的时钟周期忽长忽短。正偏移(接收器晚收到时钟)给了数据更多时间到达(利于建立),但也给了旧数据更多时间残留(不利于保持)。
SRAM & Memory
Final scope (from FA25 notes)
SRAM用6个晶体管存储1位数据。读操作:预充电位线到VDD,开启字线,感应放大器检测位线差异。读稳定性要求存储节点不被翻转(拉下>拉起)。写操作:拉低一条位线迫使单元翻转。写能力要求访问管能覆盖存储反相器。行译码器用NAND门选中字线。列电路包括位线调节、写驱动、感应放大器。
6T SRAM Cell
6T SRAM像两个互相"掐架"的反相器(交叉耦合),稳定在两个状态之一(存0或存1)。读的时候先把两条"天平秤"(位线)都充到高电平,然后打开门(字线),存储的值会让一边稍微降低,感应放大器放大这个微小差异。写的时候强行把一边拉到低,迫使状态翻转。
Read vs Write
| Read | Write | |
|---|---|---|
| Requirement | Stability (don't flip the cell) | Writability (must flip the cell) |
| Pull-down vs Access | PD > Access (W/L1 > W/L5) | Access > PU (W/L5 > W/L2) |
| Typical ratio | 1.5:1 (planar), 2:1 (finFET) | 1:1 (planar), 2:1 (finFET) |
Power & Energy Analysis
Lectures 17, 19 | Final scope
功耗分析是芯片设计的核心挑战之一。总功耗=动态功耗+静态功耗。动态功耗来自充放电(开关活动),静态功耗来自漏电流。就像一辆汽车:动态功耗是踩油门时消耗的油(取决于踩了多少次和踩多深),静态功耗是停车时发动机空转消耗的油(一直在漏)。现代芯片两者都很重要,需要用各种技术来管理。
Power Breakdown
Ptotal = Pdynamic + Pshort-circuit + Pstatic
- Dynamic: Pdyn = α × CL × VDD2 × f — from charging/discharging load capacitances
- Short-circuit: Psc = Isc × VDD — when PMOS and NMOS both on during transition
- Static (leakage): Pstatic = Ileak × VDD — subthreshold and gate leakage
动态功耗:每次信号翻转,给电容充电(0→1)从电源取能量CV^2,放电(1→0)能量消散到地。短路功耗:输入翻转过程中PMOS和NMOS同时短暂导通,产生直通电流(V_in在V_DD/2时最大)。静态功耗:即使不翻转,晶体管也有微小漏电流。
Activity Factor
The activity factor α represents the probability of a 0→1 transition per clock cycle.
- α = 2 × p × (1-p) where p = P(output = 1)
- Maximum when p = 0.5: αmax = 0.5
- Clock signal: α = 1 (transitions every cycle)
- For an AND2 with independent random inputs: p = 0.25, α = 2(0.25)(0.75) = 0.375
- For an OR2 with independent random inputs: p = 0.75, α = 2(0.75)(0.25) = 0.375
活动因子alpha衡量"信号有多忙"。alpha=2p(1-p)中p是输出为1的概率。如果信号总是0或总是1(p=0或1),alpha=0(没有翻转)。如果p=0.5(随机信号),alpha=0.5(最大翻转率)。时钟信号每个周期都翻转,所以alpha=1。
Energy Per Operation
- Energy per switching event: Eswitch = CL × VDD2
- Energy per operation (considering activity): Eop = α × CL × VDD2
- Note: energy is independent of frequency! Lowering f reduces power but not energy/op.
- Lowering VDD reduces energy/op quadratically (E ∝ VDD2)
能量和功率的区别:功率P=能量/时间。降低频率可以降低功率(单位时间做的事少了),但每次操作的能量不变。降低电压才能真正降低每次操作的能量(E正比于V^2)。这就是为什么降压是最有效的省电手段。
Power Management Techniques
| Technique | What It Does | Saves |
|---|---|---|
| Clock Gating | Disable clock to idle blocks (alpha → 0) | Dynamic power |
| Voltage Scaling | Lower VDD (P ∝ V3 since f ∝ V) | Dynamic + Energy |
| Power Gating | Cut VDD to idle blocks entirely | Static (leakage) |
| Multi-Vth | High Vth on non-critical paths (less leakage) | Static power |
| Body Biasing | Adjust Vth dynamically via substrate voltage | Static power |
时钟门控:给不用的模块"关掉闹钟",不翻转就不耗动态功耗。电压缩放:降低供电电压,功耗按V^3下降(因为频率也按V下降)。电源门控:给不用的模块彻底"断电",消除漏电。多阈值:关键路径用快(低阈值)晶体管,其他用省电(高阈值)的。
DVFS (Dynamic Voltage and Frequency Scaling)
- Adjust VDD and f together based on workload
- High load → high V/f for performance
- Low load → low V/f to save power
- Key relationship: fmax ∝ (VDD - Vth)2 / VDD (simplified: f ∝ VDD)
- Power scales as P ∝ VDD3 (because both V2 and f scale)
DVFS就像汽车的"巡航模式":高速公路上加大油门(高电压高频率)跑得快,城市里降低油门(低电压低频率)省油。手机CPU就大量使用DVFS:刷微博时降频省电,玩游戏时全速运行。功耗按电压的三次方下降,所以降压10%可以省电约27%。
Q: A chip has Ctotal=20nF, VDD=1.0V, f=1GHz, alpha=0.15, Ileak=100mA. Find total power.
- Pdyn = 0.15 × 20×10-9 × 1.02 × 109 = 3.0 W
- Pstatic = 0.1 × 1.0 = 0.1 W
- Ptotal ≈ 3.0 + 0.1 = 3.1 W (ignoring short-circuit)
If we scale VDD to 0.8V: f scales to 0.8GHz, Pdyn = 0.15 × 20e-9 × 0.64 × 0.8e9 = 1.536 W (49% reduction!)
Shifters & Crossbars
Lecture 21 | Final scope
移位器用于将二进制数左移或右移若干位(乘以或除以2的幂)。桶形移位器用多级MUX实现对数级延迟的任意移位。漏斗移位器把左移和右移统一成一个操作。交叉开关是N个输入到M个输出的全连接开关网络,用于路由器和网络芯片。就像火车站的"道岔"系统,可以把任意一条入轨连到任意一条出轨。
Barrel Shifter
An N-bit barrel shifter uses log2(N) stages of 2:1 muxes.
- Stage k shifts by 2k positions if shift_amount[k] = 1
- For 8-bit: 3 stages (shift by 1, 2, 4)
- For 32-bit: 5 stages (shift by 1, 2, 4, 8, 16)
- Each stage: N muxes selecting between shifted and unshifted
- Delay: O(log N) mux delays
- Area: O(N log N) muxes total
桶形移位器的巧妙之处:把移位量分解成二进制位。比如移位5=101(二进制),就先移1位(第0级),跳过移2位(第1级),再移4位(第2级)。每一级是一排MUX,选择"移还是不移"。就像调音量旋钮,每个刻度都是上一个的两倍。
// Shift amount = 5 = 101 in binary
// Input: A B C D E F G H
// Stage 0 (shift by 1, enabled): B C D E F G H 0
// Stage 1 (shift by 2, skip): B C D E F G H 0
// Stage 2 (shift by 4, enabled): F G H 0 0 0 0 0
// Result: left shift by 5
Funnel Shifter
- Combines left shift, right shift, and rotate into one structure
- Place two copies of input side-by-side: [A | A] (2N bits)
- Select N consecutive bits from this 2N-bit word
- Left shift by k: select bits starting at position N-k
- Right shift by k: select bits starting at position k (with sign/zero extension for upper copy)
- Rotate by k: select bits starting at position k (upper copy = lower copy)
漏斗移位器的思路很优雅:把输入复制一份拼成2N位的"长条",然后从中选取连续的N位。左移就是选靠后的位置,右移就是选靠前的位置,旋转就是两份都用原数据。就像一个"取景框"在两倍长的胶片上滑动。
Crossbar Switch
- Connects any of N inputs to any of M outputs
- Each output has an N:1 mux
- Total: M muxes, each N-to-1
- Area: O(N × M) — grows quadratically
- Delay: O(1) mux delay per output (single stage)
- Used in: network switches, NoC routers, register file read ports
交叉开关就像电话交换机:N个输入和M个输出之间可以任意连接。每个输出端有一个N选1的MUX。面积是N*M(每个交叉点一个开关),所以规模大了很贵。在片上网络(NoC)和路由器中广泛使用。
Binary Division
Lecture 21 | Final scope
二进制除法比乘法更复杂,因为每一步都依赖上一步的结果(无法像乘法那样并行)。恢复除法最直观:试减,如果不够就加回来。不恢复除法更快:不加回去,而是下一步补偿。SRT除法用更高基数(基4/8)每步产生多位商,是现代处理器的标准做法。就像手算长除法,但每一步可以"猜"多位。
Restoring Division
- Initialize: remainder R = dividend (2N bits), divisor D, quotient Q = 0
- For each bit position (N iterations):
- Left-shift R by 1
- Trial subtract: R' = R - D
- If R' ≥ 0: set Q bit = 1, R = R'
- If R' < 0: set Q bit = 0, restore R (don't update)
- After N steps: Q = quotient, R = remainder
Delay: O(N) iterations, each with an N-bit subtraction
Hardware: N-bit adder/subtractor + shift registers
恢复除法就像手算除法:每次"试探"能不能减去除数。能减(够减)就商1,不能减(不够减)就商0并把减掉的加回来(恢复)。缺点是"加回来"这一步浪费时间。
Dividend = 0111 (7), Divisor = 011 (3)
Step 1: R=0111, shift left -> 01110
Trial: 01110 - 01100 = 00010 >= 0, Q[1]=1, R=00010
Step 2: R=00010, shift left -> 00100
Trial: 00100 - 01100 = -01000 < 0, Q[0]=0, R=00100(restore)
Result: Q = 10 (2), R = 01 (1)
Check: 7 = 3*2 + 1 correct!
Non-Restoring Division
- Instead of restoring, add D in the next step if previous step was negative
- If previous remainder ≥ 0: shift left, subtract D, Q bit = 1
- If previous remainder < 0: shift left, add D, Q bit = 0
- At the end: if remainder is negative, add D to correct
- Advantage: only one add/subtract per step (no restore step)
- Quotient bits are in {0,1} but may need final correction
不恢复除法的思路:如果这一步减多了(结果为负),下一步不是"加回来再减",而是直接"加"来补偿。省去了恢复那一步。就像在天平上称重:放错了砝码不用拿下来重放,直接在另一边加砝码补偿。
SRT Division
- Named after Sweeney, Robertson, Tocher
- Uses radix-4 or higher: produces 2+ quotient bits per iteration
- Quotient digits from redundant set {-2, -1, 0, 1, 2} for radix-4
- Uses a lookup table (PLA) indexed by top bits of remainder and divisor
- Redundancy allows imprecise remainder estimation → faster
- Delay: O(N/2) for radix-4, O(N/k) for radix-2k
- Used in modern Intel/AMD/ARM processors
SRT除法是现代CPU的标准。基4版本每步产生2位商,速度翻倍。关键技巧是使用"冗余表示":商的每一位可以是-2到+2(不只是0和1),这样不需要精确知道余数的值就能选择商,容错性更强。就像去超市不用精确到分地算钱,四舍五入也能结账。(著名的Pentium FDIV bug就是SRT表格里少了几个条目。)
Division is inherently sequential (each step depends on previous remainder). Unlike multiplication, you cannot parallelize the iterations. This is why division is 10-40x slower than multiplication in hardware.
FPGA vs ASIC
Project-relevant | Final scope
FPGA(现场可编程门阵列)就像"乐高积木"——买回来后可以反复重新编程配置。ASIC(专用集成电路)就像"定制模具"——一旦制造出来就不能改了。FPGA适合原型验证和小批量,ASIC适合大批量生产(手机芯片等)。我们151的项目就是在FPGA上实现RISC-V处理器。
FPGA Architecture
- LUT (Look-Up Table): Small SRAM that can implement any N-input boolean function (typically 4-6 inputs). Acts as a universal gate.
- CLB (Configurable Logic Block): Contains multiple LUTs + flip-flops + carry chains + muxes
- Routing: Programmable interconnect (switch boxes + connection boxes) that connects CLBs
- IOBs: Input/Output blocks at chip periphery
- Block RAM: Dedicated SRAM blocks (e.g., 36Kbit)
- DSP slices: Hardened multiply-accumulate units
FPGA的核心是LUT(查找表)——一小块SRAM存储真值表,可以实现任何N输入布尔函数。比如4输入LUT有16个存储单元(2^4),可以实现任何4输入逻辑。CLB把多个LUT、触发器、进位链打包在一起。可编程路由网络把它们连接起来。就像一块"万能面包板"。
FPGA vs ASIC Comparison
| Aspect | FPGA | ASIC |
|---|---|---|
| Reconfigurable | Yes (reprogram anytime) | No (fixed after fabrication) |
| NRE Cost | Low ($0 - $10K) | Very high ($1M - $100M+) |
| Unit Cost | High ($10 - $10K per chip) | Low ($0.10 - $100 per chip) |
| Performance | ~100-500 MHz typical | ~1-5 GHz |
| Power Efficiency | 10-100x worse | Best (optimized transistors) |
| Area Efficiency | 10-40x worse | Best (custom layout) |
| Time to Market | Days to weeks | Months to years |
| Best For | Prototyping, low volume, reconfigurable | High volume, max performance |
选择FPGA还是ASIC就像选择"3D打印"还是"开模具":3D打印(FPGA)灵活快速但单件贵,开模具(ASIC)前期投入大但大量生产便宜。FPGA因为用通用LUT和可编程路由,面积和功耗都比ASIC大很多,但开发周期短,适合我们做项目验证。
FPGA Design Flow
- RTL Design: Write SystemVerilog (behavioral)
- Simulation: Verify with testbenches (iverilog, VCS)
- Synthesis: Convert RTL to LUT/FF netlist (Vivado)
- Place & Route: Map netlist to physical FPGA resources
- Bitstream Generation: Create configuration file
- Program FPGA: Load bitstream onto device
ASIC Design Flow
- RTL Design: Write SystemVerilog
- Simulation: Functional verification
- Logic Synthesis: Map to standard cell library (Design Compiler)
- Floor Planning: Arrange major blocks on die
- Place & Route: Place cells and connect wires
- Timing Closure: Meet setup/hold across all paths (STA)
- DRC/LVS: Design rule check / layout vs schematic
- Tapeout: Send to foundry (TSMC, Samsung, etc.)
- Fabrication: Weeks to months
FPGA流程:写代码 → 仿真 → 综合(变成LUT网表) → 布局布线 → 生成比特流 → 下载到板子上。几小时就能完成。ASIC流程多了很多步:综合到标准单元库、时序收敛、物理验证(DRC/LVS)、然后送去代工厂流片,整个过程几个月到一年。
In EECS 151, your RISC-V processor targets the Xilinx PYNQ-Z1 FPGA. Use Vivado for synthesis. Key FPGA-specific considerations: LUT utilization, BRAM usage, timing constraints (100MHz target), and using always @(*) for iverilog compatibility.
Cheatsheet: Midterm 1
Boolean Algebra
- De Morgan: (AB)'=A'+B', (A+B)'=A'B'
- Absorption: A+AB=A
- Consensus: AB+A'C+BC=AB+A'C
- SOP: OR of AND minterms
- POS: AND of OR maxterms
Noise Margins
- NM_H = V_OH - V_IH
- NM_L = V_IL - V_OL
- t_p = (t_pHL + t_pLH)/2
SystemVerilog
- assign = continuous (comb)
- always_comb: blocking (=)
- always_ff: non-blocking (<=)
- Module instantiation: .port(signal)
- generate/for: parameterized hardware
FSMs
- Moore: output = f(state)
- Mealy: output = f(state, input)
- Two-block style: 1 always_ff + 1 always_comb
RISC-V Formats
- R: funct7|rs2|rs1|f3|rd|op (ADD,SUB,AND)
- I: imm[11:0]|rs1|f3|rd|op (ADDI,LW,JALR)
- S: imm[11:5]|rs2|rs1|f3|imm[4:0]|op (SW)
- B: imm|rs2|rs1|f3|imm|op (BEQ,BNE)
- U: imm[31:12]|rd|op (LUI,AUIPC)
- J: imm|rd|op (JAL)
Pipelining
- 5 stages: F D X M W
- T_min = t_cq + t_logic + t_su
- Hold: t_cq + t_min > t_hold
- Hazards: structural, data, control
- Fix: forwarding, stalling, prediction
Control Signals
- PCSel: pc+4 or branch target
- ImmSel: I/S/B/U/J format
- ALUSel: ADD/SUB/AND/OR/XOR/SLT
- MemRW: Read/Write
- WBSel: ALU/mem/PC+4
- RegWEn: write enable
Key Constants
- x0 always = 0
- JAL: rd=PC+4, PC=PC+imm
- JALR: rd=PC+4, PC=rs1+imm
- LUI: rd=imm<<12
- AUIPC: rd=PC+(imm<<12)
Cheatsheet: Midterm 2
CMOS Basics
- NMOS ON: V_GS > V_th (strong 0)
- PMOS ON: V_GS < -|V_th| (strong 1)
- PDN: series=AND, parallel=OR
- PUN = dual of PDN
- Always inverting: NAND, NOR only
- Sizing: N in series => width x N
CMOS Delay
- R = R_on/W, C_g = W*C_i, C_d = gamma*C_g
- t_p = ln(2)*R*(C_p + C_L)
- Elmore: t = ln(2)*sum(R_shared * C_i)
- Off-path caps: YES. Off-path R: NO
- Chain: f_opt = F^(1/N), f ~= e ~= 2.7
Logical Effort
- d = g*h + p
- INV: g=1,p=1. NAND2: g=4/3,p=2
- NAND3: g=5/3,p=3. NOR2: g=5/3,p=2
- F = G*B*H, f_hat = F^(1/N)
- D_min = N*f_hat + sum(p_i)
- Size: C_in,i = g_i * C_out,i / f_hat
Power & Energy
- P_dyn = alpha * C * V_DD^2 * f
- alpha = 2*p*(1-p)
- E_switch = C * V_DD^2
- P_sc = I_sc * V_DD (peak at V_DD/2)
- P_leak = I_leak * V_DD
- V scaling: f prop V, P prop V^3
- DVFS: adjust V and f with workload
Wires
- Lumped: t = 0.7RC
- Distributed: t = 0.35RC
- R = rho*L/(W*t), C = eps*A/d
- Use Elmore delay for RC trees
Adders
- G=AB, P=A xor B, C_out=G+P*Cin
- RCA: O(N). Bypass: O(sqrt N), b=sqrt(N/2)
- Select: O(sqrt N). CLA: O(log N)
- Bypass delay: 2b + N/b - 2
Multipliers
- CSA: 3 nums -> 2 in O(1)
- Wallace: O(log_1.5 N) CSA levels
- Booth radix-4: N->N/2 partial products
- 3A = 4A - A trick
Constants
- ln(2) = 0.693
- e = 2.718 (optimal fanout)
- gamma = C_d/C_g (usually 1)
- NAND preferred over NOR
Cheatsheet: Final Exam
Cumulative: includes all MT1 + MT2 content plus the following.
Timing Constraints
- T >= t_cq + t_logic,max + t_su + t_sk + t_j
- Hold: t_cq + t_min >= t_hold + t_sk + t_j
- Setup slack = T - (t_cq+t_max+t_su-t_sk+t_j)
- Hold slack = (t_cq+t_min-t_sk-t_j) - t_hold
- Setup fail: reduce logic, retime, pipeline
- Hold fail: add buffers (cannot fix with freq)
Clock Non-Idealities
- Skew (t_SK): spatial (deterministic+random)
- Jitter (t_J): temporal (cycle-to-cycle)
- +skew: helps setup, hurts hold
- -skew: hurts setup, helps hold
- Both reduce effective cycle time
Latch vs Flip-Flop
- Latch: level-sensitive (transparent)
- FF: edge-triggered (two latches)
- Latch has t_cq, t_dq, t_su, t_hold
- FF has t_cq, t_su, t_hold
- Transmission gate latch: NMOS+PMOS parallel
SRAM
- 6T cell: 2 access + 2 inverters
- Read: precharge BL/BLbar, assert WL
- Read stability: PD > Access transistor
- Write: force BL low, assert WL
- Writability: Access > PU
- Ratio: PD:Access:PU = 3:2:1 (approx)
Memory Periphery
- Row decoder: NAND gates -> wordline
- Predecoder: smaller groups first
- Bitline conditioning: precharge high
- Write driver: pull one BL to 0
- Sense amplifier: detect BL vs BLbar diff
Power Analysis
- P_total = P_dyn + P_sc + P_leak
- P_dyn = alpha*C*V^2*f
- E_op = alpha*C*V^2 (freq independent)
- Clock gating: reduce alpha
- DVFS: adjust V,f with workload
- Power gating: cut V_DD to idle blocks
Division Algorithms
- Restoring: trial subtract, restore if neg
- Non-restoring: add next step if neg
- SRT: radix-4, 2 bits/iter, redundant digits
- Division is sequential: O(N) iterations
- ~10-40x slower than multiplication
Shifters & FPGA
- Barrel: log2(N) mux stages, O(NlogN)
- Funnel: 2N concat, select N bits
- Crossbar: N*M muxes, O(1) delay
- FPGA: LUT + CLB + routing + BRAM
- ASIC: standard cells, tapeout, months
- FPGA 10-100x less efficient than ASIC
Flashcards
Click to flip. Covers all exam material.