Model	Total Params	Active Params	MMLU Score	Speed vs 70B
Llama 2 13B	13B	13B	55.0	5× faster
Mixtral 8x7B	47B	13B	70.6	5× faster
Llama 2 70B	70B	70B	69.7	1× (baseline)
GPT-3.5	~175B	~175B	70.0	N/A (API)

PSYC 51.07: Models of Language and Communication

Inference Optimizations

Making generation faster:

1. KV Cache (Essential)


1# Without cache: Recompute all attention
2# Token 100 attends to tokens 1-99
3# = O(n²) attention per token!
4
5# With cache: Store previous K,V
6cache = {}
7for token in sequence:
8    k, v = compute_kv(token)
9    cache[pos] = (k, v)  # Store!
10    # Only compute attention once

2. Flash Attention


1Standard: Load full attention matrix
2FlashAtt: Tiled, memory-efficient
3→ 2-4× faster, fits longer sequences

3. Speculative Decoding


1# Draft model (fast, small): 7B
2draft_tokens = small_model.generate(5)
3# ["The", "cat", "sat", "on", "mat"]
4
5# Target model (slow, large): 70B
6verified = large_model.verify(draft_tokens)
7# ["The", "cat", "sat", "on", "the"]
8#   ✓      ✓      ✓      ✓    ✗
9
10# Accept 4/5 in one batch!
11# 2-3× speedup, same quality

4. Continuous Batching (vLLM)

New requests join mid-batch
No waiting for longest sequence

Technique	Best For	Trade-off
Dense Large	Maximum quality	Expensive, slow
MoE	Quality + speed	High memory
Quantization	Edge deployment	Slight quality loss
Distillation	Fixed tasks	Requires teacher
Pruning	Latency-critical	Irreversible

Lecture 25: Mixture of Experts & Efficiency

Scaling Efficiently with Sparse Models

Today's Journey

The Scaling Dilemma

Dense vs Sparse Models

What is Mixture of Experts?

MoE Architecture

The Router Mechanism

The Router Mechanism

MoE in PyTorch (Simplified)

MoE in PyTorch (Simplified)

MoE Forward Pass (cont.)

MoE Forward Pass (cont.)

Load Balancing Problem

Load Balancing Solutions

Training Instability

Memory and Communication

Mixtral 8x7B

Mixtral Performance

What Do Experts Learn?

Model Compression Methods

Inference Optimizations

State Space Models: Mamba

Efficiency Trade-off Landscape

Environmental Impact

Democratizing Access

Open vs Closed Models

Future of Efficient LLMs

Key Takeaways

Readings