Autoregressive (GPT)	Diffusion
Generation order	Left to right, one token at a time	All positions at once, refined iteratively
Analogy	Speaking a sentence word by word	Editing a rough draft into a polished one
Key strength	Simple, proven at scale	Bidirectional context, flexible editing
Key weakness	Can't "go back" — early errors propagate	Requires many refinement steps

	BERT (Lecture 18)	Discrete diffusion
Mask rate	Select 15%, then 80/10/10	Varies from 0% to 100% over a schedule
Prediction	One-shot: predict all masks at once	Iterative: unmask a few tokens at a time
Training	Single forward pass per example	Sample random mask rate $t$ , predict masked tokens
Generation	Not designed for generation	Built for generation: start at 100% masked, iteratively unmask

Noise type	How it works	Intuition
Uniform	Any token → any random token	Like replacing letters with random ones
Absorbing	Any token → `[MASK]` only	Like erasing letters one by one
Token similarity	Token → similar token	Like introducing typos

	Diffusion-LM (continuous)	MDLM / D3PM (discrete)
Noise type	Gaussian in embedding space	Masking or token swaps
Rounding step	Required (introduces errors)	Not needed
Controllability	Excellent — gradient-based guidance	Good — conditional masking
Scalability	Limited by rounding artifacts	Scales to 8B+ parameters

Capability	Autoregressive	Text diffusion
Infilling: Fill in a gap mid-sentence	Requires special fine-tuning	Native — just mask the gap
Iterative editing: Refine parts of generated text	Must regenerate from the edit point	Re-mask and re-denoise locally
Length control: Generate exactly $N$ tokens	Hard — models tend to over/under-generate	Natural — initialize $N$ masks
Parallel decoding: Generate multiple tokens simultaneously	Sequential by definition	Unmask multiple positions per step
Bidirectional coherence: Ensure beginning matches end	Can't look ahead	Sees all unmasked positions

Current limitations

Long-range coherence: Autoregressive models maintain a running context that naturally ensures consistency. Diffusion models must learn long-range dependencies through the iterative process, which can fail for very long documents.
Sampling speed: Even with optimized schedules, generating text with diffusion is slower than autoregressive generation with KV-caching for most practical sequence lengths.
Ecosystem maturity: Autoregressive models have years of tooling — RLHF, DPO, KV-caching, speculative decoding. Diffusion-based LLMs are catching up but lack this infrastructure.
Evaluation: Perplexity (the standard LM metric) doesn't directly apply to diffusion models, making fair comparison difficult.

RLHF (Lecture 16): Reinforcement Learning from Human Feedback — humans rank model outputs, and the model is trained to prefer higher-ranked responses
DPO (Direct Preference Optimization): A simpler alternative to RLHF that skips the reward model and directly optimizes the language model on human preference pairs
KV-caching: During autoregressive generation, previously computed key/value vectors are stored and reused so each new token only requires one forward pass through the new position — not the entire sequence

Lecture 21: Diffusion models for text