Challenge	Solution (this lecture)
Images are huge (786K+ values)	Latent diffusion — compress first, diffuse in latent space
Text must control generation	CLIP + cross-attention — bind words to spatial regions
Output must match the prompt	Classifier-free guidance — amplify text alignment
Architecture must scale	U-Net or DiT — backbones for the denoising network

Resolution	Pixels	Memory	Time per image
64 × 64	12,288	~2 GB	~30 seconds
256 × 256	196,608	~8 GB	~5 minutes
512 × 512	786,432	~32 GB	~20 minutes

Property	Value
Input resolution	512 × 512 × 3
Latent resolution	64 × 64 × 4
Compression ratio	48× (786K → 16K values)
Reconstruction quality	Near-lossless for natural images

Classifier-free guidance

Classifier-free guidance (CFG) improves alignment between text and generated images. During training, the text condition is randomly dropped some fraction of the time. At inference:

$\tilde{\boldsymbol{\epsilon}} = \underbrace{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)}_{\text{unconditional}} + w \cdot \Big(\underbrace{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c)}_{\text{conditional}} - \underbrace{\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)}_{\text{unconditional}}\Big)$

$\boldsymbol{\epsilon}_\theta$ — the denoising network (U-Net or DiT), parameterized by $\theta$
$\mathbf{x}_t$ — the noisy latent at timestep $t$
$c$ — the text condition (CLIP embedding of the prompt)
$\varnothing$ — no text (empty prompt, i.e., unconditional)
Unconditional $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, \varnothing)$ — what the model predicts without any text guidance
Conditional $\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t, c)$ — what it predicts guided by the prompt
$w$ — the guidance scale (typically 7–15): how much to amplify the text signal

Think of CFG as asking: "What's different about images that match this prompt versus random images?" Then amplifying that difference. The model learns both what a "dog on a beach" looks like and what a random image looks like — guidance amplifies the gap. See the next slide for examples.

Ho & Salimans (2022) "Classifier-free diffusion guidance" — Used in virtually all modern text-to-image systems.

Lecture 22: Image diffusion and multimodal generation

PSYC 51.17: Models of language and communication

Learning objectives

From text to images

The pixel problem

Latent diffusion

Variational autoencoder (VAE)

How convolutions compress images

The VAE bottleneck

CLIP: connecting text and images

Text conditioning with cross-attention

Classifier-free guidance

The guidance scale tradeoff

Try it: varying the guidance scale

U-Net: the original diffusion backbone

Diffusion Transformer (DiT)

DiT: why replace U-Net?

Try it: generate an image

Take-home messages

Questions?