Training compute	Inference compute
When	Before deployment (once)	At query time (every call)
What improves	Base knowledge, capabilities	Reasoning depth on hard problems
Scaling law	Kaplan et al. (2020)	Snell et al. (2024)

Why do intermediate steps help?

Threshold circuit: a Boolean circuit whose gates are threshold functions — a gate outputs 1 if at least $k$ of its $n$ inputs are 1. These circuits can compute addition, multiplication, and sorting.
Constant-depth ( $\mathsf{TC}^0$ ): a threshold circuit with a fixed number of layers, no matter the input size. It can do a lot in parallel but has limited sequential depth.
Why this matters: a transformer with $L$ layers is essentially a constant-depth threshold circuit — it always performs exactly $L$ sequential steps per token, regardless of problem difficulty.

A transformer with $L$ layers performs $L$ sequential computation steps per token. For a 100-layer model answering in one token, you get 100 steps of computation — regardless of problem difficulty.

But if the model generates $N$ intermediate tokens first, it gets $L \times N$ steps — the entire model runs once per token, and each token can attend to all previous tokens.

Merrill & Sabharwal (2024) proved that constant-depth transformers are limited to problems in the complexity class $\mathsf{TC}^0$ (constant-depth threshold circuits). But with a chain-of-thought of length $T$ , a transformer can simulate $T$ steps of any Turing machine — making it Turing-complete.

In plain language: without CoT, transformers literally cannot solve certain problems no matter how large. With CoT, they can solve anything (given enough tokens).

Year	Technique	Key idea	Reference
2022	Chain-of-thought prompting	Prompt the model with reasoning examples	Wei et al.
2023	Tree-of-thought, self-consistency	Generate multiple paths, pick the best	Yao et al.
2024	Reasoning models (o1)	Train the model via RL to reason on its own	OpenAI
2025	Adaptive thinking (Claude 4.x)	Model decides when and how much to think	Anthropic

	OpenAI o-series	Claude	DeepSeek-R1
Thinking visibility	Hidden (summarized)	Visible via API	Visible (open-weight)
Control	`reasoning_effort`	`budget_tokens`	Token count
Thinking tags	Not exposed	`thinking` block in response	`<think>...</think>` tags

Feature	OpenAI o-series	Claude extended thinking
Thinking visibility	Hidden	Visible
Control mechanism	Reasoning effort (low/med/high)	Budget tokens (1K–128K) or adaptive
Latest innovation	o4-mini: tools in reasoning loop	Interleaved thinking: reason between tool calls

Model	Total params	Active per token	Efficiency
Mixtral 8x7B	46.7B	12.9B (2/8 experts)	3.6×
DeepSeek-V3	671B	37B (8/256 experts)	18×
Llama 4 Maverick	400B	17B (1/128 experts)	23×

Model	Developer	Key capability
Claude Opus 4.6	Anthropic	1M context, adaptive thinking, visible reasoning
GPT-5 / 5.2	OpenAI	Native multimodal, 100% AIME 2025 (GPT-5.2)
Gemini 3.1 Pro	Google	1M context, built-in thinking, 48.4% HLE
DeepSeek-R1	DeepSeek	Open-weight reasoning, 671B MoE, $5.5M training
Llama 4 Maverick	Meta	Open-weight, 400B MoE, 1M context
Qwen 3	Alibaba	81.5% AIME 2025 (235B base), open-weight

Lecture 24: The thinking revolution

PSYC 51.17: Models of language and communication

Learning objectives

Welcome back!

A new paradigm: test-time compute

Two scaling axes

Chain-of-thought prompting

Why do intermediate steps help?

From prompting to training: the evolution

How reasoning models are trained

GRPO: how DeepSeek-R1 learns to reason

What emerges from RL training

The reasoning model pipeline

What thinking tokens look like in practice

The s1 experiment: reasoning is surprisingly simple

Claude's extended thinking

Mixture of Experts: doing more with less

The frontier landscape: early 2026

Benchmark saturation and the reasoning gap

Questions?

Benchmark	What it measures	Best	Human	Status
MATH-500	500 competition-level math problems	99.4%	—	Saturated
AIME 2025	15-question high school math competition	100%	—	Saturated
GPQA Diamond	PhD-level science questions	94.3%	~65%	Near-saturated
SWE-bench Verified	Real GitHub issue resolution	80.9%	—	Active
ARC-AGI-2	Novel visual reasoning patterns	~54%	60%	Closing
Humanity's Last Exam	Expert cross-domain questions	48.4%	~90%	Not saturated