Task (metric)	BERT-Large	RoBERTa	Improvement
SQuAD 2.0 (F1)	83.1	89.4	+6.3
MNLI (accuracy)	86.7	90.2	+3.5
SST-2 (accuracy)	94.9	96.4	+1.5
RACE (accuracy)	72.0	83.2	+11.2

Model	Layers	Hidden size	Parameters
BERT-base	12	768	110M
ALBERT-base	12	768	12M
ALBERT-large	24	1024	18M
ALBERT-xlarge	24	2048	60M
ALBERT-xxlarge	12	4096	235M

Metric	BERT-base	DistilBERT	Change
Parameters	110M	66M	40% smaller
Inference speed	1×	1.6×	60% faster
GLUE score	79.6	77.0	97% retained

ELECTRA in Python


1import torch
2from transformers import AutoTokenizer, AutoModelForMaskedLM, ElectraForPreTraining
3
4tokenizer = AutoTokenizer.from_pretrained("google/electra-small-generator")
5generator = AutoModelForMaskedLM.from_pretrained("google/electra-small-generator")
6discriminator = ElectraForPreTraining.from_pretrained("google/electra-small-discriminator")
7
8# Step 1: Mask a token and let the generator fill it in
9original = "The chef cooked a delicious meal"
10masked = "The chef [MASK] a delicious meal"
11inputs = tokenizer(masked, return_tensors="pt")
12
13with torch.no_grad():
14    gen_logits = generator(**inputs).logits
15
16mask_idx = (inputs.input_ids == tokenizer.mask_token_id).nonzero(as_tuple=True)[1]
17predicted_id = gen_logits[0, mask_idx].argmax(dim=-1)
18replacement = tokenizer.decode(predicted_id)
19print(f"Generator filled [MASK] → '{replacement}'")  # e.g., "prepared"
20
21# Step 2: Discriminator classifies EVERY token as original or replaced
22fake_ids = inputs.input_ids.clone()
23fake_ids[0, mask_idx] = predicted_id
24with torch.no_grad():
25    disc_logits = discriminator(fake_ids).logits
26
27predictions = (disc_logits.squeeze() > 0).long()  # positive logit = "fake"
28tokens = tokenizer.convert_ids_to_tokens(fake_ids[0])
29for tok, pred in zip(tokens, predictions):
30    print(f"  {tok:12s} → {'REPLACED' if pred else 'original'}")

Component	BERT (2018)	ModernBERT (2024)
Position encoding	Learned absolute (512 max)	RoPE (8,192 tokens)
Attention	Full quadratic	Flash Attention + alternating global/local
Padding	Processes pad tokens	Unpadding (only real tokens)
Training data	3.3B words	2 trillion tokens (600×)
Code understanding	None	Trained on code corpora

Model	Key innovation	Best for
BERT	MLM + NSP	Baseline, well-understood
RoBERTa	Better training recipe	Maximum quality (classic)
ALBERT	Parameter sharing	Memory-constrained deployment
DistilBERT	Knowledge distillation	Speed-critical production
ELECTRA	Replaced token detection	Limited training budget
ModernBERT	Modern training + RoPE + Flash Attention	Maximum quality (2024)

Lecture 19: BERT variants

PSYC 51.17: Models of language and communication

Learning objectives

What could be improved?

RoBERTa: robustly optimized BERT

Dynamic masking

RoBERTa results

ALBERT: a lite BERT

Factorized embedding parameters

ALBERT parameter efficiency

Factorized embedding in Python

DistilBERT: knowledge distillation

DistilBERT training in Python

ELECTRA: efficient learning from all tokens

ELECTRA in Python

ELECTRA efficiency

ModernBERT: 6 years of decoder tricks, applied to encoders

ModernBERT key innovations

Variant comparison summary

Other notable BERT variants

The case against encoders

Gemma Encoder and the encoder renaissance

Questions?

Lecture 19: BERT variants

PSYC 51.17: Models of language and communication

Learning objectives

What could be improved?

RoBERTa: robustly optimized BERT

Dynamic masking

RoBERTa results

ALBERT: a lite BERT

Factorized embedding parameters

ALBERT parameter efficiency

Factorized embedding in Python

Cross-layer parameter sharing: BERT vs ALBERT

Cross-layer parameter sharing: BERT vs ALBERT

DistilBERT: knowledge distillation

DistilBERT training in Python

ELECTRA: efficient learning from all tokens

ELECTRA in Python

ELECTRA efficiency

ModernBERT: 6 years of decoder tricks, applied to encoders

ModernBERT key innovations

Variant comparison summary

Other notable BERT variants

The case against encoders

Gemma Encoder and the encoder renaissance

Questions?