Explore how different tokenization algorithms break down text
Loading Tokenizers...
Downloading GPT-2, BERT, and T5 tokenizers from HuggingFace
What is tokenization? Tokenization breaks text into smaller units (tokens) that language models can process. Different models use different algorithms: GPT-2 uses BPE, BERT uses WordPiece, and T5 uses SentencePiece. The ␣ symbol indicates a leading space in the original text.
GPT-2 (BPE)
OpenAI
Vocab: 50,257Tokens: -Ratio: -
Token IDs
BERT (WordPiece)
Google
Vocab: 30,522Tokens: -Ratio: -
Token IDs
T5 (SentencePiece)
Google
Vocab: 32,128Tokens: -Ratio: -
Token IDs
Comparison Statistics
Byte Pair Encoding (BPE) iteratively merges the most frequent character pairs. Watch the algorithm work step-by-step:
Start Visualization: Begin the algorithm and see each merge step
Next Step: Advance one merge at a time to understand each decision
Auto Play: Watch the algorithm run automatically
BPE Mode:
Mode: Simplified (Educational) - 16 common patterns
• Simplified: Uses 16 hardcoded common English patterns for easy learning
• Real GPT-2: Uses 5,000 actual merge rules from OpenAI's GPT-2 tokenizer
BPE Algorithm Steps
Click "Load Text" to begin the BPE visualization
Merge History
Current State
Visualization will appear here
Current Step:0
Token Count:0
Pairs Found:0
BPE Merge Tree
Explore Tokenizer Vocabularies
Browse the complete vocabulary of each tokenizer. Each model uses different strategies:
GPT-2: 50,257 tokens using Byte Pair Encoding. Leading spaces shown as ␣ (original: Ġ).
BERT: 30,522 tokens using WordPiece. Subword continuations shown as ·· (original: ##). Includes 994 [unused0]–[unused993] placeholder tokens reserved for fine-tuning.
T5: 32,128 tokens using SentencePiece. Leading spaces shown as ␣ (original: ▁).