Interactive Tokenization Visualizer

Explore how different tokenization algorithms break down text

Loading Tokenizers...
Downloading GPT-2, BERT, and T5 tokenizers from HuggingFace

Byte Pair Encoding (BPE) iteratively merges the most frequent character pairs. Watch the algorithm work step-by-step:

  • Start Visualization: Begin the algorithm and see each merge step
  • Next Step: Advance one merge at a time to understand each decision
  • Auto Play: Watch the algorithm run automatically
BPE Mode:
Mode: Simplified (Educational) - 16 common patterns
• Simplified: Uses 16 hardcoded common English patterns for easy learning
• Real GPT-2: Uses 5,000 actual merge rules from OpenAI's GPT-2 tokenizer

BPE Algorithm Steps

Click "Load Text" to begin the BPE visualization

Merge History

    Current State

    Visualization will appear here

    Current Step: 0
    Token Count: 0
    Pairs Found: 0

    BPE Merge Tree

    Explore Tokenizer Vocabularies

    Browse the complete vocabulary of each tokenizer. Each model uses different strategies:

    • GPT-2: 50,257 tokens using Byte Pair Encoding. Leading spaces shown as (original: Ġ).
    • BERT: 30,522 tokens using WordPiece. Subword continuations shown as ·· (original: ##). Includes 994 [unused0][unused993] placeholder tokens reserved for fine-tuning.
    • T5: 32,128 tokens using SentencePiece. Leading spaces shown as (original: ▁).

    Hover over tokens to see raw values.

    Vocabulary Size

    -

    Subword Tokens

    -

    Token ID Token Type Length
    Loading vocabulary...