Select Models to Compare
Benchmark Tasks
Semantic Similarity
Test how well models capture semantic similarity between sentence pairs.
Word Analogies
Test reasoning: "A is to B as C is to ?"
Topic Categorization
Test how well models group similar items using k-means clustering.
How Embedding Models Work
Understanding the transformer architecture behind modern text embeddings.
Sentence Embedding Pipeline
Key Concepts
Tokenization
Text is split into subword units (WordPiece/BPE). "embedding" → ["em", "##bed", "##ding"]. This handles unknown words gracefully.
Self-Attention
Each token attends to all other tokens, learning contextual relationships. "bank" gets different representations in "river bank" vs "bank account".
Mean Pooling
Token vectors are averaged to create a single sentence vector. This captures the overall semantic meaning of the text.
Contrastive Learning
Models are trained to make similar sentences close in vector space and dissimilar sentences far apart.
Model Specifications
| Model | Dimensions | Parameters | Layers | Best For |
|---|---|---|---|---|
| MiniLM-L6-v2 | 384 | 22M | 6 | Fast inference, real-time apps |
| MPNet-base-v2 | 768 | 110M | 12 | High quality, semantic search |
| Multilingual-MiniLM | 384 | 118M | 12 | 50+ languages, cross-lingual |
| BGE-small-en | 384 | 33M | 6 | Retrieval, RAG systems |
Similarity Computation
Cosine Similarity:
Measures the angle between two vectors. Value ranges from -1 (opposite) to 1 (identical direction).
Word Analogy (Vector Arithmetic):
Semantic relationships are encoded as vector differences. Adding/subtracting captures analogies.