Embeddings Comparison Lab

Compare and benchmark multiple embedding models across various NLP tasks

Select Models to Compare

Benchmark Tasks

Semantic Similarity

Test how well models capture semantic similarity between sentence pairs.

Word Analogies

Test reasoning: "A is to B as C is to ?"

is to as is to

Topic Categorization

Test how well models group similar items using k-means clustering.

How Embedding Models Work

Understanding the transformer architecture behind modern text embeddings.

Sentence Embedding Pipeline

Input Text
"The cat sat on the mat"
Tokenizer
["The", "cat", "sat", "on", "the", "mat"]
Transformer Encoder
Token Embeddings
Self-Attention (×N layers)
Feed-Forward Network
Pooling
Mean of all token vectors
Sentence Embedding
[0.12, -0.45, 0.78, ...] (384 or 768 dims)

Key Concepts

Tokenization

Text is split into subword units (WordPiece/BPE). "embedding" → ["em", "##bed", "##ding"]. This handles unknown words gracefully.

Self-Attention

Each token attends to all other tokens, learning contextual relationships. "bank" gets different representations in "river bank" vs "bank account".

Mean Pooling

Token vectors are averaged to create a single sentence vector. This captures the overall semantic meaning of the text.

Contrastive Learning

Models are trained to make similar sentences close in vector space and dissimilar sentences far apart.

Model Specifications

Model Dimensions Parameters Layers Best For
MiniLM-L6-v2 384 22M 6 Fast inference, real-time apps
MPNet-base-v2 768 110M 12 High quality, semantic search
Multilingual-MiniLM 384 118M 12 50+ languages, cross-lingual
BGE-small-en 384 33M 6 Retrieval, RAG systems

Similarity Computation

Cosine Similarity:

cos(A, B) = (A · B) / (||A|| × ||B||)

Measures the angle between two vectors. Value ranges from -1 (opposite) to 1 (identical direction).

Word Analogy (Vector Arithmetic):

king - man + woman ≈ queen

Semantic relationships are encoded as vector differences. Adding/subtracting captures analogies.