Ready to Start?

Due: February 2, 2026 at 11:59 PM EST

Assignment 3: Exploring Document Embeddings

Timeline: 1 Week

Overview

"You shall know a word by the company it keeps." - J.R. Firth, 1957

In this assignment, you will explore how machines represent the meaning of documents. Working with Wikipedia articles, you will implement and compare embedding methods spanning five decades of computational linguistics—from classical statistical techniques to modern transformer-based models.

You will:

Implement ~10 different embedding approaches
Create beautiful visualizations using UMAP, clustering, and DataMapPlot
Evaluate embeddings through a document matching task
Reflect critically on what different methods capture about meaning

This is a 1-week assignment designed to be achievable with GenAI assistance. Focus on understanding the trade-offs between methods rather than exhaustive implementation details.

Learning Objectives

By completing this assignment, you will:

Understand the evolution of semantic representation from classical to modern NLP
Implement and compare diverse embedding methods
Create publication-quality visualizations of high-dimensional spaces
Evaluate embeddings quantitatively through a document matching task
Think critically about what different methods capture (and miss) about meaning

Dataset

We provide a curated dataset of 250,000 Wikipedia articles. The following code downloads and loads the dataset:

import os
import urllib.request
import pickle
Download the dataset if it doesn't exist
dataset_url = 'https://www.dropbox.com/s/v4juxkc5v2rd0xr/wikipedia.pkl?dl=1'
dataset_path = 'wikipedia.pkl'
if not os.path.exists(dataset_path):
    print("Downloading dataset (~750MB)...")
    urllib.request.urlretrieve(dataset_url, dataset_path)
    print("Download complete.")
Load the dataset
with open(dataset_path, 'rb') as f:
    wikipedia = pickle.load(f)
print(f"Loaded {len(wikipedia)} articles")

Each article is a dictionary with:

'title': Article title (string)
'text': Full article text (string)
'id': Unique identifier (string)
'url': Wikipedia URL (string)

Important: For development and testing, start with a small subset (e.g., 5,000-10,000 articles). Scale up for final results as time permits.

Part 1: Implement Embedding Methods (40 points)

Implement at least 10 of the following embedding approaches. For each method, create document-level embeddings for your Wikipedia subset.

Classical Methods

1. Latent Semantic Analysis (LSA)

Use CountVectorizer to create a term-document matrix (raw counts, as in Deerwester et al., 1990)
Apply TruncatedSVD for dimensionality reduction (e.g., 300 dimensions)
Implementation: sklearn.feature_extraction.text.CountVectorizer + sklearn.decomposition.TruncatedSVD

2. TF-IDF + SVD

Variant of LSA using TF-IDF weighting instead of raw counts
Implementation: sklearn.feature_extraction.text.TfidfVectorizer + sklearn.decomposition.TruncatedSVD

Static Word Embeddings (aggregate to document level)

3. Word2Vec

Use pre-trained word2vec-google-news-300 from gensim
Aggregate word vectors via mean pooling
Implementation: gensim.downloader

4. GloVe

Use pre-trained glove-wiki-gigaword-300 from gensim
Aggregate via mean pooling
Implementation: gensim.downloader

5. FastText

Use pre-trained fasttext-wiki-news-subwords-300 from gensim
Handles out-of-vocabulary words via subword embeddings
Implementation: gensim.downloader

Transformer-Based Embeddings

6. Sentence-BERT (all-MiniLM-L6-v2)

Lightweight, fast sentence transformer
Implementation: sentence-transformers

7. Sentence-BERT (all-mpnet-base-v2)

Higher quality, slower than MiniLM
Implementation: sentence-transformers

8. BGE (BAAI General Embedding)

State-of-the-art on MTEB benchmark
Try BAAI/bge-small-en-v1.5 or BAAI/bge-base-en-v1.5
Implementation: sentence-transformers

9. E5

Strong retrieval-focused embeddings
Try intfloat/e5-small-v2 or intfloat/e5-base-v2
Note: Requires prefix "passage: " for documents
Implementation: sentence-transformers

10. Nomic Embed

Open-weights, long-context model
Try nomic-ai/nomic-embed-text-v1.5
Implementation: sentence-transformers

Additional Options (pick any to reach 10+)

Doc2Vec: gensim.models.Doc2Vec (train on your corpus)
Universal Sentence Encoder: TensorFlow Hub
OpenAI Embeddings: text-embedding-3-small (requires API key)
Instructor Embeddings: Task-specific prompting
GTR: Google's Text Representations

Deliverables for Part 1

Working code for each embedding method
Brief documentation of hyperparameters chosen
Embeddings stored in a consistent format (numpy arrays)

Part 2: Visualization (25 points)

Create visualizations to understand the structure of your embedding spaces.

2.1 Dimensionality Reduction with UMAP

For each embedding method:

Apply UMAP to reduce to 2D: umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine')
Store the 2D coordinates for visualization

2.2 Clustering

Apply clustering to discover document groups:

K-Means

Experiment with different k values (e.g., 10, 20, 50)
Use silhouette score to evaluate cluster quality
Implementation: sklearn.cluster.KMeans

HDBSCAN

Density-based clustering that automatically determines cluster count
Handles noise/outliers gracefully
Implementation: hdbscan.HDBSCAN

2.3 DataMapPlot Visualizations

Create publication-quality visualizations using DataMapPlot:

import datamapplot
Create labeled visualization
fig, ax = datamapplot.create_plot(
    umap_coords,           # (n_samples, 2) array
    cluster_labels,        # Label for each point
    title="Wikipedia Embeddings",
    sub_title="Method: Sentence-BERT",
    label_wrap_width=20,
    darkmode=False
)

Required visualizations:

At least 3 different embedding methods visualized with DataMapPlot
Compare K-Means vs HDBSCAN clustering on the same embeddings
Include cluster labels that are meaningful (e.g., use article titles or LLM-generated descriptions)

2.4 Comparison Visualizations

Create at least one visualization that directly compares methods:

Side-by-side DataMapPlots of different embedding methods
Or overlay showing how the same articles cluster differently

Deliverables for Part 2

UMAP coordinates for all embedding methods
Cluster assignments (K-Means and HDBSCAN)
At least 5 high-quality DataMapPlot visualizations
Brief analysis of what you observe

Part 3: Document Matching Evaluation (20 points)

Evaluate embedding quality through a document matching task.

The Task

For each document:

Split it into two halves (first half and second half of the text)
Embed each half separately
For the first half, find the most similar embedding among all second halves
A "match" is correct if the retrieved second half belongs to the same original document

This tests whether embeddings capture document-level semantics consistently.

Implementation

def evaluate_document_matching(embeddings_first_half, embeddings_second_half):
    """
    Compute matching accuracy.
    
    For each first-half embedding, find the nearest second-half embedding.
    Return the fraction where the nearest neighbor is the correct match.
    """
    # Your implementation here
    pass

Metrics to Report

For each embedding method, report:

Accuracy@1: Fraction where the correct second half is the top match
Accuracy@5: Fraction where the correct second half is in the top 5 matches
Mean Reciprocal Rank (MRR): Average of 1/rank for the correct match

Visualization

Create a bar plot with error bars (or another appropriate visualization) comparing all embedding methods. Use bootstrap resampling to compute confidence intervals for your metrics.

Deliverables for Part 3

Implementation of the document matching evaluation
Bar plot visualization comparing all embedding methods (with error bars)
Brief analysis: Which methods perform best? Why might that be?

Part 4: Reflection Essays (15 points)

Write 1-2 short essays (each 300-500 words) reflecting on your findings.

Essay 1: Trade-offs Between Methods (Required)

Discuss the trade-offs you observed:

Speed vs. Quality: Which methods are fast but lower quality? Which are slow but excellent?
Interpretability: Can you understand what LSA dimensions or clusters represent? How about transformer embeddings?
What each method captures: Do some methods capture topical similarity while others capture stylistic similarity?
Practical recommendations: When would you use each method in a real application?

Essay 2: What Is Meaning? (Choose this OR your own topic)

Reflect on the deeper questions:

What does it mean for a machine to "understand" meaning?
Do embeddings truly capture semantics, or just statistical patterns?
What aspects of human understanding are missing from these representations?
How do the limitations you observed connect to broader questions about AI and language?

Alternative: Write about a topic of your choice related to embeddings, visualization, or semantic representation. Clear it with the instructor if unsure.

Deliverables for Part 4

1-2 essays in markdown cells in your notebook
Thoughtful engagement with the material (not surface-level observations)

Submission Guidelines

GitHub Classroom Submission

Accept the assignment via the GitHub Classroom link above
Clone your repository
Complete your work in Google Colab
Push your notebook to the repository before the deadline

Notebook Requirements

Your notebook should:

Run completely in Google Colab with GPU runtime
Include all code to download data and install dependencies
Have clear markdown sections matching the assignment parts
Show all visualizations inline
Include your reflection essays as markdown cells

Before Submitting

Notebook runs from start to finish without errors
All 10+ embedding methods implemented
At least 5 DataMapPlot visualizations included
Document matching evaluation complete with bar plot (including error bars)
1-2 reflection essays written
Code is reasonably commented and organized

Grading Rubric (100 points)

Component	Points	Criteria
Part 1: Embeddings	40	10+ methods implemented correctly, reasonable hyperparameters
Part 2: Visualization	25	UMAP + clustering + 5+ quality DataMapPlot visualizations
Part 3: Evaluation	20	Document matching implemented, bar plot with error bars, analysis
Part 4: Essays	15	Thoughtful, substantive reflection (300-500 words each)

Bonus opportunities:

Exceptionally insightful analysis (+5)
Creative additional visualizations (+3)
Thorough comparison of clustering methods (+2)

Tips for Success

Start Small

Begin with 1,000-5,000 articles for development
Get your full pipeline working before scaling up
Cache embeddings to avoid recomputation

Computational Efficiency

Use GPU runtime in Colab for transformer models
Process documents in batches
For large models, consider float16 precision

Recommended Timeline

Day	Tasks
1-2	Set up environment, implement 5+ embedding methods
3-4	Complete remaining embeddings, create visualizations
5	Implement document matching evaluation
6	Run final experiments, write essays
7	Polish, verify reproducibility, submit

Using GenAI Effectively

You're encouraged to use ChatGPT, Claude, or Copilot to:

Debug errors and understand library APIs
Generate boilerplate code
Explain concepts from papers

You must:

Understand all code you submit
Write your own analysis and essays
Document significant AI assistance used

Resources

Key Libraries

Papers

Deerwester et al. (1990). Indexing by Latent Semantic Analysis
Mikolov et al. (2013). Word2Vec
Reimers & Gurevych (2019). Sentence-BERT
McInnes et al. (2018). UMAP

Questions?

Post on the course forum
Attend office hours
Email the instructor

Good luck exploring the semantic space!