Ready to Start?

Accept Assignment

Due: February 2, 2026 at 11:59 PM EST

Assignment 3: Exploring Document Embeddings

Timeline: 1 Week

Overview

"You shall know a word by the company it keeps." - J.R. Firth, 1957

In this assignment, you will explore how machines represent the meaning of documents. Working with Wikipedia articles, you will implement and compare embedding methods spanning five decades of computational linguistics—from classical statistical techniques to modern transformer-based models.

You will: This is a 1-week assignment designed to be achievable with GenAI assistance. Focus on understanding the trade-offs between methods rather than exhaustive implementation details.

Learning Objectives

By completing this assignment, you will:

Dataset

We provide a curated dataset of 250,000 Wikipedia articles. The following code downloads and loads the dataset:

import os
import urllib.request
import pickle

Download the dataset if it doesn't exist

dataset_url = 'https://www.dropbox.com/s/v4juxkc5v2rd0xr/wikipedia.pkl?dl=1' dataset_path = 'wikipedia.pkl'

if not os.path.exists(dataset_path): print("Downloading dataset (~750MB)...") urllib.request.urlretrieve(dataset_url, dataset_path) print("Download complete.")

Load the dataset

with open(dataset_path, 'rb') as f: wikipedia = pickle.load(f)

print(f"Loaded {len(wikipedia)} articles")

Each article is a dictionary with: Important: For development and testing, start with a small subset (e.g., 5,000-10,000 articles). Scale up for final results as time permits.

Part 1: Implement Embedding Methods (40 points)

Implement at least 10 of the following embedding approaches. For each method, create document-level embeddings for your Wikipedia subset.

Classical Methods

1. Latent Semantic Analysis (LSA) 2. TF-IDF + SVD

Static Word Embeddings (aggregate to document level)

3. Word2Vec 4. GloVe 5. FastText

Transformer-Based Embeddings

6. Sentence-BERT (all-MiniLM-L6-v2) 7. Sentence-BERT (all-mpnet-base-v2) 8. BGE (BAAI General Embedding) 9. E5 10. Nomic Embed

Additional Options (pick any to reach 10+)

Deliverables for Part 1


Part 2: Visualization (25 points)

Create visualizations to understand the structure of your embedding spaces.

2.1 Dimensionality Reduction with UMAP

For each embedding method:
  1. Apply UMAP to reduce to 2D: umap.UMAP(n_neighbors=15, min_dist=0.1, metric='cosine')
  2. Store the 2D coordinates for visualization

2.2 Clustering

Apply clustering to discover document groups:

K-Means HDBSCAN

2.3 DataMapPlot Visualizations

Create publication-quality visualizations using DataMapPlot:

import datamapplot

Create labeled visualization

fig, ax = datamapplot.create_plot( umap_coords, # (n_samples, 2) array cluster_labels, # Label for each point title="Wikipedia Embeddings", sub_title="Method: Sentence-BERT", label_wrap_width=20, darkmode=False )
Required visualizations:
  1. At least 3 different embedding methods visualized with DataMapPlot
  2. Compare K-Means vs HDBSCAN clustering on the same embeddings
  3. Include cluster labels that are meaningful (e.g., use article titles or LLM-generated descriptions)

2.4 Comparison Visualizations

Create at least one visualization that directly compares methods:

Deliverables for Part 2


Part 3: Document Matching Evaluation (20 points)

Evaluate embedding quality through a document matching task.

The Task

For each document:
  1. Split it into two halves (first half and second half of the text)
  2. Embed each half separately
  3. For the first half, find the most similar embedding among all second halves
  4. A "match" is correct if the retrieved second half belongs to the same original document
This tests whether embeddings capture document-level semantics consistently.

Implementation

def evaluate_document_matching(embeddings_first_half, embeddings_second_half):
    """
    Compute matching accuracy.
    
    For each first-half embedding, find the nearest second-half embedding.
    Return the fraction where the nearest neighbor is the correct match.
    """
    # Your implementation here
    pass

Metrics to Report

For each embedding method, report:
  1. Accuracy@1: Fraction where the correct second half is the top match
  2. Accuracy@5: Fraction where the correct second half is in the top 5 matches
  3. Mean Reciprocal Rank (MRR): Average of 1/rank for the correct match

Visualization

Create a bar plot with error bars (or another appropriate visualization) comparing all embedding methods. Use bootstrap resampling to compute confidence intervals for your metrics.

Deliverables for Part 3


Part 4: Reflection Essays (15 points)

Write 1-2 short essays (each 300-500 words) reflecting on your findings.

Essay 1: Trade-offs Between Methods (Required)

Discuss the trade-offs you observed:

Essay 2: What Is Meaning? (Choose this OR your own topic)

Reflect on the deeper questions: Alternative: Write about a topic of your choice related to embeddings, visualization, or semantic representation. Clear it with the instructor if unsure.

Deliverables for Part 4


Submission Guidelines

GitHub Classroom Submission

  1. Accept the assignment via the GitHub Classroom link above
  2. Clone your repository
  3. Complete your work in Google Colab
  4. Push your notebook to the repository before the deadline

Notebook Requirements

Your notebook should:

Before Submitting


Grading Rubric (100 points)

Component Points Criteria
Part 1: Embeddings 40 10+ methods implemented correctly, reasonable hyperparameters
Part 2: Visualization 25 UMAP + clustering + 5+ quality DataMapPlot visualizations
Part 3: Evaluation 20 Document matching implemented, bar plot with error bars, analysis
Part 4: Essays 15 Thoughtful, substantive reflection (300-500 words each)
Bonus opportunities:

Tips for Success

Start Small

Computational Efficiency

Day Tasks
1-2 Set up environment, implement 5+ embedding methods
3-4 Complete remaining embeddings, create visualizations
5 Implement document matching evaluation
6 Run final experiments, write essays
7 Polish, verify reproducibility, submit

Using GenAI Effectively

You're encouraged to use ChatGPT, Claude, or Copilot to: You must:

Resources

Key Libraries

Papers


Questions?

Good luck exploring the semantic space!