Ready to Start?

Accept Assignment

Due: January 30, 2026 at 11:59 PM EST

Assignment 3: Representing Meaning - A Computational Exploration of Semantic Space

Overview

"You shall know a word by the company it keeps." - J.R. Firth, 1957

In this assignment, you will embark on a deep exploration of how machines represent meaning. Working with 250,000 Wikipedia articles, you will implement and compare methods spanning five decades of computational linguistics and natural language processing - from classical statistical techniques to modern large language models.

This assignment asks fundamental questions: What does it mean to "understand" the meaning of a document? How do different computational approaches capture semantic relationships? What aspects of human semantic knowledge can be modeled through distributional representations? Which methods best capture the conceptual structure of human knowledge as encoded in Wikipedia?

You will implement ~10 different embedding methods, perform sophisticated quantitative and qualitative analyses, create beautiful interactive visualizations, and connect your findings to theories of meaning in cognitive science and linguistics. This is a substantial, 2-week assignment that will deepen your understanding of how we represent and compute with meaning.

Dataset

You can automatically download the dataset from Dropbox if it doesn't already exist in your working directory. The following code will handle downloading the dataset, checking if it's present, and loading it into your notebook:

import os
import urllib.request
import pickle

Define the file name and URL

dataset_url = 'https://www.dropbox.com/s/v4juxkc5v2rd0xr/wikipedia.pkl?dl=1' dataset_path = 'wikipedia.pkl'

Download the dataset if it doesn't exist

if not os.path.exists(dataset_path): print("Downloading dataset...") urllib.request.urlretrieve(dataseturl, datasetpath) print("Download complete.")

Load the dataset

with open(dataset_path, 'rb') as f: wikipedia = pickle.load(f)
The dataset is formatted as a list of dictionary (dict) objects, each with the following keys/values: There are 250K articles in all, randomly selected from this dataset.

Learning Objectives

By completing this assignment, you will:

Part 1: Implementing the Embedding Zoo (40 points)

Task 1.1: Classical Statistical Methods (8 points)

Implement two foundational approaches from classical NLP:

Latent Semantic Analysis (LSA) Latent Dirichlet Allocation (LDA) Key questions to explore:

Task 1.2: Static Word Embeddings (8 points)

Implement three influential neural word embedding methods, aggregating word vectors to create document embeddings:

Word2Vec GloVe FastText Implementation notes:

Task 1.3: Contextualized Embeddings (8 points)

Implement transformer-based embeddings that capture context:

BERT GPT-2 Key questions:

Task 1.4: Modern Sentence/Document Embeddings (8 points)

Implement state-of-the-art embedding methods designed specifically for sentences and documents:

Sentence Transformers Llama 3 Embeddings Key questions:

Task 1.5: Modern Topic Models (8 points)

Implement neural topic models that combine traditional topic modeling with modern embeddings:

BERTopic Top2Vec Key questions: Deliverable for Part 1:

Part 2: Sophisticated Evaluation and Analysis (30 points)

Task 2.1: Clustering with Multiple Algorithms (10 points)

For each embedding method, apply multiple clustering algorithms:

K-Means Clustering Hierarchical Clustering Density-Based Clustering (DBSCAN/HDBSCAN) Deliverable:

Task 2.2: Quantitative Evaluation Metrics (8 points)

Implement comprehensive metrics to evaluate embedding and clustering quality:

Clustering Quality Metrics Embedding Quality Metrics Semantic Coherence Metrics Deliverable:

Task 2.3: Qualitative Analysis (6 points)

Go beyond numbers to understand what each method captures:

Cluster Interpretation Error Analysis Semantic Relationships Analogical Reasoning Deliverable:

Task 2.4: Cross-Method Comparison (6 points)

Directly compare embedding methods:

Embedding Similarity Analysis Consensus Clustering Performance vs. Cost Trade-offs Deliverable:

Part 3: Advanced Clustering and Visualization (15 points)

Task 3.1: Multi-Level Clustering (5 points)

Explore hierarchical structure in the Wikipedia knowledge space:

Hierarchical Clustering Analysis Dendrogram Analysis Deliverable:

Task 3.2: Interactive Visualization (10 points)

Create publication-quality, interactive visualizations:

Dimensionality Reduction for Visualization Interactive Plotly Visualizations

For each major embedding method, create:

  1. 3D Interactive Scatter Plot
    • Each point is an article
    • Color by cluster assignment
    • Size by article length or importance (e.g., page views)
    • Hover: show article title, cluster label, and snippet
    • Enable rotation, zoom, selection
  2. 2D Hexbin Density Plot
    • Show density of articles in embedding space
    • Overlay cluster boundaries
    • Interactive region selection
  3. Cluster Comparison View
    • Side-by-side comparison of same data in different embedding spaces
    • Linked selections (select in one, highlight in others)
    • Show how cluster assignments change
Advanced Visualizations Deliverable:

Part 4: Cognitive Science Connection (10 points)

Task 4.1: Distributional Semantics Theory (4 points)

Connect your computational work to theories of meaning:

Theoretical Foundations Empirical Connection Deliverable:

Task 4.2: What Is Meaning? (6 points)

Critically analyze what different methods capture:

Philosophical Analysis Comparative Semantics Limitations and Biases Deliverable:

Part 5: Advanced Extensions and Applications (5 points)

Choose at least ONE of the following extensions:

Option A: Cross-Lingual Embeddings

Implementation: Analysis:

Option B: Temporal Analysis

Implementation: Analysis:

Option C: Practical Applications

Implement at least one real-world application:

Semantic Search Engine Recommendation System Automatic Summarization/Labeling Knowledge Graph Construction Deliverable for Part 5:

Submission Guidelines

GitHub Classroom Submission

This assignment is submitted via GitHub Classroom. Follow these steps:

  1. Accept the assignment: Click the assignment link provided in Canvas or by your instructor
  2. Clone your repository:
   git clone https://github.com/ContextLab/embeddings-llm-course-YOUR_USERNAME.git
  1. Complete your work:
    • Work in Google Colab, Jupyter, or your preferred environment
    • Save your notebook to the repository
  2. Commit and push your changes:
   git add .
   git commit -m "Complete Wikipedia embeddings assignment"
   git push
  1. Verify submission: Check that your latest commit appears in your GitHub repository before the deadline
Deadline: January 30, 2026 at 11:59 PM EST

Notebook Requirements

Submit a Google Colaboratory notebook (or Jupyter notebook) that includes:

Technical Requirements

  1. Reproducibility
    • All code necessary to download datasets and models
    • Clear installation instructions for required packages
    • Random seeds set for reproducibility
    • Must run in Google Colab with GPU runtime (T4 or better)
    • Estimated runtime: 2-4 hours for full notebook
  2. Organization
    • Clear section headers matching assignment parts
    • Table of contents with navigation links
    • Markdown cells explaining approach, decisions, and insights
    • Code comments for complex operations
    • Summary sections after each major part
  3. Outputs
    • All visualizations embedded in notebook
    • Tables and metrics clearly formatted
    • Long outputs (model training) can be summarized
    • Save embeddings to files to avoid recomputation
  4. Writing Quality
    • Clear, concise explanations
    • Proper citations for papers and methods
    • Academic writing style for analysis sections
    • Proofread for grammar and clarity

What to Submit

  1. Primary Deliverable: Google Colab notebook link (ensure sharing is enabled)
  2. Optional: Saved embeddings and models (via Google Drive link)
  3. Optional: Standalone HTML export of notebook with all outputs

Collaboration Policy


Grading Rubric (100 points total)

Part 1: Implementation (40 points)

Component Points Criteria
Classical Methods (LSA, LDA) 8 Correct implementation, reasonable hyperparameters, working embeddings
Static Embeddings (Word2Vec, GloVe, FastText) 8 Proper aggregation strategies, handling of OOV words, documented choices
Contextualized Embeddings (BERT, GPT-2) 8 Appropriate pooling, handling of long documents, clear methodology
Modern Embeddings (Sentence-BERT, Llama) 8 Correct model usage, comparison of approaches, quality embeddings
Topic Models (BERTopic, Top2Vec) 8 Proper configuration, coherent topics, analysis of topic quality
Grading notes:

Part 2: Evaluation and Analysis (30 points)

Component Points Criteria
Clustering Algorithms 10 Multiple algorithms implemented, systematic comparison, justified choices
Quantitative Metrics 8 Comprehensive metrics, correct implementation, statistical rigor
Qualitative Analysis 6 Thoughtful interpretation, case studies, error analysis
Cross-Method Comparison 6 Direct comparisons, correlation analysis, actionable insights
Grading notes:

Part 3: Visualization (15 points)

Component Points Criteria
Multi-Level Clustering 5 Clear hierarchy, validated against structure, insightful analysis
Interactive Visualizations 10 High-quality Plotly plots, appropriate techniques, informative and beautiful
Grading notes:

Part 4: Cognitive Science Connection (10 points)

Component Points Criteria
Theoretical Connection 4 Clear explanation of distributional semantics, connection to cognitive science
Critical Analysis 6 Thoughtful discussion of meaning, limitations, philosophical depth
Grading notes:

Part 5: Advanced Extensions (5 points)

Component Points Criteria
Extension Implementation 3 Working implementation of chosen extension, appropriate methodology
Extension Analysis 2 Evaluation of extension, insights gained, discussion of implications
Grading notes:

Overall Quality (Holistic Assessment)

Aspect Impact on Grade
Code Quality Clean, well-documented, efficient code can earn bonus points
Writing Quality Clear, insightful writing enhances grade; poor writing reduces it
Creativity Novel approaches or insights can earn significant bonus points
Reproducibility Non-reproducible results may lose up to 10 points
Late Submission Standard course late policy applies

Maximum Total: 110 points (100 base + 10 possible bonus)


Resources and References

Key Papers (from Syllabus Weeks 3-4)

Foundational Papers: Neural Methods: Modern Topic Models: Evaluation and Analysis: Cognitive Science Connection:

Tutorials and Documentation

Libraries: Tutorials:

Datasets for Human Judgments


Tips and Best Practices

Computational Efficiency

Working with Large Datasets: GPU Utilization: Time Management:

Implementation Tips

Handling Long Documents: Aggregating Word Vectors: Choosing Hyperparameters:

Analysis Tips

Statistical Rigor: Visualization Best Practices: Qualitative Analysis:

Common Pitfalls to Avoid

Don't: Do:

Using GenAI Effectively

Good uses: Bad uses: Remember: You must understand and be able to explain everything you submit.

Expected Timeline

This is a 2-week intensive assignment (Weeks 3-4 of the course). With GenAI assistance and focused implementation, here's a suggested timeline:

Week 1: Implementation of All Methods, Initial Clustering

Week 2: Analysis, Visualization, Cognitive Science Connection, and Extensions

Total estimated effort: 25-35 hours (with GenAI assistance for coding and documentation)

Frequently Asked Questions

Q: Do I really need to implement all 10+ methods? A: Yes. The comparison is the core of the assignment. However, if you have significant technical difficulties with one method, document the issue and move on.

Q: Can I use a smaller subset of the data? A: For development, yes. For final submission, use the full 250K articles (or justify why you used fewer).

Q: How do I handle out-of-memory errors? A: Use batch processing, reduce batch size, use smaller models, or process in chunks. Ask for help if stuck.

Q: Can I use different Wikipedia data? A: Prefer the provided dataset for comparability, but you can use different data if you have a good reason (must justify).

Q: How much analysis is enough? A: Quality > quantity. Deep analysis of a few interesting findings beats superficial treatment of many.

Q: Can I work with a partner? A: Check course policy. Generally, collaboration is allowed but each person must submit their own work.

Q: How important are the visualizations? A: Very important. They're worth 15 points and central to understanding the results. Invest time here.

Q: What if my results are "bad" (low metrics, unclear clusters)? A: Document what you found! Negative results are still results. Analyze why and what it means.

Q: Can I use commercial APIs (OpenAI, Anthropic)? A: Prefer open-source models, but if you have credits, you can use APIs for comparison.


Getting Help


Learning Goals: What You'll Take Away

By completing this assignment, you will:

  1. Technical Skills
    • Mastery of diverse embedding techniques
    • Experience with unsupervised learning and clustering
    • Expertise in evaluation metrics and analysis
    • Ability to create publication-quality visualizations
  2. Conceptual Understanding
    • Deep understanding of distributional semantics
    • Knowledge of trade-offs between different approaches
    • Appreciation for the evolution of NLP methods
    • Critical thinking about what "meaning" means computationally
  3. Research Skills
    • Ability to design and conduct comparative experiments
    • Statistical rigor in evaluation
    • Clear communication of findings
    • Connection between theory and practice
  4. Practical Knowledge
    • Experience working with large-scale NLP datasets
    • Understanding of computational constraints
    • Ability to debug and troubleshoot ML pipelines
    • Portfolio-worthy project for job applications

Final Words

This assignment is designed to be challenging, extensive, and deeply educational. It's not meant to be completed in a weekend. Take your time, explore the methods, think deeply about the results, and enjoy the process of discovering how machines represent meaning.

The goal isn't just to get high metrics or pretty plots - it's to develop a sophisticated understanding of how different computational approaches capture semantic information, and to think critically about what these methods reveal (and obscure) about language and meaning.

Don't hesitate to be creative, try additional ideas, and follow interesting threads you discover. The best projects will show intellectual curiosity, rigorous analysis, and genuine insight.

Good luck, and enjoy exploring the semantic space!