Ready to Start?

Accept Assignment

Due: January 23, 2026 at 11:59 PM EST

Assignment 2: Advanced SPAM Classifier with Multi-Method Comparison

Overview

Spam detection is one of the most successful and widely-deployed applications of machine learning. In this assignment, you will build a comprehensive spam classification system that not only achieves high performance but also provides deep insights into what makes spam detection work—and what makes it fail.

Unlike a simple "build a classifier" task, this assignment requires you to:
  1. Implement multiple classification approaches (traditional ML and neural methods)
  2. Conduct rigorous comparative analysis across methods
  3. Perform extensive error analysis to understand failure modes
  4. Test adversarial robustness by trying to fool your own classifier
  5. Consider real-world deployment constraints (speed, memory, class imbalance)
This assignment mirrors real-world ML engineering: you'll make architecture decisions, justify trade-offs, and demonstrate that you understand not just how to build models, but why they work.

Timeline: This assignment is designed to be completed in 1 week (7 days) while remaining comprehensive in scope. By using GenAI tools to accelerate implementation, you can focus your time on the deeper analytical work—error analysis, robustness testing, and deriving insights—that separates excellent work from good work.

Learning Objectives

By completing this assignment, you will:

Dataset

A sample dataset is provided (training.zip), consisting of two folders: You can use this dataset to develop, train, and test your classifiers. When evaluating your solution, a new dataset—structured in the same way (with "spam" and "ham" folders)—will be used. Your models should generalize well to this unseen data. Important Notes:

Required Components

Part 1: Multiple Classifier Implementations (40 points)

You must implement and train at least three different classifiers:

1. Traditional ML Baseline (15 points)

Implement two of the following with proper feature engineering: For traditional methods, you must: Example features to consider:

2. Neural/Transformer-Based Model (15 points)

Implement at least one neural approach: For neural methods, you must:

3. Ensemble Method (10 points)

Create an ensemble that combines predictions from your best models: Document your ensemble strategy and show whether it improves over individual models.

Part 2: Comprehensive Evaluation (25 points)

For each classifier, you must report:

Quantitative Metrics (15 points)

Create a comparison table showing all metrics for all classifiers.

Computational Efficiency (5 points)

For each model, measure and report: This helps understand deployment trade-offs.

Statistical Significance (5 points)

Part 3: Error Analysis (20 points)

This is where you demonstrate deep understanding:

Systematic Error Categorization (10 points)

  1. Identify failure cases: Find at least 20 misclassified emails (10 false positives, 10 false negatives)
  2. Categorize errors: Group them into patterns:
    • False Positives: Legitimate emails classified as spam (e.g., promotional emails, newsletters)
    • False Negatives: Spam that slipped through (e.g., sophisticated phishing, image-based spam)
  3. Analyze patterns: What do misclassified emails have in common?
    • Vocabulary overlap between spam and ham
    • Short emails with little context
    • Emails with unusual formatting
    • Multilingual content

Comparative Error Analysis (5 points)

Feature Importance Analysis (5 points)

Part 4: Adversarial Testing (10 points)

Test the robustness of your classifiers:

Create Adversarial Examples (5 points)

Manually craft at least 5 emails that: Test these on all your classifiers. Which models are most robust?

Robustness Analysis (5 points)

Test your classifier against: How much does performance degrade? Which models are most robust?

Part 5: Real-World Considerations (5 points)

Discuss the following:

Class Imbalance (2 points)

Deployment Scenarios (3 points)

Given different constraints, which model would you choose? Justify your recommendations with evidence from your experiments.

Deliverables

Submit a single Jupyter notebook that includes:

1. Code Implementation

2. Markdown Documentation

Your notebook must include well-written markdown sections:

Introduction

Methods

For each classifier:

Results

Error Analysis

Adversarial Testing

Discussion

Reflection

3. Code Quality

Grading Rubric (100 points total)

Your assignment will be graded according to the following breakdown:

Technical Implementation (40 points)

Evaluation and Analysis (45 points)

Adversarial Testing and Robustness (10 points)

Real-World Considerations (5 points)

Documentation and Presentation (15 points)

Bonus Points (up to 10 points)

Note: Maximum score is capped at 110/100.

Evaluation Metrics

While your grade is based on the rubric above, your model's performance will also be tested on a held-out dataset. This serves as a sanity check—if your models perform poorly (e.g., below 0.85 AUC), you may lose points even if other components are complete.

Expected Performance Benchmarks: The following code can be used to evaluate your classifiers during development:
import os
import zipfile
import shutil
from pathlib import Path
from sklearn.metrics import rocaucscore

def evaluateclassifier(zippath: str, classifyemailfn) -> float: """ Evaluate a classifier's performance on a dataset contained in a zip archive.

Parameters: zip_path (str): Path to the zip archive containing "spam" and "ham" folders. classifyemailfn (function): A function handle to classifyemail(emailtext: str) -> int.

Returns: float: The AUC (Area Under the Curve) score of the classifier. """ # Step 1: Set up paths and directories datasetdir = Path(zippath).with_suffix('') # Create a directory name based on the zip name (without .zip) temp_extracted = False # Track if we extracted the zip (for cleanup)

# Step 2: Check if the dataset is already extracted if not dataset_dir.exists(): print(f"Extracting {zip_path}...") with zipfile.ZipFile(zippath, 'r') as zipref: zipref.extractall(datasetdir) temp_extracted = True # Mark that we extracted files

# Step 3: Prepare to collect the data emails = [] labels = []

# Traverse the spam folder spamfolder = datasetdir / "spam" for filepath in spamfolder.iterdir(): if filepath.isfile(): with open(file_path, 'r', encoding='utf-8', errors='ignore') as file: email_text = file.read() emails.append(email_text) labels.append(1) # Spam is labeled as 1

# Traverse the ham folder hamfolder = datasetdir / "ham" for filepath in hamfolder.iterdir(): if filepath.isfile(): with open(file_path, 'r', encoding='utf-8', errors='ignore') as file: email_text = file.read() emails.append(email_text) labels.append(0) # Ham is labeled as 0

# Step 4: Classify all emails predictions = [classifyemailfn(email) for email in emails]

# Step 5: Calculate AUC score aucscore = rocauc_score(labels, predictions)

# Step 6: Clean up if necessary if temp_extracted: print(f"Cleaning up extracted files from {dataset_dir}...") shutil.rmtree(dataset_dir)

return auc_score

You can call this function in your notebook to evaluate individual models during development:

aucscore = evaluateclassifier('training.zip', classify_email)
print(f"Model AUC Score: {auc_score:.4f}")

Tips for Success

Complete Within 1 Week: Suggested Daily Schedule

While this assignment is comprehensive in scope, it's designed to be completable in 7 days. Here's a suggested timeline (students can use GenAI to accelerate implementation): Key to Success: Use GenAI coding assistants to accelerate boilerplate code and feature engineering, but invest your time in understanding results, analyzing errors, and writing insightful analysis.

Use Version Control

Leverage GenAI Tools Effectively (Critical for 1-Week Timeline)

Since this assignment must be completed in 7 days, using AI coding assistants is essential to accelerate implementation while you focus on the analytical components: DO use GenAI for: DON'T use GenAI as a shortcut for: Workflow: Generate code scaffolds with GenAI, then spend your time on data exploration, result interpretation, error categorization, and insightful analysis. The best submissions show deep understanding of the why, not just the how.

The goal is to learn ML concepts deeply while shipping a complete, well-analyzed project in 7 days.

Feature Engineering Matters

For traditional ML models:

Monitor for Overfitting

Debugging Poor Performance

If your models aren't performing well:
  1. Check data quality: Are there mislabeled examples?
  2. Verify preprocessing: Are you handling special characters, URLs correctly?
  3. Inspect predictions: Look at specific examples where the model fails.
  4. Try simpler models first: Debug Naive Bayes before attempting BERT.
  5. Check class balance: Are you predicting only the majority class?

Make Comparisons Fair

Document Everything

Resources

Spam Detection Research

Transformer Fine-Tuning

Evaluation and Metrics

Adversarial Robustness

Handling Imbalanced Data

Python Libraries

Datasets (Optional Augmentation)

If you use external datasets, you must cite them clearly in your notebook.

Submission Guidelines

GitHub Classroom Submission

This assignment is submitted via GitHub Classroom. Follow these steps:

  1. Accept the assignment: Click the assignment link provided in Canvas or by your instructor
  2. Clone your repository:
   git clone https://github.com/ContextLab/spam-classifier-llm-course-YOUR_USERNAME.git
  1. Complete your work:
    • Work in Google Colab, Jupyter, or your preferred environment
    • Save your notebook to the repository
  2. Commit and push your changes:
   git add .
   git commit -m "Complete SPAM classifier assignment"
   git push
  1. Verify submission: Check that your latest commit appears in your GitHub repository before the deadline
Deadline: January 23, 2026 at 11:59 PM EST

What to Submit

Submit one Jupyter notebook (.ipynb file) in your GitHub Classroom repository.

Notebook Requirements

Your notebook must:
  1. Run from top to bottom without errors in a clean Google Colab environment
  2. Include all necessary code for training, evaluation, and analysis
  3. Download any required data/models within the notebook (don't assume files are present)
  4. Set random seeds for reproducibility (e.g., np.random.seed(42))
  5. Have a reasonable runtime: Full execution should complete in under 60 minutes on Colab (use DistilBERT instead of BERT-base to stay within this constraint)

Organization

Structure your notebook with clear sections:
1. Introduction and Setup
  • Import libraries
  • Load data
  • Exploratory data analysis
  1. Data Preprocessing
    • Train/val/test split
    • Text cleaning functions
    • Feature engineering utilities
  2. Model Implementations
    • Traditional ML models (separate subsections for each)
    • Neural model (BERT/DistilBERT)
    • Ensemble method
  3. Evaluation
    • Metrics computation
    • Comparison tables
    • Visualizations
    • Statistical tests
  4. Error Analysis
    • Failure case examination
    • Pattern identification
    • Feature importance
  5. Adversarial Testing
    • Adversarial examples
    • Robustness tests
  6. Discussion and Conclusions
    • Model comparison
    • Real-world considerations
    • Reflection
  7. References
    • Papers cited
    • Datasets used
    • Resources consulted

Formatting

File Naming

Name your file: LastNameFirstNameAssignment2.ipynb

Example: SmithJaneAssignment2.ipynb

Pre-Submission Checklist

Before submitting, verify:

Deadline

One week from assignment release (7 calendar days)

Late submissions will be penalized according to the course late policy.

Academic Integrity

While you are encouraged to use AI coding assistants and discuss concepts with peers: You may be asked to explain your implementation decisions in office hours or during grading. Make sure you can justify your choices.

Getting Help

If you're stuck:
  1. Review the tips and resources in this document
  2. Ask specific questions in office hours or on the course forum
  3. Debug systematically: Isolate the problem, test components individually
  4. Start simple: Get a basic version working before adding complexity
Remember: The goal is to learn about text classification, evaluation, and error analysis. Don't get lost in trying to achieve the highest possible score—focus on understanding the concepts deeply.

Final Notes

This assignment is designed to be challenging but achievable within 1 week. You're expected to: The best submissions will show: Remember: The 7-day timeline is realistic because:
  1. GenAI can generate boilerplate code (training loops, metrics, visualizations)
  2. DistilBERT trains faster than BERT-base
  3. You can run many experiments in parallel on Colab's GPUs
  4. The most valuable insights come from analysis, not implementation time
Good luck, and enjoy building your spam classifier in a week!