Ready to Start?

Accept Assignment

Due: February 16, 2026 at 11:59 PM EST

Assignment 5: Build and Train a GPT Model

Timeline: 1 Week

Overview

In this capstone-style assignment, you will build a GPT (Generative Pre-trained Transformer) model from the ground up and train it to generate text. This assignment goes beyond using pre-trained models—you will implement the core components of a transformer architecture, understand how autoregressive language modeling works, and gain deep insights into what makes modern large language models tick.

This is a 1-week assignment designed to be ambitious yet achievable with modern GenAI tools (ChatGPT, Claude, GitHub Copilot) to assist with implementation, debugging, and optimization.

GPT models are decoder-only transformers that have revolutionized natural language processing. By building one yourself, you'll understand the intricate details of self-attention mechanisms, positional encodings, layer normalization, and the training dynamics that enable these models to generate coherent text.

This is an ambitious assignment that will challenge you to think deeply about architecture design, optimization, and evaluation. However, with modern tools and GenAI assistance, it's entirely achievable—and incredibly rewarding.

Learning Objectives

By the end of this assignment, you will be able to:

  1. Implement transformer architecture components including multi-head self-attention, feed-forward networks, layer normalization, and residual connections
  2. Understand autoregressive language modeling and how GPT models generate text one token at a time
  3. Implement causal (masked) self-attention to ensure the model can only attend to previous tokens
  4. Design and implement positional encodings to give the model a sense of token position
  5. Build a complete training pipeline including data loading, batching, loss computation, and optimization
  6. Apply different text generation strategies including greedy decoding, temperature sampling, top-k, and nucleus (top-p) sampling
  7. Analyze what the model has learned by visualizing attention patterns and token embeddings
  8. Compare your model's performance with established models like GPT-2
  9. Understand the trade-offs between model size, training time, and generation quality

Background

The Transformer Architecture

The transformer architecture, introduced in the seminal paper "Attention is All You Need" (Vaswani et al., 2017), revolutionized sequence modeling by replacing recurrent neural networks with self-attention mechanisms. GPT uses only the decoder portion of the transformer, removing the cross-attention layers used in encoder-decoder models.

Key Components

Self-Attention: The core mechanism that allows each token to attend to all previous tokens (in the case of GPT, due to causal masking). The attention mechanism computes:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k))V

Multi-Head Attention: Instead of a single attention mechanism, GPT uses multiple attention heads in parallel, each learning different types of relationships between tokens.

Causal Masking: Unlike BERT (which uses bidirectional attention), GPT applies a causal mask during attention computation to ensure that when predicting token i, the model can only attend to tokens at positions j < i. This makes the model autoregressive.

Position Encodings: Since the attention mechanism is position-agnostic, we must explicitly encode position information. This can be done with sinusoidal encodings (as in the original paper) or learned embeddings (as in GPT).

Feed-Forward Networks: After attention, each position is processed by a position-wise feed-forward network (two linear transformations with a non-linearity in between).

Layer Normalization and Residual Connections: These help with training stability and gradient flow through deep networks.

The GPT Family

Your Task

You will implement, train, and analyze a GPT-style model. The assignment is divided into several interconnected tasks:

1. Implement Transformer Components

Build the core components of the GPT architecture:

a) Multi-Head Self-Attention with Causal Masking b) Position-wise Feed-Forward Network c) Positional Encoding d) Transformer Block e) GPT Model

2. Implement the Training Pipeline

a) Data Preparation b) Training Loop c) Hyperparameter Selection

3. Generate Text with Multiple Sampling Strategies

Implement and compare different text generation methods:

a) Greedy Decoding: Always pick the most likely next token

b) Temperature Sampling: Sample from the probability distribution with adjustable temperature (higher = more random)

c) Top-k Sampling: Sample from only the k most likely tokens

d) Nucleus (Top-p) Sampling: Sample from the smallest set of tokens whose cumulative probability exceeds p

Generate samples with each method and analyze the quality, diversity, and coherence of the outputs. What works best for your model and dataset?

4. Analyze Learned Representations

a) Attention Pattern Visualization b) Token Embedding Analysis c) Probing Tasks (Optional)

5. Compare with Pre-trained Models

a) Load a Pre-trained Model b) Quantitative Comparison c) Qualitative Analysis

6. Dataset Selection

Choose one of the following datasets or propose your own:

a) Code Generation b) Stories/Creative Writing c) Domain-Specific Text d) Dialogue e) Shakespeare or Classic Literature Choose a dataset that interests you and that's appropriate for the model size you can feasibly train. Remember: you don't need billions of tokens to build something impressive!

Implementation Options

You have flexibility in how you approach this assignment:

Implement everything from first principles using PyTorch or JAX:

Option 2: Using nanoGPT as a Starting Point

Andrej Karpathy's nanoGPT provides a clean, minimal implementation:

Option 3: Using HuggingFace Transformers

Use the HuggingFace library but implement key components yourself: Whichever option you choose, you must demonstrate deep understanding of how the model works. Simply calling library functions without explanation is insufficient.

Technical Requirements

Model Specifications

Your model should have at minimum:

Training Requirements

Code Quality

Deliverables

Submit a Google Colaboratory notebook that includes:

1. Implementation (40%)

2. Training Results (20%)

3. Generated Examples (15%)

4. Analysis and Visualization (15%)

5. Comparison with Baseline (10%)

6. Markdown Explanations Throughout

Evaluation Criteria

Your assignment will be evaluated on:

  1. Correctness of Implementation (30%)
    • Model architecture correctly implements GPT design
    • Attention mechanisms properly use causal masking
    • Training loop correctly computes loss and updates parameters
    • No critical bugs or errors
  2. Quality of Training and Results (25%)
    • Model successfully trains and converges
    • Reasonable hyperparameter choices
    • Generated text demonstrates learning
    • Proper evaluation on validation set
  3. Depth of Analysis (25%)
    • Thoughtful examination of attention patterns
    • Meaningful visualization and interpretation
    • Insightful comparison with baseline model
    • Understanding of model behavior and limitations
  4. Code Quality and Documentation (10%)
    • Clean, well-organized code
    • Comprehensive markdown explanations
    • Clear documentation of design choices
    • Reproducible results
  5. Creativity and Insight (10%)
    • Interesting dataset choice or experiments
    • Novel visualizations or analyses
    • Thoughtful discussion of results
    • Extensions beyond basic requirements

Extensions and Bonus Challenges

If you finish the core assignment and want to push further, consider these extensions:

Architecture Modifications

Advanced Training

Advanced Evaluation

Specialized Applications

Interpretability

Resources

Essential Reading

Video Resources

Code Resources

Additional Papers

Tools and Libraries

Tips for Success (1-Week Timeline)

  1. Use GenAI aggressively: Claude, ChatGPT, and GitHub Copilot are your friends. Use them to:
    • Implement transformer components you haven't built before
    • Debug errors and understand PyTorch behavior
    • Write boilerplate code and data processing pipelines
    • Explain unfamiliar concepts or papers
  2. Start small: Begin with a tiny model (2-4 layers, small hidden size) to debug your implementation quickly
  3. Validate components: Test each component individually before assembling the full model
  4. Monitor carefully: Watch for NaN losses, exploding gradients, or other training instabilities
  5. Use gradient clipping: This prevents exploding gradients in early training
  6. Overfit a small batch first: Ensure your model can memorize a tiny amount of data before scaling up
  7. Compare with references: If results seem off, compare your implementation with nanoGPT or other references
  8. Save often: Checkpointing is critical—you don't want to lose hours of training
  9. Document everything: Future you (and the grader) will thank you for clear explanations
  10. Have fun: This is a challenging but incredibly rewarding assignment!

Submission

GitHub Classroom Submission

This assignment is submitted via GitHub Classroom. Follow these steps:

  1. Accept the assignment: Click the assignment link provided in Canvas or by your instructor
  2. Clone your repository:
   git clone https://github.com/ContextLab/gpt-llm-course-YOUR_USERNAME.git
  1. Complete your work:
    • Work in Google Colab, Jupyter, or your preferred environment
    • Save your notebook to the repository
  2. Commit and push your changes:
   git add .
   git commit -m "Complete GPT assignment"
   git push
  1. Verify submission: Check that your latest commit appears in your GitHub repository before the deadline
Deadline: February 16, 2026 at 11:59 PM EST

Notebook Requirements

Submit a single Google Colaboratory notebook that:

  1. Runs without errors on a clean Colab instance with GPU runtime
  2. Automatically downloads/installs any required dependencies
  3. Can load your trained model checkpoint (upload to Google Drive or HuggingFace Hub)
  4. Contains comprehensive markdown cells explaining every step
  5. Includes all code for implementation, training, generation, and analysis
  6. Shows all visualizations and results inline
  7. Demonstrates clear understanding of transformer architecture and training dynamics
Note on Training: If your model takes too long to train in the notebook, you can train it separately and load the checkpoint in your notebook. However, your notebook should include all training code and explain your training process thoroughly.

Academic Integrity

You may: You must: Do not: This assignment is about learning. Use all available tools to learn deeply, but ensure the final submission represents your own understanding and effort.

Good luck! Building a GPT model is a rite of passage for anyone serious about understanding modern AI. Enjoy the journey, and don't hesitate to reach out if you get stuck.