Ready to Start?

Due: March 9, 2026 at 11:59 PM EST

Final Project: Your LLM Research Capstone

Overview

The final project is your opportunity to synthesize everything you've learned throughout this course into an ambitious, cutting-edge project that showcases your mastery of large language models and conversational AI. This is a capstone experience where you'll tackle a challenging problem that would be nearly impossible in a traditional course—but is achievable with the help of modern GenAI tools like Claude, ChatGPT, and GitHub Copilot.

Unlike the structured problem sets, this project is open-ended and research-oriented. You have the creative freedom to explore novel applications, replicate and extend published research, build innovative systems, or conduct rigorous evaluations of LLM capabilities. The goal is to produce work that is intellectually sophisticated, technically ambitious, and potentially publishable or deployable in the real world.

This is your chance to:

Explore a topic you're genuinely passionate about
Push the boundaries of what you thought possible
Create something meaningful that extends beyond the classroom
Build a portfolio piece that demonstrates your AI/ML capabilities
Potentially contribute to ongoing research in the field

You will work in teams of 2-3 students over 4 weeks (Weeks 7-10), leveraging your combined skills and interests to tackle problems that would be challenging for any individual. The collaborative nature of this project mirrors real-world research and industry practices.

Learning Objectives

By completing this project, you will:

Synthesize course concepts: Integrate knowledge from all assignments—from rule-based systems (ELIZA) to modern transformers (GPT), from embeddings to fine-tuning.
Design and execute research: Formulate research questions, design methodologies, implement solutions, and critically evaluate results.
Master LLM tools and frameworks: Work extensively with Hugging Face, OpenAI/Anthropic APIs, PyTorch, and other modern ML frameworks.
Leverage GenAI for ambitious projects: Use AI coding assistants to implement complex systems that would traditionally take months.
Communicate technical work: Present findings clearly through code, visualizations, presentations, and written reports.
Think critically about AI: Evaluate limitations, biases, ethical implications, and societal impacts of your work.
Collaborate effectively: Work in a team to divide tasks, integrate components, and produce cohesive results.

Project Scope

Your project should represent approximately 4 weeks of focused work for a team of 2-3 students (Weeks 7-10). This translates to roughly 15-20 hours per person per week, though the actual time investment may vary based on your background and the project's complexity.

Scale and Ambition

With GenAI assistance, you can tackle projects that would have been impossible just a few years ago. However, be realistic about scope:

Appropriate scale:

Fine-tuning a model on a custom dataset for a specific task
Building a multi-component system (e.g., RAG with custom embeddings and retrieval)
Replicating a recent paper and extending it with new experiments
Creating a novel evaluation framework for LLM capabilities
Developing an agent-based system with multiple tools and reasoning components

Too small:

Simple prompt engineering exercises
Basic API wrapper applications
Straightforward classification tasks using existing models
Tutorials or literature reviews without implementation

Too ambitious (probably):

Training a foundation model from scratch
Building a production-scale application with full frontend/backend
Projects requiring weeks of data collection
Multiple independent research questions that should be separate projects

Finding the Right Balance

A good rule of thumb: your project should involve at least 3-4 substantial technical components (e.g., custom data processing + model fine-tuning + evaluation framework + analysis), where each component requires thoughtful design and implementation.

Project Types

Your project should fall into one or more of the following categories:

1. Novel Applications of LLMs

Design and implement a new application that leverages LLMs in creative ways. Focus on solving a real problem or creating something genuinely useful. Examples:

Domain-specific assistants (medical diagnosis helper, legal research tool, educational tutor)
Creative tools (story co-writer, game master, music lyric generator)
Accessibility applications (sign language translation, simplified text generation)
Professional tools (code documentation generator, literature review assistant)

2. Research Replication and Extension

Reproduce results from a recent research paper, then extend the work with new experiments, datasets, or analyses. Examples:

Replicate a fine-tuning approach from a paper, then test on new domains
Reproduce an evaluation study, then extend with additional models or metrics
Implement a published architecture, then ablate components to understand contributions

3. New Model Architectures or Training Approaches

Explore modifications to existing architectures or novel training procedures. This is advanced but achievable with GenAI help. Examples:

Modified attention mechanisms for specific tasks
Novel fine-tuning approaches (e.g., parameter-efficient methods)
Hybrid architectures combining different model types
Custom loss functions or training objectives

4. Analysis and Evaluation of LLMs

Conduct rigorous empirical studies of LLM capabilities, limitations, or behaviors. Examples:

Systematic evaluation of reasoning abilities across multiple models
Bias analysis in specific domains or demographic groups
Consistency and reliability testing under various conditions
Comparative studies of different prompting strategies or model families

5. Multimodal and Agent-Based Systems

Build systems that integrate multiple modalities (text, image, audio) or implement autonomous agents with reasoning and tool use. Examples:

Vision-language systems for image description or visual question answering
Agents that can plan, use tools, and complete complex tasks
Multi-agent systems with specialized roles and communication
Cross-modal retrieval or generation systems

6. Safety, Alignment, and Ethics

Investigate critical issues around AI safety, bias, fairness, or societal impact. Examples:

Jailbreaking detection and mitigation strategies
Bias measurement and debiasing techniques
Factuality and hallucination analysis
Privacy and security considerations in LLM applications

Project Ideas

Here are 20 diverse project ideas to inspire you. These span different difficulty levels and interest areas. Feel free to use these as starting points, combine elements, or create something entirely new.

RAG and Retrieval Systems

Domain-Specific RAG System: Build a retrieval-augmented generation system for a specialized domain (e.g., medical literature, legal documents, historical archives). Implement custom embeddings, retrieval strategies, and evaluate against baselines.
Conversational Search Engine: Create a multi-turn conversational search system that maintains context, asks clarifying questions, and provides source citations. Compare different retrieval and reranking strategies.
Hybrid Retrieval Architecture: Combine dense embeddings, sparse retrieval (BM25), and knowledge graphs for enhanced retrieval. Systematically evaluate each component's contribution.

Fine-Tuning and Adaptation

Low-Resource Language Adaptation: Fine-tune an LLM for a low-resource language or dialect using limited data. Explore techniques like cross-lingual transfer, data augmentation, and parameter-efficient fine-tuning.
Scientific Writing Assistant: Fine-tune a model to help researchers write better papers by learning from high-quality publications in a specific field. Implement style transfer and citation generation.
Personalized Conversation Models: Develop methods to adapt dialogue models to individual user preferences and communication styles while preserving helpfulness and safety.

Dialogue and Interaction

Multi-Turn Reasoning Dialogues: Build a conversational system that can engage in extended reasoning tasks (math problems, logic puzzles, strategic planning) while explaining its thought process.
Empathetic Support Chatbot: Create a psychologically-informed chatbot for emotional support or mental health check-ins. Evaluate empathy, safety, and effectiveness using human evaluation and automated metrics.
Socratic Tutoring System: Implement an educational assistant that uses Socratic questioning to guide students through problem-solving rather than providing direct answers.

Code and Technical Applications

Code Documentation Generator: Build a system that generates comprehensive documentation for code repositories, including docstrings, README files, and tutorials. Test on diverse codebases and programming languages.
Bug Detection and Repair: Develop an LLM-based system for finding and fixing bugs in code. Create a benchmark and compare against existing tools.
Technical Interview Practice System: Create an interactive system that conducts realistic technical interviews, provides feedback, and adapts difficulty based on performance.

Creative and Cultural Applications

Interactive Fiction Engine: Build a system for collaborative storytelling where the model maintains narrative coherence, character consistency, and player agency across long interactions.
Cross-Cultural Communication Assistant: Develop a tool that helps bridge cultural communication gaps by explaining context, suggesting phrasing, and highlighting potential misunderstandings.
Historical Dialogue Simulation: Create historically-accurate conversational agents representing figures from different time periods, trained on period-appropriate texts and evaluated for authenticity.

Cognitive Science and Modeling

LLMs as Cognitive Models: Systematically evaluate whether LLMs exhibit human-like cognitive biases, heuristics, and reasoning patterns. Compare model behavior to psychological literature.
Language Acquisition Simulation: Investigate how LLMs "learn" linguistic structures by analyzing learning trajectories, comparing to child language acquisition data.
Theory of Mind in LLMs: Design experiments to test whether models exhibit theory of mind capabilities—understanding beliefs, intentions, and mental states of others.

Interpretability and Analysis

Mechanistic Interpretability Study: Use techniques like attention visualization, activation probing, and causal interventions to understand how models perform specific tasks (e.g., factual recall, arithmetic, syntax).
Failure Mode Taxonomy: Systematically categorize and analyze LLM failures across multiple models and tasks. Develop predictors for when failures occur and potential mitigations.

Advanced Reasoning and Agents

Multi-Step Planning Agent: Build an agent that can decompose complex goals, create plans, execute actions with tools, and adapt based on feedback. Test on challenging benchmarks.
Scientific Hypothesis Generator: Create a system that reads research papers, identifies gaps, and proposes testable hypotheses. Evaluate novelty and plausibility with domain experts.

Requirements

Your final project must include three key deliverables:

1. Code Implementation (Due Week 10)

Submit a well-documented Jupyter notebook that:

Runs completely in Google Colaboratory without errors
Includes clear markdown explanations of your approach, decisions, and findings
Contains all code necessary to reproduce your results
Automatically downloads/loads any required datasets, models, or resources
Follows best practices: modular code, informative comments, reproducible results
Includes visualizations, tables, and figures to illustrate key findings
Provides clear instructions for running experiments or using your system

Code quality matters: Your notebook should be something you'd be proud to share publicly or include in a portfolio.

2. Presentation (Due Week 10)

Create a 10-12 minute video presentation and present it in class:

Video submission: Upload to YouTube (unlisted is fine) and include the link in your writeup
In-class component: Be prepared to present your video and lead a 5-10 minute discussion/Q&A
Content: Cover motivation, approach, key results, insights, and limitations
Visuals: Use slides, demos, visualizations—make it engaging and clear
Team participation: All team members should contribute to the presentation

The presentation should be accessible to your classmates—assume they're smart but may not know the specific technical details of your domain.

3. Written Writeup (Due Week 10)

Submit a 2-5 page writeup (not including references or appendices) structured as:

Introduction (0.5-1 page)

Motivation and background
Research questions or goals
Why this problem is interesting/important

Approach (1-2 pages)

Methods, models, datasets used
Technical details of your implementation
Design decisions and justifications

Results (1-1.5 pages)

Key findings, with figures and tables
Quantitative metrics and qualitative observations
Comparison to baselines or related work

Discussion (0.5-1 page)

Interpretation of results
Limitations and potential improvements
Broader implications
Future directions

References

Cite relevant papers, tools, datasets
Follow a consistent citation format (APA, IEEE, etc.)

Write clearly and concisely. Think of this as a mini research paper suitable for a workshop or course anthology.

Timeline and Milestones

Weeks 7–8: Team Formation and Brainstorming

Form teams of 2-3 students
Brainstorm project ideas
Review recent papers and existing systems for inspiration
Identify datasets, models, and tools needed
Create timeline and divide initial responsibilities
Consider meeting with the instructor early on to ensure your project is appropriately scoped
Deliverable: Team formed, project idea identified

Week 9: Core Development and Iteration

Implement main technical components
Run experiments and collect results
Iterate based on findings—adjust approaches as needed
Begin drafting writeup and presentation materials
Deliverable: Major progress on implementation, preliminary results

Week 10: Final Push and Presentations

Finalize code and ensure reproducibility
Complete all experiments and analysis
Finish writeup and presentation video
Present to class and engage in discussions
Deliverable: All final materials submitted, presentation delivered

Grading Rubric

Your project will be evaluated holistically, with the final grade broken down as:

Implementation (45%)

Technical quality: Is the code well-written, documented, and reproducible?
Sophistication: Does it demonstrate mastery of course concepts?
Functionality: Does it work as intended?
Reproducibility: Can results be replicated in Colab?

Results and Analysis (25%)

Rigor: Are experiments well-designed and thorough?
Insights: Do results provide meaningful insights?
Evaluation: Are appropriate metrics and baselines used?
Critical thinking: Are limitations and interpretations discussed thoughtfully?

Presentation (15%)

Clarity: Is the presentation clear and well-organized?
Engagement: Is it interesting and accessible?
Completeness: Does it cover all key aspects?
Participation: Did all team members contribute?

Writeup (10%)

Writing quality: Is it clear, concise, and well-structured?
Completeness: Does it cover all required sections?
Depth: Does it provide sufficient detail and analysis?
References: Are sources properly cited?

Teamwork and Collaboration (5%)

Distribution of work: Did all team members contribute meaningfully?
Integration: Are components well-integrated into a cohesive whole?
Process: Did the team work effectively together?

Note: Exceptional projects that go above and beyond expectations may receive bonus points. Projects that demonstrate novel insights, publishable quality, or significant real-world applicability will be highlighted.

Resources and Support

Course Materials

Review all assignments and lectures—they contain techniques, tools, and insights directly applicable to your project:

Assignment 1 (ELIZA): Rule-based systems, pattern matching, conversation management
Assignment 2 (SPAM): Classification, evaluation metrics, text preprocessing
Assignment 3 (Wikipedia): Embeddings (LDA, BERT, GPT-2, Llama), clustering, dimensionality reduction (PCA, UMAP)
Assignment 4 (Customer Service): Dialogue systems, context management, response generation
Assignment 5 (GPT): Model architectures, training from scratch, transformer mechanics

Key Tools and Frameworks

Hugging Face Transformers: Pre-trained models, tokenizers, fine-tuning pipelines
Hugging Face Datasets: Access to thousands of datasets
OpenAI API / Anthropic API: Access to GPT-4, Claude, and other commercial models
LangChain: Building LLM applications, chains, and agents
FAISS / Pinecone: Vector databases for retrieval systems
PyTorch / TensorFlow: Deep learning frameworks
Weights & Biases / TensorBoard: Experiment tracking and visualization
Ollama: Running open-source models locally

Finding Datasets

Recent Research

Stay current with cutting-edge work:

arXiv cs.CL (Computation and Language)
arXiv cs.AI (Artificial Intelligence)
Papers with Code (Latest research with code)
ACL Anthology (NLP conference papers)
NeurIPS, ICML, ICLR (ML conferences)

Getting Help

Discord: Use our course Discord for questions, discussions, and collaboration
Office Hours: Schedule time to discuss project ideas, troubleshoot issues, or get feedback
Peers: Discuss ideas with other teams—collaboration across teams is encouraged
GenAI Tools: Use ChatGPT, Claude, GitHub Copilot extensively—document what you use

Tips for Success

Start Early

Don't underestimate the time required. Even with GenAI assistance, projects take longer than expected. Starting early gives you time to iterate, handle unexpected challenges, and produce polished results.

Embrace Failure and Iteration

Not everything will work on the first try. That's normal in research. Build in time to try multiple approaches, debug issues, and refine your methods based on what you learn.

Use GenAI Aggressively

This is where you can be truly ambitious. Use tools like Claude, ChatGPT, and GitHub Copilot to:

Implement complex architectures you haven't built before
Write data processing pipelines quickly
Debug tricky errors
Generate boilerplate code
Understand unfamiliar libraries or papers
Brainstorm ideas and approaches

Remember: You're responsible for understanding and validating what AI generates. Always test, verify, and critically evaluate AI-generated code.

Scope Appropriately

It's better to do one thing really well than three things poorly. If you find your project is too ambitious, narrow the focus rather than delivering incomplete work.

Communicate Clearly

Technical sophistication is important, but so is clear communication. Make sure your code, writeup, and presentation are understandable to intelligent non-experts.

Document Everything

Keep track of experiments, decisions, and results as you go. This makes writing the final report much easier and helps you understand what worked and why.

Leverage Your Team's Strengths

Different team members may excel at different aspects (coding, experimentation, writing, visualization). Divide labor strategically but ensure everyone understands the full project.

Make It Yours

Choose a project you're genuinely excited about. Passion and curiosity will sustain you through challenges and lead to better results.

Example Projects from Prior Years

Note: As this is a new course, we don't yet have prior student projects to showcase. However, here are examples of the caliber of work we hope to see:

Cognitive bias evaluation framework: Systematic testing of whether LLMs exhibit human-like cognitive biases, with comparisons across model families
Domain-adapted medical QA system: Fine-tuned model for answering medical questions with retrieval augmentation, evaluated against human experts
Interactive mathematical reasoning tutor: Socratic dialogue system that guides students through proofs and problem-solving
Multimodal scientific figure understanding: System that can interpret graphs, diagrams, and figures from research papers and answer questions about them
Debate agent with evidence retrieval: Autonomous system that can argue positions on controversial topics by retrieving and citing evidence

Your project can match or exceed these examples. With modern tools and GenAI assistance, sophisticated projects are within reach.

Submission Guidelines

GitHub Classroom Submission

This assignment is submitted via GitHub Classroom. Follow these steps:

Accept the assignment: Click the assignment link provided in Canvas or by your instructor
- Repository: github.com/ContextLab/final-project-llm-course
- This creates your own private repository for the assignment
Clone your repository:

   git clone https://github.com/ContextLab/final-project-llm-course-YOUR_USERNAME.git

Complete your work:
- Work in Google Colab, Jupyter, or your preferred environment
- Save your notebook, writeup, and presentation materials to the repository
Commit and push your changes:

   git add .
   git commit -m "Complete final project"
   git push

Verify submission: Check that your latest commit appears in your GitHub repository before the deadline

Deadlines

Final Submission (Code, Writeup, Presentation): March 9, 2026 at 11:59 PM EST

Technical Requirements

Google Colaboratory Compatibility

Your project must run in Google Colab. This means:

Use publicly accessible datasets (or include download code)
Include all installation commands in the notebook (!pip install ...)
Don't rely on local files or proprietary data
Test in a fresh Colab session before submission
Consider compute constraints (RAM, GPU availability)
Use !nvidia-smi to check GPU allocation if needed

Model and Data Accessibility

Use open-source models (Hugging Face) or provide API keys instructions
For large models, use quantized versions or API access
Include code to automatically download datasets
Document any manual setup steps clearly

Reproducibility

Set random seeds for reproducible results
Include version numbers for key libraries
Save and load model checkpoints appropriately
Document any hyperparameters used

Documentation

Use markdown cells liberally to explain your approach
Include docstrings for functions
Comment complex code sections
Provide a "Quick Start" section at the top of your notebook
Include a table of contents for longer notebooks

Final Thoughts

This final project is your chance to create something remarkable. You've learned about the full spectrum of language models—from simple pattern matching to sophisticated transformers. You've implemented these systems, evaluated them, and thought critically about their capabilities and limitations.

Now it's time to apply that knowledge to a problem you care about. With GenAI tools at your disposal, you can tackle projects that would have been impossible for a student team just a few years ago. Take advantage of this unique moment in history.

The best projects will:

Demonstrate technical sophistication and mastery
Provide novel insights or create genuinely useful systems
Show critical thinking about AI capabilities and limitations
Be executed with care, rigor, and attention to detail
Communicate results clearly and engagingly

We're excited to see what you create. Good luck, and remember: ambitious goals combined with thoughtful execution lead to extraordinary results.

Questions? Use the course Discord or attend office hours. We're here to help you succeed.

Ready to start? Form your team, start brainstorming, and prepare to build something amazing.