Assignment 4: Context-Aware Customer Service Chatbot
Timeline: 1 Week
Overview
In this assignment, you will build a sophisticated, context-aware customer service chatbot that uses modern transformer models and retrieval-augmented generation (RAG) techniques. Unlike traditional rule-based chatbots, your system will leverage semantic understanding to match customer queries with relevant knowledge base entries and generate contextually appropriate responses.
This is a 1-week assignment designed to be achievable with GenAI assistance (ChatGPT, Claude, GitHub Copilot). You can focus on system design, integration, and evaluation rather than getting bogged down in low-level implementation details.
You will implement a complete RAG pipeline that:- Uses transformer-based encoders (BERT or similar) to understand customer queries semantically
- Performs efficient semantic search over a knowledge base using vector similarity
- Retrieves relevant context to ground responses in factual information
- Handles multi-turn conversations while maintaining context
- Compares your approach against simpler keyword-matching baselines
Learning Objectives
By completing this assignment, you will develop the following skills:
- Semantic Understanding: Apply transformer-based models (BERT, Sentence-BERT) to encode text into meaningful vector representations
- Information Retrieval: Implement efficient semantic search using vector similarity and libraries like FAISS
- Retrieval-Augmented Generation (RAG): Combine retrieval and generation to produce grounded, factual responses
- Evaluation Design: Develop metrics to assess chatbot quality, including retrieval accuracy and response relevance
- System Architecture: Design and implement a complete end-to-end conversational AI system
- Baseline Comparison: Understand the importance of baselines by comparing against keyword-matching approaches
- Production Considerations: Handle edge cases, multi-turn conversations, and system scalability
Background
Context-Aware Language Understanding
Traditional customer service systems often rely on keyword matching or simple pattern recognition (like your Assignment 1 ELIZA chatbot). However, customers express the same need in many different ways:- "I can't log into my account"
- "My password isn't working"
- "I'm having authentication issues"
- "The login page keeps rejecting my credentials"
- Contextualized Embeddings: BERT produces vector representations where semantically similar text has similar vectors
- Semantic Similarity: Using cosine similarity or other distance metrics to find relevant knowledge base entries
- Dense Retrieval: Unlike sparse keyword methods (TF-IDF, BM25), dense vector representations capture deeper semantic meaning
Retrieval-Augmented Generation (RAG)
RAG systems combine the strengths of retrieval (finding relevant information) and generation (producing natural language). The typical pipeline:
- Encode: Convert the user query into a vector representation
- Retrieve: Find the most similar entries in your knowledge base
- Augment: Include retrieved context with the query
- Generate: Produce a response grounded in the retrieved information
Key Papers and Concepts
- BERT (Devlin et al., 2018): Bidirectional encoder representations from transformers
- Sentence-BERT (Reimers | Gurevych, 2019): Modified BERT for efficient sentence embeddings
- RAG (Lewis et al., 2020): Retrieval-augmented generation for knowledge-intensive tasks
- Dense Passage Retrieval (Karpukhin et al., 2020): Using dense representations for passage retrieval
Dataset
You will use a customer service FAQ dataset. We recommend one of the following options:
Option 1: HuggingFace Dataset (Recommended)
Use the customersupporttwitter dataset or similar customer service datasets from HuggingFace:
from datasets import load_dataset
Load customer support conversations
dataset = loaddataset("salesken/customersupport_twitter")
Alternatively, explore these datasets:
salesken/customersupporttwitter: Real customer support conversationsbanking77: Banking customer service intentsSetFit/customer_support: Multi-domain customer support dataset
Option 2: Create Your Own Knowledge Base
You can create a synthetic knowledge base for a specific domain (e-commerce, banking, tech support, etc.):
knowledge_base = [
{
"question": "How do I reset my password?",
"answer": "To reset your password, click 'Forgot Password' on the login page. Enter your email address, and we'll send you a reset link. Follow the link to create a new password.",
"category": "account_access"
},
{
"question": "What is your return policy?",
"answer": "We offer a 30-day return policy for most items. Products must be in original condition with tags attached. Refunds are processed within 5-7 business days.",
"category": "returns"
},
# Add 100+ entries for a meaningful knowledge base
]
Option 3: Web Scraping
Scrape FAQs from public customer support pages (ensure compliance with terms of service):
# Example: Parse FAQ pages from a website
import requests
from bs4 import BeautifulSoup
Your scraping code here
Requirements:
- Minimum 100 FAQ entries for meaningful evaluation
- Diverse topics/categories (at least 5-10 categories)
- Both simple and complex queries
- Include some ambiguous questions that require context
Your Tasks
1. Build a Semantic Search System
Implement a semantic search system that can find relevant FAQ entries given a customer query.
Requirements:
a) Encode the Knowledge Base:- Use a pre-trained sentence transformer model (e.g.,
sentence-transformers/all-MiniLM-L6-v2) - Generate embeddings for all FAQ questions/answers
- Store embeddings for efficient retrieval
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
Your code here
b) Implement Efficient Search:
- Use FAISS or similar library for fast similarity search
- Implement both cosine similarity and L2 distance metrics
- Support retrieving top-k results
import faiss
import numpy as np
Build FAISS index
dimension = embeddings.shape[1]
index = faiss.IndexFlatL2(dimension)
index.add(embeddings)
c) Query Processing:
- Encode incoming queries using the same model
- Retrieve top-k most similar FAQ entries
- Return results with similarity scores
2. Implement a Baseline (Keyword Matching)
Create a simple baseline using traditional keyword matching to demonstrate the value of semantic search.
Requirements:- Implement TF-IDF with cosine similarity OR BM25
- Use the same knowledge base
- Compare retrieval quality against your semantic approach
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
Your baseline implementation
3. Build the Retrieval Mechanism
Develop a complete retrieval system that:
a) Handles Query Variations:- Test with paraphrased queries
- Handle typos and misspellings (optional: add spelling correction)
- Support queries of varying length
- Implement category-based filtering (if applicable)
- Use confidence thresholds to reject low-quality matches
- Handle "no good match" scenarios gracefully
- Optionally implement a re-ranking step using cross-encoders
- This can improve retrieval quality for ambiguous queries
4. Generate Contextual Responses
Use the retrieved context to generate helpful responses.
Requirements:
a) Template-Based Generation (Minimum):- Use retrieved FAQ answers directly
- Format responses naturally
- Include confidence indicators
- Use a lightweight LLM (via Ollama, HuggingFace, or OpenAI API)
- Provide retrieved context in the prompt
- Generate natural, conversational responses
# Example prompt structure
def generateresponse(query, retrievedcontexts):
prompt = f"""You are a helpful customer service assistant.
Customer Question: {query}
Relevant Information:
{retrieved_contexts}
Provide a helpful, accurate response based on the information above.
Do not make up information not present in the context."""
# Call your LLM here
return response
c) Response Quality:
- Ensure responses are grounded in retrieved context
- Handle cases where no good match exists
- Provide clear, actionable information
5. Handle Multi-Turn Conversations
Extend your system to maintain context across multiple conversation turns.
Requirements:
a) Conversation State:- Track conversation history
- Maintain context from previous turns
- Update query encoding with conversation context
- Combine current query with relevant prior context
- Implement conversation summarization for long dialogues
- Handle follow-up questions ("What about shipping?", "How long does that take?")
User: "I want to return an item"
Bot: "Our return policy allows returns within 30 days..."
User: "How do I start the process?"
Bot: [Uses context that this is about returns]
User: "What about shipping costs?"
Bot: [Understands this relates to return shipping]
6. Evaluate Response Quality
Develop comprehensive evaluation metrics for your system.
Required Metrics:
a) Retrieval Metrics:- Precision@k: Of the k retrieved documents, how many are relevant?
- Recall@k: Of all relevant documents, how many are in top-k?
- MRR (Mean Reciprocal Rank): Position of first relevant result
- Create a test set with ground-truth relevant FAQs for queries
- Semantic Similarity: Compare generated response to ground truth
- Factual Grounding: Verify responses don't hallucinate information
- Human Evaluation: Test on sample queries and manually assess quality
- Compare semantic search vs. keyword matching on all metrics
- Use statistical tests to determine significance
- Visualize results with charts/graphs
- Identify failure modes (when does the system fail?)
- Categorize errors (retrieval failures vs. generation issues)
- Provide examples of good and bad responses
7. Advanced Features (Optional Bonus)
Implement one or more of these for extra credit:
- Hybrid Search: Combine semantic and keyword search for better results
- Cross-Encoder Re-ranking: Use a cross-encoder model to re-rank retrieved results
- Query Expansion: Expand queries with synonyms or related terms
- Active Learning: Identify uncertain cases for human review
- Multimodal Support: Handle queries about images or documents
- Intent Classification: Classify query intent before retrieval
- Conversation Analytics: Track common issues and query patterns
Technical Requirements
Required Libraries
# Core ML/NLP
transformers>=4.30.0
sentence-transformers>=2.2.0
torch>=2.0.0
Vector Search
faiss-cpu>=1.7.4 # or faiss-gpu for GPU support
Traditional IR (baseline)
scikit-learn>=1.3.0
rank-bm25>=0.2.2
Data Handling
datasets>=2.14.0
pandas>=2.0.0
numpy>=1.24.0
Visualization
matplotlib>=3.7.0
seaborn>=0.12.0
plotly>=5.14.0
Optional: LLM Integration
openai>=0.27.0 # if using OpenAI
or use Ollama for local LLMs
Recommended Models
Sentence Encoders (choose one or compare multiple):sentence-transformers/all-MiniLM-L6-v2(fast, good quality)sentence-transformers/all-mpnet-base-v2(higher quality)BAAI/bge-small-en-v1.5(state-of-the-art for retrieval)intfloat/e5-small-v2(efficient, strong performance)
cross-encoder/ms-marco-MiniLM-L-6-v2
- Local: Llama 3 8B via Ollama
- API: OpenAI GPT-3.5-turbo or GPT-4
- HuggingFace: Flan-T5, GPT-2, or similar
Computational Requirements
- CPU: Sufficient for most encoder models with FAISS
- GPU: Recommended for faster embedding generation with larger models
- Memory: 8GB+ RAM recommended
- Storage: ~2GB for models and data
Deliverables
Submit a Google Colaboratory notebook that includes:
1. Code Implementation (60%)
- Complete, runnable code for all required components
- Clear code organization with functions/classes
- Proper error handling and edge case management
- Comments explaining key design decisions
2. Documentation (20%)
- Markdown cells explaining your approach for each section
- System architecture diagram or description
- Model selection justification
- Design decisions and trade-offs
3. Evaluation and Analysis (15%)
- Comprehensive evaluation metrics implementation
- Baseline comparison with statistical analysis
- Visualizations (charts, confusion matrices, example outputs)
- Error analysis with specific examples
4. Examples and Demo (5%)
- At least 10 example queries with system responses
- Include both successful and failure cases
- Demo of multi-turn conversation (at least 3 turns)
- Comparison of semantic vs. keyword baseline on same queries
Required Sections in Notebook
- Introduction: Overview of your system
- Data Loading: Load and explore the knowledge base
- Semantic Search Implementation: Encoder model and FAISS
- Baseline Implementation: TF-IDF or BM25
- Response Generation: Template or LLM-based
- Multi-Turn Handling: Conversation state management
- Evaluation: Metrics, comparisons, and analysis
- Examples: Interactive demos
- Conclusion: Findings, limitations, future improvements
Evaluation Criteria
Your assignment will be graded on the following criteria:
Technical Implementation (40 points)
- Semantic Search (15 pts): Correct implementation of encoder + FAISS
- Baseline (5 pts): Working keyword-matching baseline
- Response Generation (10 pts): Quality and grounding of responses
- Multi-Turn (10 pts): Effective conversation context handling
Evaluation and Analysis (25 points)
- Metrics (10 pts): Proper implementation of retrieval and quality metrics
- Comparison (10 pts): Thorough baseline comparison with statistics
- Error Analysis (5 pts): Insightful analysis of failure modes
Code Quality and Documentation (20 points)
- Code Organization (10 pts): Clean, modular, well-commented code
- Documentation (10 pts): Clear markdown explanations throughout
Examples and Presentation (10 points)
- Quality Examples (5 pts): Diverse, illustrative examples
- Demo (5 pts): Working interactive demonstration
Creativity and Innovation (5 points)
- Advanced Features (3 pts): Implementation of optional features
- Novel Insights (2 pts): Unique observations or improvements
Grading Rubric
- A (90-100): Exceptional implementation with advanced features, thorough evaluation, and clear documentation
- B (80-89): Complete implementation of all required components with good evaluation
- C (70-79): Working system with basic evaluation and some documentation gaps
- D (60-69): Partial implementation with significant gaps
- F (<60): Incomplete or non-functional submission
Tips for Success
Getting Started (1-Week Timeline)
- Start Simple: Begin with a small knowledge base (20-30 FAQs) to test your pipeline
- Incremental Development: Build and test each component separately before integration
- Use Examples: Work through concrete examples at each step
- Validate Early: Check that embeddings and retrieval make sense before moving to generation
- Leverage GenAI: Use ChatGPT, Claude, or GitHub Copilot to accelerate implementation. Ask for help understanding libraries, debugging errors, and optimizing code.
Common Pitfalls to Avoid
- Ignoring Normalization: Normalize embeddings for cosine similarity
- Wrong Distance Metric: FAISS L2 distance requires normalized vectors for cosine similarity, or use IndexFlatIP
- Memory Issues: Batch embedding generation for large knowledge bases
- Overfitting to Examples: Test on diverse, unseen queries
- Hallucination: Always ground responses in retrieved context
- Ignoring Edge Cases: Handle no-match scenarios gracefully
Debugging Strategies
- Print Similarities: Inspect actual similarity scores to understand retrieval
- Manual Inspection: Look at retrieved documents for sample queries
- Embedding Visualization: Use t-SNE/UMAP to visualize embedding space
- Start Small: Debug with 10 FAQs before scaling to 100+
Performance Optimization
- Cache Embeddings: Don't re-encode the knowledge base every time
- Batch Processing: Encode multiple queries at once
- FAISS GPU: Use GPU-accelerated FAISS for large knowledge bases
- Model Selection: Balance model size with quality needs
Going Beyond Requirements
- Implement A/B testing framework to compare different models
- Add conversation analytics dashboard
- Create a web interface using Gradio or Streamlit
- Fine-tune sentence transformers on your domain data
- Implement feedback loops for continuous improvement
Resources and References
Key Papers
- BERT: Devlin et al. (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805
- Sentence-BERT: Reimers | Gurevych (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks". arXiv:1908.10084
- RAG: Lewis et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks". arXiv:2005.11401
- Dense Passage Retrieval: Karpukhin et al. (2020). "Dense Passage Retrieval for Open-Domain Question Answering". arXiv:2004.04906
- ColBERT: Khattab | Zaharia (2020). "ColBERT: Efficient and Effective Passage Search via Contextualized Late Interaction over BERT". arXiv:2004.12832
Documentation and Tutorials
- HuggingFace Transformers: https://huggingface.co/docs/transformers/
- Sentence Transformers: https://www.sbert.net/
- FAISS: https://github.com/facebookresearch/faiss/wiki
- LangChain (optional framework): https://python.langchain.com/
Code Examples
- Sentence Transformers Examples: https://www.sbert.net/examples/applications/semantic-search/README.html
- FAISS Tutorial: https://github.com/facebookresearch/faiss/wiki/Getting-started
- RAG Tutorial: https://huggingface.co/docs/transformers/modeldoc/rag
Datasets
- HuggingFace Datasets Hub: https://huggingface.co/datasets
- Search for: "customer support", "FAQ", "helpdesk"
- Banking77: https://huggingface.co/datasets/banking77
- MS MARCO (optional, for retrieval practice): https://microsoft.github.io/msmarco/
Tools and Libraries
- Gradio (for UI): https://gradio.app/
- Streamlit (for UI): https://streamlit.io/
- Weights | Biases (for experiment tracking): https://wandb.ai/
Additional Reading
- "Building Chatbots with Python" - Sumit Raj
- "Natural Language Processing with Transformers" - Lewis Tunstall et al.
- "Speech and Language Processing" (Chapter on Question Answering) - Jurafsky | Martin
Related Techniques
- Hybrid Search: Combining dense and sparse retrieval
- Query Expansion: Enhancing queries with related terms
- Pseudo-Relevance Feedback: Using top results to refine queries
- Learning to Rank: ML approaches to re-ranking results
Submission Guidelines
GitHub Classroom Submission
This assignment is submitted via GitHub Classroom. Follow these steps:
- Accept the assignment: Click the assignment link provided in Canvas or by your instructor
- Repository: github.com/ContextLab/customer-service-bot-llm-course
- This creates your own private repository for the assignment
- Clone your repository:
git clone https://github.com/ContextLab/customer-service-bot-llm-course-YOUR_USERNAME.git
- Complete your work:
- Work in Google Colab, Jupyter, or your preferred environment
- Save your notebook to the repository
- Commit and push your changes:
git add .
git commit -m "Complete customer service chatbot assignment"
git push
- Verify submission: Check that your latest commit appears in your GitHub repository before the deadline
Notebook Requirements
- Runtime: The notebook must run from start to finish without errors
- Permissions: Ensure the notebook is accessible (include in your GitHub repository)
- Dependencies: All required packages should be installed in the notebook
- Data: Include code to automatically download any required datasets
- Output: Keep cell outputs visible in your submission
Before Submission Checklist
- Notebook runs completely in a fresh Colab session
- All required sections are included with markdown explanations
- Code is well-commented and organized
- Evaluation metrics are properly implemented and visualized
- At least 10 diverse examples are included
- Multi-turn conversation demo is working
- Baseline comparison is complete with statistical analysis
- All visualizations are clear and properly labeled
- No hardcoded paths (use relative paths or automatic downloads)
- Cell outputs are visible and meaningful
Academic Integrity
You are encouraged to:- Use generative AI tools (ChatGPT, Claude, Copilot) to help write code
- Collaborate with classmates on understanding concepts
- Search for tutorials and examples online
- Ask questions in class or office hours
- Write your own analysis and explanations
- Understand every line of code you submit
- Cite any significant code you use from external sources
- Submit your own original work
Questions?
If you have questions about the assignment:- Check this README thoroughly
- Review the resources and references section
- Post questions in the course forum
- Attend office hours
- Email the instructor/TA with specific questions
This assignment is designed to give you hands-on experience with modern NLP techniques used in production systems. The skills you develop here—semantic search, RAG, and evaluation—are directly applicable to real-world AI applications.