Conversational AI agent that collects user preferences and provides personalized movie recommendations using Retrieval-Augmented Generation (RAG).
This movie recommendation system orchestrates multiple specialized AI agents that work together to understand user preferences and deliver personalized movie suggestions.
At its core, a Conversation Orchestrator coordinates four distinct agents: the ExtractorAgent analyzes user input to extract structured preferences (genres, keywords, sentiment) using Pydantic schemas; the RequesterAgent generates contextual follow-up questions when information is incomplete; the RecommenderAgent performs hybrid RAG search combining semantic similarity with genre filtering to retrieve relevant movies; and the SummarizerAgent summarizes the conversation.
When the ExtractorAgent detects negative sentiment (frustration, impatience), the orchestrator immediately skips further questioning and proceeds directly to recommendations, ensuring a responsive user experience.
The system maintains conversation state across multiple turns, accumulating preferences incrementally while providing graceful fallbacks at every layer from LLM failures and database errors.
Its powered by local Ollama models, gpt-oss has been chosen for most of the agents but llama3.1 and llama3.2 have also been tested (with worse results).
- What This Project Does
- Architecture
- System Flow
- Setup Instructions
- Testing
- Key Design Decisions
- Potential Improvements
Modular agents that handle specific conversation tasks with standardized interfaces:
-
base.py- Abstract agent protocol and shared infrastructureAgent: Abstract base class withexecute()methodAgentResponse: Standardized response wrapper with error handlingAgentErrorType: Enum for error categorization
-
extractor.py- ExtractorAgent- Extracts structured data using Instructor + Pydantic
- Returns:
ExtractedInfo(genres, preferences, sentiment) - Chosen model:
gpt-oss:20b@ temp 0.0 - note: focusing on accurate results to avoid extra turns - Uses MD_JSON mode for Ollama compatibility
- Fallback: Empty model
-
requester.py- RequesterAgent- Generates contextual follow-up questions
- Analyzes conversation history and missing information
- Chosen model:
llama3.2:3b@ temp 0.5 - note: it does the good at a better speed - Fallback: Generic question if LLM fails
-
recommender.py- RecommenderAgent- Retrieves movies via RAG semantic search
- Formats recommendations with natural language
- Chosen model:
gpt-oss:20b@ temp 0.5 - giving a good and personalised recommendation to the user - Fallback: Plain text list if LLM fails
-
summarizer.py- SummarizerAgent- Generates conversation summaries for storage
- Runs once at end of conversation
- Chosen model:
gpt-oss:20b@ temp 0.5 - here we don't care about speed cause this part has the potential to be handled asynchronously - Returns: String summary or None on error
-
orchestrator.py- ConversationOrchestrator- Coordinates agent execution flow
- Manages conversation loop (max 20 turns by default)
- Handles voice/text input switching
- Saves conversation state to JSON
Foundational data structures and configuration:
-
models.py- Pydantic data models with validationMessage: Conversation history entries (role, content)ExtractedInfo: User preferences (genres, preferences, sentiment)Movie: Movie metadata (title, year, rating, genres, overview)Genre: Enum of 19 TMDB genresSentiment: Enum (positive, neutral, negative)Role: Enum (user, assistant, system)
-
state.py- Conversation state managementState: Maintains conversation history and extracted info- Accumulates data across multiple turns
- Provides JSON serialization for persistence
-
config.py- Centralized configuration with environment variables- Model selection (EXTRACTION_MODEL, REQUESTER_MODEL, etc.)
- Temperature settings per agent
- Ollama API configuration
- ChromaDB settings
External service integrations and client management:
-
llm_client.py- LLM client singleton management -
database.py- ChromaDB persistent client -
listener.py- Voice input using Whisper and webrtcvad for VAD -
speaker.py- Voice output using pyttsx3 TTS
Semantic search and vector database operations:
-
retriever.py- Hybrid semantic searchretrieve_movies(): Main retrieval function- Generates query embeddings using
embeddinggemma - Applies genre metadata filtering
- Returns top-N results sorted by similarity
-
indexer.py- Dataset indexing pipelineindex_movies(): Batch embedding generation- Reads from
data/movies.csv - Creates boolean genre fields for filtering (no support for lists)
Specialized system prompts for each agent:
extractor.py- Structured extraction promptsrequester.py- Question generation promptsrecommender.py- Movie presentation promptssummarizer.py- Conversation summary prompts
flowchart TD
Start([User Input]) --> Extract[1. Extractor Agent<br/>Extract genres<br/>Extract preferences<br/>Detect sentiment]
Extract --> Update[2. Update State<br/>Merge new info<br/>Maintain history]
Update --> Check{3. Check Completeness<br/>Have genres OR prefs?<br/>Negative sentiment?}
Check -->|YES| Recommend[4a. Recommender<br/>RAG Search<br/>Format Reply]
Recommend --> Show[5a. Show Movies]
Show --> Summarize[6. Summarizer<br/>Generate Summary]
Summarize --> Save[7. Save State<br/>JSON + Summary<br/>END]
Save --> End([End])
Check -->|NO| Question[4b. Requester<br/>Ask Question]
Question --> Loop[5b. Loop Back<br/>Get User Input]
Loop --> Start
flowchart TD
Prefs[User Preferences<br/>Genres: Action<br/>Prefs: fast-paced] --> Query[Build Search Query<br/>fast-paced]
Query --> Embed[Generate Query Embedding<br/>Ollama embeddinggemma]
Embed --> Filter[Build Genre Filter<br/>genre_Action: true]
Filter --> ChromaDB[ChromaDB Query<br/>Semantic + Filter]
ChromaDB --> Results[Top 5 Movies<br/>Sorted by similarity]
Before you begin, ensure you have the following installed:
-
Python 3.10+
python --version # Should be 3.10 or higher -
Ollama - Local LLM inference engine
# Install from https://ollama.ai/ # Or via Homebrew on macOS: brew install ollama # Verify installation ollama --version
-
System Dependencies (for voice mode - optional)
# macOS brew install portaudio ffmpeg # Ubuntu/Debian sudo apt-get install portaudio19-dev ffmpeg
# Clone the repository
git clone <repository-url>
cd movie-recommender
# Create virtual environment
python -m venv .venv
# Activate virtual environment
source .venv/bin/activate # On Windows: .venv\Scripts\activate# Install all required packages
pip install -r requirements.txt# Pull the recommended models (one-time setup)
# This one can be used for every agent
ollama pull gpt-oss:20b
# (Optional) It can be used for the RequesterAgent to speed up things a bit
ollama pull llama3.2:3b
# Used as the embedding model for semantic search
ollama pull embeddinggemma
# Verify models are installed
ollama list# Copy example configuration and edit to customize models and temperatures per agent.
cp .env.example .envDefault Configuration (already optimized):
# Models
EXTRACTION_MODEL=gpt-oss:20b
REQUESTER_MODEL=llama3.2:3b
RECOMMENDER_MODEL=gpt-oss:20b
SUMMARIZER_MODEL=gpt-oss:20b
EMBEDDING_MODEL=embeddinggemma
# Temperatures
EXTRACTION_TEMPERATURE=0.0
REQUESTER_TEMPERATURE=0.5
RECOMMENDER_TEMPERATURE=0.5
SUMMARIZER_TEMPERATURE=0.5# Index the dataset (1,000 movies from data/movies.csv)
python -m scripts.index_dataset
# Expected output:
# Processing movies... ββββββββββββββββββββββββββββββββββββββββ 100% 0:01:23
# β
Successfully indexed 1000 movies to ChromaDB collection 'movies'What this does:
- Generates embeddings for each movie's overview
- Stores vectors + metadata in ChromaDB (
./chroma_db/) - Takes around a minute
Pro Tip: You can interrupt with Ctrl+C to test with a smaller subset first.
Run a quick test to ensure everything is working:
python -m scripts.runpython -m scripts.run --voice- π€ Speak your responses instead of typing
- π Hear the agent's messages
- First run: Whisper downloads ~140MB model automatically
python -m scripts.run --verbosepytest testsAt the end of each conversation, a conversation transcript is stored along with a summary to the ./conversations folder:
./conversations/conversation_YYYYMMDD_HHMMSS.jsonA few examples are included in the ./conversations/examples folder:
- Complete information provided upfront:
1-information-upfront.json - Information provided after a follow-up:
2-information-follow-up.json - Impatience detected and movie recommendation provided without information collected:
3-without-information-due-to-impatience.json - Some off-topic questions:
4-off-topic.json
- One agent per task: Modular agent architecture allows fine-tuning models, temperatures and prompts for every specific use case.
- Protocol-based design: All agents implement a common
Agentprotocol with standardizedAgentResponse, enabling easy swapping and testing. - Model selection strategy: Different models for different tasks (gpt-oss for accuracy, llama3.2 for speed), optimized through experimentation.
- Structured outputs with Pydantic: Using Instructor library with Pydantic models ensures type-safe, validated extractions.
- Voice-aware prompts: Dynamically adjusted formatting based on output mode (text vs. speech) for better UX.
- Sentiment detection: Detects user frustration to skip additional questions and provide immediate recommendations.
- Flexible extraction logic: Accepts either genres OR preference descriptions, lowering the barrier for users to get results.
- Hybrid search: Combines semantic similarity (embeddings) with metadata filtering (genre tags) for more accurate retrieval.
- Embedding model choice: Using
embeddinggemmaas from investigations seems to be a great open source model for embeddings and provided good results. - Graceful degradation: Fallback messages for every agent ensure a message is always sent to the user no matter what happens.
- Multi-turn conversation flow: Turns are configurable, 20 by default.
- JSON persistence: Conversations and their summary are being stored as JSON at the end for simplicity.
- Edge case handling: Manages empty inputs, off-topic responses, and varying levels of user detail.
- Enhanced test coverage: More comprehensive e2e flow testing and edge case scenarios.
- LLM evaluation framework: Implement automated evaluation of agent responses for quality and accuracy.
- Performance benchmarking: Track response times, accuracy metrics, and user satisfaction scores.
- Integration tests with real LLMs: Currently mocked in tests; real model integration tests would catch prompt regressions.
- Model experimentation: Only tried 3 models; could explore more specialized models for each task.
- Prompt versioning: Track prompt changes and A/B test different phrasings for better results.
- Temperature tuning: More granular temperature optimization per use case.
- Adding more filters: Rating could be used to weight results.
- User feedback loop: Learn from user reactions to improve future recommendations.
- Improve default search query: Right now it just returns any movies, we could create a default query based on rating.
- Caching: Store frequent queries and embeddings to reduce latency and API calls.
- Asynchronous processing: Make summarization and non-critical operations async to improve response times.
- Real-time conversation updates: Stream conversation state to storage after each turn, not just at the end.
- Rate limiting & quotas: Protect against abuse and manage token usage per user/session.
- Containerization: Docker setup for consistent deployment across environments.
- Input sanitization: More robust validation and sanitization of user inputs.
- Content filtering: Detect and handle inappropriate or off-topic requests more strictly.
- PII detection: Identify and redact personally identifiable information.
- Streaming responses: Stream LLM responses token-by-token for better perceived performance.
- Progress indicators: Show when the system is thinking/searching for better transparency.
- Multi-language support: Extend beyond English.