A Retrieval-Augmented Generation (RAG) system that allows you to chat with multiple PDF documents using Google's Gemini AI. Ask questions about your documents and get intelligent answers based on their content.
- Upload multiple PDFs and ask questions about their content
- Intelligent search through your documents using AI embeddings
- Persistent storage - build the index once, use it multiple times
- Smart responses using Google's Gemini 2.5 Flash model
- Easy setup in Google Colab with no local installation required
Before you start, you'll need:
- Google Account for accessing Google Colab
- Google Gemini API Key (free tier available)
- Visit Google AI Studio
- Click "Create API Key"
- Copy your API key (you'll need this later)
- Click the "Open in Colab" badge at the top of this README to open directly in Google Colab
- Alternative: Download the
Multi_PDF_RAG_with_LlamaIndex.ipynb
file from the GitHub repository and upload to Google Colab
- In Google Colab, look for the key icon in the left sidebar
- Click on "Secrets" tab
- Click "Add new secret"
- Name:
geminiapikey
- Value: Paste your Google Gemini API key here
- Toggle the "Notebook access" switch to ON
Cell 1: Install Dependencies
!pip install -q llama-index pypdf
!pip install -q llama-index-embeddings-gemini
!pip install -q llama-index-llms-gemini
What this does:
llama-index
: Core framework for building RAG applicationspypdf
: Library for reading and processing PDF filesllama-index-embeddings-gemini
: Google Gemini embedding model integrationllama-index-llms-gemini
: Google Gemini language model integration-q
flag: Quiet installation (less verbose output)
Cell 2: Import Libraries
from pathlib import Path
import os
from google.colab import userdata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.storage import StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import load_index_from_storage
What each import does:
pathlib.Path
: Handle file system pathsos
: Operating system interface for file operationsgoogle.colab.userdata
: Access Colab secrets (API keys)VectorStoreIndex
: Create searchable vector database from documentsSimpleDirectoryReader
: Read PDF files from specified locationsSettings
: Global configuration for LlamaIndexStorageContext
: Manage persistent storage of the vector indexSentenceSplitter
: Split documents into manageable chunksGeminiEmbedding
: Convert text to vector embeddings using GeminiGemini
: Google's language model for generating responses
Cell 3: Get API Key
API_KEY = userdata.get('geminiapikey')
What this does:
- Retrieves your Gemini API key from Colab's secure storage
userdata.get()
safely accesses the secret you stored earlier
-
Upload your PDFs to Colab:
- Click the folder icon in the left sidebar
- Drag and drop your PDF files into the file browser
- Wait for upload to complete
-
Configure file paths:
Cell 4: Set PDF Locations
pdf_directory = ['/content/part-1.pdf','/content/part-2.pdf']
Update this with your files:
# Replace with your actual PDF file names
pdf_directory = [
'/content/your-document-1.pdf',
'/content/your-document-2.pdf',
'/content/your-document-3.pdf'
]
What this does:
- Creates a list of file paths pointing to your uploaded PDFs
/content/
is Colab's default upload directory- You can add as many PDF files as needed
Cell 5-6: Set up storage and chunk size
persist_dir = "./storage"
chunk_size = 1024
What these settings mean:
persist_dir
: Directory where the vector index will be savedchunk_size
: How many characters/tokens each text chunk contains- Smaller chunks (512): More precise but less context
- Larger chunks (2048): More context but less precise
- 1024 is a good balance for most documents
Cell 7: Create storage directory
Path(persist_dir).mkdir(exist_ok=True)
What this does:
- Creates the storage directory if it doesn't exist
exist_ok=True
prevents errors if directory already exists
Cell 8: Set up embedding model
Settings.embed_model = GeminiEmbedding(
model_name="models/embedding-001", api_key=API_KEY
)
What this does:
- Configures the embedding model that converts text to vectors
embedding-001
is Google's text embedding model- These vectors enable semantic search through your documents
Cell 9: Configure language model and text processing
Settings.llm = Gemini(api_key=API_KEY, model_name="models/gemini-2.5-flash")
Settings.text_splitter = SentenceSplitter(chunk_size=chunk_size)
Settings.chunk_size = chunk_size
What each setting does:
Settings.llm
: The AI model that generates answers to your questionsgemini-2.5-flash
: Fast and efficient version of GeminiSentenceSplitter
: Intelligently splits text at sentence boundaries- Global chunk_size setting ensures consistency
Cell 10: Main function to create or load index
def load_or_create_index():
"""Load existing index or create new one if it doesn't exist"""
if not os.listdir(persist_dir):
print("Creating new index...")
# Load PDF documents
documents = SimpleDirectoryReader(input_files=pdf_directory).load_data()
# Create and persist index
index = VectorStoreIndex.from_documents(
documents, show_progress=True
)
index.storage_context.persist(persist_dir=persist_dir)
else:
print("Loading existing index...")
storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
index = load_index_from_storage(storage_context)
return index
What this function does:
-
Checks if index already exists:
os.listdir(persist_dir)
checks if storage directory has files- If empty, creates new index; if not, loads existing one
-
Creating new index (first time):
SimpleDirectoryReader(input_files=pdf_directory).load_data()
:- Reads all PDFs from your specified paths
- Extracts text content from each PDF
VectorStoreIndex.from_documents()
:- Splits documents into chunks
- Converts each chunk to vector embeddings
- Creates searchable vector database
show_progress=True
: Shows progress bar during creationindex.storage_context.persist()
: Saves index to disk for reuse
-
Loading existing index:
StorageContext.from_defaults()
: Loads storage configurationload_index_from_storage()
: Reconstructs index from saved files- Much faster than recreating from scratch
Cell 11: Initialize the system
index = load_or_create_index()
What happens here:
- Calls the function to either create or load your document index
- First run: Processes all PDFs (may take several minutes)
- Subsequent runs: Loads quickly from storage
Cell 12: Query function
def query_pdfs(question):
"""Query the PDF knowledge base"""
query_engine = index.as_query_engine(
similarity_top_k=3,
response_mode="compact",
verbose=True
)
response = query_engine.query(question)
return response
What each parameter does:
similarity_top_k=3
: Retrieves 3 most relevant text chunksresponse_mode="compact"
: Generates concise, focused answersverbose=True
: Shows which chunks were used for the answerquery_engine.query(question)
: Searches index and generates response
How the query process works:
- Your question is converted to a vector
- System finds 3 most similar text chunks from your PDFs
- These chunks are sent to Gemini along with your question
- Gemini generates an answer based on the relevant content
Cell 13: Example query
response = query_pdfs("What challenges do the heroes face on their journey to recover the Crystal of Lumina?")
print(response)
Try different types of questions:
# Summarization questions
response = query_pdfs("What are the main topics covered in these documents?")
print(response)
# Specific factual questions
response = query_pdfs("What methodology was used in the research?")
print(response)
# Analytical questions
response = query_pdfs("What are the key findings and conclusions?")
print(response)
# Comparative questions
response = query_pdfs("How do the authors' recommendations differ between documents?")
print(response)
Understanding the output:
- The system will show which document chunks were used
- Answers are generated based on actual content from your PDFs
- If information isn't found, the system will indicate this
# In Cell 6, modify the chunk size
chunk_size = 1024 # Default: 1024
# Options and their effects:
chunk_size = 512 # Smaller chunks = more precise answers, less context
chunk_size = 1024 # Balanced approach (recommended)
chunk_size = 2048 # Larger chunks = more context, potentially less precise
When to adjust chunk size:
- Use smaller chunks (512) for documents with dense, specific information
- Use larger chunks (2048) for documents that need more context to understand
# In the query_pdfs function, modify these parameters:
def query_pdfs(question):
query_engine = index.as_query_engine(
similarity_top_k=3, # Number of relevant chunks to retrieve
response_mode="compact", # How to format the response
verbose=True # Show source information
)
response = query_engine.query(question)
return response
Parameter explanations:
# Retrieve more or fewer relevant chunks
similarity_top_k=1 # Fast, but may miss context
similarity_top_k=3 # Good balance (recommended)
similarity_top_k=5 # More comprehensive, slower
# Different response modes
response_mode="compact" # Concise answers
response_mode="tree_summarize" # Hierarchical summarization
response_mode="accumulate" # Detailed, comprehensive responses
# Verbose output control
verbose=True # Shows which chunks were used (helpful for debugging)
verbose=False # Cleaner output, just the answer
# Customize the embedding model (Cell 8)
Settings.embed_model = GeminiEmbedding(
model_name="models/embedding-001", # Google's embedding model
api_key=API_KEY,
# Optional: add custom parameters
)
# Customize the language model (Cell 9)
Settings.llm = Gemini(
api_key=API_KEY,
model_name="models/gemini-2.5-flash", # Fast model
# Alternative: "models/gemini-1.5-pro" for more complex reasoning
temperature=0.1, # Lower = more focused, Higher = more creative
max_tokens=1000 # Maximum response length
)
# Advanced text splitting options (Cell 9)
from llama_index.core.node_parser import SentenceSplitter, TokenTextSplitter
# Sentence-based splitting (default - recommended)
Settings.text_splitter = SentenceSplitter(
chunk_size=chunk_size,
chunk_overlap=20, # Overlap between chunks for continuity
)
# Token-based splitting (alternative)
Settings.text_splitter = TokenTextSplitter(
chunk_size=chunk_size,
chunk_overlap=50,
)
# Create a custom query engine with specific instructions
def query_pdfs_with_custom_prompt(question, custom_instruction=""):
"""Query with custom instructions for the AI"""
# Custom system prompt
system_prompt = f"""
You are an expert document analyst. {custom_instruction}
Always cite which document or section your information comes from.
If you cannot find relevant information, say so clearly.
"""
query_engine = index.as_query_engine(
similarity_top_k=3,
response_mode="compact",
system_prompt=system_prompt
)
response = query_engine.query(question)
return response
# Example usage
response = query_pdfs_with_custom_prompt(
"Summarize the methodology",
"Focus on technical details and be very specific about procedures."
)
print(response)
This system works great for:
- Research papers: "What are the limitations mentioned in this study?"
- Legal documents: "What are the key terms and conditions?"
- Technical manuals: "How do I troubleshoot this specific issue?"
- Reports: "What were the main conclusions and recommendations?"
- Books: "What challenges did the main character face?"
Your Colab Environment/
├── Multi_PDF_RAG_with_LlamaIndex.ipynb # Main notebook from GitHub
├── your-document-1.pdf # Your uploaded PDFs
├── your-document-2.pdf
├── storage/ # Auto-created vector database
│ ├── docstore.json
│ ├── index_store.json
│ └── vector_store.json
└── README.md # This documentation
When you ask a question, the system will:
- Search through your documents for relevant content
- Retrieve the most similar text chunks
- Generate an answer using the found information
- Provide source context when available
- Error: Authentication failed
- Solution: Double-check that your API key is correctly stored in Colab secrets with the exact name
geminiapikey
- Error: File not found
- Solution: Verify your PDF file paths in
pdf_directory
. Use/content/filename.pdf
format
- Error: Runtime disconnected or out of memory
- Solution: Try reducing
chunk_size
to 512 or process fewer PDFs at once
- Issue: Taking too long to process
- Solution:
- Use smaller PDFs (under 100 pages each)
- Reduce
chunk_size
- Process fewer documents simultaneously
- Issue: Answers are not relevant or accurate
- Solution:
- Ask more specific questions
- Increase
similarity_top_k
to retrieve more context - Ensure your PDFs contain the information you're asking about
Here's the complete notebook code with detailed explanations:
# ===== CELL 1: Install Dependencies =====
!pip install -q llama-index pypdf
!pip install -q llama-index-embeddings-gemini
!pip install -q llama-index-llms-gemini
# ===== CELL 2: Import Required Libraries =====
from pathlib import Path
import os
from google.colab import userdata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.storage import StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import load_index_from_storage
# ===== CELL 3: Get API Key from Colab Secrets =====
API_KEY = userdata.get('geminiapikey')
# ===== CELL 4: Define PDF File Paths =====
# IMPORTANT: Update these paths with your actual PDF files
pdf_directory = ['/content/part-1.pdf','/content/part-2.pdf']
# Example with more files:
# pdf_directory = [
# '/content/research-paper-1.pdf',
# '/content/research-paper-2.pdf',
# '/content/manual.pdf'
# ]
# ===== CELL 5: Set Storage Directory =====
persist_dir = "./storage" # Where the vector index will be saved
# ===== CELL 6: Configure Chunk Size =====
chunk_size = 1024 # Size of text chunks for processing
# ===== CELL 7: Create Storage Directory =====
Path(persist_dir).mkdir(exist_ok=True)
# ===== CELL 8: Configure Embedding Model =====
Settings.embed_model = GeminiEmbedding(
model_name="models/embedding-001",
api_key=API_KEY
)
# ===== CELL 9: Configure Language Model and Text Processing =====
Settings.llm = Gemini(api_key=API_KEY, model_name="models/gemini-2.5-flash")
Settings.text_splitter = SentenceSplitter(chunk_size=chunk_size)
Settings.chunk_size = chunk_size
# ===== CELL 10: Main Function - Create or Load Vector Index =====
def load_or_create_index():
"""
This function either:
1. Creates a new vector index from your PDFs (first time)
2. Loads an existing index from storage (subsequent times)
"""
if not os.listdir(persist_dir):
print("Creating new index...")
print("This may take a few minutes for large PDFs...")
# Read all PDF files
documents = SimpleDirectoryReader(input_files=pdf_directory).load_data()
print(f"Loaded {len(documents)} documents")
# Create vector embeddings and searchable index
index = VectorStoreIndex.from_documents(
documents,
show_progress=True # Shows progress bar
)
# Save the index for future use
index.storage_context.persist(persist_dir=persist_dir)
print("Index created and saved successfully!")
else:
print("Loading existing index...")
# Load previously created index
storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
index = load_index_from_storage(storage_context)
print("Index loaded successfully!")
return index
# ===== CELL 11: Initialize the System =====
# This will either create a new index or load existing one
index = load_or_create_index()
# ===== CELL 12: Query Function =====
def query_pdfs(question):
"""
Function to ask questions about your PDFs
Args:
question (str): Your question about the documents
Returns:
Response object with answer and source information
"""
print(f"Question: {question}")
print("-" * 50)
# Create query engine
query_engine = index.as_query_engine(
similarity_top_k=3, # Get 3 most relevant text chunks
response_mode="compact", # Generate concise response
verbose=True # Show source information
)
# Get response
response = query_engine.query(question)
return response
# ===== CELL 13: Example Query =====
# Ask your first question
response = query_pdfs("What challenges do the heroes face on their journey to recover the Crystal of Lumina?")
print(response)
# ===== CELL 14: Additional Example Queries =====
# Try different types of questions:
# Summarization
response = query_pdfs("Provide a summary of the main topics discussed")
print("SUMMARY:")
print(response)
print("\n" + "="*60 + "\n")
# Specific facts
response = query_pdfs("What specific methods or approaches are mentioned?")
print("METHODS:")
print(response)
print("\n" + "="*60 + "\n")
# Analysis
response = query_pdfs("What are the key conclusions or findings?")
print("CONCLUSIONS:")
print(response)
Phase 1: Setup (Cells 1-9)
- Install required Python packages
- Import necessary libraries
- Get API key from secure storage
- Configure file paths and settings
- Set up AI models (embedding and language models)
Phase 2: Index Creation (Cells 10-11)
- Check if vector index already exists
- If not, read PDFs and create embeddings
- Save index for future use
- If exists, load from storage
Phase 3: Querying (Cells 12-14)
- Define function to process questions
- Search through vector database for relevant content
- Generate AI-powered answers
- Display results with source information
# ===== OPTIONAL: Add this cell for debugging =====
def debug_index_info():
"""Display information about your vector index"""
print("=== INDEX INFORMATION ===")
print(f"Storage directory: {persist_dir}")
print(f"Directory exists: {os.path.exists(persist_dir)}")
if os.path.exists(persist_dir):
files = os.listdir(persist_dir)
print(f"Storage files: {files}")
print(f"PDF files to process: {pdf_directory}")
for pdf_path in pdf_directory:
exists = os.path.exists(pdf_path)
print(f" {pdf_path}: {'✓ Found' if exists else '✗ Missing'}")
# Run this to check your setup
debug_index_info()
# ===== OPTIONAL: Test with simple question first =====
def test_system():
"""Test the system with a simple question"""
try:
response = query_pdfs("What is this document about?")
print("✓ System working correctly!")
print("Response:", str(response)[:200] + "...")
return True
except Exception as e:
print("✗ Error in system:")
print(f"Error: {e}")
return False
# Run this to test your setup
test_system()
- Gemini API: Free tier includes generous limits
- Google Colab: Free tier sufficient for most use cases
- Storage: Vector indices stored temporarily in Colab session
- Session-based: Data is lost when Colab runtime disconnects
- File size: Large PDFs (>100MB) may cause memory issues
- Languages: Works best with English text
- Complex layouts: Tables and images are converted to text
We welcome contributions to improve this project! Here's how you can help:
-
Fork the Repository
- Visit https://github.com/yasithrashan/llamaindex-multipdf-rag
- Click the "Fork" button
-
Make Your Improvements
- Test with different types of PDFs
- Add new features or fix bugs
- Update documentation as needed
-
Submit a Pull Request
- Describe your changes and their benefits
- New features like web interface or batch processing
- Performance optimizations and better error handling
- More examples and troubleshooting guides
- Testing with different PDF types and edge cases
MIT License - feel free to use and modify for your projects.
If you encounter issues:
- Check the troubleshooting section above for common solutions
- Verify all setup steps were completed correctly
- Try with a simple, small PDF first to test the system
- Search existing issues on GitHub
- Open a new issue with detailed error information if needed
If this project helped you, please consider giving it a ⭐ on GitHub!