Skip to content

yasithrashan/llamaindex-multipdf-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Multi-PDF RAG with LlamaIndex

GitHub Repository Open in Colab

A Retrieval-Augmented Generation (RAG) system that allows you to chat with multiple PDF documents using Google's Gemini AI. Ask questions about your documents and get intelligent answers based on their content.

What This Project Does

  • Upload multiple PDFs and ask questions about their content
  • Intelligent search through your documents using AI embeddings
  • Persistent storage - build the index once, use it multiple times
  • Smart responses using Google's Gemini 2.5 Flash model
  • Easy setup in Google Colab with no local installation required

Prerequisites

Before you start, you'll need:

  1. Google Account for accessing Google Colab
  2. Google Gemini API Key (free tier available)
    • Visit Google AI Studio
    • Click "Create API Key"
    • Copy your API key (you'll need this later)

Step-by-Step Setup Guide

Step 1: Get the Code

  1. Click the "Open in Colab" badge at the top of this README to open directly in Google Colab
  2. Alternative: Download the Multi_PDF_RAG_with_LlamaIndex.ipynb file from the GitHub repository and upload to Google Colab

Step 2: Set Up Your API Key

  1. In Google Colab, look for the key icon in the left sidebar
  2. Click on "Secrets" tab
  3. Click "Add new secret"
  4. Name: geminiapikey
  5. Value: Paste your Google Gemini API key here
  6. Toggle the "Notebook access" switch to ON

Step 3: Install Required Libraries

Cell 1: Install Dependencies

!pip install -q llama-index pypdf
!pip install -q llama-index-embeddings-gemini
!pip install -q llama-index-llms-gemini

What this does:

  • llama-index: Core framework for building RAG applications
  • pypdf: Library for reading and processing PDF files
  • llama-index-embeddings-gemini: Google Gemini embedding model integration
  • llama-index-llms-gemini: Google Gemini language model integration
  • -q flag: Quiet installation (less verbose output)

Step 4: Import Required Modules

Cell 2: Import Libraries

from pathlib import Path
import os
from google.colab import userdata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.storage import StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import load_index_from_storage

What each import does:

  • pathlib.Path: Handle file system paths
  • os: Operating system interface for file operations
  • google.colab.userdata: Access Colab secrets (API keys)
  • VectorStoreIndex: Create searchable vector database from documents
  • SimpleDirectoryReader: Read PDF files from specified locations
  • Settings: Global configuration for LlamaIndex
  • StorageContext: Manage persistent storage of the vector index
  • SentenceSplitter: Split documents into manageable chunks
  • GeminiEmbedding: Convert text to vector embeddings using Gemini
  • Gemini: Google's language model for generating responses

Step 5: Configure API Access

Cell 3: Get API Key

API_KEY = userdata.get('geminiapikey')

What this does:

  • Retrieves your Gemini API key from Colab's secure storage
  • userdata.get() safely accesses the secret you stored earlier

Step 6: Prepare Your PDF Files

  1. Upload your PDFs to Colab:

    • Click the folder icon in the left sidebar
    • Drag and drop your PDF files into the file browser
    • Wait for upload to complete
  2. Configure file paths:

Cell 4: Set PDF Locations

pdf_directory = ['/content/part-1.pdf','/content/part-2.pdf']

Update this with your files:

# Replace with your actual PDF file names
pdf_directory = [
    '/content/your-document-1.pdf',
    '/content/your-document-2.pdf',
    '/content/your-document-3.pdf'
]

What this does:

  • Creates a list of file paths pointing to your uploaded PDFs
  • /content/ is Colab's default upload directory
  • You can add as many PDF files as needed

Step 7: Configure Storage and Processing Settings

Cell 5-6: Set up storage and chunk size

persist_dir = "./storage"
chunk_size = 1024

What these settings mean:

  • persist_dir: Directory where the vector index will be saved
  • chunk_size: How many characters/tokens each text chunk contains
    • Smaller chunks (512): More precise but less context
    • Larger chunks (2048): More context but less precise
    • 1024 is a good balance for most documents

Cell 7: Create storage directory

Path(persist_dir).mkdir(exist_ok=True)

What this does:

  • Creates the storage directory if it doesn't exist
  • exist_ok=True prevents errors if directory already exists

Step 8: Configure AI Models

Cell 8: Set up embedding model

Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", api_key=API_KEY
)

What this does:

  • Configures the embedding model that converts text to vectors
  • embedding-001 is Google's text embedding model
  • These vectors enable semantic search through your documents

Cell 9: Configure language model and text processing

Settings.llm = Gemini(api_key=API_KEY, model_name="models/gemini-2.5-flash")
Settings.text_splitter = SentenceSplitter(chunk_size=chunk_size)
Settings.chunk_size = chunk_size

What each setting does:

  • Settings.llm: The AI model that generates answers to your questions
  • gemini-2.5-flash: Fast and efficient version of Gemini
  • SentenceSplitter: Intelligently splits text at sentence boundaries
  • Global chunk_size setting ensures consistency

Step 9: Create the RAG System

Cell 10: Main function to create or load index

def load_or_create_index():
    """Load existing index or create new one if it doesn't exist"""
    if not os.listdir(persist_dir):
        print("Creating new index...")
        # Load PDF documents
        documents = SimpleDirectoryReader(input_files=pdf_directory).load_data()

        # Create and persist index
        index = VectorStoreIndex.from_documents(
            documents, show_progress=True
        )
        index.storage_context.persist(persist_dir=persist_dir)
    else:
        print("Loading existing index...")
        storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
        index = load_index_from_storage(storage_context)

    return index

What this function does:

  1. Checks if index already exists:

    • os.listdir(persist_dir) checks if storage directory has files
    • If empty, creates new index; if not, loads existing one
  2. Creating new index (first time):

    • SimpleDirectoryReader(input_files=pdf_directory).load_data():
      • Reads all PDFs from your specified paths
      • Extracts text content from each PDF
    • VectorStoreIndex.from_documents():
      • Splits documents into chunks
      • Converts each chunk to vector embeddings
      • Creates searchable vector database
    • show_progress=True: Shows progress bar during creation
    • index.storage_context.persist(): Saves index to disk for reuse
  3. Loading existing index:

    • StorageContext.from_defaults(): Loads storage configuration
    • load_index_from_storage(): Reconstructs index from saved files
    • Much faster than recreating from scratch

Cell 11: Initialize the system

index = load_or_create_index()

What happens here:

  • Calls the function to either create or load your document index
  • First run: Processes all PDFs (may take several minutes)
  • Subsequent runs: Loads quickly from storage

Step 10: Create Query Function

Cell 12: Query function

def query_pdfs(question):
    """Query the PDF knowledge base"""
    query_engine = index.as_query_engine(
        similarity_top_k=3,
        response_mode="compact",
        verbose=True
    )
    response = query_engine.query(question)
    return response

What each parameter does:

  • similarity_top_k=3: Retrieves 3 most relevant text chunks
  • response_mode="compact": Generates concise, focused answers
  • verbose=True: Shows which chunks were used for the answer
  • query_engine.query(question): Searches index and generates response

How the query process works:

  1. Your question is converted to a vector
  2. System finds 3 most similar text chunks from your PDFs
  3. These chunks are sent to Gemini along with your question
  4. Gemini generates an answer based on the relevant content

Step 11: Start Asking Questions

Cell 13: Example query

response = query_pdfs("What challenges do the heroes face on their journey to recover the Crystal of Lumina?")
print(response)

Try different types of questions:

# Summarization questions
response = query_pdfs("What are the main topics covered in these documents?")
print(response)

# Specific factual questions
response = query_pdfs("What methodology was used in the research?")
print(response)

# Analytical questions
response = query_pdfs("What are the key findings and conclusions?")
print(response)

# Comparative questions
response = query_pdfs("How do the authors' recommendations differ between documents?")
print(response)

Understanding the output:

  • The system will show which document chunks were used
  • Answers are generated based on actual content from your PDFs
  • If information isn't found, the system will indicate this

Advanced Configuration Options

Customizing Chunk Size

# In Cell 6, modify the chunk size
chunk_size = 1024  # Default: 1024

# Options and their effects:
chunk_size = 512   # Smaller chunks = more precise answers, less context
chunk_size = 1024  # Balanced approach (recommended)
chunk_size = 2048  # Larger chunks = more context, potentially less precise

When to adjust chunk size:

  • Use smaller chunks (512) for documents with dense, specific information
  • Use larger chunks (2048) for documents that need more context to understand

Customizing Search Parameters

# In the query_pdfs function, modify these parameters:
def query_pdfs(question):
    query_engine = index.as_query_engine(
        similarity_top_k=3,      # Number of relevant chunks to retrieve
        response_mode="compact", # How to format the response
        verbose=True            # Show source information
    )
    response = query_engine.query(question)
    return response

Parameter explanations:

# Retrieve more or fewer relevant chunks
similarity_top_k=1    # Fast, but may miss context
similarity_top_k=3    # Good balance (recommended)
similarity_top_k=5    # More comprehensive, slower

# Different response modes
response_mode="compact"        # Concise answers
response_mode="tree_summarize" # Hierarchical summarization
response_mode="accumulate"     # Detailed, comprehensive responses

# Verbose output control
verbose=True   # Shows which chunks were used (helpful for debugging)
verbose=False  # Cleaner output, just the answer

Advanced Model Configuration

# Customize the embedding model (Cell 8)
Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001",  # Google's embedding model
    api_key=API_KEY,
    # Optional: add custom parameters
)

# Customize the language model (Cell 9)
Settings.llm = Gemini(
    api_key=API_KEY, 
    model_name="models/gemini-2.5-flash",  # Fast model
    # Alternative: "models/gemini-1.5-pro" for more complex reasoning
    temperature=0.1,  # Lower = more focused, Higher = more creative
    max_tokens=1000   # Maximum response length
)

Custom Text Splitting

# Advanced text splitting options (Cell 9)
from llama_index.core.node_parser import SentenceSplitter, TokenTextSplitter

# Sentence-based splitting (default - recommended)
Settings.text_splitter = SentenceSplitter(
    chunk_size=chunk_size,
    chunk_overlap=20,  # Overlap between chunks for continuity
)

# Token-based splitting (alternative)
Settings.text_splitter = TokenTextSplitter(
    chunk_size=chunk_size,
    chunk_overlap=50,
)

Adding Custom Prompts

# Create a custom query engine with specific instructions
def query_pdfs_with_custom_prompt(question, custom_instruction=""):
    """Query with custom instructions for the AI"""
    
    # Custom system prompt
    system_prompt = f"""
    You are an expert document analyst. {custom_instruction}
    Always cite which document or section your information comes from.
    If you cannot find relevant information, say so clearly.
    """
    
    query_engine = index.as_query_engine(
        similarity_top_k=3,
        response_mode="compact",
        system_prompt=system_prompt
    )
    
    response = query_engine.query(question)
    return response

# Example usage
response = query_pdfs_with_custom_prompt(
    "Summarize the methodology", 
    "Focus on technical details and be very specific about procedures."
)
print(response)

Example Use Cases

This system works great for:

  • Research papers: "What are the limitations mentioned in this study?"
  • Legal documents: "What are the key terms and conditions?"
  • Technical manuals: "How do I troubleshoot this specific issue?"
  • Reports: "What were the main conclusions and recommendations?"
  • Books: "What challenges did the main character face?"

File Structure After Setup

Your Colab Environment/
├── Multi_PDF_RAG_with_LlamaIndex.ipynb    # Main notebook from GitHub
├── your-document-1.pdf                    # Your uploaded PDFs
├── your-document-2.pdf
├── storage/                               # Auto-created vector database
│   ├── docstore.json
│   ├── index_store.json
│   └── vector_store.json
└── README.md                              # This documentation

Understanding the Output

When you ask a question, the system will:

  1. Search through your documents for relevant content
  2. Retrieve the most similar text chunks
  3. Generate an answer using the found information
  4. Provide source context when available

Troubleshooting Common Issues

API Key Problems

  • Error: Authentication failed
  • Solution: Double-check that your API key is correctly stored in Colab secrets with the exact name geminiapikey

File Path Issues

  • Error: File not found
  • Solution: Verify your PDF file paths in pdf_directory. Use /content/filename.pdf format

Memory or Timeout Issues

  • Error: Runtime disconnected or out of memory
  • Solution: Try reducing chunk_size to 512 or process fewer PDFs at once

Slow Performance

  • Issue: Taking too long to process
  • Solution:
    • Use smaller PDFs (under 100 pages each)
    • Reduce chunk_size
    • Process fewer documents simultaneously

Poor Answer Quality

  • Issue: Answers are not relevant or accurate
  • Solution:
    • Ask more specific questions
    • Increase similarity_top_k to retrieve more context
    • Ensure your PDFs contain the information you're asking about

Complete Code Walkthrough

Here's the complete notebook code with detailed explanations:

Complete Notebook Structure

# ===== CELL 1: Install Dependencies =====
!pip install -q llama-index pypdf
!pip install -q llama-index-embeddings-gemini
!pip install -q llama-index-llms-gemini
# ===== CELL 2: Import Required Libraries =====
from pathlib import Path
import os
from google.colab import userdata
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.core.storage import StorageContext
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.gemini import GeminiEmbedding
from llama_index.llms.gemini import Gemini
from llama_index.core import load_index_from_storage
# ===== CELL 3: Get API Key from Colab Secrets =====
API_KEY = userdata.get('geminiapikey')
# ===== CELL 4: Define PDF File Paths =====
# IMPORTANT: Update these paths with your actual PDF files
pdf_directory = ['/content/part-1.pdf','/content/part-2.pdf']

# Example with more files:
# pdf_directory = [
#     '/content/research-paper-1.pdf',
#     '/content/research-paper-2.pdf',
#     '/content/manual.pdf'
# ]
# ===== CELL 5: Set Storage Directory =====
persist_dir = "./storage"  # Where the vector index will be saved
# ===== CELL 6: Configure Chunk Size =====
chunk_size = 1024  # Size of text chunks for processing
# ===== CELL 7: Create Storage Directory =====
Path(persist_dir).mkdir(exist_ok=True)
# ===== CELL 8: Configure Embedding Model =====
Settings.embed_model = GeminiEmbedding(
    model_name="models/embedding-001", 
    api_key=API_KEY
)
# ===== CELL 9: Configure Language Model and Text Processing =====
Settings.llm = Gemini(api_key=API_KEY, model_name="models/gemini-2.5-flash")
Settings.text_splitter = SentenceSplitter(chunk_size=chunk_size)
Settings.chunk_size = chunk_size
# ===== CELL 10: Main Function - Create or Load Vector Index =====
def load_or_create_index():
    """
    This function either:
    1. Creates a new vector index from your PDFs (first time)
    2. Loads an existing index from storage (subsequent times)
    """
    if not os.listdir(persist_dir):
        print("Creating new index...")
        print("This may take a few minutes for large PDFs...")
        
        # Read all PDF files
        documents = SimpleDirectoryReader(input_files=pdf_directory).load_data()
        print(f"Loaded {len(documents)} documents")

        # Create vector embeddings and searchable index
        index = VectorStoreIndex.from_documents(
            documents, 
            show_progress=True  # Shows progress bar
        )
        
        # Save the index for future use
        index.storage_context.persist(persist_dir=persist_dir)
        print("Index created and saved successfully!")
        
    else:
        print("Loading existing index...")
        # Load previously created index
        storage_context = StorageContext.from_defaults(persist_dir=persist_dir)
        index = load_index_from_storage(storage_context)
        print("Index loaded successfully!")

    return index
# ===== CELL 11: Initialize the System =====
# This will either create a new index or load existing one
index = load_or_create_index()
# ===== CELL 12: Query Function =====
def query_pdfs(question):
    """
    Function to ask questions about your PDFs
    
    Args:
        question (str): Your question about the documents
    
    Returns:
        Response object with answer and source information
    """
    print(f"Question: {question}")
    print("-" * 50)
    
    # Create query engine
    query_engine = index.as_query_engine(
        similarity_top_k=3,      # Get 3 most relevant text chunks
        response_mode="compact", # Generate concise response
        verbose=True            # Show source information
    )
    
    # Get response
    response = query_engine.query(question)
    return response
# ===== CELL 13: Example Query =====
# Ask your first question
response = query_pdfs("What challenges do the heroes face on their journey to recover the Crystal of Lumina?")
print(response)
# ===== CELL 14: Additional Example Queries =====
# Try different types of questions:

# Summarization
response = query_pdfs("Provide a summary of the main topics discussed")
print("SUMMARY:")
print(response)
print("\n" + "="*60 + "\n")

# Specific facts
response = query_pdfs("What specific methods or approaches are mentioned?")
print("METHODS:")
print(response)
print("\n" + "="*60 + "\n")

# Analysis
response = query_pdfs("What are the key conclusions or findings?")
print("CONCLUSIONS:")
print(response)

Understanding the Code Flow

Phase 1: Setup (Cells 1-9)

  1. Install required Python packages
  2. Import necessary libraries
  3. Get API key from secure storage
  4. Configure file paths and settings
  5. Set up AI models (embedding and language models)

Phase 2: Index Creation (Cells 10-11)

  1. Check if vector index already exists
  2. If not, read PDFs and create embeddings
  3. Save index for future use
  4. If exists, load from storage

Phase 3: Querying (Cells 12-14)

  1. Define function to process questions
  2. Search through vector database for relevant content
  3. Generate AI-powered answers
  4. Display results with source information

Debugging and Monitoring Code

# ===== OPTIONAL: Add this cell for debugging =====
def debug_index_info():
    """Display information about your vector index"""
    print("=== INDEX INFORMATION ===")
    print(f"Storage directory: {persist_dir}")
    print(f"Directory exists: {os.path.exists(persist_dir)}")
    
    if os.path.exists(persist_dir):
        files = os.listdir(persist_dir)
        print(f"Storage files: {files}")
    
    print(f"PDF files to process: {pdf_directory}")
    for pdf_path in pdf_directory:
        exists = os.path.exists(pdf_path)
        print(f"  {pdf_path}: {'✓ Found' if exists else '✗ Missing'}")

# Run this to check your setup
debug_index_info()
# ===== OPTIONAL: Test with simple question first =====
def test_system():
    """Test the system with a simple question"""
    try:
        response = query_pdfs("What is this document about?")
        print("✓ System working correctly!")
        print("Response:", str(response)[:200] + "...")
        return True
    except Exception as e:
        print("✗ Error in system:")
        print(f"Error: {e}")
        return False

# Run this to test your setup
test_system()

Cost Considerations

  • Gemini API: Free tier includes generous limits
  • Google Colab: Free tier sufficient for most use cases
  • Storage: Vector indices stored temporarily in Colab session

Limitations

  • Session-based: Data is lost when Colab runtime disconnects
  • File size: Large PDFs (>100MB) may cause memory issues
  • Languages: Works best with English text
  • Complex layouts: Tables and images are converted to text

Contributing

We welcome contributions to improve this project! Here's how you can help:

  1. Fork the Repository

  2. Make Your Improvements

    • Test with different types of PDFs
    • Add new features or fix bugs
    • Update documentation as needed
  3. Submit a Pull Request

    • Describe your changes and their benefits

Ideas for Contributions

  • New features like web interface or batch processing
  • Performance optimizations and better error handling
  • More examples and troubleshooting guides
  • Testing with different PDF types and edge cases

License

MIT License - feel free to use and modify for your projects.

Support

If you encounter issues:

  1. Check the troubleshooting section above for common solutions
  2. Verify all setup steps were completed correctly
  3. Try with a simple, small PDF first to test the system
  4. Search existing issues on GitHub
  5. Open a new issue with detailed error information if needed

Star the Repository

If this project helped you, please consider giving it a ⭐ on GitHub!

About

RAG system for querying multiple PDF documents using Google Gemini API and LlamaIndex

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published