Skip to content

weaviate-tutorials/workshop-pdf-driven-rag

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building a PDF-driven RAG system with Weaviate

This repository contains materials for the online workshop "Building a PDF-driven RAG system with Weaviate".

You’ll learn how to:

  • Extract and preprocess text and images from PDFs
  • Chunk and embed document content
  • Store and retrieve data using Weaviate
  • Build Retrieval-Augmented Generation (RAG) pipelines that combine text and images

Requirements

  • Python 3.10+
  • Anthropic API key (for Generative AI use).
    • In the live session, the instructor may provide temporary keys.

Important

🚀 WORKSHOP SETUP - COMPLETE BEFORE STARTING

🔧 Required Setup Steps

You must complete ALL these steps before running the workshop notebooks!

Setup Checklist:
☐ Repository cloned and configured
☐ Environment file (.env) created
☐ API keys added to .env file
☐ Python environment set up
☐ Dependencies installed

Setup instructions

1. Repository setup

Clone this repository:

git clone [email protected]:weaviate-tutorials/workshop-pdf-driven-rag.git
cd workshop-pdf-driven-rag

Copy the .env.example to .env:

cp .env.example .env

2. AI model provider API keys

Fill in the API keys in your .env file:

  • Add your ANTHROPIC_API_KEY
  • Add your COHERE_API_KEY

Note

In the live session, the instructor may provide temporary keys.

3. Set up your Python environment

Choose one of these methods:

Option A: Using venv & pip:

python3 -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate.bat`
pip install -r requirements.txt

Option B: Using uv:

uv sync
source .venv/bin/activate

Option C: Use any other tool you prefer

Workshop notebooks

The workshop is organized as a series of numbered Jupyter notebooks.

  • 1_basics_of_working_with_pdfs.ipynb
  • 2_basic_rag.ipynb
  • 3_pdfs_with_images.ipynb
  • 4_pdfs_simplified.ipynb

Tip

There are also completed version of each notebook, with -completed suffix. If you get stuck, you can refer to these.

Running the notebooks

You can run the notebooks using Jupyter/JupyterLab or VSCode.

  • Using JupyterLab:

    • Activate the Python environment where you installed the dependencies
    • Start JupyterLab (with hidden files visible):
    jupyter lab --ContentsManager.allow_hidden=True
    • Open the notebook file you want to run
  • Using VSCode:

    • Open the repository folder in VSCode
    • Open the notebook file you want to run
    • Make sure to select the correct Python interpreter (the one where you installed the dependencies)

Helper scripts

There are some helper scripts:

  • preprocess_pdf_to_img.py: Convert PDF pages to images
  • preprocess_pdf_to_md.py: Convert PDF to markdown text
  • preprocess_img_to_embeddings_cohere.py: Pre-generate image embeddings & object data with Cohere embeddings

🎉 Next Steps

Congratulations on completing the workshop! Ready to take your RAG skills to the next level?

📚 Continue Learning

Free Resources:

  • 🎓 Weaviate Academy - Free courses on vector databases, RAG, and AI applications
  • 📖 Documentation - Complete guides, tutorials, and API references
  • 🎙️ Workshops - Live workshops and events

Upcoming Workshops:

🌟 Join the Community

Connect with other developers and get help:


Thank you for building with Weaviate! Happy vector searching!

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published