This repository contains materials for the online workshop "Building a PDF-driven RAG system with Weaviate".
You’ll learn how to:
- Extract and preprocess text and images from PDFs
- Chunk and embed document content
- Store and retrieve data using Weaviate
- Build Retrieval-Augmented Generation (RAG) pipelines that combine text and images
- Python 3.10+
- Anthropic API key (for Generative AI use).
- In the live session, the instructor may provide temporary keys.
Important
🚀 WORKSHOP SETUP - COMPLETE BEFORE STARTING
Clone this repository:
git clone [email protected]:weaviate-tutorials/workshop-pdf-driven-rag.git
cd workshop-pdf-driven-ragCopy the .env.example to .env:
cp .env.example .envFill in the API keys in your .env file:
- Add your
ANTHROPIC_API_KEY - Add your
COHERE_API_KEY
Note
In the live session, the instructor may provide temporary keys.
Choose one of these methods:
Option A: Using venv & pip:
python3 -m venv .venv
source .venv/bin/activate # On Windows use `.venv\Scripts\activate.bat`
pip install -r requirements.txtOption B: Using uv:
uv sync
source .venv/bin/activateOption C: Use any other tool you prefer
The workshop is organized as a series of numbered Jupyter notebooks.
- 1_basics_of_working_with_pdfs.ipynb
- 2_basic_rag.ipynb
- 3_pdfs_with_images.ipynb
- 4_pdfs_simplified.ipynb
Tip
There are also completed version of each notebook, with -completed suffix. If you get stuck, you can refer to these.
You can run the notebooks using Jupyter/JupyterLab or VSCode.
-
Using JupyterLab:
- Activate the Python environment where you installed the dependencies
- Start JupyterLab (with hidden files visible):
jupyter lab --ContentsManager.allow_hidden=True
- Open the notebook file you want to run
-
Using VSCode:
- Open the repository folder in VSCode
- Open the notebook file you want to run
- Make sure to select the correct Python interpreter (the one where you installed the dependencies)
There are some helper scripts:
preprocess_pdf_to_img.py: Convert PDF pages to imagespreprocess_pdf_to_md.py: Convert PDF to markdown textpreprocess_img_to_embeddings_cohere.py: Pre-generate image embeddings & object data with Cohere embeddings
Congratulations on completing the workshop! Ready to take your RAG skills to the next level?
Free Resources:
- 🎓 Weaviate Academy - Free courses on vector databases, RAG, and AI applications
- 📖 Documentation - Complete guides, tutorials, and API references
- 🎙️ Workshops - Live workshops and events
Upcoming Workshops:
- Oct 2: Intro to building AI-native applications with Weaviate
- Oct 7: Building Intelligent Chatbots with Pydantic AI and Weaviate
- More workshops listed on our events page
Connect with other developers and get help:
- 💬 Slack Community - Meet & share projects
- 💻 Forum - Ask questions & get support
- 📧 Newsletter - Latest AI & vector database news
Thank you for building with Weaviate! Happy vector searching! ✨