Building a PDF-driven RAG system with Weaviate

This repository contains materials for the online workshop "Building a PDF-driven RAG system with Weaviate".

You’ll learn how to:

Extract and preprocess text and images from PDFs
Chunk and embed document content
Store and retrieve data using Weaviate
Build Retrieval-Augmented Generation (RAG) pipelines that combine text and images

Requirements

Python 3.10+
Anthropic API key (for Generative AI use).
- In the live session, the instructor may provide temporary keys.

Important

🚀 WORKSHOP SETUP - COMPLETE BEFORE STARTING

🔧 Required Setup Steps

You must complete ALL these steps before running the workshop notebooks!

Setup Checklist:
☐ Repository cloned and configured
☐ Environment file (.env) created
☐ API keys added to .env file
☐ Python environment set up
☐ Dependencies installed

Setup instructions

1. Repository setup

Clone this repository:

git clone [email protected]:weaviate-tutorials/workshop-pdf-driven-rag.git
cd workshop-pdf-driven-rag

Copy the .env.example to .env:

cp .env.example .env

2. AI model provider API keys

Fill in the API keys in your .env file:

Add your ANTHROPIC_API_KEY
Add your COHERE_API_KEY

Note

In the live session, the instructor may provide temporary keys.

3. Set up your Python environment

Choose one of these methods:

Option A: Using venv & pip:

python3 -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\Scripts\activate.bat`
pip install -r requirements.txt

Option B: Using uv:

uv sync
source .venv/bin/activate

Option C: Use any other tool you prefer

Workshop notebooks

The workshop is organized as a series of numbered Jupyter notebooks.

1_basics_of_working_with_pdfs.ipynb
2_basic_rag.ipynb
3_pdfs_with_images.ipynb
4_pdfs_simplified.ipynb

Tip

There are also completed version of each notebook, with -completed suffix. If you get stuck, you can refer to these.

Running the notebooks

You can run the notebooks using Jupyter/JupyterLab or VSCode.

Using JupyterLab:
- Activate the Python environment where you installed the dependencies
- Start JupyterLab (with hidden files visible):
```
jupyter lab --ContentsManager.allow_hidden=True
```
- Open the notebook file you want to run
Using VSCode:
- Open the repository folder in VSCode
- Open the notebook file you want to run
- Make sure to select the correct Python interpreter (the one where you installed the dependencies)

Helper scripts

There are some helper scripts:

preprocess_pdf_to_img.py: Convert PDF pages to images
preprocess_pdf_to_md.py: Convert PDF to markdown text
preprocess_img_to_embeddings_cohere.py: Pre-generate image embeddings & object data with Cohere embeddings

🎉 Next Steps

Congratulations on completing the workshop! Ready to take your RAG skills to the next level?

📚 Continue Learning

Free Resources:

🎓 Weaviate Academy - Free courses on vector databases, RAG, and AI applications
📖 Documentation - Complete guides, tutorials, and API references
🎙️ Workshops - Live workshops and events

Upcoming Workshops:

Oct 2: Intro to building AI-native applications with Weaviate
Oct 7: Building Intelligent Chatbots with Pydantic AI and Weaviate
More workshops listed on our events page

🌟 Join the Community

Connect with other developers and get help:

💬 Slack Community - Meet & share projects
💻 Forum - Ask questions & get support
📧 Newsletter - Latest AI & vector database news

Thank you for building with Weaviate! Happy vector searching! ✨

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
assets		assets
data		data
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
1_basics_of_working_with_pdfs-complete.ipynb		1_basics_of_working_with_pdfs-complete.ipynb
1_basics_of_working_with_pdfs.ipynb		1_basics_of_working_with_pdfs.ipynb
2_basic_rag-complete.ipynb		2_basic_rag-complete.ipynb
2_basic_rag.ipynb		2_basic_rag.ipynb
3_pdfs_with_images-complete.ipynb		3_pdfs_with_images-complete.ipynb
3_pdfs_with_images.ipynb		3_pdfs_with_images.ipynb
4_pdfs_simplified-complete.ipynb		4_pdfs_simplified-complete.ipynb
4_pdfs_simplified.ipynb		4_pdfs_simplified.ipynb
README.md		README.md
preprocess_img_to_embeddings_cohere.py		preprocess_img_to_embeddings_cohere.py
preprocess_pdf_to_img.py		preprocess_pdf_to_img.py
preprocess_pdf_to_md.py		preprocess_pdf_to_md.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
utils.py		utils.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

Repository files navigation

Building a PDF-driven RAG system with Weaviate

Requirements

🔧 Required Setup Steps

Setup instructions

1. Repository setup

2. AI model provider API keys

3. Set up your Python environment

Workshop notebooks

Running the notebooks

Helper scripts

🎉 Next Steps

📚 Continue Learning

🌟 Join the Community

About

Uh oh!

Releases

Packages

Languages

Uh oh!

Uh oh!

weaviate-tutorials/workshop-pdf-driven-rag

Folders and files

Latest commit

History

Repository files navigation

Building a PDF-driven RAG system with Weaviate

Requirements

🔧 Required Setup Steps

Setup instructions

1. Repository setup

2. AI model provider API keys

3. Set up your Python environment

Workshop notebooks

Running the notebooks

Helper scripts

🎉 Next Steps

📚 Continue Learning

🌟 Join the Community

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages