GitHub - QuentinFuxa/PolyRAG: Agentic RAG platform purpose-built for small language models (SLM). Robust PDF/SQL search

PolyRAG is a modular agentic RAG framework optimized for SLM (small language models) with small context windows

Agents and tools are designed to pipe outputs directly, auto-correct imperfect inputs, and minimize main agent context load. Every feature is built for small, slow, or local LLMs.

Modular: Bring Your Own Data, Lexicon, and LLM

This repo uses MedRxiv publications as a demo, but you can connect PolyRAG to any database, lexicon, or document set—just adapt the system prompt and DB connection. Any LLM backend is supported. Indexing scripts are in scripts/. See "Customization" below.

Agentic Architecture

Suggestion of actions at the creation of a new conversation

Conversational Research Assistant

Ask complex research questions and get precise, sourced answers. Accesses lexicon, finds paragraphs in documents

Contextual PDF Highlighting

View PDFs with automatically highlighted, contextually relevant blocks—extracted.

Data Visualization from Natural Language (Agent-Tool Chaining)

Generate publication trend graphs and other visualizations from natural language requests, with results piped through the agent-tool chain.

End-to-End Agent Orchestration

See how PolyRAG chains SQL, RAG, and PDF tools to answer technical questions—each step

Document Extraction & Indexing

Semi-structured Extraction:
- Uses NLM Ingestor and Tika with data type detection and tree structure.
- Localization and regex rules for extracting structured parts.
- Produces a structure with type, parent, child, and position.
Indexing:
- Uses PostgreSQL TSVector (French) for efficient, scalable full-text search with tokenization and stemming.
- No embeddings by default: lighter, scalable, and future-ready for on-premise models.
- Excellent performance for technical queries.

Features Overview

Agentic RAG: Modular agents for database and document queries.
Interactive PDF viewer with contextual highlights.
Natural language to SQL and graphing, coordinated by agents.
Conversation history, feedback, and moderation.
Docker and Python support for easy deployment.

⚡ Quick Start

Run with Python

# Set required environment variables. Check in .env.example for values to put in .env

# Install dependencies (uv recommended)
pip install uv
uv sync --frozen
source .venv/bin/activate

# Run the service
python src/run_service.py

# In another terminal
source .venv/bin/activate
streamlit run src/streamlit-app.py

🗄️ Database Setup

To quickly get started with real data as shown in the screenshots, you can populate your PostgreSQL database using the following dump:

Download the demo database dump (Google Drive)

This dump contains metadata and vectorized PDFs from medrxiv (first half of 2025), enabling you to reproduce the demo experience out of the box.

To restore the dump:

# Example command (adjust connection details as needed)
pg_restore -d your_database_name -U your_postgres_user -h your_postgres_host -p your_postgres_port /path/to/downloaded/dump_file

⚙️ Configuration

All configuration is handled via environment variables in your .env file. See .env.example for a full list. Key options include:

OPENAI_API_KEY, AZURE_OPENAI_API_KEY, DEEPSEEK_API_KEY, ANTHROPIC_API_KEY, GOOGLE_API_KEY, GROQ_API_KEY: API keys for supported LLM providers.
USE_AWS_BEDROCK: Enable Amazon Bedrock integration (true/false).
AWS_KB_ID: Amazon Bedrock Knowledge Base ID.
DATABASE_URL: PostgreSQL connection string.
SCHEMA_APP_DATA: Database schema for application data (default: document_data).
LANGUAGE: Language for text search queries (default: english).
NLM_INGESTOR_API: URL for the NLM Ingestor service.
UPLOADED_PDF_PARSER: Parser for uploaded PDFs (pypdf, nlm-ingestor, etc.).
DISPLAY_TEXTS_JSON_PATH: Path to display texts JSON.
SYSTEM_PROMPT_PATH: Path to the system prompt file.
NO_AUTH: Set to True to disable authentication (not recommended for production).

Copy .env.example to .env and fill in the required values for your setup.

🛠️ Customization: Connect Your Own Data, Lexicon, and LLM

PolyRAG is designed to be easily adapted to your own use case—across any domain, database, or document set. The MedRxiv setup provided here is just a showcase.

To use PolyRAG with your own data:

Database Connection:
- Edit the DATABASE_URL in your .env file to point to your own PostgreSQL (or compatible) database.
- Adjust schema/table names as needed in your configuration.
System Prompt:
- Adapt the system prompt file (see the SYSTEM_PROMPT_PATH variable in your .env) to fit your domain, lexicon, and user instructions.
Indexing Your Data:
- Use the scripts in the scripts/ directory to index your own documents:
  - index-folder-script.py: Index documents from a local folder.
  - index-urls-script.py: Index documents from a list of URLs.
  - scrape_arxiv.py, scrape_medrxiv.py: Example scrapers for scientific sources.
- You can create your own scripts following these templates for other data sources.
LLM Backend:
- PolyRAG is backend-agnostic. Set the appropriate API key(s) in your .env to use OpenAI, Mistral, DeepSeek, Anthropic, Google, Groq, or your own local LLM.
Display Texts & Instructions:
- Customize user-facing texts and instructions by editing the files referenced in your .env (e.g., DISPLAY_TEXTS_JSON_PATH, instructions.md).

For more advanced customization, see the code in the src/ directory and adapt agents, tools, or workflows as needed.

Name		Name	Last commit message	Last commit date
Latest commit History 426 Commits
.devcontainer		.devcontainer
.github/workflows		.github/workflows
.streamlit		.streamlit
.variables/demo_articles		.variables/demo_articles
docker		docker
media		media
scripts		scripts
src		src
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
Dockerfile_gcp.app		Dockerfile_gcp.app
Dockerfile_gcp.service		Dockerfile_gcp.service
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
codecov.yml		codecov.yml
compose.yaml		compose.yaml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Agentic Architecture

Conversational Research Assistant

Contextual PDF Highlighting

Data Visualization from Natural Language (Agent-Tool Chaining)

End-to-End Agent Orchestration

Document Extraction & Indexing

Features Overview

⚡ Quick Start

Run with Python

🗄️ Database Setup

⚙️ Configuration

🛠️ Customization: Connect Your Own Data, Lexicon, and LLM

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 18

Uh oh!

Languages

License

QuentinFuxa/PolyRAG

Folders and files

Latest commit

History

Repository files navigation

Agentic Architecture

Conversational Research Assistant

Contextual PDF Highlighting

Data Visualization from Natural Language (Agent-Tool Chaining)

End-to-End Agent Orchestration

Document Extraction & Indexing

Features Overview

⚡ Quick Start

Run with Python

🗄️ Database Setup

⚙️ Configuration

🛠️ Customization: Connect Your Own Data, Lexicon, and LLM

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 18

Uh oh!

Languages

Packages