PolyRAG is a modular agentic RAG framework optimized for SLM (small language models) with small context windows
Agents and tools are designed to pipe outputs directly, auto-correct imperfect inputs, and minimize main agent context load. Every feature is built for small, slow, or local LLMs.
Modular: Bring Your Own Data, Lexicon, and LLM
This repo uses MedRxiv publications as a demo, but you can connect PolyRAG to any database, lexicon, or document set—just adapt the system prompt and DB connection. Any LLM backend is supported. Indexing scripts are in
scripts/
. See "Customization" below.
Suggestion of actions at the creation of a new conversation
Ask complex research questions and get precise, sourced answers. Accesses lexicon, finds paragraphs in documents
View PDFs with automatically highlighted, contextually relevant blocks—extracted.
Generate publication trend graphs and other visualizations from natural language requests, with results piped through the agent-tool chain.
See how PolyRAG chains SQL, RAG, and PDF tools to answer technical questions—each step
-
Semi-structured Extraction:
- Uses NLM Ingestor and Tika with data type detection and tree structure.
- Localization and regex rules for extracting structured parts.
- Produces a structure with type, parent, child, and position.
-
Indexing:
- Uses PostgreSQL TSVector (French) for efficient, scalable full-text search with tokenization and stemming.
- No embeddings by default: lighter, scalable, and future-ready for on-premise models.
- Excellent performance for technical queries.
- Agentic RAG: Modular agents for database and document queries.
- Interactive PDF viewer with contextual highlights.
- Natural language to SQL and graphing, coordinated by agents.
- Conversation history, feedback, and moderation.
- Docker and Python support for easy deployment.
# Set required environment variables. Check in .env.example for values to put in .env
# Install dependencies (uv recommended)
pip install uv
uv sync --frozen
source .venv/bin/activate
# Run the service
python src/run_service.py
# In another terminal
source .venv/bin/activate
streamlit run src/streamlit-app.py
To quickly get started with real data as shown in the screenshots, you can populate your PostgreSQL database using the following dump:
Download the demo database dump (Google Drive)
This dump contains metadata and vectorized PDFs from medrxiv (first half of 2025), enabling you to reproduce the demo experience out of the box.
To restore the dump:
# Example command (adjust connection details as needed)
pg_restore -d your_database_name -U your_postgres_user -h your_postgres_host -p your_postgres_port /path/to/downloaded/dump_file
All configuration is handled via environment variables in your .env
file. See .env.example
for a full list. Key options include:
OPENAI_API_KEY
,AZURE_OPENAI_API_KEY
,DEEPSEEK_API_KEY
,ANTHROPIC_API_KEY
,GOOGLE_API_KEY
,GROQ_API_KEY
: API keys for supported LLM providers.USE_AWS_BEDROCK
: Enable Amazon Bedrock integration (true
/false
).AWS_KB_ID
: Amazon Bedrock Knowledge Base ID.DATABASE_URL
: PostgreSQL connection string.SCHEMA_APP_DATA
: Database schema for application data (default:document_data
).LANGUAGE
: Language for text search queries (default:english
).NLM_INGESTOR_API
: URL for the NLM Ingestor service.UPLOADED_PDF_PARSER
: Parser for uploaded PDFs (pypdf
,nlm-ingestor
, etc.).DISPLAY_TEXTS_JSON_PATH
: Path to display texts JSON.SYSTEM_PROMPT_PATH
: Path to the system prompt file.NO_AUTH
: Set toTrue
to disable authentication (not recommended for production).
Copy .env.example
to .env
and fill in the required values for your setup.
PolyRAG is designed to be easily adapted to your own use case—across any domain, database, or document set. The MedRxiv setup provided here is just a showcase.
To use PolyRAG with your own data:
-
Database Connection:
- Edit the
DATABASE_URL
in your.env
file to point to your own PostgreSQL (or compatible) database. - Adjust schema/table names as needed in your configuration.
- Edit the
-
System Prompt:
- Adapt the system prompt file (see the
SYSTEM_PROMPT_PATH
variable in your.env
) to fit your domain, lexicon, and user instructions.
- Adapt the system prompt file (see the
-
Indexing Your Data:
- Use the scripts in the
scripts/
directory to index your own documents:index-folder-script.py
: Index documents from a local folder.index-urls-script.py
: Index documents from a list of URLs.scrape_arxiv.py
,scrape_medrxiv.py
: Example scrapers for scientific sources.
- You can create your own scripts following these templates for other data sources.
- Use the scripts in the
-
LLM Backend:
- PolyRAG is backend-agnostic. Set the appropriate API key(s) in your
.env
to use OpenAI, Mistral, DeepSeek, Anthropic, Google, Groq, or your own local LLM.
- PolyRAG is backend-agnostic. Set the appropriate API key(s) in your
-
Display Texts & Instructions:
- Customize user-facing texts and instructions by editing the files referenced in your
.env
(e.g.,DISPLAY_TEXTS_JSON_PATH
,instructions.md
).
- Customize user-facing texts and instructions by editing the files referenced in your
For more advanced customization, see the code in the src/
directory and adapt agents, tools, or workflows as needed.