Ragazza

The name "Ragazza" was chosen as it's memorable, simple, and cleverly contains "RAG" (Retrieval-Augmented Generation) within its spelling.

A tool to convert PDF slides into markdown format with AI-powered content analysis, suitable for loading into LLM Models.

Features

Extracts text content from PDF slides
Generates visual descriptions using Claude Models (or any other AWS Bedrock availables)
Provides educational purpose analysis for each slide
Supports error handling and retry mechanisms
Progress tracking with tqdm
Comprehensive logging

Installation

Using pip

pip install ragazza

From source

Clone the repository:

git clone https://github.com/evereven-tech/ragazza.git
cd ragazza

Use make commands for development:

make help         # Show all available commands
make install      # Install package in production mode
make install-dev  # Install in development mode with dev dependencies
make build        # Build package distribution
make lint         # Check style with flake8
make test         # Run tests
make clean        # Clean up build artifacts

System Dependencies

You'll need to install Poppler for PDF processing:

# Ubuntu/Debian
sudo apt-get install poppler-utils

# macOS
brew install poppler

# Windows: Download and add Poppler to PATH

AWS Configuration

This tool requires access to AWS Bedrock to use Claude AI models for content analysis:

Ensure you have an AWS account with Bedrock access enabled
Request model access for Claude models in the AWS Bedrock console
Configure your AWS credentials:
```
pip install awscli
aws configure
```
Enter your AWS Access Key ID, Secret Access Key, and set default region to 'us-east-1'

Important notes:

Using AWS Bedrock incurs costs based on token usage
AWS Bedrock may not be available in all regions
Your AWS user/role needs permissions for 'bedrock:InvokeModel'
If you don't have AWS Bedrock access, this tool cannot function properly

Usage

Basic usage:

ragazza input.pdf output.md

Advanced options:

ragazza --model "anthropic.claude-3-5-sonnet-20241022-v2:0" --max-tokens 1000 input.pdf output.md

Output

The script generates:

A markdown file with structured content for each slide
Temporary images in ./tmp directory (automatically cleaned up)
A log file (ragazza.log) with processing details

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
src/ragazza		src/ragazza
tests		tests
.gitignore		.gitignore
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
requirements.dev.txt		requirements.dev.txt
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Ragazza

Features

Installation

Using pip

From source

System Dependencies

AWS Configuration

Usage

Output

About

Uh oh!

Releases

Packages

Languages

License

evereven-tech/ragazza

Folders and files

Latest commit

History

Repository files navigation

Ragazza

Features

Installation

Using pip

From source

System Dependencies

AWS Configuration

Usage

Output

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages