The name "Ragazza" was chosen as it's memorable, simple, and cleverly contains "RAG" (Retrieval-Augmented Generation) within its spelling.
A tool to convert PDF slides into markdown format with AI-powered content analysis, suitable for loading into LLM Models.
- Extracts text content from PDF slides
- Generates visual descriptions using Claude Models (or any other AWS Bedrock availables)
- Provides educational purpose analysis for each slide
- Supports error handling and retry mechanisms
- Progress tracking with tqdm
- Comprehensive logging
pip install ragazza
-
Clone the repository:
git clone https://github.com/evereven-tech/ragazza.git cd ragazza
-
Use make commands for development:
make help # Show all available commands make install # Install package in production mode make install-dev # Install in development mode with dev dependencies make build # Build package distribution make lint # Check style with flake8 make test # Run tests make clean # Clean up build artifacts
You'll need to install Poppler for PDF processing:
# Ubuntu/Debian
sudo apt-get install poppler-utils
# macOS
brew install poppler
# Windows: Download and add Poppler to PATH
This tool requires access to AWS Bedrock to use Claude AI models for content analysis:
- Ensure you have an AWS account with Bedrock access enabled
- Request model access for Claude models in the AWS Bedrock console
- Configure your AWS credentials:
pip install awscli aws configure
- Enter your AWS Access Key ID, Secret Access Key, and set default region to 'us-east-1'
Important notes:
- Using AWS Bedrock incurs costs based on token usage
- AWS Bedrock may not be available in all regions
- Your AWS user/role needs permissions for 'bedrock:InvokeModel'
- If you don't have AWS Bedrock access, this tool cannot function properly
Basic usage:
ragazza input.pdf output.md
Advanced options:
ragazza --model "anthropic.claude-3-5-sonnet-20241022-v2:0" --max-tokens 1000 input.pdf output.md
The script generates:
- A markdown file with structured content for each slide
- Temporary images in ./tmp directory (automatically cleaned up)
- A log file (ragazza.log) with processing details