This application anonymizes large PDF, Markdown or Text files using LLMs.
- High-Quality Anonymization: Leverages LLMs to identify and replace Personally Identifiable Information (PII) with high accuracy.
- Large File Support: Consistently anonymizes large files (tested up to 1GB).
- Multi-Provider & Cost-Effective: Free to use with local Ollama models. It also supports major providers like OpenAI, Anthropic, Google, Hugging Face, and OpenRouter.
- Reversible: Supports deanonymization to recover original data when needed.
- Multi-Format: Works with PDF, Markdown, and plain text files.
This project is a monorepo containing two main packages:
packages/pdf-anonymizer-core
: The core library containing the anonymization and deanonymization logic. See the core README for more details.packages/pdf-anonymizer-cli
: A command-line interface for using the anonymizer. See the CLI README for detailed usage instructions.
-
Install
uv
: This project usesuv
for package management. Follow the official installation instructions. -
Clone the repository:
git clone <repository_url> cd anonymizer
-
Install dependencies:
uv sync --group dev
-
Install Ollama (optional): If you want to use a local model for anonymization, install Ollama.
-
Set up environment variables: Create a
.env
file in thepackages/pdf-anonymizer-cli
directory and add the necessary API keys for the providers you want to use. For example:# For Google models GOOGLE_API_KEY="YOUR_GOOGLE_API_KEY" # For OpenAI models OPENAI_API_KEY="YOUR_OPENAI_API_KEY" # For Anthropic models ANTHROPIC_API_KEY="YOUR_ANTHROPIC_API_KEY" # For Hugging Face models HUGGING_FACE_TOKEN="YOUR_HF_TOKEN" # For OpenRouter models OPENROUTER_API_KEY="YOUR_OPENROUTER_KEY"
To anonymize a file, use the pdf-anonymizer
command:
pdf-anonymizer run document.pdf
For detailed command-line options and examples, please refer to the CLI README.
To run the test suite:
uv run pytest