Skip to content

PDF Table Extraction Prompt Tester - A tool for testing different prompts to extract tables from PDFs using GPT-4o and measuring their repeatability

Notifications You must be signed in to change notification settings

Displayr/pdftocsv

Repository files navigation

PDF Table Extractor - MVP

A simple tool to test different prompts for extracting tables from PDFs using GPT-4o and measure their repeatability.

Setup

  1. Install dependencies:
pip install -r requirements.txt
  1. Set your OpenAI API key (choose one method):

    Option A: Create config.py file

    # Copy the template and edit it
    cp config.py.template config.py
    # Then edit config.py and replace "your-openai-api-key-here" with your actual key

    Option B: Environment variable

    export OPENAI_API_KEY="your-api-key-here"

Usage

Basic Usage

python pdf_table_extractor.py "pdfs/sample.pdf" "Extract all tables from this PDF and convert them to CSV format"

With Custom Parameters

python pdf_table_extractor.py "pdfs/sample.pdf" "Extract all tables from this PDF and convert them to CSV format" --runs 10 --output results.json

With Simultaneous Testing (Faster)

python pdf_table_extractor.py "pdfs/sample.pdf" "Extract all tables from this PDF and convert them to CSV format" --runs 10 --async-mode

Parameters

  • pdf_path: Path to the PDF file to process
  • prompt: The prompt to test for table extraction
  • --runs: Number of test runs (default: 5)
  • --output: Output file for detailed results (JSON format)
  • --api-key: OpenAI API key (optional if set via environment variable)
  • --async-mode: Run tests simultaneously for faster execution

Example Prompts to Test

  1. Simple extraction:

    "Extract all tables from this PDF and convert them to CSV format"
    
  2. Detailed extraction:

    "Find all tabular data in this PDF. For each table, extract the data and format it as CSV. Include headers and preserve the structure. If there are multiple tables, separate them clearly."
    
  3. Structured extraction:

    "Analyze this PDF and extract any tables or structured data. Convert each table to CSV format with proper headers. Maintain the original column order and data types."
    

Output

The tool will show:

  • Total number of test runs
  • Number of unique results
  • Agreement percentage (how many runs produced the same result)
  • The most common result
  • All individual results

🌐 Web UI (Recommended)

For the best experience, use the modern web interface:

python start_ui.py

Then open your browser to: http://localhost:5000

Web UI Features:

  • 🎨 Beautiful, modern interface designed specifically for testing C# NonStructuredDataReader prompts
  • 📝 Pre-filled forms with the exact instructions and initial message from your C# code
  • Concurrent testing for faster results
  • 📊 Real-time progress and visual results
  • 📈 Agreement metrics and repeatability analysis
  • 📚 Test history to compare different prompt variations
  • 📱 Responsive design works on desktop and mobile

📁 Files

  • app.py: Flask web application
  • start_ui.py: Easy startup script for the web UI
  • pdf_table_extractor.py: Main script (command line)
  • requirements.txt: Python dependencies
  • test_prompts.py: Example script for testing multiple prompts
  • templates/: Web UI templates
  • static/: Web UI assets (CSS, JS)
  • pdfs/: Directory containing sample PDF files

About

PDF Table Extraction Prompt Tester - A tool for testing different prompts to extract tables from PDFs using GPT-4o and measuring their repeatability

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published