Skip to content

nelsonmestevao/pdf_extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

40 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

PdfExtractor

Release Documentation Downloads License Last Commit

A powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.

PdfExtractor leverages Python's pdfplumber library through seamless integration to provide robust PDF text extraction capabilities. It supports both file-based and binary-based operations, making it suitable for various use cases from local file processing to web-based PDF handling.

Features

  • πŸ” Extract text from single or multiple PDF pages
  • πŸ“ Area-based extraction using bounding boxes
  • 🌐 Work with PDF data directly from memory (e.g., HTTP downloads)
  • πŸ“Š Get PDF metadata like title, author, creation date
  • 🐍 Leverages Python's powerful pdfplumber library
  • πŸš€ Simple and intuitive API
  • βœ… Comprehensive test coverage
  • πŸ“š Full documentation

Installation

Add pdf_extractor to your list of dependencies in mix.exs:

def deps do
  [
    {:pdf_extractor, "~> 0.5.0"}
  ]
end

Then start it in your application start function:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    children = [
        PdfExtractor,
        ...
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Usage

Extract text from specific regions using bounding boxes {x0, y0, x1, y1}:

areas = %{
  0 => {0, 0, 300, 200},    # Top-left area of page 0
  1 => [
        {200, 300, 600, 500}, # Bottom-right area of page 1
        {0, 0, 200, 250}, # Top-left area of page 1
       ]
}
PdfExtractor.extract_text("path/to/document.pdf", areas)

Return Format

The function returns a map where keys are page numbers and values are the extracted text:

%{
  0 => "Text from page 0...",
  1 => ["Text from page 1 (first area)...", "Text from page 1 (second area)..."],
  2 => "Text from page 2..."
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Built on top of the excellent pdfplumber Python library
  • Uses pythonx for seamless Python integration

About

πŸ“„ A powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •