PdfExtractor

A powerful and easy-to-use Elixir library for extracting text and metadata from PDF files.

PdfExtractor leverages Python's pdfplumber library through seamless integration to provide robust PDF text extraction capabilities. It supports both file-based and binary-based operations, making it suitable for various use cases from local file processing to web-based PDF handling.

Features

🔍 Extract text from single or multiple PDF pages
📍 Area-based extraction using bounding boxes
🌐 Work with PDF data directly from memory (e.g., HTTP downloads)
📊 Get PDF metadata like title, author, creation date
🐍 Leverages Python's powerful pdfplumber library
🚀 Simple and intuitive API
✅ Comprehensive test coverage
📚 Full documentation

Installation

Add pdf_extractor to your list of dependencies in mix.exs:

def deps do
  [
    {:pdf_extractor, "~> 0.5.0"}
  ]
end

Then start it in your application start function:

defmodule MyApp.Application do
  use Application

  def start(_type, _args) do
    children = [
        PdfExtractor,
        ...
    ]

    opts = [strategy: :one_for_one, name: MyApp.Supervisor]
    Supervisor.start_link(children, opts)
  end
end

Usage

Extract text from specific regions using bounding boxes {x0, y0, x1, y1}:

areas = %{
  0 => {0, 0, 300, 200},    # Top-left area of page 0
  1 => [
        {200, 300, 600, 500}, # Bottom-right area of page 1
        {0, 0, 200, 250}, # Top-left area of page 1
       ]
}
PdfExtractor.extract_text("path/to/document.pdf", areas)

Return Format

The function returns a map where keys are page numbers and values are the extracted text:

%{
  0 => "Text from page 0...",
  1 => ["Text from page 1 (first area)...", "Text from page 1 (second area)..."],
  2 => "Text from page 2..."
}

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on top of the excellent pdfplumber Python library
Uses pythonx for seamless Python integration

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
lib		lib
priv/fixtures		priv/fixtures
test		test
.envrc		.envrc
.formatter.exs		.formatter.exs
.gitignore		.gitignore
.tool-versions		.tool-versions
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
flake.lock		flake.lock
flake.nix		flake.nix
mix.exs		mix.exs
mix.lock		mix.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PdfExtractor

Features

Installation

Usage

Return Format

License

Acknowledgments

About

Uh oh!

Releases 1

Packages

Contributors 2

Uh oh!

Languages

License

nelsonmestevao/pdf_extractor

Folders and files

Latest commit

History

Repository files navigation

PdfExtractor

Features

Installation

Usage

Return Format

License

Acknowledgments

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Uh oh!

Languages

Packages