A minimal, research-focused pipeline for extracting tables from PDFs using state-of-the-art DETR-based models (Table Transformer) and simple postprocessing. This project is intended as a quick testbed and reference for table extraction, not as a production-ready solution.
Want to play around with the model? You can use our Google Colab notebook to experiment interactively. The notebook includes a nice explanation and is easy to copy and use in your own Google Drive.
Requirements:
- Docker installed
# Build and run the API (serves on http://localhost:5000)
docker-compose up --build- PDF files placed in
pdfs/will be accessible inside the container. - Extracted images will be saved to
images/.
Requirements:
- Python 3.10+
- pip
# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install --upgrade pip
pip install -r requirements.txt
# Download models (optional, will auto-download on first run)
python local_model.py
# Start the API
python src/app/main.pyA simple static HTML frontend is provided in frontend/index.html.
Open it in your browser and connect to the backend at http://localhost:5000.
-
PDF Upload:
User uploads a PDF via the frontend or API. -
PDF to Images:
Each page is converted to an image using PyMuPDF (src/app/pdf_reader.py). -
Table Detection & Structure Recognition:
- TD Model: Table Transformer Detection
Detects tables and rotated tables in the image. - TSR Model: Table Structure Recognition
Recognizes table structure (rows, columns, headers, spanning cells).
- TD Model: Table Transformer Detection
-
OCR:
Each detected cell is read using EasyOCR. We suggest switching to another more porewfull OCR model. -
Postprocessing:
- Detected bounding boxes are mapped to a grid.
- OCR results are assembled into a 2D array (list of lists of strings).
-
API Output:
Returns the extracted table(s) as JSON.
"table": Standard table"table rotated": Rotated table"no object": No table detected
0: 'table'1: 'table column'2: 'table row'3: 'table column header'4: 'table projected row header'5: 'table spanning cell'
-
POST /process
Upload a PDF file (pdffield).
Returns:{ "data": [ [ ["cell1", "cell2", ...], ... ] ] } -
GET /health
Health check.
- Bounding boxes for rows, columns, and spanning cells are detected.
- A grid is constructed by intersecting row and column boxes.
- Spanning cells are mapped to all grid positions they cover.
- Each cell is cropped and OCR is applied.
- The result is a rectangular 2D array, padded as needed.
Note: For robust postprocessing, thresholding, and extracting meaning from model predictions, we strongly recommend referring to the official Microsoft Table Transformer postprocess.py. Their code covers many edge cases and implements a much more comprehensive logic for table structure extraction. As this was not the main scope of our project, we suggest using their approach for production or research-grade extraction. They also provide clear instructions for training and fine-tuning the model.
- This repo is a quick testbed, not a production system.
- For complex tables, multi-page tables, or robust extraction, refer to the above projects.
- Postprocessing is intentionally simple and may fail on edge cases.
- We suggest to use better OCR model, as easyOCR can fail in lots of casese.
For examples of complex table extraction, see the RAGFlow project and its documentation.
TableExtractorDETR/
├── src/app/ # Main backend code (Flask API, inference, PDF reader)
├── models/ # Downloaded model weights (auto-populated)
├── pdfs/ # Input PDFs
├── images/ # Output images from PDF pages
├── frontend/ # Static HTML frontend
├── requirements.txt # Python dependencies
├── Dockerfile # Docker build
├── docker-compose.yml # Docker Compose config
└── local_model.py # Script to pre-download models