Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
{
"files.exclude": {
"**/__pycache__": true,
"**/*.egg-info": true
}
}
695 changes: 21 additions & 674 deletions LICENSE

Large diffs are not rendered by default.

4 changes: 2 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -66,12 +66,12 @@ setup-local:
.PHONY: install-linux
install-linux:
@echo -e "\033[1;34m Installing Linux dependencies...\033[0m"; \
sudo apt update && sudo apt install -y libmagic1 tesseract-ocr poppler-utils pkg-config
sudo apt update && sudo apt install -y libmagic1 poppler-utils pkg-config

.PHONY: install-macos
install-macos:
@echo -e "\033[1;34m Installing macOS dependencies...\033[0m"; \
brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf
brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf

.PHONY: install-requirements
install-requirements:
Expand Down
30 changes: 13 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,8 @@ The API is built with FastAPI and uses Celery for asynchronous task processing.
![hero doc extract](ocr-hero.webp)

## Features:
- **No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
- **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
- **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR)
- **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
- **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
- **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
Expand Down Expand Up @@ -39,8 +39,6 @@ Before running the example see [getting started](#getting-started)

![Converting Invoice to JSON](./screenshots/example-2.png)

**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.

## Getting started

You might want to run the app directly on your machine for development purposes OR to use for example Apple GPUs (which are not supported by Docker at the moment).
Expand Down Expand Up @@ -114,7 +112,7 @@ This command will install all the dependencies - including Redis (via Docker, so

(MAC) - Dependencies
```
brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf
brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf
```

(Mac) - You need to startup the celery worker
Expand Down Expand Up @@ -312,9 +310,11 @@ python client/cli.py llm_pull --model llama3.2-vision
and only after to run this specific prompt query:

```bash
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en
```

**Note:** The language argument is used for the OCR strategy to load the model weights for the selected language. You can specify multiple languages as a list: `en,de,pl` etc.

The `ocr` command can store the results using the `storage_profiles`:
- **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
Expand Down Expand Up @@ -410,37 +410,39 @@ apiClient.uploadFile(formData).then(response => {
- **Method**: POST
- **Parameters**:
- **file**: PDF, image or Office file to be processed.
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
- **ocr_cache**: Whether to cache the OCR result (true or false).
- **prompt**: When provided, will be used for Ollama processing the OCR result
- **model**: When provided along with the prompt - this model will be used for LLM processing
- **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
- **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights

Example:

```bash
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=marker" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=easyocr" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload"
```

### OCR Endpoint via JSON request
- **URL**: /ocr/request
- **Method**: POST
- **Parameters** (JSON body):
- **file**: Base64 encoded PDF file content.
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
- **ocr_cache**: Whether to cache the OCR result (true or false).
- **prompt**: When provided, will be used for Ollama processing the OCR result.
- **model**: When provided along with the prompt - this model will be used for LLM processing.
- **storage_profile**: Used to save the result - the `default` profile (`/storage_profiles/default.yaml`) is used by default; if empty file is not saved.
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting.
- **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights

Example:

```bash
curl -X POST "http://localhost:8000/ocr/request" -H "Content-Type: application/json" -d '{
"file": "<base64-encoded-file-content>",
"strategy": "marker",
"strategy": "easyocr",
"ocr_cache": true,
"prompt": "",
"model": "llama3.1",
Expand Down Expand Up @@ -598,13 +600,7 @@ AWS_S3_BUCKET_NAME=your-bucket-name
```

## License
This project is licensed under the GNU General Public License. See the [LICENSE](LICENSE) file for details.

**Important note on [marker](https://github.com/VikParuchuri/marker) license***:

The weights for the models are licensed `cc-by-nc-sa-4.0`, but Marker's author will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to/).


This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.

## Contact
In case of any questions please contact us at: [email protected]
18 changes: 12 additions & 6 deletions client/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,19 @@
import math
from ollama import pull

def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None):
def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None, language='en'):
ocr_url = os.getenv('OCR_UPLOAD_URL', 'http://localhost:8000/ocr/upload')
files = {'file': open(file_path, 'rb')}
if not ocr_cache:
print("OCR cache disabled.")

data = {'ocr_cache': ocr_cache, 'model': model, 'strategy': strategy, 'storage_profile': storage_profile}
data = {'ocr_cache': ocr_cache, 'model': model, 'strategy': strategy, 'storage_profile': storage_profile, 'language': language}

if storage_filename:
data['storage_filename'] = storage_filename

print(data) # @todo change to log debug in the future

try:
if prompt_file:
prompt = open(prompt_file, 'r').read()
Expand All @@ -42,7 +44,7 @@ def ocr_upload(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1',
print(f"Failed to upload file: {response.text}")
return None

def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None):
def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1', strategy='llama_vision', storage_profile='default', storage_filename=None, language='en'):
ocr_url = os.getenv('OCR_REQUEST_URL', 'http://localhost:8000/ocr/request')
with open(file_path, 'rb') as f:
file_content = base64.b64encode(f.read()).decode('utf-8')
Expand All @@ -52,7 +54,8 @@ def ocr_request(file_path, ocr_cache, prompt, prompt_file=None, model='llama3.1'
'model': model,
'strategy': strategy,
'storage_profile': storage_profile,
'file': file_content
'file': file_content,
'language': language
}

if storage_filename:
Expand Down Expand Up @@ -175,6 +178,7 @@ def main():
ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
#ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')

# Sub-command for uploading a file via file upload - @deprecated - it's a backward compatibility gimmick
Expand All @@ -189,6 +193,7 @@ def main():
ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
#ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')


Expand All @@ -204,6 +209,7 @@ def main():
ocr_request_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
ocr_request_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use. You may use some formatting - see the docs')
ocr_request_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use')
ocr_request_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')

# Sub-command for getting the result
result_parser = subparsers.add_parser('result', help='Get the OCR result by specified task id.')
Expand Down Expand Up @@ -239,7 +245,7 @@ def main():

if args.command == 'ocr' or args.command == 'ocr_upload':
print(args)
result = ocr_upload(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename)
result = ocr_upload(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename, args.language)
if result is None:
print("Error uploading file.")
return
Expand All @@ -251,7 +257,7 @@ def main():
if text_result:
print(text_result)
elif args.command == 'ocr_request':
result = ocr_request(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename)
result = ocr_request(args.file, False if args.disable_ocr_cache else args.ocr_cache, args.prompt, args.prompt_file, args.model, args.strategy, args.storage_profile, args.storage_filename, args.language)
if result is None:
print("Error uploading file.")
return
Expand Down
6 changes: 2 additions & 4 deletions config/strategies.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,5 @@
strategies:
llama_vision:
class: text_extract_api.extract.strategies.llama_vision.LlamaVisionStrategy
marker:
class: text_extract_api.extract.strategies.marker.MarkerStrategy
tesseract:
class: text_extract_api.extract.strategies.tesseract.TesseractStrategy
easyocr:
class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy
2 changes: 0 additions & 2 deletions dev.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,6 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/* \
&& apt-get update --fix-missing \
&& apt-get install -y \
libgl1-mesa-glx \
tesseract-ocr \
libtesseract-dev \
poppler-utils \
libmagic1 \
libmagic-dev \
Expand Down
2 changes: 0 additions & 2 deletions dev.gpu.Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,6 @@ RUN apt-get clean && rm -rf /var/lib/apt/lists/* \
&& apt-get update --fix-missing \
&& apt-get install -y \
libgl1-mesa-glx \
tesseract-ocr \
libtesseract-dev \
poppler-utils \
libpoppler-cpp-dev \
&& rm -rf /var/lib/apt/lists/*
Expand Down
4 changes: 1 addition & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ readme = "README.md"
requires-python = ">=3.8"
dependencies = [
"fastapi",
"easyocr",
"celery",
"redis",
"pytesseract",
"opencv-python-headless",
"pdf2image",
"ollama",
Expand All @@ -27,8 +27,6 @@ dependencies = [
"google-auth-httplib2",
"google-auth-oauthlib",
"transformers",
"surya-ocr==0.4.14",
"marker-pdf==0.2.6",
"boto3",
"Pillow",
"python-magic==0.4.27",
Expand Down
3 changes: 0 additions & 3 deletions run.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,9 +52,6 @@ echo "Starting Redis"
echo "Your ENV settings loaded from .env.localhost file: "
printenv

echo "Downloading models"
python -c 'from marker.models import load_all_models; load_all_models()'

CELERY_BIN="$(pwd)/.venv/bin/celery"
CELERY_PID=$(pgrep -f "$CELERY_BIN")
REDIS_PORT=6379 # will move it to .envs in near future
Expand Down
56 changes: 56 additions & 0 deletions text_extract_api/extract/strategies/easyocr.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
import io
import numpy as np
from PIL import Image
import easyocr

from text_extract_api.extract.strategies.strategy import Strategy
from text_extract_api.files.file_formats.file_format import FileFormat
from text_extract_api.files.file_formats.image import ImageFileFormat


class EasyOCRStrategy(Strategy):
@classmethod
def name(cls) -> str:
return "easyOCR"

def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:
"""
Extract text using EasyOCR after converting the input file to images
(if not already an ImageFileFormat).
"""

# Ensure we can actually convert the input file to ImageFileFormat
if (
not isinstance(file_format, ImageFileFormat)
and not file_format.can_convert_to(ImageFileFormat)
):
raise TypeError(
f"EasyOCR - format {file_format.mime_type} is not supported (yet?)"
)

# Convert the input file to a list of ImageFileFormat objects
images = FileFormat.convert_to(file_format, ImageFileFormat)

# Initialize the EasyOCR Reader
# Add or change languages to your needs, e.g., ['en', 'fr']
reader = easyocr.Reader(language.split(','))

# Process each image, extracting text
all_extracted_text = []
for image_format in images:
# Convert the in-memory bytes to a PIL Image
pil_image = Image.open(io.BytesIO(image_format.binary))

# Convert PIL image to numpy array for EasyOCR
np_image = np.array(pil_image)

# Perform OCR; with `detail=0`, we get just text, no bounding boxes
ocr_result = reader.readtext(np_image, detail=0) # TODO: addd bounding boxes support as described in #37

# Combine all lines into a single string for that image/page
extracted_text = "\n".join(ocr_result)
all_extracted_text.append(extracted_text)

# Join text from all images/pages
full_text = "\n\n".join(all_extracted_text)
return full_text
2 changes: 1 addition & 1 deletion text_extract_api/extract/strategies/llama_vision.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class LlamaVisionStrategy(Strategy):
def name(cls) -> str:
return "llama_vision"

def extract_text(self, file_format: FileFormat):
def extract_text(self, file_format: FileFormat, language: str = 'en') -> str:

if (
not isinstance(file_format, ImageFileFormat)
Expand Down
24 changes: 0 additions & 24 deletions text_extract_api/extract/strategies/marker.py

This file was deleted.

2 changes: 1 addition & 1 deletion text_extract_api/extract/strategies/strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def name(cls) -> str:
raise NotImplementedError("Strategy subclasses must implement name")

@classmethod
def extract_text(cls, file_format: Type["FileFormat"]):
def extract_text(cls, file_format: Type["FileFormat"], language: str = 'en') -> str:
raise NotImplementedError("Strategy subclasses must implement extract_text method")

@classmethod
Expand Down
Loading