-
Notifications
You must be signed in to change notification settings - Fork 249
feat: easyOCR added, tesseract - removed, marker - removed, license changed to MIT #91
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 3 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| { | ||
| "files.exclude": { | ||
| "**/__pycache__": true, | ||
| "**/*.egg-info": true | ||
| } | ||
| } |
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -7,8 +7,8 @@ The API is built with FastAPI and uses Celery for asynchronous task processing. | |
|  | ||
|
|
||
| ## Features: | ||
| - **No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment, | ||
| - **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract) | ||
| - **No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment, | ||
| - **PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR) | ||
| - **PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1) | ||
| - **LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text | ||
| - **Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples` | ||
|
|
@@ -39,8 +39,6 @@ Before running the example see [getting started](#getting-started) | |
|
|
||
|  | ||
|
|
||
| **Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables. | ||
|
|
||
| ## Getting started | ||
|
|
||
| You might want to run the app directly on your machine for development purposes OR to use for example Apple GPUs (which are not supported by Docker at the moment). | ||
|
|
@@ -114,7 +112,7 @@ This command will install all the dependencies - including Redis (via Docker, so | |
|
|
||
| (MAC) - Dependencies | ||
| ``` | ||
| brew update && brew install libmagic tesseract poppler pkg-config ghostscript ffmpeg automake autoconf | ||
| brew update && brew install libmagic poppler pkg-config ghostscript ffmpeg automake autoconf | ||
| ``` | ||
|
|
||
| (Mac) - You need to startup the celery worker | ||
|
|
@@ -312,9 +310,11 @@ python client/cli.py llm_pull --model llama3.2-vision | |
| and only after to run this specific prompt query: | ||
|
|
||
| ```bash | ||
| python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt | ||
| python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en | ||
| ``` | ||
|
|
||
| **Note:** The language argument is used for the OCR strategy to load the model weights for the selected language. You can specify multiple languages as a list: `en,de,pl` etc. | ||
|
|
||
| The `ocr` command can store the results using the `storage_profiles`: | ||
| - **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved | ||
| - **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting | ||
|
|
@@ -410,37 +410,39 @@ apiClient.uploadFile(formData).then(response => { | |
| - **Method**: POST | ||
| - **Parameters**: | ||
| - **file**: PDF, image or Office file to be processed. | ||
| - **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`). | ||
| - **strategy**: OCR strategy to use (`llama_vision` or `easyocr`). | ||
| - **ocr_cache**: Whether to cache the OCR result (true or false). | ||
| - **prompt**: When provided, will be used for Ollama processing the OCR result | ||
| - **model**: When provided along with the prompt - this model will be used for LLM processing | ||
| - **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved | ||
| - **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting | ||
| - **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=marker" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload" | ||
| curl -X POST -H "Content-Type: multipart/form-data" -F "file=@examples/example-mri.pdf" -F "strategy=easyocr" -F "ocr_cache=true" -F "prompt=" -F "model=" "http://localhost:8000/ocr/upload" | ||
| ``` | ||
|
|
||
| ### OCR Endpoint via JSON request | ||
| - **URL**: /ocr/request | ||
| - **Method**: POST | ||
| - **Parameters** (JSON body): | ||
| - **file**: Base64 encoded PDF file content. | ||
| - **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`). | ||
| - **strategy**: OCR strategy to use (`llama_vision` or `easyocr`). | ||
| - **ocr_cache**: Whether to cache the OCR result (true or false). | ||
| - **prompt**: When provided, will be used for Ollama processing the OCR result. | ||
| - **model**: When provided along with the prompt - this model will be used for LLM processing. | ||
| - **storage_profile**: Used to save the result - the `default` profile (`/storage_profiles/default.yaml`) is used by default; if empty file is not saved. | ||
| - **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting. | ||
| - **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights | ||
|
|
||
| Example: | ||
|
|
||
| ```bash | ||
| curl -X POST "http://localhost:8000/ocr/request" -H "Content-Type: application/json" -d '{ | ||
| "file": "<base64-encoded-file-content>", | ||
| "strategy": "marker", | ||
| "strategy": "easyocr", | ||
| "ocr_cache": true, | ||
| "prompt": "", | ||
| "model": "llama3.1", | ||
|
|
@@ -598,13 +600,7 @@ AWS_S3_BUCKET_NAME=your-bucket-name | |
| ``` | ||
|
|
||
| ## License | ||
| This project is licensed under the GNU General Public License. See the [LICENSE](LICENSE) file for details. | ||
|
|
||
| **Important note on [marker](https://github.com/VikParuchuri/marker) license***: | ||
|
|
||
| The weights for the models are licensed `cc-by-nc-sa-4.0`, but Marker's author will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to/). | ||
|
|
||
|
|
||
| This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details. | ||
|
|
||
| ## Contact | ||
| In case of any questions please contact us at: [email protected] | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,7 +1,5 @@ | ||
| strategies: | ||
| llama_vision: | ||
| class: text_extract_api.extract.strategies.llama_vision.LlamaVisionStrategy | ||
| marker: | ||
| class: text_extract_api.extract.strategies.marker.MarkerStrategy | ||
| tesseract: | ||
| class: text_extract_api.extract.strategies.tesseract.TesseractStrategy | ||
| easyocr: | ||
| class: text_extract_api.extract.strategies.easyocr.EasyOCRStrategy |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,56 @@ | ||
| import io | ||
| import numpy as np | ||
| from PIL import Image | ||
| import easyocr | ||
|
|
||
| from text_extract_api.extract.strategies.strategy import Strategy | ||
| from text_extract_api.files.file_formats.file_format import FileFormat | ||
| from text_extract_api.files.file_formats.image import ImageFileFormat | ||
|
|
||
|
|
||
| class EasyOCRStrategy(Strategy): | ||
| @classmethod | ||
| def name(cls) -> str: | ||
| return "easyOCR" | ||
|
|
||
| def extract_text(self, file_format: FileFormat, language: str = 'en') -> str: | ||
| """ | ||
| Extract text using EasyOCR after converting the input file to images | ||
| (if not already an ImageFileFormat). | ||
| """ | ||
|
|
||
| # Ensure we can actually convert the input file to ImageFileFormat | ||
| if ( | ||
| not isinstance(file_format, ImageFileFormat) | ||
| and not file_format.can_convert_to(ImageFileFormat) | ||
| ): | ||
| raise TypeError( | ||
| f"EasyOCR - format {file_format.mime_type} is not supported (yet?)" | ||
| ) | ||
|
|
||
| # Convert the input file to a list of ImageFileFormat objects | ||
| images = FileFormat.convert_to(file_format, ImageFileFormat) | ||
|
|
||
| # Initialize the EasyOCR Reader | ||
| # Add or change languages to your needs, e.g., ['en', 'fr'] | ||
| reader = easyocr.Reader(language.split(',')) | ||
|
|
||
| # Process each image, extracting text | ||
| all_extracted_text = [] | ||
| for image_format in images: | ||
| # Convert the in-memory bytes to a PIL Image | ||
| pil_image = Image.open(io.BytesIO(image_format.binary)) | ||
|
|
||
| # Convert PIL image to numpy array for EasyOCR | ||
| np_image = np.array(pil_image) | ||
|
|
||
| # Perform OCR; with `detail=0`, we get just text, no bounding boxes | ||
| ocr_result = reader.readtext(np_image, detail=0) # TODO: addd bounding boxes support as described in #37 | ||
|
|
||
| # Combine all lines into a single string for that image/page | ||
| extracted_text = "\n".join(ocr_result) | ||
| all_extracted_text.append(extracted_text) | ||
|
|
||
| # Join text from all images/pages | ||
| full_text = "\n\n".join(all_extracted_text) | ||
| return full_text |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I'll commit similar settings for PyCharm :)