You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+13-17Lines changed: 13 additions & 17 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -7,8 +7,8 @@ The API is built with FastAPI and uses Celery for asynchronous task processing.
7
7

8
8
9
9
## Features:
10
-
-**No Cloud/external dependencies** all you need: PyTorch based OCR (Marker) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11
-
-**PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [marker](https://github.com/VikParuchuri/marker) and [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [surya-ocr](https://github.com/VikParuchuri/surya) or [tessereact](https://github.com/h/pytesseract)
10
+
-**No Cloud/external dependencies** all you need: PyTorch based OCR (EasyOCR) + Ollama are shipped and configured via `docker-compose` no data is sent outside your dev/server environment,
11
+
-**PDF/Office to Markdown** conversion with very high accuracy using different OCR strategies including [llama3.2-vision](https://ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/), [easyOCR](https://github.com/JaidedAI/EasyOCR)
12
12
-**PDF/Office to JSON** conversion using Ollama supported models (eg. LLama 3.1)
13
13
-**LLM Improving OCR results** LLama is pretty good with fixing spelling and text issues in the OCR text
14
14
-**Removing PII** This tool can be used for removing Personally Identifiable Information out of document - see `examples`
@@ -39,8 +39,6 @@ Before running the example see [getting started](#getting-started)
39
39
40
40

41
41
42
-
**Note:** As you may observe in the example above, `marker-pdf` sometimes mismatches the cols and rows which could have potentially great impact on data accuracy. To improve on it there is a feature request [#3](https://github.com/CatchTheTornado/text-extract-api/issues/3) for adding alternative support for [`tabled`](https://github.com/VikParuchuri/tabled) model - which is optimized for tables.
43
-
44
42
## Getting started
45
43
46
44
You might want to run the app directly on your machine for development purposes OR to use for example Apple GPUs (which are not supported by Docker at the moment).
@@ -114,7 +112,7 @@ This command will install all the dependencies - including Redis (via Docker, so
python client/cli.py ocr_upload --file examples/example-mri.pdf --ocr_cache --prompt_file=examples/example-mri-remove-pii.txt --language en
316
314
```
317
315
316
+
**Note:** The language argument is used for the OCR strategy to load the model weights for the selected language. You can specify multiple languages as a list: `en,de,pl` etc.
317
+
318
318
The `ocr`command can store the results using the `storage_profiles`:
319
319
- **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default;if empty file is not saved
320
320
- **storage_filename**: Outputting filename - relative path of the `root_path`setin the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - fortime formatting
- **file**: PDF, image or Office file to be processed.
413
-
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
413
+
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
414
414
- **ocr_cache**: Whether to cache the OCR result (true or false).
415
415
- **prompt**: When provided, will be used for Ollama processing the OCR result
416
416
- **model**: When provided along with the prompt - this model will be used for LLM processing
417
417
- **storage_profile**: Used to save the result - the `default` profile (`./storage_profiles/default.yaml`) is used by default; if empty file is not saved
418
418
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting
419
+
- **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights
- **strategy**: OCR strategy to use (`marker`, `llama_vision` or `tesseract`).
432
+
- **strategy**: OCR strategy to use (`llama_vision` or `easyocr`).
432
433
- **ocr_cache**: Whether to cache the OCR result (true or false).
433
434
- **prompt**: When provided, will be used for Ollama processing the OCR result.
434
435
- **model**: When provided along with the prompt - this model will be used for LLM processing.
435
436
- **storage_profile**: Used to save the result - the `default` profile (`/storage_profiles/default.yaml`) is used by default; if empty file is not saved.
436
437
- **storage_filename**: Outputting filename - relative path of the `root_path` set in the storage profile - by default a relative path to `/storage` folder; can use placeholders for dynamic formatting: `{file_name}`, `{file_extension}`, `{Y}`, `{mm}`, `{dd}` - for date formatting, `{HH}`, `{MM}`, `{SS}` - for time formatting.
438
+
- **language**: One or many (`en` or `en,pl,de`) language codes for the OCR to load the language weights
437
439
438
440
Example:
439
441
440
442
```bash
441
443
curl -X POST "http://localhost:8000/ocr/request" -H "Content-Type: application/json" -d '{
This project is licensed under the GNU General Public License. See the [LICENSE](LICENSE) file for details.
602
-
603
-
**Important note on [marker](https://github.com/VikParuchuri/marker) license***:
604
-
605
-
The weights forthe models are licensed `cc-by-nc-sa-4.0`, but Marker's author will waive that for any organization under $5M USDin gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to/).
606
-
607
-
603
+
This project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.
ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
176
179
ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
177
180
ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
181
+
ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
178
182
#ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')
179
183
180
184
# Sub-command for uploading a file via file upload - @deprecated - it's a backward compatibility gimmick
@@ -189,6 +193,7 @@ def main():
189
193
ocr_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
190
194
ocr_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use for the file')
191
195
ocr_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use for the file. You may use some formatting - see the docs')
196
+
ocr_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
192
197
#ocr_parser.add_argument('--async_mode', action='store_true', help='Enable async mode for the OCR task')
193
198
194
199
@@ -204,6 +209,7 @@ def main():
204
209
ocr_request_parser.add_argument('--print_progress', default=True, action='store_true', help='Print the progress of the OCR task')
205
210
ocr_request_parser.add_argument('--storage_profile', type=str, default='default', help='Storage profile to use. You may use some formatting - see the docs')
206
211
ocr_request_parser.add_argument('--storage_filename', type=str, default=None, help='Storage filename to use')
212
+
ocr_request_parser.add_argument('--language', type=str, default='en', help='Language to use for the OCR task')
207
213
208
214
# Sub-command for getting the result
209
215
result_parser=subparsers.add_parser('result', help='Get the OCR result by specified task id.')
0 commit comments