- 
                Notifications
    
You must be signed in to change notification settings  - Fork 247
 
Feature/54 add docling support #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…g-support # Conflicts: # config/strategies.yaml # text_extract_api/extract/tasks.py
…54-add-docling-support
…ling-support-btrojan
…support-btrojan Feature/54 add docling support btrojan
+ suppport for docx, txt, html,plain, csv, json, xml + few fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I think this is cool and done right. Great job @choinek!
The only thing - please don't remove the RemoteStrategy references as it's required for the remote marker-pdf to work
That's the only requested change before the merge
Thanks!
| 
           @pkarw To be honest I don't remember why I commented it out, currently on my way to check it out. :)  | 
    
+ use default pdf file format for pdf instead of using docling for that
…into feature/54-add-docling-support
| 
           tests:  | 
    
Adding support for docling
Adding a special type of docling document, which itself does not convert to anything, but from which text can be extracted directly
Using text_gatherer, which is a mechanism related to DTO ExtractResult that was previously introduced as an independent functionality
Note - it is recommended to use a prompt with docling even though it already converts to markdown by itself; I noticed that in the case of several test invoices, it does it quite poorly, so for testing purposes in OcrRequest I was adding a prompt:
"Only answer the main task. No hints, advice, side information, examples, or extra comments. Make markdown from provided file, or just send plain text if its not convertable." + Im using llama 3.1