Skip to content

Conversation

@choinek
Copy link
Collaborator

@choinek choinek commented Mar 27, 2025

Adding support for docling

Adding a special type of docling document, which itself does not convert to anything, but from which text can be extracted directly

Using text_gatherer, which is a mechanism related to DTO ExtractResult that was previously introduced as an independent functionality

Note - it is recommended to use a prompt with docling even though it already converts to markdown by itself; I noticed that in the case of several test invoices, it does it quite poorly, so for testing purposes in OcrRequest I was adding a prompt:
"Only answer the main task. No hints, advice, side information, examples, or extra comments. Make markdown from provided file, or just send plain text if its not convertable." + Im using llama 3.1

@choinek choinek self-assigned this Mar 27, 2025
@choinek choinek changed the title [WIP] Feature/54 add docling support Feature/54 add docling support Apr 26, 2025
@choinek choinek requested a review from pkarw April 26, 2025 14:04
Copy link
Contributor

@pkarw pkarw left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I think this is cool and done right. Great job @choinek!

The only thing - please don't remove the RemoteStrategy references as it's required for the remote marker-pdf to work

That's the only requested change before the merge

Thanks!

@choinek
Copy link
Collaborator Author

choinek commented Apr 28, 2025

@pkarw
please don't remove the RemoteStrategy references as it's required for the remote marker-pdf to work

To be honest I don't remember why I commented it out, currently on my way to check it out. :)

@choinek choinek linked an issue Apr 29, 2025 that may be closed by this pull request
6 tasks
choinek added 2 commits April 29, 2025 03:39
+ use default pdf file format for pdf instead of using docling for that
@choinek choinek merged commit 70af9f1 into main Apr 29, 2025
@choinek
Copy link
Collaborator Author

choinek commented Apr 29, 2025

tests:

[2025-04-29 04:17:42,360: INFO/MainProcess] Task text_extract_api.extract.tasks.ocr_task[ae60f5fd-faa4-4546-87d7-eb403a9911c6] succeeded in 24.2156476250384s: '### Acme Invoice Ltd

#### Notes
From: Acme Invoice Ltd  
Darrow Street 2  
E1 7AW Portsoken  
London  

#### Invoice For
John Doe  
Invoice ID: INVISI24/2024  
2048 Michigan Str  
Address Line 2: PO Number: 60601 Chicago; US  

#### Issue Date: 17/09/2024  
Due Date: 11/10/2024  

### Subject

| Description       | Quantity   | Unit Price   |   Amount |
|-------------------|------------|--------------|----------|
| iPhone 13 PRO MAX |            | 700.00       |      700 |
| Magic Mouse       |            | 54.00        |       54 |
|                   |            |              |        0 |
|                   |            |              |        0 |
|                   |            |              |        0 |
|                   |            |              |        0 |

#### Subtotal
754.00

#### Discount (0.25 25%)
Amount Due: 701.22'
[2025-04-29 04:17:42,404: INFO/MainProcess] Task text_extract_api.extract.tasks.ocr_task[3023fd58-0288-4eda-916b-fd712bf78089] received
[2025-04-29 04:17:42,411


Here is the text after fixing spelling issues, removing personal information, and replacing it with "ANONYMIZED":

**Dikengil Radiology Associates**

*   **Address:** ANONYMIZED
*   **Phone:** ANONYMIZED

**Patient Information**

*   **Name:** ANONYMIZED
*   **RE:** ANONYMIZED; 55F
*   **Acct #:** ANONYMIZED
*   **DOB:** 00/00/1966
*   **Study:** Brain MRI
*   **DOS:** 04/29/2021

**Clinical History**

*   **Age:** 55 years old
*   **Sex:** Female
*   **Interruption:** Intermittent positional headaches

**Technique**

*   **MRI Technique:** Non-contrast MRI of the brain performed in three orthogonal planes using T1/T2/T2* FLAIR/T2 GRE/Diffusion-ADC sequences.

**Findings**

*   **Lateral Ventricles:** Normal volume and configuration
*   **Third and Fourth Ventricles:** Normal volume and configuration
*   **Diffusion-ADC Sequences:** No evidence of diffusion restriction or ADC signal changes suggestive of infarction or hemorrhage

**Conclusion**

*   **Diagnosis:** Chiari I malformation with 10mm descent of cerebellar tonsils
*   **Recommendations:** Further evaluation and management by a neurosurgeon for potential surgical intervention.

I removed the following personal information:

*   Names (Phil Referrer, Jane, Mary)
*   Address (0 Maywood Ave, Maywood, NJ 00000)
*   Phone number (201-725-0913)
*   Date of Birth (00/00/1966)

 '### Introduction

#### Lorem ipsum dolor sit amet, consectetur adipiscing elit.

Sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.

### Main Content

Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur.

Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum. Phasellus imperdiet ligula sit amet arcu suscipit, et placerat metus egestas.

Vestibulum ante ipsum.

Primis in faucibus orci luctus et ultrices posuere cubilia curae; Curabitur bibendum, nisl ut luctus euismod, elit ligula hendrerit lorem, nec tincidunt libero nisl a metus.

Aliquam erat volutpat.

Ut vehicula purus a magna malesuada, a tincidunt mi interdum.

### Details

Nullam tincidunt vehicula nulla, nec tincidunt metus. Integer tristique tristique turpis, a convallis elit fermentum a. Sed volutpat risus nec arcu...'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[feat] Add docling support

4 participants