You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The _resolve_sub_lines function in doctr/models/builder.py uses a fixed paragraph_break parameter to determine when to split lines horizontally. This approach fails when processing documents with two text columns side by side, as it doesn't adapt to the actual spacing patterns in the document.
Current Behavior
When processing a document with two text columns side by side, the algorithm groups words from both columns into a single line because it uses a fixed threshold (paragraph_break) that doesn't consider the actual gaps between words in the specific line being processed.
Example scenario:
Column 1: "This is some text" Column 2: "Another text block"
"More text here" "Different content"
Current result: Words from both columns get grouped into the same line if the gap between them is smaller than the fixed paragraph_break threshold.
Root Cause
In _resolve_sub_lines() (line 92), the condition uses a fixed parameter:
ifdist<self.paragraph_break:
horiz_break=False# Same sub-line
This ignores the actual spacing patterns within the specific line being processed, causing words from different columns to be incorrectly grouped together.
Proposed Solution
Replace the fixed paragraph_break parameter with a dynamic calculation based on the median of gaps between consecutive words in the current line:
def_resolve_sub_lines(self, boxes: np.ndarray, word_idcs: list[int]) ->list[list[int]]:
"""Order boxes to group them in sub-lines"""iflen(word_idcs) <=1:
return [word_idcs]
# Sort words by x position to ensure proper orderword_idcs_sorted=sorted(word_idcs, key=lambdaw: boxes[w][0])
# Compute median of gaps between consecutive words in this linegaps= []
foriinrange(len(word_idcs_sorted) -1):
current_right=boxes[word_idcs_sorted[i]][2]
next_left=boxes[word_idcs_sorted[i+1]][0]
gap=next_left-current_rightifgap>0: # Only consider positive gaps (no overlapping)gaps.append(gap)
ifnotgaps:
return [word_idcs_sorted]
x_gap_med=np.median(gaps)
# Split the line based on gaps that exceed the median * 2lines= []
sub_line= [word_idcs_sorted[0]]
foriinword_idcs_sorted[1:]:
horiz_break=Trueprev_box=boxes[sub_line[-1]]
dist=boxes[i, 0] -prev_box[2]
# Use dynamic threshold based on line-specific gap medianifdist<x_gap_med*2:
horiz_break=Falseifhoriz_break:
lines.append(sub_line)
sub_line= [i]
else:
sub_line.append(i)
ifsub_line:
lines.append(sub_line)
returnlines
Key Changes
Remove fixed paragraph_break parameter - No longer relies on a global threshold
Calculate line-specific gap median - Computes the median of gaps between consecutive words in the current line
Dynamic threshold - Uses x_gap_med * 2 as the threshold for splitting
Questions for Discussion
Threshold tuning: Is x_gap_med * 2 an appropriate multiplier, or should it be configurable?
Performance impact: How would calculating the median for each line affect processing speed?
Edge cases: How should we handle lines with very few words or overlapping boxes?
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Problem Description
The _resolve_sub_lines function in doctr/models/builder.py uses a fixed paragraph_break parameter to determine when to split lines horizontally. This approach fails when processing documents with two text columns side by side, as it doesn't adapt to the actual spacing patterns in the document.
Current Behavior
When processing a document with two text columns side by side, the algorithm groups words from both columns into a single line because it uses a fixed threshold (paragraph_break) that doesn't consider the actual gaps between words in the specific line being processed.
Example scenario:
Current result: Words from both columns get grouped into the same line if the gap between them is smaller than the fixed paragraph_break threshold.
Root Cause
In _resolve_sub_lines() (line 92), the condition uses a fixed parameter:
This ignores the actual spacing patterns within the specific line being processed, causing words from different columns to be incorrectly grouped together.
Proposed Solution
Replace the fixed paragraph_break parameter with a dynamic calculation based on the median of gaps between consecutive words in the current line:
Key Changes
Questions for Discussion
Use Cases Affected
Beta Was this translation helpful? Give feedback.
All reactions