Issue: Imperfect line grouping in side-by-side text blocks #2012

ArthurMrl · 2025-09-03T12:11:05Z

ArthurMrl
Sep 3, 2025

Problem Description

The _resolve_sub_lines function in doctr/models/builder.py uses a fixed paragraph_break parameter to determine when to split lines horizontally. This approach fails when processing documents with two text columns side by side, as it doesn't adapt to the actual spacing patterns in the document.

Current Behavior

When processing a document with two text columns side by side, the algorithm groups words from both columns into a single line because it uses a fixed threshold (paragraph_break) that doesn't consider the actual gaps between words in the specific line being processed.

Example scenario:

Column 1: "This is some text"     Column 2: "Another text block"
         "More text here"                   "Different content"

Current result: Words from both columns get grouped into the same line if the gap between them is smaller than the fixed paragraph_break threshold.

Root Cause

In _resolve_sub_lines() (line 92), the condition uses a fixed parameter:

if dist < self.paragraph_break:
    horiz_break = False  # Same sub-line

This ignores the actual spacing patterns within the specific line being processed, causing words from different columns to be incorrectly grouped together.

Proposed Solution

Replace the fixed paragraph_break parameter with a dynamic calculation based on the median of gaps between consecutive words in the current line:

def _resolve_sub_lines(self, boxes: np.ndarray, word_idcs: list[int]) -> list[list[int]]:
    """Order boxes to group them in sub-lines"""
    if len(word_idcs) <= 1:
        return [word_idcs]
    
    # Sort words by x position to ensure proper order
    word_idcs_sorted = sorted(word_idcs, key=lambda w: boxes[w][0])
    
    # Compute median of gaps between consecutive words in this line
    gaps = []
    for i in range(len(word_idcs_sorted) - 1):
        current_right = boxes[word_idcs_sorted[i]][2]
        next_left = boxes[word_idcs_sorted[i + 1]][0]
        gap = next_left - current_right
        if gap > 0:  # Only consider positive gaps (no overlapping)
            gaps.append(gap)
    
    if not gaps:
        return [word_idcs_sorted]
    
    x_gap_med = np.median(gaps)
    
    # Split the line based on gaps that exceed the median * 2
    lines = []
    sub_line = [word_idcs_sorted[0]]
    
    for i in word_idcs_sorted[1:]:
        horiz_break = True
        prev_box = boxes[sub_line[-1]]
        dist = boxes[i, 0] - prev_box[2]
        
        # Use dynamic threshold based on line-specific gap median
        if dist < x_gap_med * 2:
            horiz_break = False
        
        if horiz_break:
            lines.append(sub_line)
            sub_line = [i]
        else:
            sub_line.append(i)
    
    if sub_line:
        lines.append(sub_line)
    
    return lines

Key Changes

Remove fixed paragraph_break parameter - No longer relies on a global threshold
Calculate line-specific gap median - Computes the median of gaps between consecutive words in the current line
Dynamic threshold - Uses x_gap_med * 2 as the threshold for splitting

Questions for Discussion

Threshold tuning: Is x_gap_med * 2 an appropriate multiplier, or should it be configurable?
Performance impact: How would calculating the median for each line affect processing speed?
Edge cases: How should we handle lines with very few words or overlapping boxes?

Use Cases Affected

Multi-column documents (newspapers, magazines)
Side-by-side text blocks
Tables with separate text regions
Any layout with horizontally separated text areas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Issue: Imperfect line grouping in side-by-side text blocks #2012

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Issue: Imperfect line grouping in side-by-side text blocks #2012

Uh oh!

Uh oh!

ArthurMrl Sep 3, 2025

Problem Description

Current Behavior

Root Cause

Proposed Solution

Key Changes

Questions for Discussion

Use Cases Affected

Replies: 0 comments

ArthurMrl
Sep 3, 2025