Skip to content

UTF-8 Encoding Issues in API Response Text Fields #11

@pablospe

Description

@pablospe

Description

The Draftable API is returning text fields with UTF-8 encoding problems (mojibake) that require post-processing to be readable.

Problem Details

Issue: Text fields (leftText, rightText, text) in API responses contain improperly encoded UTF-8 characters, likely due to encoding/decoding issues in the text extraction pipeline.

Root Cause: Appears to be double-encoding, encoding detection errors, or improper handling of mixed character encodings from source PDFs.

Examples

Current API Response (garbled):

{
  "leftText": "caf\u00e9"  // Should be "café"
}

After ftfy.fix_text() (corrected):

{
  "leftText": "café"  // Properly decoded UTF-8
}

More severe example:

{
  "leftText": "áéíóú"  // Garbled accented characters
}

After ftfy.fix_text():

{
  "leftText": "áéíóú"  // Correct Spanish accented characters  
}

Impact

  • Text appears garbled to end users
  • Search functionality fails on corrupted text
  • Document comparison accuracy is reduced
  • Requires client-side post-processing with libraries like ftfy

Workaround

Using ftfy Python library to repair the encoding issues:

import ftfy

def fix_json_text_fields(response_data):
    """Fix UTF-8 encoding issues in Draftable API response."""
    target_fields = {"rightText", "text", "leftText"}
    stack = [response_data]
    
    while stack:
        current = stack.pop()
        if isinstance(current, dict):
            for key, value in current.items():
                if key in target_fields and isinstance(value, str):
                    current[key] = ftfy.fix_text(value)  # Fixes mojibake
                elif isinstance(value, (dict, list)):
                    stack.append(value)
        elif isinstance(current, list):
            for item in current:
                if isinstance(item, (dict, list)):
                    stack.append(item)

Suggested Fix

Please ensure proper UTF-8 handling in the API response pipeline:

  1. Text extraction: Verify correct encoding detection from source PDFs
  2. Internal processing: Maintain UTF-8 throughout the pipeline
  3. API serialization: Ensure JSON output uses proper UTF-8 encoding
  4. Consider using ftfy server-side to automatically fix common encoding issues

Environment

  • Occurs with PDFs containing non-ASCII characters (accents, symbols, non-Latin scripts)
  • Frequency: Common with multilingual documents
  • All API endpoints returning text fields are affected

This significantly impacts international document processing. Server-side encoding fixes would eliminate the need for client-side workarounds.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions