-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Description
The Draftable API is returning text fields with UTF-8 encoding problems (mojibake) that require post-processing to be readable.
Problem Details
Issue: Text fields (leftText
, rightText
, text
) in API responses contain improperly encoded UTF-8 characters, likely due to encoding/decoding issues in the text extraction pipeline.
Root Cause: Appears to be double-encoding, encoding detection errors, or improper handling of mixed character encodings from source PDFs.
Examples
Current API Response (garbled):
After ftfy.fix_text() (corrected):
{
"leftText": "café" // Properly decoded UTF-8
}
More severe example:
{
"leftText": "áéÃóú" // Garbled accented characters
}
After ftfy.fix_text():
{
"leftText": "áéíóú" // Correct Spanish accented characters
}
Impact
- Text appears garbled to end users
- Search functionality fails on corrupted text
- Document comparison accuracy is reduced
- Requires client-side post-processing with libraries like
ftfy
Workaround
Using ftfy
Python library to repair the encoding issues:
import ftfy
def fix_json_text_fields(response_data):
"""Fix UTF-8 encoding issues in Draftable API response."""
target_fields = {"rightText", "text", "leftText"}
stack = [response_data]
while stack:
current = stack.pop()
if isinstance(current, dict):
for key, value in current.items():
if key in target_fields and isinstance(value, str):
current[key] = ftfy.fix_text(value) # Fixes mojibake
elif isinstance(value, (dict, list)):
stack.append(value)
elif isinstance(current, list):
for item in current:
if isinstance(item, (dict, list)):
stack.append(item)
Suggested Fix
Please ensure proper UTF-8 handling in the API response pipeline:
- Text extraction: Verify correct encoding detection from source PDFs
- Internal processing: Maintain UTF-8 throughout the pipeline
- API serialization: Ensure JSON output uses proper UTF-8 encoding
- Consider using ftfy server-side to automatically fix common encoding issues
Environment
- Occurs with PDFs containing non-ASCII characters (accents, symbols, non-Latin scripts)
- Frequency: Common with multilingual documents
- All API endpoints returning text fields are affected
This significantly impacts international document processing. Server-side encoding fixes would eliminate the need for client-side workarounds.