UTF-8 Encoding Issues in API Response Text Fields

## Description                                                                                                                                                            
                                                                                                                                                                            
  The Draftable API is returning text fields with UTF-8 encoding problems (mojibake) that require post-processing to be readable.                                           
                                                                                                                                                                            
  ## Problem Details                                                                                                                                                        

  **Issue**: Text fields (`leftText`, `rightText`, `text`) in API responses contain improperly encoded UTF-8 characters, likely due to encoding/decoding issues in the text extraction pipeline.

  **Root Cause**: Appears to be double-encoding, encoding detection errors, or improper handling of mixed character encodings from source PDFs.

  ## Examples

  ### Current API Response (garbled):
  ```jsonc
  {
    "leftText": "caf\u00e9"  // Should be "café"
  }
```

  After ftfy.fix_text() (corrected):
  ```jsonc
  {
    "leftText": "café"  // Properly decoded UTF-8
  }
```
  More severe example:
  ```jsonc
  {
    "leftText": "Ã¡Ã©Ã­Ã³Ãº"  // Garbled accented characters
  }
```
  After ftfy.fix_text():
  ```jsonc
  {
    "leftText": "áéíóú"  // Correct Spanish accented characters  
  }
```

## Impact

- Text appears garbled to end users
- Search functionality fails on corrupted text
- Document comparison accuracy is reduced
- Requires client-side post-processing with libraries like `ftfy`

## Workaround

Using `ftfy` Python library to repair the encoding issues:

```python
import ftfy

def fix_json_text_fields(response_data):
    """Fix UTF-8 encoding issues in Draftable API response."""
    target_fields = {"rightText", "text", "leftText"}
    stack = [response_data]
    
    while stack:
        current = stack.pop()
        if isinstance(current, dict):
            for key, value in current.items():
                if key in target_fields and isinstance(value, str):
                    current[key] = ftfy.fix_text(value)  # Fixes mojibake
                elif isinstance(value, (dict, list)):
                    stack.append(value)
        elif isinstance(current, list):
            for item in current:
                if isinstance(item, (dict, list)):
                    stack.append(item)
```

## Suggested Fix

Please ensure proper UTF-8 handling in the API response pipeline:

1. **Text extraction**: Verify correct encoding detection from source PDFs
2. **Internal processing**: Maintain UTF-8 throughout the pipeline  
3. **API serialization**: Ensure JSON output uses proper UTF-8 encoding
4. **Consider using ftfy server-side** to automatically fix common encoding issues

## Environment

- Occurs with PDFs containing non-ASCII characters (accents, symbols, non-Latin scripts)
- Frequency: Common with multilingual documents
- All API endpoints returning text fields are affected

This significantly impacts international document processing. Server-side encoding fixes would eliminate the need for client-side workarounds.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UTF-8 Encoding Issues in API Response Text Fields #11

Description

Problem Details

Examples

Current API Response (garbled):

Impact

Workaround

Suggested Fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

UTF-8 Encoding Issues in API Response Text Fields #11

Description

Description

Problem Details

Examples

Current API Response (garbled):

Impact

Workaround

Suggested Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions