Skip to content
This repository was archived by the owner on Jul 31, 2025. It is now read-only.

Conversation

@arthurgousset
Copy link
Member

@arthurgousset arthurgousset commented May 29, 2025

Cyrillic Typo Tolerance Bug Fix Successfully Implemented

Summary

Successfully fixed the Unicode character typo tolerance bug in the Meilisearch milli crate that was causing incorrect typo tolerance for multi-byte Unicode characters including Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and other non-ASCII text.

Root Cause

The bug was located in the number_of_typos_allowed function in ./crates/milli/src/search/new/query_term/parse_query.rs at lines 205 and 209, where the code was using word.len() (byte count) instead of word.chars().count() (character count) to determine typo tolerance.

The Problem

  • ASCII characters: Each character = 1 byte, so "doggy".len() = 5 bytes = 5 characters
  • Cyrillic characters: Each character = 2-3 bytes in UTF-8, so "собак".len() = 10 bytes ≠ 5 characters

This caused words with the same character count to receive different typo tolerance based on their byte representation.

Fix Applied

Before (Buggy Code)

Ok(Box::new(move |word: &str| {
    if !authorize_typos
        || word.len() < min_len_one_typo as usize  // BUG: Using byte length
        || exact_words.as_ref().is_some_and(|fst| fst.contains(word))
    {
        0
    } else if word.len() < min_len_two_typos as usize {  // BUG: Using byte length
        1
    } else {
        2
    }
}))

After (Fixed Code)

Ok(Box::new(move |word: &str| {
    if !authorize_typos
        || word.chars().count() < min_len_one_typo as usize  // FIX: Using character count
        || exact_words.as_ref().is_some_and(|fst| fst.contains(word))
    {
        0
    } else if word.chars().count() < min_len_two_typos as usize {  // FIX: Using character count
        1
    } else {
        2
    }
}))

Changes Made

  1. Line 205: Changed word.len() to word.chars().count()
  2. Line 209: Changed word.len() to word.chars().count()

What Was Not Changed

The ngram_str.len() > MAX_WORD_LENGTH check at line 249 was intentionally left unchanged because MAX_WORD_LENGTH represents a byte-based limit for LMDB database storage, not a character-based limit.

Impact of the Fix

  • Before: Words with same character count but different byte lengths got different typo tolerance
    • "doggy" (5 chars, 5 bytes) → 1 typo tolerance
    • "собак" (5 chars, 10 bytes) → 2 typos tolerance (incorrect)
  • After: Words with same character count get same typo tolerance regardless of byte length
    • "doggy" (5 chars, 5 bytes) → 1 typo tolerance
    • "собак" (5 chars, 10 bytes) → 1 typo tolerance (correct)

Verification

  • ✅ Code compiles successfully with cargo check
  • ✅ Fix correctly applied to both problematic lines
  • ✅ Character counting logic now consistent across all Unicode text
  • ✅ Maintains backward compatibility for ASCII text

Languages/Scripts Affected by This Fix

This fix resolves typo tolerance issues for all Unicode text including:

  • Cyrillic (Russian, Bulgarian, Serbian, etc.)
  • Arabic and Hebrew
  • Chinese, Japanese, Korean (CJK)
  • Accented Latin characters (é, ñ, ü, etc.)
  • Thai, Hindi, and other complex scripts
  • Emoji and Unicode symbols

The fix ensures that typo tolerance is determined by logical character count rather than UTF-8 byte representation, providing consistent behavior across all languages and writing systems.

Tests

cargo test --package milli --lib -- search::new::query_term::parse_query::tests --show-output 

    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.16s
     Running unittests src/lib.rs (target/debug/deps/milli-a43628d37620dffa)

running 3 tests
test search::new::query_term::parse_query::tests::test_unicode_typo_tolerance_fixed ... ok
test search::new::query_term::parse_query::tests::start_with_hard_separator ... ok
test search::new::query_term::parse_query::tests::test_various_unicode_scripts ... ok

successes:

successes:
    search::new::query_term::parse_query::tests::start_with_hard_separator
    search::new::query_term::parse_query::tests::test_unicode_typo_tolerance_fixed
    search::new::query_term::parse_query::tests::test_various_unicode_scripts

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 267 filtered out; finished in 0.30s

@arthurgousset arthurgousset force-pushed the workback/patch/5594/FB6ED899-E821-4C88-AA79-8BB975E1937A branch from b3b1abc to 263300b Compare June 4, 2025 11:19
Used `cargo insta test`
Reviewed with `cargo insta review`
Used `cargo insta test`
Reviewed with `cargo insta review`
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants