fix(5594): Cyrillic has different typo tolerance due to byte counting bug #8

arthurgousset · 2025-05-29T22:24:34Z

Cyrillic Typo Tolerance Bug Fix Successfully Implemented

Summary

Successfully fixed the Unicode character typo tolerance bug in the Meilisearch milli crate that was causing incorrect typo tolerance for multi-byte Unicode characters including Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and other non-ASCII text.

Root Cause

The bug was located in the number_of_typos_allowed function in ./crates/milli/src/search/new/query_term/parse_query.rs at lines 205 and 209, where the code was using word.len() (byte count) instead of word.chars().count() (character count) to determine typo tolerance.

The Problem

ASCII characters: Each character = 1 byte, so "doggy".len() = 5 bytes = 5 characters
Cyrillic characters: Each character = 2-3 bytes in UTF-8, so "собак".len() = 10 bytes ≠ 5 characters

This caused words with the same character count to receive different typo tolerance based on their byte representation.

Fix Applied

Before (Buggy Code)

Ok(Box::new(move |word: &str| {
    if !authorize_typos
        || word.len() < min_len_one_typo as usize  // BUG: Using byte length
        || exact_words.as_ref().is_some_and(|fst| fst.contains(word))
    {
        0
    } else if word.len() < min_len_two_typos as usize {  // BUG: Using byte length
        1
    } else {
        2
    }
}))

After (Fixed Code)

Ok(Box::new(move |word: &str| {
    if !authorize_typos
        || word.chars().count() < min_len_one_typo as usize  // FIX: Using character count
        || exact_words.as_ref().is_some_and(|fst| fst.contains(word))
    {
        0
    } else if word.chars().count() < min_len_two_typos as usize {  // FIX: Using character count
        1
    } else {
        2
    }
}))

Changes Made

Line 205: Changed word.len() to word.chars().count()
Line 209: Changed word.len() to word.chars().count()

What Was Not Changed

The ngram_str.len() > MAX_WORD_LENGTH check at line 249 was intentionally left unchanged because MAX_WORD_LENGTH represents a byte-based limit for LMDB database storage, not a character-based limit.

Impact of the Fix

Before: Words with same character count but different byte lengths got different typo tolerance
- "doggy" (5 chars, 5 bytes) → 1 typo tolerance
- "собак" (5 chars, 10 bytes) → 2 typos tolerance (incorrect)
After: Words with same character count get same typo tolerance regardless of byte length
- "doggy" (5 chars, 5 bytes) → 1 typo tolerance
- "собак" (5 chars, 10 bytes) → 1 typo tolerance (correct)

Verification

✅ Code compiles successfully with cargo check
✅ Fix correctly applied to both problematic lines
✅ Character counting logic now consistent across all Unicode text
✅ Maintains backward compatibility for ASCII text

Languages/Scripts Affected by This Fix

This fix resolves typo tolerance issues for all Unicode text including:

Cyrillic (Russian, Bulgarian, Serbian, etc.)
Arabic and Hebrew
Chinese, Japanese, Korean (CJK)
Accented Latin characters (é, ñ, ü, etc.)
Thai, Hindi, and other complex scripts
Emoji and Unicode symbols

The fix ensures that typo tolerance is determined by logical character count rather than UTF-8 byte representation, providing consistent behavior across all languages and writing systems.

Tests

cargo test --package milli --lib -- search::new::query_term::parse_query::tests --show-output 

    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.16s
     Running unittests src/lib.rs (target/debug/deps/milli-a43628d37620dffa)

running 3 tests
test search::new::query_term::parse_query::tests::test_unicode_typo_tolerance_fixed ... ok
test search::new::query_term::parse_query::tests::start_with_hard_separator ... ok
test search::new::query_term::parse_query::tests::test_various_unicode_scripts ... ok

successes:

successes:
    search::new::query_term::parse_query::tests::start_with_hard_separator
    search::new::query_term::parse_query::tests::test_unicode_typo_tolerance_fixed
    search::new::query_term::parse_query::tests::test_various_unicode_scripts

test result: ok. 3 passed; 0 failed; 0 ignored; 0 measured; 267 filtered out; finished in 0.30s

Used `cargo insta test` Reviewed with `cargo insta review`

arthurgousset force-pushed the workback/patch/5594/FB6ED899-E821-4C88-AA79-8BB975E1937A branch from 23729c4 to 6296ad9 Compare May 29, 2025 22:30

arthurgousset mentioned this pull request May 30, 2025

Wrong minWordSizeForTypos processing for cyrillic language meilisearch/meilisearch#5594

Closed

arthurgousset added 3 commits June 4, 2025 12:19

fix(parse_query): cyrillic bug

ef9fc6c

chore(parse_query): delete println and move test inside tests module

ab3d92d

style(milli): linting

263300b

arthurgousset force-pushed the workback/patch/5594/FB6ED899-E821-4C88-AA79-8BB975E1937A branch from b3b1abc to 263300b Compare June 4, 2025 11:19

arthurgousset added 2 commits June 4, 2025 14:17

test(meilisearch/search/locales.rs): updates snapshot

2752784

Used `cargo insta test` Reviewed with `cargo insta review`

test(meilisearch/search/locales.rs): updates snapshot

666680b

Used `cargo insta test` Reviewed with `cargo insta review`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(5594): Cyrillic has different typo tolerance due to byte counting bug #8

fix(5594): Cyrillic has different typo tolerance due to byte counting bug #8

Uh oh!

arthurgousset commented May 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(5594): Cyrillic has different typo tolerance due to byte counting bug #8

Are you sure you want to change the base?

fix(5594): Cyrillic has different typo tolerance due to byte counting bug #8

Uh oh!

Conversation

arthurgousset commented May 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cyrillic Typo Tolerance Bug Fix Successfully Implemented

Summary

Root Cause

The Problem

Fix Applied

Before (Buggy Code)

After (Fixed Code)

Changes Made

What Was Not Changed

Impact of the Fix

Verification

Languages/Scripts Affected by This Fix

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arthurgousset commented May 29, 2025 •

edited

Loading