This repository was archived by the owner on Jul 31, 2025. It is now read-only.
fix(5594): Cyrillic has different typo tolerance due to byte counting bug #8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cyrillic Typo Tolerance Bug Fix Successfully Implemented
Summary
Successfully fixed the Unicode character typo tolerance bug in the Meilisearch
millicrate that was causing incorrect typo tolerance for multi-byte Unicode characters including Cyrillic, Arabic, Hebrew, Chinese, Japanese, Korean, and other non-ASCII text.Root Cause
The bug was located in the
number_of_typos_allowedfunction in./crates/milli/src/search/new/query_term/parse_query.rsat lines 205 and 209, where the code was usingword.len()(byte count) instead ofword.chars().count()(character count) to determine typo tolerance.The Problem
"doggy".len()= 5 bytes = 5 characters"собак".len()= 10 bytes ≠ 5 charactersThis caused words with the same character count to receive different typo tolerance based on their byte representation.
Fix Applied
Before (Buggy Code)
After (Fixed Code)
Changes Made
word.len()toword.chars().count()word.len()toword.chars().count()What Was Not Changed
The
ngram_str.len() > MAX_WORD_LENGTHcheck at line 249 was intentionally left unchanged becauseMAX_WORD_LENGTHrepresents a byte-based limit for LMDB database storage, not a character-based limit.Impact of the Fix
Verification
cargo checkLanguages/Scripts Affected by This Fix
This fix resolves typo tolerance issues for all Unicode text including:
The fix ensures that typo tolerance is determined by logical character count rather than UTF-8 byte representation, providing consistent behavior across all languages and writing systems.
Tests