You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
At the moment, RediSearch uses a very simple tokenizer for documents, and a slightly more sophisticated tokenizer for queries. Both allow a degree of control over string escaping and tokenization.
4
+
5
+
Note: There is a different mechanism for tokenizing text and tag fields, this document refers only to text fields. For tag fields please refer to the [Tag Fields](/Tags) documentation.
6
+
7
+
## The Rules of Text Field Tokenization
8
+
9
+
1. All punctuation marks and whitespaces (besides underscores) separate the document and queries into tokens. e.g. any character of `,.<>{}[]"':;!@#$%^&*()-+=~` will break the text into terms. So the text `foo-bar.baz...bag` will be tokenized into `[foo, bar, baz, bag]`
10
+
11
+
2. Escaping separators in both queries and documents is done by prepending a backslash to any separator. e.g. the text `hello\-world hello-world` will be tokenized as `[hello-world, hello, world]`. **NOTE** that in most languages you will need an extra backslash when formatting the document or query, to signify an actual backslash, so the actual text in redis-cli for example, will be entered as `hello\\-world`.
12
+
13
+
2. Underscores (`_`) are not used as separators in either document or query. So the text `hello_world` will remain as is after tokenization.
14
+
15
+
3. Repeating spaces or punctuation marks are stripped.
16
+
17
+
4. In latin characters, everything gets converted to lowercase.
0 commit comments