From e7c6f8e116f7fb6de239a3339637d28fded7f296 Mon Sep 17 00:00:00 2001 From: Matthew McNeely Date: Thu, 21 Aug 2025 17:38:32 -0400 Subject: [PATCH] Add ngram updates --- dgraph/concepts/index-tokenize.mdx | 6 +- dgraph/dql/functions.mdx | 31 ++++++++- dgraph/dql/indexes.mdx | 1 + dgraph/graphql/schema/dgraph-schema.mdx | 1 + dgraph/graphql/schema/directives/search.mdx | 73 +++++++++++++-------- 5 files changed, 80 insertions(+), 32 deletions(-) diff --git a/dgraph/concepts/index-tokenize.mdx b/dgraph/concepts/index-tokenize.mdx index 88bf611d..c6a3ab0d 100644 --- a/dgraph/concepts/index-tokenize.mdx +++ b/dgraph/concepts/index-tokenize.mdx @@ -27,6 +27,6 @@ property. E.g. if a Book Node has a Title attribute, and you add a "term" index, each word (term) in the text will be indexed. The word "Tokenizer" derives its name from tokenizing operations to create this index type. -Similary if the Book has a publicationDateTime you can add a day or year index. -The "tokenizer" here extracts the value to be indexed, which may be the day or -hour of the dateTime, or only the year. +Similarly, if the Book has a publicationDateTime you can add a day or year +index. The "tokenizer" here extracts the value to be indexed, which may be the +day or hour of the dateTime, or only the year. diff --git a/dgraph/dql/functions.mdx b/dgraph/dql/functions.mdx index db23df93..eede7af7 100644 --- a/dgraph/dql/functions.mdx +++ b/dgraph/dql/functions.mdx @@ -80,8 +80,8 @@ Schema Types: `string` Index Required: `term` -Matches strings that have any of the specified terms in any order; case -insensitive. +Matches strings that have any of the specified terms in any order (case +insensitive). #### Usage at root @@ -117,6 +117,31 @@ Steven Spielberg. } ``` +## N-gram search + +Syntax Examples: `ngram(predicate, "a string of text")` + +Schema Types: `string` + +Index Required: `ngram` + +The `ngram` index tokenizes a string into shingles (contiguous sequences of n +words), with support for stop word removal and stemming. The `ngram` function +matches strings that contain the given sequence of terms. + +#### Usage at root + +Query example: all nodes that have a `name` containing `quick`, `brown`, and +`fox`. + +```json +{ + me(func: ngram(name@en, "quick brown fox")) { + name@en + } +} +``` + ## Regular expressions Syntax Examples: `regexp(predicate, /regular-expression/)` or case insensitive @@ -474,7 +499,7 @@ Query Example: Movies initially released in 1977, listed by genre. } ``` -## uid +## UID Syntax Examples: diff --git a/dgraph/dql/indexes.mdx b/dgraph/dql/indexes.mdx index a15aaee4..bc450904 100644 --- a/dgraph/dql/indexes.mdx +++ b/dgraph/dql/indexes.mdx @@ -43,6 +43,7 @@ The indices available for strings are as follows. | `le`, `ge`, `lt`, `gt` | `exact` | Allows faster sorting. | | `allofterms`, `anyofterms` | `term` | Allows searching by a term in a sentence. | | `alloftext`, `anyoftext` | `fulltext` | Matching with language specific stemming and stopwords. | +| `ngram` | `ngram` | Contiguous sequence matching (shingles) with stop word removal and stemming. | | `regexp` | `trigram` | Regular expression matching. Can also be used for equality checking. | diff --git a/dgraph/graphql/schema/dgraph-schema.mdx b/dgraph/graphql/schema/dgraph-schema.mdx index 78b91b2b..f1e1bcad 100644 --- a/dgraph/graphql/schema/dgraph-schema.mdx +++ b/dgraph/graphql/schema/dgraph-schema.mdx @@ -67,6 +67,7 @@ enum DgraphIndex { term fulltext trigram + ngram regexp year month diff --git a/dgraph/graphql/schema/directives/search.mdx b/dgraph/graphql/schema/directives/search.mdx index 145b68b7..0295a109 100644 --- a/dgraph/graphql/schema/directives/search.mdx +++ b/dgraph/graphql/schema/directives/search.mdx @@ -85,15 +85,13 @@ contain the term "GraphQL". ```graphql queryAuthor(filter: { name: { eq: "Diggy" } } ) { - posts(filter: { title: { anyofterms: "GraphQL" }}) { - title } } ``` Dgraph can build search types with the ability to search between a range. For -example with the above Post type with datePublished field, a query can find -publish dates within a range +example, with the preceding Post type with the `datePublished` field, a query +can find publish dates within a range. ```graphql query { @@ -104,8 +102,8 @@ query { ``` Dgraph can also build GraphQL search ability to find match a value from a list. -For example with the above Author type with the name field, a query can return -the Authors that match a list +For example with the preceding Author type with the name field, a query can +return the Authors that match a list ```graphql queryAuthor(filter: { name: { in: ["Diggy", "Jarvis"] } } ) { @@ -115,13 +113,13 @@ queryAuthor(filter: { name: { in: ["Diggy", "Jarvis"] } } ) { There's different search possible for each type as explained below. -### Int, Float and DateTime +### Int, float and dateTime | argument | constructed filter | | -------- | ------------------------------------------------- | | none | `lt`, `le`, `eq`, `in`, `between`, `ge`, and `gt` | -Search for fields of types `Int`, `Float` and `DateTime` is enabled by adding +Search for fields of types `Int`, `Float` and `dateTime` is enabled by adding `@search` to the field with no arguments. For example, if a schema contains: ```graphql @@ -187,7 +185,7 @@ queryAuthor(filter: { name: { eq: "Diggy" } } ) { } ``` -### DateTime +### dateTime | argument | constructed filters | | --------------------------------- | ------------------------------------------------- | @@ -198,14 +196,14 @@ the search index should be built: by year, month, day or hour. `@search` defaults to year, but once you understand your data and query patterns, you might want to changes that like `@search(by: [day])`. -### Boolean +### Boolean fields | argument | constructed filter | | -------- | ------------------ | | none | `true` and `false` | -Booleans can only be tested for true or false. If `isPublished: Boolean @search` -is in the schema, then the search allows +Boolean fields can only be tested for `true` or `false`. If +`isPublished: Boolean @search` is in the schema, then the search allows ```graphql filter: { isPublished: true } @@ -229,6 +227,7 @@ you have the following options as arguments to `@search`. | `regexp` | `regexp` (regular expressions) | | `term` | `allofterms` and `anyofterms` | | `fulltext` | `alloftext` and `anyoftext` | +| `ngram` | `ngram` | - _Schema rule_: `hash` and `exact` can't be used together. @@ -250,7 +249,7 @@ query { } ``` -to find users with names lexicographically after "Diggy". +to find users with names lexicographically after "Diggy." #### String regular expression search @@ -283,12 +282,8 @@ query { } ``` -will match all posts with both "GraphQL and "tutorial" in the title, while `anyofterms: "GraphQL tutorial"` would match posts with either "GraphQL" or -"tutorial". -`fulltext` search is Google-stye text search with stop words, stemming. etc. So -`alloftext: "run woman"` would match "run" as well as "running", etc. For example, to find posts that talk about fantastic GraphQL tutorials: ```graphql @@ -297,6 +292,32 @@ query { } ``` +#### String ngram search + +The `ngram` index tokenizes a string into contiguous sequences of n words, with +support for stop word removal and stemming. N-gram search matches if the indexed +string contains the given sequence of terms. + +If the schema has + +```graphql +type Post { + title: String @search(by: [ngram]) + ... +} +``` + +then + +```graphql +query { + queryPost(filter: { title: { ngram: "quick brown fox" } } ) { ... } +} +``` + +will match all posts that contain the contiguous sequence "quick brown fox" in +the title. + #### Strings with multiple searches It is possible to add multiple string indexes to a field. For example to search @@ -310,7 +331,7 @@ type Author { } ``` -### Enums +### enums | argument | constructed searches | | -------- | --------------------------------------------------------------------- | @@ -319,8 +340,8 @@ type Author { | `exact` | `lt`, `le`, `eq`, `in`, `between`, `ge`, and `gt` (lexicographically) | | `regexp` | `regexp` (regular expressions) | -Enums are serialized in Dgraph as strings. `@search` with no arguments is the -same as `@search(by: [hash])` and provides `eq` and `in` searches. Also +enum fields are serialized in Dgraph as strings. `@search` with no arguments is +the same as `@search(by: [hash])` and provides `eq` and `in` searches. Also available for enums are `exact` and `regexp`. For hash and exact search on enums, the literal enum value, without quotes `"..."`, is used, for regexp, strings are required. For example: @@ -387,7 +408,7 @@ type Hotel { } ``` -#### near +#### Near The `near` filter matches all entities where the location given by a field is within a distance `meters` from a coordinate. @@ -408,7 +429,7 @@ queryHotel(filter: { } ``` -#### within +#### Within The `within` filter matches all entities where the location given by a field is within a defined `polygon`. @@ -441,7 +462,7 @@ queryHotel(filter: { } ``` -#### contains +#### Contains The `contains` filter matches all entities where the `Polygon` or `MultiPolygon` field contains another given `point` or `polygon`. @@ -489,7 +510,7 @@ A `contains` example using `polygon`: } ``` -#### intersects +#### Intersects The `intersects` filter matches all entities where the `Polygon` or `MultiPolygon` field intersects another given `polygon` or `multiPolygon`. @@ -579,8 +600,8 @@ Unions can be queried only as a field of a type. Union queries can't be ordered, but you can filter and paginate them. - Union queries do not support the `order` argument. The results will be ordered - by the `uid` of each node in ascending order. + Union queries don't support the `order` argument. The results will be ordered + by the UID of each node in ascending order. For example, the following schema will enable to query the `members` union field