Skip to content

The Marisa extension by Query.Farm integrates the fast, space-efficient MARISA trie into DuckDB, enabling high-performance string lookups, prefix searches, and autocomplete functionality.

License

Notifications You must be signed in to change notification settings

Query-farm/marisa

Repository files navigation

DuckDB Marisa Extension by Query.Farm

The Marisa extension, developed by Query.Farm, adds MARISA (Matching Algorithm with Recursively Implemented StorAge) trie functionality for DuckDB. MARISA is a static and space-efficient trie data structure that enables fast string lookups, prefix searches, and predictive text operations.

Use Cases

MARISA tries are particularly useful for:

  • Autocomplete/Type-ahead functionality: Use marisa_predictive() to find all completions for a partial string
  • Spell checking: Use marisa_lookup() to verify if words exist in a dictionary
  • URL routing: Efficiently match URL patterns and extract parameters
  • IP address prefix matching: Network routing and firewall rules
  • String deduplication: Compact storage of large sets of strings with common prefixes

Installation

marisa is a DuckDB Community Extension.

You can now use this by using this SQL:

install marisa from community;
load marisa;

The marisa extension provides several functions for working with MARISA tries:

Creating a Trie

Use the marisa_trie() aggregate function to create a trie from string data:

CREATE TABLE employees(name TEXT);
INSERT INTO employees VALUES('Alice'), ('Bob'), ('Charlie'), ('David'), ('Eve'), ('Frank'), ('Mallory'), ('Megan'), ('Oscar'), ('Melissa');

-- Create a trie from the employee names
CREATE TABLE employees_trie AS SELECT marisa_trie(name) AS trie FROM employees;

SELECT trie, octet_length(trie) FROM employees_trie;
┌─────────────────────────────────────────────────────────────────┬────────────────────┐
│                              trie                               │ octet_length(trie) │
│                              blob                               │       int64        │
├─────────────────────────────────────────────────────────────────┼────────────────────┤
│ We love Marisa.\x00\x08\x00\x00\x00\x00\x00\x00\x00\xFD\x1B`\…  │        4160        │
└─────────────────────────────────────────────────────────────────┴────────────────────┘

Lookup Function

Check if a string exists in the trie using marisa_lookup():

-- Check if 'Alice' exists in the trie (returns true)
SELECT marisa_lookup(trie, 'Alice') FROM employees_trie;
┌──────────────────────────────┐
│ marisa_lookup(trie, 'Alice') │
│           boolean            │
├──────────────────────────────┤
│ true                         │
└──────────────────────────────┘

-- Check if 'Unknown' exists in the trie (returns false)
SELECT marisa_lookup(trie, 'Unknown') FROM employees_trie;
┌────────────────────────────────┐
│ marisa_lookup(trie, 'Unknown') │
│            boolean             │
├────────────────────────────────┤
│ false                          │
└────────────────────────────────┘

Common Prefix Search

Find all strings in the trie that are prefixes of a given string using marisa_common_prefix():

CREATE TABLE countries(name TEXT);
INSERT INTO countries VALUES ('U'), ('US'), ('USA');
CREATE TABLE countries_trie AS SELECT marisa_trie(name) AS trie FROM countries;

-- Find all prefixes of 'USA' (returns ['U', 'US', 'USA'])
SELECT marisa_common_prefix(trie, 'USA', 10) FROM countries_trie;
┌───────────────────────────────────────┐
│ marisa_common_prefix(trie, 'USA', 10) │
│               varchar[]               │
├───────────────────────────────────────┤
│ [U, US, USA]                          │
└───────────────────────────────────────┘

Predictive Search

Find all strings in the trie that start with a given prefix using marisa_predictive():

-- Find all names starting with 'Me' (returns ['Megan', 'Melissa'])
SELECT marisa_predictive(trie, 'Me', 10) FROM employees_trie;
┌───────────────────────────────────┐
│ marisa_predictive(trie, 'Me', 10) │
│             varchar[]             │
├───────────────────────────────────┤
│ [Megan, Melissa]                  │
└───────────────────────────────────┘

Function Reference

marisa_trie(column)

Type: Aggregate Function Description: Creates a MARISA trie from string values in a column. Parameters:

  • column (VARCHAR): Column containing strings to build the trie from

Example:

SELECT marisa_trie(name) FROM employees;

marisa_lookup(trie, search_string)

Type: Scalar Function Description: Checks if a string exists in the trie. Parameters:

  • trie (BLOB): The trie created by marisa_trie()
  • search_string (VARCHAR): String to search for

Returns: BOOLEAN (true if found, false otherwise)

Example:

SELECT marisa_lookup(trie, 'Alice') FROM employees_trie;

marisa_common_prefix(trie, search_string, max_results)

Type: Scalar Function Description: Finds all strings in the trie that are prefixes of the search string. Parameters:

  • trie (BLOB): The trie created by marisa_trie()
  • search_string (VARCHAR): String to find prefixes for
  • max_results (INTEGER): Maximum number of results to return

Returns: LIST(VARCHAR) - List of prefix matches

Example:

SELECT marisa_common_prefix(trie, 'USA', 10) FROM countries_trie;
-- Returns: ['U', 'US', 'USA']

marisa_predictive(trie, prefix, max_results)

Type: Scalar Function Description: Finds all strings in the trie that start with the given prefix. Parameters:

  • trie (BLOB): The trie created by marisa_trie()
  • prefix (VARCHAR): Prefix to search for
  • max_results (INTEGER): Maximum number of results to return

Returns: LIST(VARCHAR) - List of strings starting with the prefix

Example:

SELECT marisa_predictive(trie, 'Me', 10) FROM employees_trie;
-- Returns: ['Megan', 'Melissa']

About

The Marisa extension by Query.Farm integrates the fast, space-efficient MARISA trie into DuckDB, enabling high-performance string lookups, prefix searches, and autocomplete functionality.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published