Skip to content

Goatherd\Commons\Word

goatherd edited this page Jun 28, 2012 · 4 revisions

Goatherd\Commons\Word

Generic digital search tree (or trie) implementation approach.

There are several proposals for very basic tree or trie data structures in PHP. But most (reviewed) implementations use inefficient node objects restricting there practical use. None was found to provide additional logic, like iteration, suffix compression or balancing (in case of ternary search trees).

Note that there are superior ways to represent dictionary data when coded and packed using bit masks for example. If you require any of those, do not use PHP for your project. (well I did not check, but there might be some difference in time and space when processing the human genome in C/Perl vs PHP).

Should solve

  • implemented using arrays
  • optimised for retrieval (iteration, search)
  • optimised for de-serialisation (speed and size)
  • must hold millions of short words (<50 characters) within reasonable time and space
  • must provide iteration logic
  • must provide serialisation logic
  • must provide fuzzy search
  • must provide prefix search (list all words with common prefix)
  • may provide suffix search
  • must provide a generic trie
  • should provide a generic tree implementation
  • should provide a balanced ternary search tree (TST)
  • must provide a (read-only) DAG-transformation for TST and generic tries
  • should provide a radix tree/ compact DAWG
  • may provide a suffix tree

Proposal state

Initial profiling done. Trie, DAG and TST design approach reviewed. Interface design started.

Generic tree and digital search tree implementation provided for interface testing.

Design

Design is based upon node arrays (as opposed to list representation or node objects). Tree data structures are optimised for traversal and de-serialisation.

drawbacks

Edges do share the limitations of array keys: only integer and non-integer strings are supported. (integer string and any scalar data type will be converted).

Generic tree

Might be implemented using the node layout array(N_DATA , N_CHILDREN) with N_DATA being optional. N_CHILDREN is an array of node pointers denoted by edge-label.

It should be tested if flat arrays are worth the overhead of escape sequences.

Digital search tree (DST)

Provide a simple trie. Nodes are flat arrays and edges 'short strings' (as PHP does not know a char data type).

An extension is provided that compresses leaves by replacing array(-1=>$data) with $data ($data being boolean true if not used as hash table).

Directed acyclic word graph (DAWG)

Is provided as suffix-compression algorithm for tries. Results will most likely be read-only.

Ternary search tree

Might use the node layout array(N_DATA, N_CHAR, N_LO, N_EQ, N_HI) with N_DATA being optional. N_LO, N_EQ and N_HI are node pointers.

Roadmap

Initial interface proposal and sample implementation scheduled for July 2012.