-
Notifications
You must be signed in to change notification settings - Fork 0
Goatherd\Commons\Word
Generic digital search tree (or trie) implementation approach.
There are several proposals for very basic tree or trie data structures in PHP. But most (reviewed) implementations use inefficient node objects restricting there practical use. None was found to provide additional logic, like iteration, suffix compression or balancing (in case of ternary search trees).
Note that there are superior ways to represent dictionary data when coded and packed using bit masks for example. If you require any of those, do not use PHP for your project. (well I did not check, but there might be some difference in time and space when processing the human genome in C/Perl vs PHP).
- implemented using arrays
- optimised for retrieval (iteration, search)
- optimised for de-serialisation (speed and size)
- must hold millions of short words (<50 characters) within reasonable time and space
- must provide iteration logic
- must provide serialisation logic
- must provide fuzzy search
- must provide prefix search (list all words with common prefix)
- may provide suffix search
- must provide a generic trie
- should provide a generic tree implementation
- should provide a balanced ternary search tree (TST)
- must provide a (read-only) DAG-transformation for TST and generic tries
- should provide a radix tree/ compact DAWG
- may provide a suffix tree
Initial profiling done. Trie, DAG and TST design approach reviewed. Interface design started.
Generic tree and digital search tree implementation provided for interface testing.
Design is based upon node arrays (as opposed to list representation or node objects). Tree data structures are optimised for traversal and de-serialisation.
Edges do share the limitations of array keys: only integer and non-integer strings are supported. (integer string and any scalar data type will be converted).
Might be implemented using the node layout array(N_DATA , N_CHILDREN) with N_DATA being optional. N_CHILDREN is an array of node pointers denoted by edge-label.
It should be tested if flat arrays are worth the overhead of escape sequences.
Provide a simple trie. Nodes are flat arrays and edges 'short strings' (as PHP does not know a char data type).
An extension is provided that compresses leaves by replacing array(-1=>$data) with $data ($data being boolean true if not used as hash table).
Is provided as suffix-compression algorithm for tries. Results will most likely be read-only.
Might use the node layout array(N_DATA, N_CHAR, N_LO, N_EQ, N_HI) with N_DATA being optional.
N_LO, N_EQ and N_HI are node pointers.
Initial interface proposal and sample implementation scheduled for July 2012.