Fast table queries with interpolation search #2122
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Intro
This should fix issue #1803
Currently querying examples in a dataset is O(n) because of the underlying pyarrow ChunkedArrays implementation.
To fix this I implemented interpolation search that is pretty effective since datasets usually verifies the condition of evenly distributed chunks (the default chunk size is fixed).
Benchmark
Here is a benchmark I did on bookcorpus (74M rows):
for the current implementation
for the new one using interpolation search:
The RandIter class is just an iterable of 1024 random indices from 0 to 74004228.
Here is also a plot showing the speed improvement depending on the dataset size:

Implementation details:
datasets.table.Tableobjects implement interpolation search for theslicemethodTableis initialized.datasets.table.Table.slicereturns adatasets.table.Tableusing interpolation searchdatasets.table.Table.fast_slicereturns apyarrow.Tableobject using interpolation search. This is useful to get a part of a dataset if we don't need the indexing structure for future computations. For example it's used when querying an example as a dictionary.Datasetobject is always backed by adatasets.table.Tableobject. If one passes apyarrow.Tableto initialize aDataset, then it's converted to adatasets.table.TableChecklist:
datasets.table.TableinDatasetobjectsFix #1803.