Skip to content

Conversation

@lhoestq
Copy link
Member

@lhoestq lhoestq commented Mar 26, 2021

Intro

This should fix issue #1803

Currently querying examples in a dataset is O(n) because of the underlying pyarrow ChunkedArrays implementation.
To fix this I implemented interpolation search that is pretty effective since datasets usually verifies the condition of evenly distributed chunks (the default chunk size is fixed).

Benchmark

Here is a benchmark I did on bookcorpus (74M rows):

for the current implementation

>>> python speed.py
Loaded dataset 'bookcorpus', len=74004228, nbytes=4835358766


========================= Querying unshuffled bookcorpus =========================

Avg access time key=1                                                 : 0.018ms
Avg access time key=74004227                                          : 0.215ms
Avg access time key=range(74003204, 74004228)                         : 1.416ms
Avg access time key=RandIter(low=0, high=74004228, size=1024, seed=42): 92.532ms

========================== Querying shuffled bookcorpus ==========================

Avg access time key=1                                                 : 0.187ms
Avg access time key=74004227                                          : 6.642ms
Avg access time key=range(74003204, 74004228)                         : 90.941ms
Avg access time key=RandIter(low=0, high=74004228, size=1024, seed=42): 3448.456ms

for the new one using interpolation search:

>>> python speed.py
Loaded dataset 'bookcorpus', len=74004228, nbytes=4835358766


========================= Querying unshuffled bookcorpus =========================

Avg access time key=1                                                 : 0.076ms
Avg access time key=74004227                                          : 0.056ms
Avg access time key=range(74003204, 74004228)                         : 1.807ms
Avg access time key=RandIter(low=0, high=74004228, size=1024, seed=42): 24.028ms

========================== Querying shuffled bookcorpus ==========================

Avg access time key=1                                                 : 0.061ms
Avg access time key=74004227                                          : 0.058ms
Avg access time key=range(74003204, 74004228)                         : 22.166ms
Avg access time key=RandIter(low=0, high=74004228, size=1024, seed=42): 42.757ms

The RandIter class is just an iterable of 1024 random indices from 0 to 74004228.

Here is also a plot showing the speed improvement depending on the dataset size:
image

Implementation details:

  • datasets.table.Table objects implement interpolation search for the slice method
  • The interpolation search requires to store the offsets of all the chunks of a table. The offsets are stored when the Table is initialized.
  • datasets.table.Table.slice returns a datasets.table.Table using interpolation search
  • datasets.table.Table.fast_slice returns a pyarrow.Table object using interpolation search. This is useful to get a part of a dataset if we don't need the indexing structure for future computations. For example it's used when querying an example as a dictionary.
  • Now a Dataset object is always backed by a datasets.table.Table object. If one passes a pyarrow.Table to initialize a Dataset, then it's converted to a datasets.table.Table

Checklist:

  • implement interpolation search
  • use datasets.table.Table in Dataset objects
  • update current tests
  • add tests for interpolation search
  • comments and docstring
  • add the benchmark to the CI

Fix #1803.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Querying examples from big datasets is slower than small datasets

2 participants