Fast table queries with interpolation search #2122

lhoestq · 2021-03-26T18:09:20Z

Intro

This should fix issue #1803

Currently querying examples in a dataset is O(n) because of the underlying pyarrow ChunkedArrays implementation.
To fix this I implemented interpolation search that is pretty effective since datasets usually verifies the condition of evenly distributed chunks (the default chunk size is fixed).

Benchmark

Here is a benchmark I did on bookcorpus (74M rows):

for the current implementation

>>> python speed.py
Loaded dataset 'bookcorpus', len=74004228, nbytes=4835358766


========================= Querying unshuffled bookcorpus =========================

Avg access time key=1                                                 : 0.018ms
Avg access time key=74004227                                          : 0.215ms
Avg access time key=range(74003204, 74004228)                         : 1.416ms
Avg access time key=RandIter(low=0, high=74004228, size=1024, seed=42): 92.532ms

========================== Querying shuffled bookcorpus ==========================

Avg access time key=1                                                 : 0.187ms
Avg access time key=74004227                                          : 6.642ms
Avg access time key=range(74003204, 74004228)                         : 90.941ms
Avg access time key=RandIter(low=0, high=74004228, size=1024, seed=42): 3448.456ms

for the new one using interpolation search:

>>> python speed.py
Loaded dataset 'bookcorpus', len=74004228, nbytes=4835358766


========================= Querying unshuffled bookcorpus =========================

Avg access time key=1                                                 : 0.076ms
Avg access time key=74004227                                          : 0.056ms
Avg access time key=range(74003204, 74004228)                         : 1.807ms
Avg access time key=RandIter(low=0, high=74004228, size=1024, seed=42): 24.028ms

========================== Querying shuffled bookcorpus ==========================

Avg access time key=1                                                 : 0.061ms
Avg access time key=74004227                                          : 0.058ms
Avg access time key=range(74003204, 74004228)                         : 22.166ms
Avg access time key=RandIter(low=0, high=74004228, size=1024, seed=42): 42.757ms

The RandIter class is just an iterable of 1024 random indices from 0 to 74004228.

Here is also a plot showing the speed improvement depending on the dataset size:

Implementation details:

datasets.table.Table objects implement interpolation search for the slice method
The interpolation search requires to store the offsets of all the chunks of a table. The offsets are stored when the Table is initialized.
datasets.table.Table.slice returns a datasets.table.Table using interpolation search
datasets.table.Table.fast_slice returns a pyarrow.Table object using interpolation search. This is useful to get a part of a dataset if we don't need the indexing structure for future computations. For example it's used when querying an example as a dictionary.
Now a Dataset object is always backed by a datasets.table.Table object. If one passes a pyarrow.Table to initialize a Dataset, then it's converted to a datasets.table.Table

Checklist:

implement interpolation search
use datasets.table.Table in Dataset objects
update current tests
add tests for interpolation search
comments and docstring
add the benchmark to the CI

Fix #1803.

lhoestq added 9 commits March 26, 2021 18:10

add interpolation search

11fdbb8

update dataset and formatting

eef624d

update test_formatting

a5ed569

test interpolation search

35a0376

docstrings

816ac59

Merge branch 'master' into fast-table-queries-with-interpolation-search

3fe1e75

add benchmark

d57d021

update benchmarks

f2caeff

add indexed table test

a81c172

lhoestq marked this pull request as ready for review March 31, 2021 18:19

lhoestq requested a review from albertvillanova March 31, 2021 18:20

lhoestq merged commit ae8b940 into master Apr 6, 2021

lhoestq deleted the fast-table-queries-with-interpolation-search branch April 6, 2021 14:33

lhoestq mentioned this pull request Apr 12, 2021

dataloading slow when using HUGE dataset #2210

Closed

albertvillanova mentioned this pull request Apr 21, 2021

Map is slow and processes batches one after another #2243

Closed

hwijeen mentioned this pull request Apr 23, 2021

Slow dataloading with big datasets issue persists #2252

Closed

lhoestq mentioned this pull request Jul 30, 2021

Improving training time for Marian MT model with the Trainer huggingface/transformers#10278

Closed

albertvillanova mentioned this pull request Aug 4, 2021

Querying examples from big datasets is slower than small datasets #1803

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fast table queries with interpolation search #2122

Fast table queries with interpolation search #2122

Uh oh!

lhoestq commented Mar 26, 2021 •

edited by albertvillanova

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fast table queries with interpolation search #2122

Fast table queries with interpolation search #2122

Uh oh!

Conversation

lhoestq commented Mar 26, 2021 • edited by albertvillanova Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intro

Benchmark

Implementation details:

Checklist:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lhoestq commented Mar 26, 2021 •

edited by albertvillanova

Loading