-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Closed
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the problem
This is NOT a bug, but an potential improvement goal
Datafusion v19.rc1 by default turn on repartition_file_scans at #5295
with my local Macbook Pro (2.6 GHz 6-Core Intel Core i7, 32 GB 2667 MHz DDR4), for following query on clickbench 14GB hits.parquet:
- v19.rc1 took 12.343 seconds (yeah, 8x faster than v18, was 83.863 seconds)
- DuckDB v0.6.1 took
real 0.566 user 1.876031 sys 0.357483- clock time 566ms
- cpu time 1.87s
- I think clock time is smaller than cpu time, because of it uses multiple CPU cores in parallel.
To Reproduce
Download data file
wget --continue https://datasets.clickhouse.com/hits_compatible/hits.parquet
Prepare SQL
create a file called create.sql
CREATE EXTERNAL TABLE hits
STORED AS PARQUET
LOCATION 'hits.parquet';
create a file called q23_no_order_limit_1.sql
SELECT * FROM hits WHERE "URL" LIKE '%google%' limit 1;
Datafusion
git clone https://github.com/apache/arrow-datafusion.git
git checkout 19.0.0-rc1
cd datafusion-cli
cargo build --release
target/release/datafusion-cli -f create.sql q23_no_order_limit_1.sql
// output: 1 row in set. Query took 12.343 seconds
DuckDB
brew install duckdb
duckdb
> .timer on
> SELECT * FROM read_parquet('hits.parquet') WHERE URL LIKE '%google%' LIMIT 1;
// output: Run Time (s): real 0.566 user 1.876031 sys 0.357483
Expected behavior
- with single core, datafusion-cli tooks 2s (like cpu time of DuckDB)
- with multi cores, datafusion-cli tooks 0.6s (like real time of DuckDB)
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working