diff --git a/benchmarks/README.md b/benchmarks/README.md index 3016dfe3c6de..cb6b276c414e 100644 --- a/benchmarks/README.md +++ b/benchmarks/README.md @@ -23,7 +23,6 @@ This crate contains benchmarks based on popular public data sets and open source benchmark suites, to help with performance and scalability testing of DataFusion. - ## Other engines The benchmarks measure changes to DataFusion itself, rather than @@ -31,8 +30,8 @@ its performance against other engines. For competitive benchmarking, DataFusion is included in the benchmark setups for several popular benchmarks that compare performance with other engines. For example: -* [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion) -* [H2o.ai `db-benchmark`] scripts are in [db-benchmark](https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs) +- [ClickBench] scripts are in the [ClickBench repo](https://github.com/ClickHouse/ClickBench/tree/main/datafusion) +- [H2o.ai `db-benchmark`] scripts are in [db-benchmark](https://github.com/apache/datafusion/tree/main/benchmarks/src/h2o.rs) [ClickBench]: https://github.com/ClickHouse/ClickBench/tree/main [H2o.ai `db-benchmark`]: https://github.com/h2oai/db-benchmark @@ -65,39 +64,50 @@ Create / download a specific dataset (TPCH) ```shell ./bench.sh data tpch ``` + Data is placed in the `data` subdirectory. ## Running benchmarks Run benchmark for TPC-H dataset + ```shell ./bench.sh run tpch ``` + or for TPC-H dataset scale 10 + ```shell ./bench.sh run tpch10 ``` To run for specific query, for example Q21 + ```shell ./bench.sh run tpch10 21 ``` ## Benchmark with modified configurations + ### Select join algorithm + The benchmark runs with `prefer_hash_join == true` by default, which enforces HASH join algorithm. To run TPCH benchmarks with join other than HASH: + ```shell PREFER_HASH_JOIN=false ./bench.sh run tpch ``` ### Configure with environment variables -Any [datafusion options](https://datafusion.apache.org/user-guide/configs.html) that are provided environment variables are + +Any [datafusion options](https://datafusion.apache.org/user-guide/configs.html) that are provided environment variables are also considered by the benchmarks. -The following configuration runs the TPCH benchmark with datafusion configured to *not* repartition join keys. +The following configuration runs the TPCH benchmark with datafusion configured to _not_ repartition join keys. + ```shell DATAFUSION_OPTIMIZER_REPARTITION_JOINS=false ./bench.sh run tpch ``` + You might want to adjust the results location to avoid overwriting previous results. Environment configuration that was picked up by datafusion is logged at `info` level. To verify that datafusion picked up your configuration, run the benchmarks with `RUST_LOG=info` or higher. @@ -419,7 +429,7 @@ logs. Example -dfbench parquet-filter --path ./data --scale-factor 1.0 +dfbench parquet-filter --path ./data --scale-factor 1.0 generates the synthetic dataset at `./data/logs.parquet`. The size of the dataset can be controlled through the `size_factor` @@ -451,6 +461,7 @@ Iteration 2 returned 1781686 rows in 1947 ms ``` ## Sort + Test performance of sorting large datasets This test sorts a a synthetic dataset generated during the @@ -474,22 +485,27 @@ Additionally, an optional `--limit` flag is available for the sort benchmark. Wh See [`sort_tpch.rs`](src/sort_tpch.rs) for more details. ### Sort TPCH Benchmark Example Runs + 1. Run all queries with default setting: + ```bash cargo run --release --bin dfbench -- sort-tpch -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json' ``` 2. Run a specific query: + ```bash cargo run --release --bin dfbench -- sort-tpch -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json' --query 2 ``` 3. Run all queries as TopK queries on presorted data: + ```bash cargo run --release --bin dfbench -- sort-tpch --sorted --limit 10 -p './datafusion/benchmarks/data/tpch_sf1' -o '/tmp/sort_tpch.json' ``` 4. Run all queries with `bench.sh` script: + ```bash ./bench.sh run sort_tpch ``` @@ -527,66 +543,78 @@ External aggregation benchmarks run several aggregation queries with different m This benchmark is inspired by [DuckDB's external aggregation paper](https://hannes.muehleisen.org/publications/icde2024-out-of-core-kuiper-boncz-muehleisen.pdf), specifically Section VI. ### External Aggregation Example Runs + 1. Run all queries with predefined memory limits: + ```bash # Under 'benchmarks/' directory cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json' ``` 2. Run a query with specific memory limit: + ```bash cargo run --release --bin external_aggr -- benchmark -n 4 --iterations 3 -p '....../data/tpch_sf1' -o '/tmp/aggr.json' --query 1 --memory-limit 30M ``` 3. Run all queries with `bench.sh` script: + ```bash ./bench.sh data external_aggr ./bench.sh run external_aggr ``` - ## h2o.ai benchmarks + The h2o.ai benchmarks are a set of performance tests for groupby and join operations. Beyond the standard h2o benchmark, there is also an extended benchmark for window functions. These benchmarks use synthetic data with configurable sizes (small: 1e7 rows, medium: 1e8 rows, big: 1e9 rows) to evaluate DataFusion's performance across different data scales. Reference: + - [H2O AI Benchmark](https://duckdb.org/2023/04/14/h2oai.html) - [Extended window benchmark](https://duckdb.org/2024/06/26/benchmarks-over-time.html#window-functions-benchmark) ### h2o benchmarks for groupby #### Generate data for h2o benchmarks + There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory. 1. Generate small data (1e7 rows) + ```bash ./bench.sh data h2o_small ``` - 2. Generate medium data (1e8 rows) + ```bash ./bench.sh data h2o_medium ``` - 3. Generate large data (1e9 rows) + ```bash ./bench.sh data h2o_big ``` #### Run h2o benchmarks + There are three options for running h2o benchmarks: `small`, `medium`, and `big`. + 1. Run small data benchmark + ```bash ./bench.sh run h2o_small ``` 2. Run medium data benchmark + ```bash ./bench.sh run h2o_medium ``` 3. Run large data benchmark + ```bash ./bench.sh run h2o_big ``` @@ -594,6 +622,7 @@ There are three options for running h2o benchmarks: `small`, `medium`, and `big` 4. Run a specific query with a specific data path For example, to run query 1 with the small data generated above: + ```bash cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7_100_0.csv --query 1 ``` @@ -602,7 +631,7 @@ cargo run --release --bin dfbench -- h2o --path ./benchmarks/data/h2o/G1_1e7_1e7 There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. The data is generated in the `data` directory. -Here is a example to generate `small` dataset and run the benchmark. To run other +Here is a example to generate `small` dataset and run the benchmark. To run other dataset size configuration, change the command similar to the previous example. ```bash @@ -616,6 +645,7 @@ dataset size configuration, change the command similar to the previous example. To run a specific query with a specific join data paths, the data paths are including 4 table files. For example, to run query 1 with the small data generated above: + ```bash cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/join.sql --query 1 ``` @@ -624,7 +654,7 @@ cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1 This benchmark extends the h2o benchmark suite to evaluate window function performance. H2o window benchmark uses the same dataset as the h2o join benchmark. There are three options for generating data for h2o benchmarks: `small`, `medium`, and `big`. -Here is a example to generate `small` dataset and run the benchmark. To run other +Here is a example to generate `small` dataset and run the benchmark. To run other dataset size configuration, change the command similar to the previous example. ```bash @@ -638,6 +668,7 @@ dataset size configuration, change the command similar to the previous example. To run a specific query with a specific window data paths, the data paths are including 4 table files (the same as h2o-join dataset) For example, to run query 1 with the small data generated above: + ```bash cargo run --release --bin dfbench -- h2o --join-paths ./benchmarks/data/h2o/J1_1e7_NA_0.csv,./benchmarks/data/h2o/J1_1e7_1e1_0.csv,./benchmarks/data/h2o/J1_1e7_1e4_0.csv,./benchmarks/data/h2o/J1_1e7_1e7_NA.csv --queries-path ./benchmarks/queries/h2o/window.sql --query 1 ``` diff --git a/benchmarks/queries/clickbench/README.md b/benchmarks/queries/clickbench/README.md index 7d27d3c6c3dc..ad4ae8f60fc1 100644 --- a/benchmarks/queries/clickbench/README.md +++ b/benchmarks/queries/clickbench/README.md @@ -5,12 +5,13 @@ This directory contains queries for the ClickBench benchmark https://benchmark.c ClickBench is focused on aggregation and filtering performance (though it has no Joins) ## Files: -* `queries.sql` - Actual ClickBench queries, downloaded from the [ClickBench repository] -* `extended.sql` - "Extended" DataFusion specific queries. + +- `queries.sql` - Actual ClickBench queries, downloaded from the [ClickBench repository] +- `extended.sql` - "Extended" DataFusion specific queries. [ClickBench repository]: https://github.com/ClickHouse/ClickBench/blob/main/datafusion/queries.sql -## "Extended" Queries +## "Extended" Queries The "extended" queries are not part of the official ClickBench benchmark. Instead they are used to test other DataFusion features that are not covered by @@ -25,7 +26,7 @@ the standard benchmark. Each description below is for the corresponding line in distinct string columns. ```sql -SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") +SELECT COUNT(DISTINCT "SearchPhrase"), COUNT(DISTINCT "MobilePhone"), COUNT(DISTINCT "MobilePhoneModel") FROM hits; ``` @@ -35,7 +36,6 @@ FROM hits; **Important Query Properties**: multiple `COUNT DISTINCT`s. All three are small strings (length either 1 or 2). - ```sql SELECT COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserCountry"), COUNT(DISTINCT "BrowserLanguage") FROM hits; @@ -43,21 +43,20 @@ FROM hits; ### Q2: Top 10 analysis -**Question**: "Find the top 10 "browser country" by number of distinct "social network"s, -including the distinct counts of "hit color", "browser language", +**Question**: "Find the top 10 "browser country" by number of distinct "social network"s, +including the distinct counts of "hit color", "browser language", and "social action"." **Important Query Properties**: GROUP BY short, string, multiple `COUNT DISTINCT`s. There are several small strings (length either 1 or 2). ```sql SELECT "BrowserCountry", COUNT(DISTINCT "SocialNetwork"), COUNT(DISTINCT "HitColor"), COUNT(DISTINCT "BrowserLanguage"), COUNT(DISTINCT "SocialAction") -FROM hits -GROUP BY 1 -ORDER BY 2 DESC +FROM hits +GROUP BY 1 +ORDER BY 2 DESC LIMIT 10; ``` - ### Q3: What is the income distribution for users in specific regions **Question**: "What regions and social networks have the highest variance of parameter price?" @@ -65,17 +64,17 @@ LIMIT 10; **Important Query Properties**: STDDEV and VAR aggregation functions, GROUP BY multiple small ints ```sql -SELECT "SocialSourceNetworkID", "RegionID", COUNT(*), AVG("Age"), AVG("ParamPrice"), STDDEV("ParamPrice") as s, VAR("ParamPrice") -FROM 'hits.parquet' -GROUP BY "SocialSourceNetworkID", "RegionID" +SELECT "SocialSourceNetworkID", "RegionID", COUNT(*), AVG("Age"), AVG("ParamPrice"), STDDEV("ParamPrice") as s, VAR("ParamPrice") +FROM 'hits.parquet' +GROUP BY "SocialSourceNetworkID", "RegionID" HAVING s IS NOT NULL -ORDER BY s DESC +ORDER BY s DESC LIMIT 10; ``` ### Q4: Response start time distribution analysis (median) -**Question**: Find the WatchIDs with the highest median "ResponseStartTiming" without Java enabled +**Question**: Find the WatchIDs with the highest median "ResponseStartTiming" without Java enabled **Important Query Properties**: MEDIAN, functions, high cardinality grouping that skips intermediate aggregation @@ -102,10 +101,9 @@ Results look like +-------------+---------------------+---+------+------+------+ ``` - ### Q5: Response start time distribution analysis (p95) -**Question**: Find the WatchIDs with the highest p95 "ResponseStartTiming" without Java enabled +**Question**: Find the WatchIDs with the highest p95 "ResponseStartTiming" without Java enabled **Important Query Properties**: APPROX_PERCENTILE_CONT, functions, high cardinality grouping that skips intermediate aggregation @@ -122,6 +120,7 @@ LIMIT 10; ``` Results look like + ``` +-------------+---------------------+---+------+------+------+ | ClientIP | WatchID | c | tmin | tp95 | tmax | @@ -151,13 +150,14 @@ WHERE -- Stage 3: Heavy computations (expensive) AND regexp_match("Referer", '\/campaign\/(spring|summer)_promo') IS NOT NULL -- Find campaign-specific referrers - AND CASE - WHEN split_part(split_part("URL", 'resolution=', 2), '&', 1) ~ '^\d+$' - THEN split_part(split_part("URL", 'resolution=', 2), '&', 1)::INT - ELSE 0 + AND CASE + WHEN split_part(split_part("URL", 'resolution=', 2), '&', 1) ~ '^\d+$' + THEN split_part(split_part("URL", 'resolution=', 2), '&', 1)::INT + ELSE 0 END > 1920 -- Extract and validate resolution parameter AND levenshtein(CAST("UTMSource" AS STRING), CAST("UTMCampaign" AS STRING)) < 3 -- Verify UTM parameter similarity ``` + Result is empty,Since it has already been filtered by `"SocialAction" = 'share'`. ### Q7: Device Resolution and Refresh Behavior Analysis @@ -175,6 +175,7 @@ LIMIT 10; ``` Results look like + ``` +---------------------+------+------+----------+ | WatchID | wmin | wmax | srefresh | @@ -192,10 +193,46 @@ Results look like +---------------------+------+------+----------+ ``` +### Q8: Average Latency and Response Time Analysis + +**Question**: Which combinations of operating system, region, and user agent exhibit the highest average latency? For each of these combinations, also report the average response time. + +**Important Query Properties**: Multiple average of Duration, high cardinality grouping + +```sql +SELECT "RegionID", "UserAgent", "OS", AVG(to_timestamp("ResponseEndTiming")-to_timestamp("ResponseStartTiming")) as avg_response_time, AVG(to_timestamp("ResponseEndTiming")-to_timestamp("ConnectTiming")) as avg_latency +FROM hits +GROUP BY "RegionID", "UserAgent", "OS" +ORDER BY avg_latency DESC +LIMIT 10; +``` + +Results look like + +``` ++----------+-----------+-----+------------------------------------------+------------------------------------------+ +| RegionID | UserAgent | OS | avg_response_time | avg_latency | ++----------+-----------+-----+------------------------------------------+------------------------------------------+ +| 22934 | 5 | 126 | 0 days 8 hours 20 mins 0.000000000 secs | 0 days 8 hours 20 mins 0.000000000 secs | +| 22735 | 82 | 74 | 0 days 8 hours 20 mins 0.000000000 secs | 0 days 8 hours 20 mins 0.000000000 secs | +| 21687 | 32 | 49 | 0 days 8 hours 20 mins 0.000000000 secs | 0 days 8 hours 20 mins 0.000000000 secs | +| 18518 | 82 | 77 | 0 days 8 hours 20 mins 0.000000000 secs | 0 days 8 hours 20 mins 0.000000000 secs | +| 14006 | 7 | 126 | 0 days 7 hours 58 mins 20.000000000 secs | 0 days 8 hours 20 mins 0.000000000 secs | +| 9803 | 82 | 77 | 0 days 8 hours 20 mins 0.000000000 secs | 0 days 8 hours 20 mins 0.000000000 secs | +| 107108 | 82 | 77 | 0 days 8 hours 20 mins 0.000000000 secs | 0 days 8 hours 20 mins 0.000000000 secs | +| 111626 | 7 | 44 | 0 days 7 hours 23 mins 12.500000000 secs | 0 days 8 hours 0 mins 47.000000000 secs | +| 17716 | 56 | 44 | 0 days 6 hours 48 mins 44.500000000 secs | 0 days 7 hours 35 mins 47.000000000 secs | +| 13631 | 82 | 45 | 0 days 7 hours 23 mins 1.000000000 secs | 0 days 7 hours 23 mins 1.000000000 secs | ++----------+-----------+-----+------------------------------------------+------------------------------------------+ +10 row(s) fetched. +Elapsed 30.195 seconds. +``` + ## Data Notes Here are some interesting statistics about the data used in the queries Max length of `"SearchPhrase"` is 1113 characters + ```sql > select min(length("SearchPhrase")) as "SearchPhrase_len_min", max(length("SearchPhrase")) "SearchPhrase_len_max" from 'hits.parquet' limit 10; +----------------------+----------------------+ @@ -205,8 +242,8 @@ Max length of `"SearchPhrase"` is 1113 characters +----------------------+----------------------+ ``` - Here is the schema of the data + ```sql > describe 'hits.parquet'; +-----------------------+-----------+-------------+ diff --git a/benchmarks/queries/clickbench/extended.sql b/benchmarks/queries/clickbench/extended.sql index f210c3a56e35..93c39efe4f8e 100644 --- a/benchmarks/queries/clickbench/extended.sql +++ b/benchmarks/queries/clickbench/extended.sql @@ -6,3 +6,4 @@ SELECT "ClientIP", "WatchID", COUNT(*) c, MIN("ResponseStartTiming") tmin, MEDI SELECT "ClientIP", "WatchID", COUNT(*) c, MIN("ResponseStartTiming") tmin, APPROX_PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY "ResponseStartTiming") tp95, MAX("ResponseStartTiming") tmax FROM 'hits' WHERE "JavaEnable" = 0 GROUP BY "ClientIP", "WatchID" HAVING c > 1 ORDER BY tp95 DESC LIMIT 10; SELECT COUNT(*) AS ShareCount FROM hits WHERE "IsMobile" = 1 AND "MobilePhoneModel" LIKE 'iPhone%' AND "SocialAction" = 'share' AND "SocialSourceNetworkID" IN (5, 12) AND "ClientTimeZone" BETWEEN -5 AND 5 AND regexp_match("Referer", '\/campaign\/(spring|summer)_promo') IS NOT NULL AND CASE WHEN split_part(split_part("URL", 'resolution=', 2), '&', 1) ~ '^\d+$' THEN split_part(split_part("URL", 'resolution=', 2), '&', 1)::INT ELSE 0 END > 1920 AND levenshtein(CAST("UTMSource" AS STRING), CAST("UTMCampaign" AS STRING)) < 3; SELECT "WatchID", MIN("ResolutionWidth") as wmin, MAX("ResolutionWidth") as wmax, SUM("IsRefresh") as srefresh FROM hits GROUP BY "WatchID" ORDER BY "WatchID" DESC LIMIT 10; +SELECT "RegionID", "UserAgent", "OS", AVG(to_timestamp("ResponseEndTiming")-to_timestamp("ResponseStartTiming")) as avg_response_time, AVG(to_timestamp("ResponseEndTiming")-to_timestamp("ConnectTiming")) as avg_latency FROM hits GROUP BY "RegionID", "UserAgent", "OS" ORDER BY avg_latency DESC limit 10; \ No newline at end of file