Implement specialized group values for single Uft8/LargeUtf8/Binary/LargeBinary column #8827

alamb · 2024-01-10T23:22:14Z

Note this looks like a large PR, but a substantial amount of it is moving code and tests and documentation

Which issue does this PR close?

Rationale for this change

This PR makes queries like this faster (grouping on strings):

SELECT url, count(*) FROM hits GROUP BY URL ORDER BY count(*) desc limit 10;

ClickBench seems slightly slower in 34.0.0 than in 33.0.0: #8789 (comment) so we need some more real wins

One thing that takes significant time in queries is grouping on a single string column, where the current GroupsAccumulator copies the strings several times:

Copying into Rows,
Copying OwnedRow ,
Copying from Rows to form the final output Array

We can make such queries faster by avoiding the copies

What changes are included in this PR?

Generalize the SSOStringSet that @jayzhan211 and I implemented in Optimize COUNT( DISTINCT ...) for strings (up to 9x faster) #8849 into ArrowBinaryMap which also handles values and both string and binary
Refactor COUNT DISTINCT specialization to handle binary/large binary
Use ArrowBinaryMap to implement GroupValuesByes, a GroupsValues implementation for String / LargeString / Binary / LargeBinary data

Are these changes tested?

Yes:

Unit tests
New fuzz tests (added in Add string aggregate grouping fuzz test, add MemTable::with_sort_exprs #9190)

Are there any user-facing changes?

Faster performance.

For Clickbench, several of the longest running queries got significantly faster. This is almost certainly the result of less string copying as they are grouping by "Url", a string column

│ QQuery 5     │  2756.24ms │                    1917.97ms │ +1.44x faster │
│ QQuery 12    │  2501.83ms │                    2003.72ms │ +1.25x faster │
│ QQuery 13    │  4875.29ms │                    4373.19ms │ +1.11x faster │
│ QQuery 28    │ 31656.41ms │                   26984.03ms │ +1.17x faster │
│ QQuery 33    │ 12469.57ms │                    9065.49ms │ +1.38x faster │
│ QQuery 36    │   475.49ms │                     350.97ms │ +1.35x faster │
│ QQuery 37    │   252.94ms │                     232.00ms │ +1.09x faster │

There are two that are reported to be slower, but I think given they run in 10s of ms this is likely a measurement error

│ QQuery 41    │    77.78ms │                      81.92ms │  1.05x slower │
│ QQuery 42    │    91.72ms │                      97.71ms │  1.07x slower │

For ClickBench extended things got faster:

│ QQuery 2     │ 3555.86ms │                    3025.06ms │ +1.18x faster │

For TPCH there were also several queries that got faster, but the time spent is so small (10s of ms) that I think it likely measurement error

Full Performance Results

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_specialized_group_keys ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.79ms │                       0.77ms │     no change │
│ QQuery 1     │    89.67ms │                      91.96ms │     no change │
│ QQuery 2     │   189.61ms │                     198.97ms │     no change │
│ QQuery 3     │   200.28ms │                     199.00ms │     no change │
│ QQuery 4     │  2142.31ms │                    2090.35ms │     no change │
│ QQuery 5     │  2756.24ms │                    1917.97ms │ +1.44x faster │
│ QQuery 6     │    81.90ms │                      81.71ms │     no change │
│ QQuery 7     │    92.59ms │                      95.39ms │     no change │
│ QQuery 8     │  3271.38ms │                    3112.39ms │     no change │
│ QQuery 9     │  2364.29ms │                    2344.73ms │     no change │
│ QQuery 10    │   801.35ms │                     828.64ms │     no change │
│ QQuery 11    │   874.21ms │                     902.34ms │     no change │
│ QQuery 12    │  2501.83ms │                    2003.72ms │ +1.25x faster │
│ QQuery 13    │  4875.29ms │                    4373.19ms │ +1.11x faster │
│ QQuery 14    │  2816.89ms │                    2707.51ms │     no change │
│ QQuery 15    │  2353.51ms │                    2331.28ms │     no change │
│ QQuery 16    │  5839.98ms │                    5631.23ms │     no change │
│ QQuery 17    │  5732.99ms │                    5509.90ms │     no change │
│ QQuery 18    │ 11679.49ms │                   11116.23ms │     no change │
│ QQuery 19    │   163.08ms │                     159.01ms │     no change │
│ QQuery 20    │  2657.83ms │                    2586.30ms │     no change │
│ QQuery 21    │  3382.28ms │                    3307.62ms │     no change │
│ QQuery 22    │  9120.73ms │                    9097.39ms │     no change │
│ QQuery 23    │ 21176.57ms │                   21659.87ms │     no change │
│ QQuery 24    │  1337.87ms │                    1351.44ms │     no change │
│ QQuery 25    │  1148.52ms │                    1171.45ms │     no change │
│ QQuery 26    │  1477.97ms │                    1475.16ms │     no change │
│ QQuery 27    │  3902.05ms │                    3815.77ms │     no change │
│ QQuery 28    │ 31656.41ms │                   26984.03ms │ +1.17x faster │
│ QQuery 29    │  1065.64ms │                    1058.78ms │     no change │
│ QQuery 30    │  2480.57ms │                    2385.47ms │     no change │
│ QQuery 31    │  3199.02ms │                    3095.34ms │     no change │
│ QQuery 32    │ 16151.55ms │                   16111.18ms │     no change │
│ QQuery 33    │ 12469.57ms │                    9065.49ms │ +1.38x faster │
│ QQuery 34    │ 13162.17ms │                   13481.65ms │     no change │
│ QQuery 35    │  3869.02ms │                    3907.86ms │     no change │
│ QQuery 36    │   475.49ms │                     350.97ms │ +1.35x faster │
│ QQuery 37    │   252.94ms │                     232.00ms │ +1.09x faster │
│ QQuery 38    │   194.91ms │                     189.39ms │     no change │
│ QQuery 39    │  1112.28ms │                    1148.26ms │     no change │
│ QQuery 40    │    89.82ms │                      92.75ms │     no change │
│ QQuery 41    │    77.78ms │                      81.92ms │  1.05x slower │
│ QQuery 42    │    91.72ms │                      97.71ms │  1.07x slower │
└──────────────┴────────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃             ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━┩
│ Total Time (main_base)                      │ 179380.43ms │
│ Total Time (alamb_specialized_group_keys)   │ 168444.07ms │
│ Average Time (main_base)                    │   4171.64ms │
│ Average Time (alamb_specialized_group_keys) │   3917.30ms │
│ Queries Faster                              │           7 │
│ Queries Slower                              │           2 │
│ Queries with No Change                      │          34 │
└─────────────────────────────────────────────┴─────────────┘

--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ alamb_specialized_group_keys ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │ 3761.83ms │                    3645.99ms │     no change │
│ QQuery 1     │ 1540.43ms │                    1523.35ms │     no change │
│ QQuery 2     │ 3555.86ms │                    3025.06ms │ +1.18x faster │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                      │ 8858.12ms │
│ Total Time (alamb_specialized_group_keys)   │ 8194.40ms │
│ Average Time (main_base)                    │ 2952.71ms │
│ Average Time (alamb_specialized_group_keys) │ 2731.47ms │
│ Queries Faster                              │         1 │
│ Queries Slower                              │         0 │
│ Queries with No Change                      │         2 │
└─────────────────────────────────────────────┴───────────┘

--------------------
Benchmark tpch_mem.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃ main_base ┃ alamb_specialized_group_keys ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  221.65ms │                     211.74ms │     no change │
│ QQuery 2     │   45.09ms │                      44.14ms │     no change │
│ QQuery 3     │   75.12ms │                      75.49ms │     no change │
│ QQuery 4     │   74.56ms │                      73.28ms │     no change │
│ QQuery 5     │  121.22ms │                     121.11ms │     no change │
│ QQuery 6     │   16.32ms │                      16.63ms │     no change │
│ QQuery 7     │  317.89ms │                     310.41ms │     no change │
│ QQuery 8     │   77.76ms │                      75.12ms │     no change │
│ QQuery 9     │  127.65ms │                     128.59ms │     no change │
│ QQuery 10    │  147.28ms │                     152.27ms │     no change │
│ QQuery 11    │   33.24ms │                      33.89ms │     no change │
│ QQuery 12    │   68.39ms │                      70.30ms │     no change │
│ QQuery 13    │   86.20ms │                      85.18ms │     no change │
│ QQuery 14    │   25.62ms │                      23.91ms │ +1.07x faster │
│ QQuery 15    │   57.68ms │                      56.15ms │     no change │
│ QQuery 16    │   46.01ms │                      48.26ms │     no change │
│ QQuery 17    │  167.87ms │                     165.19ms │     no change │
│ QQuery 18    │  459.99ms │                     469.59ms │     no change │
│ QQuery 19    │   63.05ms │                      63.21ms │     no change │
│ QQuery 20    │  116.83ms │                     112.89ms │     no change │
│ QQuery 21    │  343.40ms │                     351.09ms │     no change │
│ QQuery 22    │   31.69ms │                      29.94ms │ +1.06x faster │
└──────────────┴───────────┴──────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                           ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main_base)                      │ 2724.50ms │
│ Total Time (alamb_specialized_group_keys)   │ 2718.37ms │
│ Average Time (main_base)                    │  123.84ms │
│ Average Time (alamb_specialized_group_keys) │  123.56ms │
│ Queries Faster                              │         2 │
│ Queries Slower                              │         0 │
│ Queries with No Change                      │        20 │
└─────────────────────────────────────────────┴───────────┘

TODOs

Write tests for storing values in the ArrowStringMap
Make it work for LargeStrings
Add fuzz test
Propose porting ArrowStringMap upstream to arrow-rs (could use the DictionaryArray builder)
Implement drain / retain + memory accounting
Try and ArrowStringMap generic so it works for Binary Types
Final Performance benchmarks

datafusion/physical-plan/src/aggregates/group_values/binary.rs

alamb · 2024-02-06T01:59:10Z

I think this is getting pretty close. I need to add some more fuzz tests and maybe try to make this code generic and work for BinaryArray but otherwise it is ready

datafusion/physical-plan/src/aggregates/group_values/binary.rs

alamb · 2024-02-06T02:03:07Z

datafusion/physical-expr/src/aggregate/count_distinct/strings.rs

    }
 }
-
-/// Maximum size of a string that can be inlined in the hash table


This code and tests were moved into bytes_map.rs and made general

datafusion/physical-expr/src/string_map.rs

alamb · 2024-02-10T13:10:37Z

Update here is I have written a fuzz test (which actually caught several bugs 🐛 ). I plan to take a final pass through the code tomorrow morning and get it up for review (It is a pretty large PR so I expect it will take time to review, but it will be worth it I think)

Update:
I was thinking about this. My plan is to:

Break out the fuzz test into its own PR: Add string aggregate grouping fuzz test, add MemTable::with_sort_exprs #9190
Remove the streaming support (even though it works, it is complicated and I am not sure how much value it adds relative to its complexity we can do that as a follow on PR potentially) - Implement streaming for specialized group by string aggregator #9188
Take one final pass

alamb · 2024-02-13T17:13:23Z

Update here is that the benchmark programs consistently shows this query slower, but I can't reproduce the difference. I am continuing to investigate

│ QQuery 28    │ 28361.56ms │                   30849.47ms │  1.09x slower │

…mn string grouping compiles. add test for null

alamb · 2024-02-16T12:44:19Z

datafusion/physical-expr/src/binary_map.rs

+        });
+        // SAFETY: the offsets were constructed correctly in `insert_if_new` --
+        // monotonically increasing, overflows were checked.
+        let offsets = unsafe { OffsetBuffer::new_unchecked(ScalarBuffer::from(offsets)) };


Validating the offsets actually took significant time which took me a while to find and trackdown (without this unsafe call ClickBench Q28 slowed down by 10%)

alamb · 2024-02-16T13:47:40Z

FYI @thinkharderdev and @Dandandan as you have expressed interest in this type of optimization before, this PR is now ready for review

datafusion/physical-plan/src/aggregates/group_values/bytes.rs

Dandandan

🚀

alamb · 2024-02-19T07:41:03Z

Thank you for the review @Dandandan 🙏

…up_keys

alamb · 2024-02-20T07:10:42Z

🚀

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Jan 10, 2024

alamb mentioned this pull request Jan 10, 2024

Improve aggregate performance with specialized groups accumulator for single string group by #7064

Closed

alamb force-pushed the alamb/specialized_group_keys branch from f78bea7 to 886666b Compare January 11, 2024 12:29

This comment was marked as outdated.

Sign in to view

jayzhan211 reviewed Jan 12, 2024

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/binary.rs Outdated Show resolved Hide resolved

jayzhan211 mentioned this pull request Jan 13, 2024

Optimize COUNT( DISTINCT ...) for strings (up to 9x faster) #8849

Merged

alamb mentioned this pull request Jan 14, 2024

Add "Extended" clickbench queries #8861

Merged

alamb force-pushed the alamb/specialized_group_keys branch from 9224a38 to 28d9105 Compare January 27, 2024 10:37

github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate labels Jan 27, 2024

This comment was marked as outdated.

Sign in to view

alamb force-pushed the alamb/specialized_group_keys branch from 7c23abb to 09b3c5f Compare February 2, 2024 22:56

github-actions bot removed core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 2, 2024

alamb commented Feb 6, 2024

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/binary.rs Outdated Show resolved Hide resolved

alamb commented Feb 6, 2024

View reviewed changes

alamb force-pushed the alamb/specialized_group_keys branch from 1009b85 to 6ee2e03 Compare February 6, 2024 13:14

alamb commented Feb 10, 2024

View reviewed changes

datafusion/physical-expr/src/string_map.rs Outdated Show resolved Hide resolved

alamb force-pushed the alamb/specialized_group_keys branch from 6ee2e03 to da68c45 Compare February 10, 2024 13:08

github-actions bot added the core Core DataFusion crate label Feb 10, 2024

alamb mentioned this pull request Feb 10, 2024

Implement streaming for specialized group by string aggregator #9188

Closed

alamb force-pushed the alamb/specialized_group_keys branch from da68c45 to 489c4c1 Compare February 10, 2024 16:41

This was referenced Feb 11, 2024

Add a mutable builder for Offsets apache/arrow-rs#5384

Closed

Improved performance for streaming grouping with single string columns #9195

Open

alamb force-pushed the alamb/specialized_group_keys branch 2 times, most recently from ccb0e9b to 5854997 Compare February 11, 2024 14:52

github-actions bot added the sqllogictest SQL Logic Tests (.slt) label Feb 11, 2024

alamb force-pushed the alamb/specialized_group_keys branch from 7f0b188 to 5a2a3d3 Compare February 11, 2024 17:13

alamb force-pushed the alamb/specialized_group_keys branch from 46ba28c to 637908c Compare February 12, 2024 16:43

github-actions bot added core Core DataFusion crate and removed core Core DataFusion crate labels Feb 12, 2024

alamb force-pushed the alamb/specialized_group_keys branch from d9fa058 to 637908c Compare February 12, 2024 21:52

github-actions bot removed the core Core DataFusion crate label Feb 12, 2024

alamb force-pushed the alamb/specialized_group_keys branch from 637908c to dc55c1b Compare February 13, 2024 16:22

alamb force-pushed the alamb/specialized_group_keys branch 2 times, most recently from 3de9042 to b2fe2c7 Compare February 16, 2024 12:38

alamb added 3 commits February 16, 2024 07:40

Implement GroupValuesBinary special case for for handling single colu…

66e90e8

…mn string grouping compiles. add test for null

Avoid overflow checking

61c0284

avoid offsest validation

53e7274

alamb force-pushed the alamb/specialized_group_keys branch from b2fe2c7 to 53e7274 Compare February 16, 2024 12:40

alamb commented Feb 16, 2024

View reviewed changes

alamb marked this pull request as ready for review February 16, 2024 13:46

alamb added the performance Make DataFusion faster label Feb 17, 2024

Dandandan reviewed Feb 18, 2024

View reviewed changes

datafusion/physical-plan/src/aggregates/group_values/bytes.rs Outdated Show resolved Hide resolved

Dandandan approved these changes Feb 18, 2024

View reviewed changes

alamb added 2 commits February 19, 2024 02:41

Update datafusion/physical-plan/src/aggregates/group_values/bytes.rs

3e6422c

Merge remote-tracking branch 'apache/main' into alamb/specialized_gro…

0d31780

…up_keys

alamb merged commit 8233e64 into apache:main Feb 20, 2024

alamb deleted the alamb/specialized_group_keys branch February 20, 2024 07:10

This was referenced Feb 26, 2024

DataFusion weekly project plan (Andrew Lamb) - Feb 26, 2024 #9345

Closed

Update ClickBench benchmarks with DataFusion 34 #8789

Closed

Update ClickBench benchmarks with DataFusion 36 #9404

Closed

yjshen mentioned this pull request Feb 29, 2024

Speed up DistinctCountAccumulator #5472

Closed

alamb mentioned this pull request Mar 13, 2024

Blog post with DataFusion Jan - June 2024 #9602

Closed

Implement specialized group values for single Uft8/LargeUtf8/Binary/LargeBinary column #8827

Implement specialized group values for single Uft8/LargeUtf8/Binary/LargeBinary column #8827

Uh oh!

Conversation

alamb commented Jan 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

TODOs

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

alamb commented Feb 6, 2024

Uh oh!

Uh oh!

alamb Feb 6, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Feb 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alamb commented Feb 13, 2024

Uh oh!

alamb Feb 16, 2024

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 16, 2024

Uh oh!

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Feb 19, 2024

Uh oh!

alamb commented Feb 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alamb commented Jan 10, 2024 •

edited

Loading

alamb Feb 6, 2024 •

edited

Loading

alamb commented Feb 10, 2024 •

edited

Loading