feat: Push down hashes to probe side in HashJoinExec #17529

rkrishn7 · 2025-09-11T22:12:09Z

Which issue does this PR close?

Closes Push down entire hash table from HashJoinExec into scans #17171

What changes are included in this PR?

Adds a new configuration option (hash_join_sideways_hash_passing) to enable passing hashes from build side to probe side scans.
Adds a new internal HashComparePhysicalExpr that is pushed down to supported right-side scans in hash join.

Are these changes tested?

Not yet. Plan to add unit + fuzz tests

Are there any user-facing changes?

Yes, new configuration option for hash join execution.

rkrishn7 · 2025-09-16T22:42:19Z

datafusion/physical-plan/src/joins/hash_join/information_passing.rs

+//! ```
+//! The join portion of the query should look something like this:
+//!
+//! ```text


Thanks @alamb for the diagrams 😄 (from #7955)

rkrishn7 · 2025-09-16T22:44:26Z

@adriangb I think this is probably ready for an initial look when you get a chance! I plan on adding unit + fuzz tests as well. But let me know if you have any other thoughts re: testing.

rkrishn7 · 2025-09-16T22:47:51Z

@alamb Would you be able to kick off benchmarks 🙏🏾 ? Specifically TPC-H against parquet files

Should want the following configuration options set:

DATAFUSION_EXECUTION_HASH_JOIN_SIDEWAYS_HASH_PASSING=true
DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true

rkrishn7 · 2025-09-16T22:58:32Z

datafusion/physical-plan/src/joins/hash_join/information_passing.rs

+    /// Each element represents the column bounds computed by one partition.
+    bounds: Vec<PartitionBounds>,
+    /// Hashes from the left (build) side, if enabled
+    left_hashes: NoHashSet<u64>,


I found using a HashSet here to yield better performance than using a Vec<Arc<dyn JoinHashMapType>>. Though, it does of course result in extra allocations.

My guess is that its primarily due to that Vec<Arc<dyn JoinHashMapType>> results in more indirection + scattered memory accesses which likely means worse cache locality.

I worry that the extra memory cost is prohibitively expensive: there are going to be queries that ran just fine previously but now OOM.

Since this is opt-in, hopefully it's not as much of an issue? I've mentioned this in the config documentation:

/// When set to true, hash joins will allow passing hashes from the build /// side to the right side of the join. This can be useful to prune rows early on, /// but may consume more memory.

In general though I agree that we shouldn't need extra allocations here. It gets a bit tricky though because even if we combine all the hash tables from each build partition into a single, shareable table - each stream probe partition needs to be able to validate that the lookup is localized to its partition. Otherwise we'll see duplicate / incorrect results I believe.

Though, it may just take some tweaking to the existing data structure. Haven't thought about it enough.

I think the way you'd do it is something like (col in hash_table_1) OR (col in hash_table_2) OR (...)

Ah yeah that's essentially the same as what I was referring to earlier with using a Vec<Arc<dyn JoinHashMapType>> (sorry probably should've clarified more).

@LiaCastaneda do you mean because of hash collisions?

I was thinking because localized lookups would be more efficient. iiuc, probe partition 0 should only check build partition 0's hash table, and so on. The problem is that on the probe side, in the evaluate() function (when the dynamic filter runs), we don't have information about which partition the batch belongs to.

I wonder if we can compute this "routing" using the RepartitionExec hashing to figure out the partition, then use the join's hashing for the actual hash lookup. I tried this two hash approach branching off this PR and have something here, I think it returns correct results while using less memory in most cases (e.g., Q9 and Q18 save ~600MB each). @rkrishn7 feel free to take it if it helps or if it even makes sense :) I was just trying things out here.

Thanks @LiaCastaneda!

Yeah, so this is pretty much exactly what I had done previously. Instead of scanning through the Vec<Arc<dyn JoinHashMapType>>, we can leverage the fact that we know the hashing method downstream in RepartitionExec. And thus use the same seed/columns used to compute the hash for distributing across partitions.

But, even though we get O(1) lookup in this approach, I found it to be not as performant as the single HashSet approach. My thinking is this occurs primarily because we're using any one of the N HashMaps on the probe side. So, this likely exhibits much worse cache efficiency than a single HashSet. I'm not sure if this aligns with your measurements

But I do think @adriangb's thinking is correct in his comment here. Even if this approach is slightly less performant than allocating an entire new HashSet, it probably wins just on account of no extra memory overhead.

Interestingly I had thought the HashSet approach yielded substantially better results but looking at the results in your branch @LiaCastaneda they seem to be somewhat comparable? I will test out again today locally so we have another comparison

I'm not sure if this aligns with your measurements

I think, in terms of latency, the results were similar. The key improvement I noticed was in memory usage -- for instance, Q18 has all distinct left side values, making it the heaviest in terms of memory. There’s a difference from 5.1 GB down to 4.4 GB

Ran your commit locally @LiaCastaneda and can confirm I see similar results as well! Strange - I thought I was doing essentially the same thing in my comparisons from a while ago 🤔 . Only difference then is I was running on my beefier Linux computer which I don't have at the moment, but I definitely could have just missed something.

Anyhow, since results look similar I propose we move forward with the approach of reusing the hash table(s)! I can apply your patch to this branch over the weekend if that sounds good. And thanks again for putting that up 🙌🏾

adriangb · 2025-09-17T05:11:53Z

datafusion/physical-plan/src/joins/hash_join/information_passing.rs

+}
+
+impl SharedBuildAccumulator {
+    /// Creates a new [SharedBuildAccumulator] configured for the given partition mode


Suggested change

/// Creates a new [SharedBuildAccumulator] configured for the given partition mode

/// Creates a new [`SharedBuildAccumulator`] configured for the given partition mode

adriangb · 2025-09-17T05:24:34Z

Very cool!

Incidentally we were just discussing today with @gabotechs and @robtandy how to make HashJoin dynamic filter pushdown more compatible with distributed datafusion and how to eliminate the latency associated with waiting until we have the full build side to create filters.

One idea that came up was to push something like:
(hash(join_key_col) % parts != 0) or join_key_col >= 1 and join_key_col <= 2 where 0 is the partition number, 1 is the min val for join_key_col in that partition and 2 is the max val
We can push these down "as they are ready" and then once the all partitions are ready we can simplify this to our current join_key_col >= 1 and join_key_col <= 2.
I bring this up because the main sticking point with that approach is that we add third computation of the hash to the existing two (in the hash join and repartition/shuffle), which might have a performance impact.
That led us to file #17599 which is a broader issue that DataFusion has.

But for this PR the big question in my mind is going to be: is the cost of the extra evaluation of the hash worth it?

alamb · 2025-09-17T17:47:29Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing feat/pushdown-hashes-hashjoinexec (76e0c0a) to 0181c79 diff using: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-09-17T17:47:57Z

@alamb Would you be able to kick off benchmarks 🙏🏾 ? Specifically TPC-H against parquet files

Should want the following configuration options set:
DATAFUSION_EXECUTION_HASH_JOIN_SIDEWAYS_HASH_PASSING=true
DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true

I am not quite sure how to set these options in the benchmarks...

rkrishn7 · 2025-09-17T17:55:11Z

@alamb Would you be able to kick off benchmarks 🙏🏾 ? Specifically TPC-H against parquet files
Should want the following configuration options set:
DATAFUSION_EXECUTION_HASH_JOIN_SIDEWAYS_HASH_PASSING=true
DATAFUSION_EXECUTION_PARQUET_PUSHDOWN_FILTERS=true
I am not quite sure how to set these options in the benchmarks...

Ah okay then we probably won't see any changes since these need to be enabled for the changes here to take effect. Posting a comparison I did locally:

Comparing main and feat_pushdown-hashes-hashjoinexec
--------------------
Benchmark tpch_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      main ┃ feat_pushdown-hashes-hashjoinexec ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │  58.88 ms │                          57.03 ms │     no change │
│ QQuery 2     │  53.69 ms │                          30.18 ms │ +1.78x faster │
│ QQuery 3     │  66.45 ms │                          68.39 ms │     no change │
│ QQuery 4     │  41.55 ms │                          41.23 ms │     no change │
│ QQuery 5     │ 108.33 ms │                          68.41 ms │ +1.58x faster │
│ QQuery 6     │  28.36 ms │                          26.23 ms │ +1.08x faster │
│ QQuery 7     │ 101.16 ms │                          56.93 ms │ +1.78x faster │
│ QQuery 8     │ 111.82 ms │                          78.32 ms │ +1.43x faster │
│ QQuery 9     │ 126.30 ms │                         127.26 ms │     no change │
│ QQuery 10    │  94.72 ms │                          64.64 ms │ +1.47x faster │
│ QQuery 11    │  36.03 ms │                          17.83 ms │ +2.02x faster │
│ QQuery 12    │  69.64 ms │                          58.65 ms │ +1.19x faster │
│ QQuery 13    │  34.31 ms │                          33.93 ms │     no change │
│ QQuery 14    │  43.37 ms │                          41.74 ms │     no change │
│ QQuery 15    │  49.23 ms │                          48.76 ms │     no change │
│ QQuery 16    │  35.28 ms │                          30.01 ms │ +1.18x faster │
│ QQuery 17    │ 125.91 ms │                         112.71 ms │ +1.12x faster │
│ QQuery 18    │ 160.70 ms │                         203.86 ms │  1.27x slower │
│ QQuery 19    │  81.69 ms │                          68.15 ms │ +1.20x faster │
│ QQuery 20    │  47.11 ms │                          47.74 ms │     no change │
│ QQuery 21    │ 149.61 ms │                         113.31 ms │ +1.32x faster │
│ QQuery 22    │  18.61 ms │                          17.81 ms │     no change │
└──────────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (main)                                │ 1642.74ms │
│ Total Time (feat_pushdown-hashes-hashjoinexec)   │ 1413.10ms │
│ Average Time (main)                              │   74.67ms │
│ Average Time (feat_pushdown-hashes-hashjoinexec) │   64.23ms │
│ Queries Faster                                   │        12 │
│ Queries Slower                                   │         1 │
│ Queries with No Change                           │         9 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

alamb · 2025-09-17T18:43:39Z

🤖: Benchmark completed

Details

Comparing HEAD and feat_pushdown-hashes-hashjoinexec
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ feat_pushdown-hashes-hashjoinexec ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 0     │  2717.61 ms │                        2725.63 ms │ no change │
│ QQuery 1     │  1351.49 ms │                        1383.95 ms │ no change │
│ QQuery 2     │  2518.22 ms │                        2567.39 ms │ no change │
│ QQuery 3     │  1174.45 ms │                        1172.97 ms │ no change │
│ QQuery 4     │  2266.99 ms │                        2238.41 ms │ no change │
│ QQuery 5     │ 27509.19 ms │                       27546.37 ms │ no change │
│ QQuery 6     │  4211.49 ms │                        4245.39 ms │ no change │
│ QQuery 7     │  3546.52 ms │                        3562.03 ms │ no change │
└──────────────┴─────────────┴───────────────────────────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 45295.95ms │
│ Total Time (feat_pushdown-hashes-hashjoinexec)   │ 45442.14ms │
│ Average Time (HEAD)                              │  5661.99ms │
│ Average Time (feat_pushdown-hashes-hashjoinexec) │  5680.27ms │
│ Queries Faster                                   │          0 │
│ Queries Slower                                   │          0 │
│ Queries with No Change                           │          8 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ feat_pushdown-hashes-hashjoinexec ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.15 ms │                           2.22 ms │     no change │
│ QQuery 1     │    50.36 ms │                          49.92 ms │     no change │
│ QQuery 2     │   137.12 ms │                         140.45 ms │     no change │
│ QQuery 3     │   163.55 ms │                         163.97 ms │     no change │
│ QQuery 4     │  1042.87 ms │                        1079.35 ms │     no change │
│ QQuery 5     │  1503.48 ms │                        1629.49 ms │  1.08x slower │
│ QQuery 6     │     2.20 ms │                           2.17 ms │     no change │
│ QQuery 7     │    55.16 ms │                          55.88 ms │     no change │
│ QQuery 8     │  1450.70 ms │                        1483.81 ms │     no change │
│ QQuery 9     │  1758.13 ms │                        1899.19 ms │  1.08x slower │
│ QQuery 10    │   375.94 ms │                         380.50 ms │     no change │
│ QQuery 11    │   436.65 ms │                         439.74 ms │     no change │
│ QQuery 12    │  1370.17 ms │                        1395.11 ms │     no change │
│ QQuery 13    │  2128.89 ms │                        2168.22 ms │     no change │
│ QQuery 14    │  1264.19 ms │                        1302.04 ms │     no change │
│ QQuery 15    │  1185.43 ms │                        1262.46 ms │  1.06x slower │
│ QQuery 16    │  2628.93 ms │                        2698.18 ms │     no change │
│ QQuery 17    │  2646.98 ms │                        2684.21 ms │     no change │
│ QQuery 18    │  5350.80 ms │                        5020.88 ms │ +1.07x faster │
│ QQuery 19    │   128.21 ms │                         127.70 ms │     no change │
│ QQuery 20    │  2051.17 ms │                        2077.75 ms │     no change │
│ QQuery 21    │  2382.70 ms │                        2369.91 ms │     no change │
│ QQuery 22    │  4069.62 ms │                        4029.18 ms │     no change │
│ QQuery 23    │ 15388.80 ms │                       12911.70 ms │ +1.19x faster │
│ QQuery 24    │   224.76 ms │                         228.39 ms │     no change │
│ QQuery 25    │   510.24 ms │                         509.16 ms │     no change │
│ QQuery 26    │   216.79 ms │                         225.03 ms │     no change │
│ QQuery 27    │  2983.96 ms │                        2906.25 ms │     no change │
│ QQuery 28    │ 23267.75 ms │                       23351.25 ms │     no change │
│ QQuery 29    │   963.44 ms │                        1019.77 ms │  1.06x slower │
│ QQuery 30    │  1349.04 ms │                        1326.20 ms │     no change │
│ QQuery 31    │  1350.83 ms │                        1335.29 ms │     no change │
│ QQuery 32    │  4846.62 ms │                        4877.66 ms │     no change │
│ QQuery 33    │  5932.21 ms │                        5797.16 ms │     no change │
│ QQuery 34    │  6005.87 ms │                        6076.10 ms │     no change │
│ QQuery 35    │  2078.60 ms │                        2080.23 ms │     no change │
│ QQuery 36    │   124.39 ms │                         122.41 ms │     no change │
│ QQuery 37    │    55.88 ms │                          54.30 ms │     no change │
│ QQuery 38    │   125.99 ms │                         126.27 ms │     no change │
│ QQuery 39    │   208.15 ms │                         202.62 ms │     no change │
│ QQuery 40    │    44.99 ms │                          43.80 ms │     no change │
│ QQuery 41    │    42.19 ms │                          42.26 ms │     no change │
│ QQuery 42    │    33.63 ms │                          33.99 ms │     no change │
└──────────────┴─────────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 97939.49ms │
│ Total Time (feat_pushdown-hashes-hashjoinexec)   │ 95732.17ms │
│ Average Time (HEAD)                              │  2277.66ms │
│ Average Time (feat_pushdown-hashes-hashjoinexec) │  2226.33ms │
│ Queries Faster                                   │          2 │
│ Queries Slower                                   │          4 │
│ Queries with No Change                           │         37 │
│ Queries with Failure                             │          0 │
└──────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ feat_pushdown-hashes-hashjoinexec ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 170.84 ms │                         168.69 ms │     no change │
│ QQuery 2     │  27.69 ms │                          27.60 ms │     no change │
│ QQuery 3     │  45.32 ms │                          46.66 ms │     no change │
│ QQuery 4     │  27.50 ms │                          27.10 ms │     no change │
│ QQuery 5     │  75.04 ms │                          77.54 ms │     no change │
│ QQuery 6     │  19.38 ms │                          19.93 ms │     no change │
│ QQuery 7     │ 148.69 ms │                         148.04 ms │     no change │
│ QQuery 8     │  38.63 ms │                          32.15 ms │ +1.20x faster │
│ QQuery 9     │ 103.45 ms │                          86.18 ms │ +1.20x faster │
│ QQuery 10    │  59.85 ms │                          59.21 ms │     no change │
│ QQuery 11    │  41.71 ms │                          40.99 ms │     no change │
│ QQuery 12    │  51.25 ms │                          50.80 ms │     no change │
│ QQuery 13    │  47.90 ms │                          47.13 ms │     no change │
│ QQuery 14    │  14.05 ms │                          14.06 ms │     no change │
│ QQuery 15    │  24.19 ms │                          24.27 ms │     no change │
│ QQuery 16    │  24.75 ms │                          23.96 ms │     no change │
│ QQuery 17    │ 146.04 ms │                         147.61 ms │     no change │
│ QQuery 18    │ 331.51 ms │                         330.26 ms │     no change │
│ QQuery 19    │  36.24 ms │                          36.63 ms │     no change │
│ QQuery 20    │  48.12 ms │                          48.99 ms │     no change │
│ QQuery 21    │ 218.30 ms │                         228.72 ms │     no change │
│ QQuery 22    │  19.70 ms │                          19.98 ms │     no change │
└──────────────┴───────────┴───────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary                                ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)                                │ 1720.15ms │
│ Total Time (feat_pushdown-hashes-hashjoinexec)   │ 1706.50ms │
│ Average Time (HEAD)                              │   78.19ms │
│ Average Time (feat_pushdown-hashes-hashjoinexec) │   77.57ms │
│ Queries Faster                                   │         2 │
│ Queries Slower                                   │         0 │
│ Queries with No Change                           │        20 │
│ Queries with Failure                             │         0 │
└──────────────────────────────────────────────────┴───────────┘

adriangb · 2025-10-08T05:41:47Z

@rkrishn7 sorry if I haven't looped back here. I went on a bit of a tangent exploring #17632 and then had some vacation and a team offsite. This is overall very exciting work that I think will help a lot of people.

My main concern with this change is the overhead and making a CPU / memory tradeoff decision for users. I think we might be able to ship it as an experimental thing with the feature flag defaulting to false as you've done in this PR but long term I worry that an extra 8GB or RAM consumed might be too much. Do you have any numbers on how fast and how much RAM these 3 different scenarios use for some queries? I don't mean to ask you to run them all but I do remember you mentioning you have already.

No hashes pushdown.
Build a single hash table (this PR).
Push down the existing hash tables w/ a CASE statement for which hash table to check based on the same partitioning as RepartitionExec.

My hypothesis is that the table will look something like this:

	(1)	(2)	(3)
Runtime	300s	25s	30s
Peak mem	8GB	16GB	8GB

If that were the case, which is just a guess at this point, making a query 10x faster with no extra memory use is easy to justify, everyone wants that! Choosing to make some queries say 11x faster for 2x memory use is harder to justify. If the performance difference is larger and we think it is justified in some cases maybe we can at least try to reserve the extra memory and fall back to re-using the existing hash tables?

I also think it's worth thinking about integrating your suggestion from our conversation to use an IN LIST expression because it's even integrated into bloom filter pruning and predicate pruning. I see basically no reason to not do that for small build side result sets.

wjones127 · 2025-10-20T21:08:46Z

~~To address the memory concern, have we consider switching to a bloom filter at a certain size?~~

Nevermind, I see discussion in #17171

LiaCastaneda · 2025-10-21T21:49:20Z

👋 I was playing around with this feature today, here are some results for sf1 and sf10, claude did these nice summaries

| Query | Baseline Time | Hash Pushdown Time | Time Change    | Baseline Memory | Hash Memory | Memory Change |
|-------|---------------|--------------------|----------------|-----------------|-------------|---------------|
| Q1    | 55.80 ms      | 59.15 ms           | 1.06x slower   | 378.9 MB        | 349.0 MB    | -29.9 MB      |
| Q2    | 32.31 ms      | 21.93 ms           | 1.47x faster   | 243.4 MB        | 223.8 MB    | -19.6 MB      |
| Q3    | 50.02 ms      | 43.97 ms           | 1.14x faster   | 465.0 MB        | 438.5 MB    | -26.5 MB      |
| Q4    | 34.93 ms      | 36.87 ms           | 1.06x slower   | 639.0 MB        | 656.9 MB    | +17.9 MB      |
| Q5    | 58.00 ms      | 52.43 ms           | 1.11x faster   | 458.2 MB        | 453.5 MB    | -4.7 MB       |
| Q6    | 24.32 ms      | 26.13 ms           | 1.07x slower   | 430.4 MB        | 422.9 MB    | -7.5 MB       |
| Q7    | 61.62 ms      | 39.10 ms           | 1.58x faster   | 522.3 MB        | 494.2 MB    | -28.1 MB      |
| Q8    | 54.45 ms      | 47.40 ms           | 1.15x faster   | 597.3 MB        | 576.9 MB    | -20.4 MB      |
| Q9    | 61.88 ms      | 78.72 ms           | 1.27x slower   | 649.0 MB        | 729.8 MB    | +80.8 MB      |
| Q10   | 61.38 ms      | 43.34 ms           | 1.42x faster   | 493.4 MB        | 457.4 MB    | -36.0 MB      |
| Q11   | 21.78 ms      | 15.00 ms           | 1.45x faster   | 213.7 MB        | 173.5 MB    | -40.2 MB      |
| Q12   | 56.02 ms      | 53.81 ms           | 1.04x faster   | 507.0 MB        | 480.0 MB    | -27.0 MB      |
| Q13   | 27.48 ms      | 26.66 ms           | no change      | 386.6 MB        | 411.0 MB    | +24.4 MB      |
| Q14   | 25.54 ms      | 29.37 ms           | 1.15x slower   | 410.0 MB        | 406.6 MB    | -3.4 MB       |
| Q15   | 41.22 ms      | 40.93 ms           | no change      | 420.3 MB        | 420.8 MB    | +0.5 MB       |
| Q16   | 18.24 ms      | 16.11 ms           | 1.13x faster   | 196.8 MB        | 193.5 MB    | -3.3 MB       |
| Q17   | 68.96 ms      | 54.58 ms           | 1.26x faster   | 739.0 MB        | 702.2 MB    | -36.8 MB      |
| Q18   | 78.27 ms      | 99.72 ms           | 1.27x slower   | 646.4 MB        | 763.2 MB    | +116.8 MB     |
| Q19   | 40.37 ms      | 37.09 ms           | 1.09x faster   | 546.1 MB        | 555.2 MB    | +9.1 MB       |
| Q20   | 34.68 ms      | 35.96 ms           | no change      | 466.8 MB        | 477.9 MB    | +11.1 MB      |
| Q21   | 102.68 ms     | 87.24 ms           | 1.18x faster   | 714.5 MB        | 602.4 MB    | -112.1 MB     |
| Q22   | 12.09 ms      | 13.67 ms           | 1.13x slower   | 205.6 MB        | 205.6 MB    | 0.0 MB        |

SF10

| Query | Baseline Time | Hash Pushdown Time | Time Change      | Baseline Memory | Hash Memory | Memory Change |
|-------|---------------|--------------------| -----------------|-----------------|-------------|---------------|
| Q1    | 472.58 ms     | 502.66 ms          | 1.06x slower     | 835.4 MB        | 844.5 MB    | +9.1 MB       |
| Q2    | 131.42 ms     | 109.07 ms          | 1.21x faster     | 734.1 MB        | 766.8 MB    | +32.7 MB      |
| Q3    | 428.75 ms     | 287.98 ms          | 1.49x faster     | 1.3 GB          | 1.1 GB      | -194.2 MB     |
| Q4    | 327.30 ms     | 341.23 ms          | 1.04x slower     | 3.2 GB          | 3.4 GB      | +200 MB       |
| Q5    | 430.55 ms     | 409.65 ms          | 1.05x faster     | 1.3 GB          | 1.3 GB      | +75 MB        |
| Q6    | 172.13 ms     | 169.20 ms          | no change        | 1.0 GB          | 1.0 GB      | +8 MB         |
| Q7    | 423.89 ms     | 284.72 ms          | 1.49x faster     | 1.3 GB          | 1.3 GB      | -4.1 MB       |
| Q8    | 559.76 ms     | 454.55 ms          | 1.23x faster     | 1.8 GB          | 1.9 GB      | +45.8 MB      |
| Q9    | 697.39 ms     | 835.67 ms          | 1.20x slower     | 2.7 GB          | 3.4 GB      | +700 MB       |
| Q10   | 533.91 ms     | 363.03 ms          | 1.47x faster     | 1.8 GB          | 1.6 GB      | -138.1 MB     |
| Q11   | 90.07 ms      | 63.19 ms           | 1.43x faster     | 514.3 MB        | 564.8 MB    | +50.5 MB      |
| Q12   | 469.67 ms     | 451.50 ms          | 1.04x faster     | 1.3 GB          | 1.4 GB      | +50.1 MB      |
| Q13   | 306.60 ms     | 284.17 ms          | 1.08x faster     | 2.8 GB          | 2.3 GB      | -500 MB       |
| Q14   | 220.53 ms     | 225.02 ms          | no change        | 1.1 GB          | 1.2 GB      | +111.3 MB     |
| Q15   | 321.28 ms     | 315.00 ms          | no change        | 969.6 MB        | 1.1 GB      | +95.2 MB      |
| Q16   | 81.82 ms      | 79.00 ms           | no change        | 766.1 MB        | 794.2 MB    | +28.1 MB      |
| Q17   | 694.52 ms     | 512.22 ms          | 1.36x faster     | 1.2 GB          | 1.3 GB      | +15.4 MB      |
| Q18   | 904.41 ms     | 1290.26 ms         | 1.43x slower     | 4.4 GB          | 5.1 GB      | +700 MB       |
| Q19   | 330.76 ms     | 273.10 ms          | 1.21x faster     | 1.6 GB          | 1.6 GB      | +26.7 MB      |
| Q20   | 290.40 ms     | 282.53 ms          | no change        | 1.7 GB          | 1.6 GB      | -59.3 MB      |
| Q21   | 947.20 ms     | 789.62 ms          | 1.20x faster     | 2.0 GB          | 2.0 GB      | +28.7 MB      |
| Q22   | 89.73 ms      | 88.49 ms           | no change        | 615.0 MB        | 615.1 MB    | +0.1 MB       |

Memory measurements are from Peak RSS which iiuc is the peak memory used by the process, not sure if this is the best way to test memory. I'm wondering it would be nice to track how many hashes are generated per query, maybe include it as a metric of the join.

In any case, with some manual logging for Q18 (which appears to be the heaviest join), I'm seeing 1.5M hashes for SF1 and 15M hashes for SF10 (just realized its because all the build side are distinct values). The correlation between more data and more memory is clear, but IMO, if there's a way to measure the total number of hashes across partitions, could we opt out of the feature during runtime? and allow the user to configure this based on their available resources/ pod size

Dandandan · 2025-10-22T12:20:50Z

datafusion/physical-plan/src/joins/hash_join/information_passing.rs

+        create_hashes(&expr_values, self.random_state, &mut hashes_buffer)?;
+
+        // Create a boolean array where each position indicates if the corresponding hash is in the set of known hashes
+        let mut buf = MutableBuffer::from_len_zeroed(bit_util::ceil(num_rows, 8));


Should be able to use MutableBuffer::collect_bool instead of setting bits.

Dandandan · 2025-10-22T12:33:30Z

datafusion/physical-plan/src/joins/hash_join/information_passing.rs

+                let estimated_additional_size =
+                    estimate_memory_size::<u64>(left_hash_map.num_hashes(), fixed_size)?;
+                inner.reservation.try_grow(estimated_additional_size)?;
+                inner.left_hashes.extend(left_hash_map.hashes());


Can we maybe derive the capacity for left_hashes upfront to reduce the cost of building the map?

Dandandan · 2025-10-22T12:44:29Z

Benchmarks look promising, but I think we should be able to reduce/remove regressions by avoiding the overhead of creating a new HashSet when not needed + some other tweaks.

rkrishn7 · 2025-10-24T22:29:23Z

Thanks for the reviews and testing @adriangb @LiaCastaneda @Dandandan

Apologies, I haven't had time to look at this the past couple weeks. I'll be able to give this more attention next week when I'm back home.

Still very excited about this work and looking forward to addressing everyone's feedback/suggestions!

github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate execution Related to the execution crate physical-plan Changes to the physical-plan crate labels Sep 11, 2025

rkrishn7 mentioned this pull request Sep 11, 2025

Push down entire hash table from HashJoinExec into scans #17171

Open

rkrishn7 added 6 commits September 16, 2025 10:23

feat: push down build side hashes for row filtering

766d2ce

update slt

a906af7

reorganize information passing utilities

2421139

wrap hashset in arc to avoid costly cloning

e3a4e0d

remove physicalexprref usage

96f80cc

clarifying comment on bounds/hash collection

18d2acc

rkrishn7 force-pushed the feat/pushdown-hashes-hashjoinexec branch from 0649ad3 to 18d2acc Compare September 16, 2025 21:22

fix clippy errors

f429da1

github-actions bot added the documentation Improvements or additions to documentation label Sep 16, 2025

rkrishn7 force-pushed the feat/pushdown-hashes-hashjoinexec branch from 131c4c5 to 8c5d61b Compare September 16, 2025 21:50

update config docs

e23fbea

rkrishn7 force-pushed the feat/pushdown-hashes-hashjoinexec branch from 8c5d61b to e23fbea Compare September 16, 2025 21:51

fix clippy errors

76e0c0a

rkrishn7 marked this pull request as ready for review September 16, 2025 22:40

rkrishn7 commented Sep 16, 2025

View reviewed changes

rkrishn7 changed the title ~~feat: pushdown hashes hashjoinexec~~ feat: pushdown hashes to probe side in HashJoinExec Sep 16, 2025

rkrishn7 changed the title ~~feat: pushdown hashes to probe side in HashJoinExec~~ feat: Push down hashes to probe side in HashJoinExec Sep 16, 2025

rkrishn7 commented Sep 16, 2025

View reviewed changes

adriangb reviewed Sep 17, 2025

View reviewed changes

rkrishn7 mentioned this pull request Sep 17, 2025

Avoid re-computing hashes in HashJoin and GroupByHash #17625

Closed

rkrishn7 mentioned this pull request Sep 18, 2025

Refactor hash join dynamic filtering for progressive bounds application #17632

Closed

4 tasks

Merge branch 'main' into feat/pushdown-hashes-hashjoinexec

66e826a

adriangb mentioned this pull request Oct 21, 2025

make hash function in RepartitionExec configurable #17648

Closed

Dandandan reviewed Oct 22, 2025

View reviewed changes

	/// Creates a new [SharedBuildAccumulator] configured for the given partition mode
	/// Creates a new [`SharedBuildAccumulator`] configured for the given partition mode

feat: Push down hashes to probe side in HashJoinExec #17529

Are you sure you want to change the base?

feat: Push down hashes to probe side in HashJoinExec #17529

Uh oh!

Conversation

rkrishn7 commented Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rkrishn7 commented Sep 16, 2025

Uh oh!

rkrishn7 commented Sep 16, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adriangb commented Sep 17, 2025

Uh oh!

alamb commented Sep 17, 2025

Uh oh!

alamb commented Sep 17, 2025

Uh oh!

rkrishn7 commented Sep 17, 2025

Uh oh!

alamb commented Sep 17, 2025

Uh oh!

adriangb commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wjones127 commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

LiaCastaneda commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Dandandan Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan commented Oct 22, 2025

Uh oh!

rkrishn7 commented Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

rkrishn7 commented Sep 11, 2025 •

edited

Loading

adriangb commented Oct 8, 2025 •

edited

Loading

wjones127 commented Oct 20, 2025 •

edited

Loading

LiaCastaneda commented Oct 21, 2025 •

edited

Loading

Dandandan Oct 22, 2025 •

edited

Loading