Convert more cross joins to inner joins (Address performance/execution plan of TPCH query 19) #3482

DhamoPS · 2022-09-14T17:06:18Z

* Added the new optimizer rule reduce_cross_join which would convert cross joins to inner joins if the filter has the join predicates for the corresponding tables.
* Join Predicates which are part of OR expressions of FILTER PREDICATES, were not pulled properly. Due to that, CROSS JOINs are chosen for those joins instead of innerjoin and causes query to run longer. Actually TPCH Q19 was taking more than 2000 sec in my laptop. With Fix, TPCH Q19 is taking ~3 sec.

Which issue does this PR close?

-- reduce_cross_join.rs is the new optimizer rule added to pull the
Joined predicates from Filters and convert cross_join into inner_join
-- tests/sql/subqueries.rs::added testcases for the above fix

Closes #78
Closes #2859

Rationale for this change

In [PR#3334](#3334), I have tried to fix this issue in planner.rs. But to handle DataFrameAPI queries, we have decided to move this rule to optimizer.

What changes are included in this PR?

new optimizer rule reduce_cross_join
unit test cases for the optimizer rule
test cases by running SQL stmts

Logs of TPCH Q19:
arrow-datafusion/target/release/tpch benchmark datafusion --iterations 3 --path ./data --format tbl --query 19 --batch-size 4096`

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 19, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "./data", file_format: "tbl", mem_table: false, output_path: None }
Query 19 iteration 0 took 2640.2 ms and returned 1 rows
Query 19 iteration 1 took 2099.2 ms and returned 1 rows
Query 19 iteration 2 took 2199.9 ms and returned 1 rows
Query 19 avg time: 2313.08 ms

Are there any user-facing changes?

No

* Added the new optimizer rule reduce_cross_join which would convert cross joins to inner joins if the filter has the join predicates for the corresponding tables.

DhamoPS · 2022-09-14T17:12:07Z

@andygrove or @alamb can you kick off CI please?

avantgardnerio · 2022-09-14T17:16:39Z

Thanks @DhamoPS ! It looks great and I appreciate the tests. I'll review in more detail tonight or tomorrow.

avantgardnerio · 2022-09-14T19:05:11Z

It might be important to note that master regressed in functionality: all the queries used to pass. Some were slow and only accurate to 4 decimals, but now several actually fail. Hopefully q19 isn't one of them.

Dandandan · 2022-09-14T20:52:04Z

It might be important to note that master regressed in functionality: all the queries used to pass. Some were slow and only accurate to 4 decimals, but now several actually fail. Hopefully q19 isn't one of them.

I think we should think of ways of adding the tpch queries to the CI?

alamb · 2022-09-14T20:52:34Z

It might be important to note that master regressed in functionality: all the queries used to pass. Some were slow and only accurate to 4 decimals, but now several actually fail. Hopefully q19 isn't one of them.

🤔 this sounds bad -- I thought we had tests in place to catch this. Clearly we didn't have enough.

I also plan to review this PR more carefully tomorrow

avantgardnerio · 2022-09-14T21:05:45Z

this sounds bad

Ah, my bad - this may be behind a flag: #3393 (comment)

One man's regression is another's more strict acceptance criteria.

Edit: and in any event, it looks like query 19 was unaffected. I should have reviewed which ones they were before speaking up.

avantgardnerio · 2022-09-14T21:10:18Z

I think we should think of ways of adding the tpch queries to the CI?

There are tests for several of them here: https://github.com/apache/arrow-datafusion/blob/5f029cc73755b6800217e370ebe0a7b5e8a6a224/datafusion/core/tests/sql/subqueries.rs#L119

kmitchener · 2022-09-14T21:12:38Z

@Dandandan adding verification of TPCH results to the CI would be nice. I was thinking of suggesting that once all the tests pass and we can verify results against the answers. Fixing q19 is one of the last major blockers, besides some decimal issues.

Edit: by "add all the tests", I mean that maybe we can load the benchmark data + answers to CI and actually run the tests in the benchmark itself? Anyway, I don't want to pull attention away from this issue.

- fixing the build failure in CI -> Rust/ clippy - Fixing the testcases

xudong963

If we have the rule in the optimizer, I think the logic of cross join -> inner join in planner can be deleted https://github.com/apache/arrow-datafusion/blob/master/datafusion/sql/src/planner.rs#L631. (It's ok to do it in the next ticket).

xudong963 · 2022-09-15T12:42:51Z

datafusion/core/tests/sql/subqueries.rs

Why add the test in the tests/sql/subqueries.rs file?

Other TPCH Queries related testcases are in subqueries.rs and so I have added this test also here.

In fact, tests in subqueries.rs are about correlated subqueries(you can know by the file name directly), not all tpch queries.

So you need to find a proper place for your tests.

Ok, I move the tests to different file

Perhaps we pull all the tpch_* queries out into a tcph.rs so they are easier to track.

xudong963 · 2022-09-15T12:45:27Z

datafusion/optimizer/src/reduce_cross_join.rs

It'll be clearer to highlight those sqls by ``.

avantgardnerio · 2022-09-15T13:49:18Z

datafusion/core/tests/sql/subqueries.rs

+1 for r## strings!

avantgardnerio · 2022-09-15T13:59:02Z

datafusion/optimizer/src/reduce_cross_join.rs

I don't understand this comment. It implies that this query cannot be optimized, but then the test goes on to assert that it has been converted to an inner join (which seems correct BTW).

It was a typo. I have corrected it now.

avantgardnerio · 2022-09-15T14:03:52Z

datafusion/optimizer/src/reduce_cross_join.rs

It seems weird to pass &self to a non-member function when this should be able to be made a member of the struct impl itself? (I realize this is just following an existing pattern from other optimizer rules)

yes, I just followed existing pattern. I hope it is fine.

avantgardnerio · 2022-09-15T14:05:13Z

datafusion/optimizer/src/reduce_cross_join.rs

Formatting nit:

Suggested change

}

} else {

avantgardnerio · 2022-09-15T14:07:08Z

datafusion/optimizer/src/reduce_cross_join.rs

If this parameter is not needed, let's just remove it. This isn't part of the trait, so there is no contractual need to keep it around.

Ok. removed the _optimizer_config

avantgardnerio · 2022-09-15T14:11:14Z

datafusion/optimizer/src/reduce_cross_join.rs

All branches appear to return Ok(()) (including intersect()), so this method should return () not Result().

ok removed.

avantgardnerio · 2022-09-15T14:27:29Z

datafusion/core/tests/sql/subqueries.rs

There should be no unwraps in optimizer rules. @andygrove and others have been actively making PRs to remove them, based on some fuzzing results where things that were never supposed to happen kept happening.

I'd suggest something like:

for (l, r) in possible_join_keys { let dt = match left_schema.field_from_column(l) { Ok(f) => f.data_type(), Err(e) => e.map_err(|e| context!("Field not found!", e))?, }; if right_schema.field_from_column(r).is_ok() && can_hash(dt) { join_keys.push((l.clone(), r.clone())); } else if right_schema.field_from_column(l).is_ok() && can_hash(dt) { join_keys.push((r.clone(), l.clone())); } }

ok, removed unwraps

Oh, sorry, I see I accidentally left the comment on an unwrap in a test - I think that's fine BTW. Thanks for removing the others.

avantgardnerio · 2022-09-15T14:33:27Z

datafusion/optimizer/src/reduce_cross_join.rs

Suggested change

) -> Result<()> {

) -> () {

avantgardnerio · 2022-09-15T14:44:55Z

datafusion/optimizer/src/reduce_cross_join.rs

Nice! I had no idea this function existed, and would have used it if I knew it did :)

avantgardnerio · 2022-09-15T14:54:50Z

datafusion/optimizer/src/reduce_cross_join.rs

Nice work tracking down all the parts that needed to be recursive :)

avantgardnerio · 2022-09-15T15:04:11Z

datafusion/optimizer/src/reduce_cross_join.rs

Additional documentation about return values would help here, e.g. what do Some() and None mean in context?

avantgardnerio · 2022-09-15T15:10:50Z

datafusion/optimizer/src/reduce_cross_join.rs

r## strings would be nice here.

avantgardnerio · 2022-09-15T15:33:31Z

datafusion/optimizer/src/reduce_cross_join.rs

Unless I'm missing something, left is misleading name for this variable - this is just the resultant plan after children have been optimized? There is no right?

ok, renamed it to new_plan

avantgardnerio

@DhamoPS , nice work! I spent a few hours looking it over and trying to break it with some tests. The logic looks solid.

I gave lots of feedback, but I think the only things preventing merge (IMO) are:

the unwraps should be replaced with proper error handling
the functions that return a Result that is always Ok should be simplified to reduce cognitive load

alamb · 2022-09-15T20:32:37Z

datafusion/optimizer/src/reduce_cross_join.rs

Suggested change

) -> () {

) {

I don't think you need to explicitly state in rust that you are returning ()

codecov-commenter · 2022-09-15T21:05:18Z

Codecov Report

Merging #3482 (672091a) into master (f3bb84f) will increase coverage by 0.03%.
The diff coverage is 95.04%.

@@            Coverage Diff             @@
##           master    #3482      +/-   ##
==========================================
+ Coverage   85.75%   85.79%   +0.03%     
==========================================
  Files         299      300       +1     
  Lines       55273    55513     +240     
==========================================
+ Hits        47402    47626     +224     
- Misses       7871     7887      +16

Impacted Files	Coverage Δ
datafusion/optimizer/src/reduce_cross_join.rs	`94.47% <94.47%> (ø)`
datafusion/core/src/execution/context.rs	`79.33% <100.00%> (+0.02%)`	⬆️
datafusion/core/tests/sql/predicates.rs	`100.00% <100.00%> (ø)`
datafusion/optimizer/tests/integration-test.rs	`89.04% <100.00%> (+0.15%)`	⬆️
datafusion/core/src/config.rs	`89.65% <0.00%> (-4.10%)`	⬇️
datafusion/expr/src/logical_plan/plan.rs	`77.52% <0.00%> (-0.34%)`	⬇️
...n/core/src/physical_plan/file_format/row_filter.rs	`87.30% <0.00%> (-0.31%)`	⬇️
datafusion/common/src/scalar.rs	`85.11% <0.00%> (-0.07%)`	⬇️
datafusion/proto/src/lib.rs	`94.14% <0.00%> (+<0.01%)`	⬆️
datafusion/sql/src/planner.rs	`80.97% <0.00%> (+0.02%)`	⬆️
... and 5 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

alamb

Thank you @DhamoPS -- I reviewed the tests carefully and they look very good 👌 I think if we add a few more showing this PR works with some more complicated OR predicates and with more than 2 tables it would be ready to merge.

Thank you @avantgardnerio for your thorough review

I went over the code a bit, and it also looks quite reasonable. I left a few comments but they are all stylistic.

I can't help feeling after this there are three somewhat redundant overlapping areas of the code

This one
the planner logic identified by @xudong963 in https://github.com/apache/arrow-datafusion/blob/master/datafusion/sql/src/planner.rs#L630-L681
The rewrite in RewriteDisjunctivePredicate

Maybe we can work as a follow on to consolidate / remove the redundancy.

All in all very nice work 👍

alamb · 2022-09-15T20:34:45Z

datafusion/optimizer/src/reduce_cross_join.rs

I don't think you need to explicitly state in rust that you are returning ()

alamb · 2022-09-15T20:35:41Z

datafusion/core/tests/sql/predicates.rs

alamb · 2022-09-15T20:38:00Z

datafusion/core/tests/sql/predicates.rs

that sure looks better 👍

alamb · 2022-09-15T20:40:17Z

datafusion/optimizer/src/reduce_cross_join.rs

I agree that this is a concern - and if it turns out to be ok, I think we should document why it can be global (across the entire plan) rather than per-Join

alamb · 2022-09-15T20:43:50Z

datafusion/optimizer/tests/integration-test.rs

It seems like having a parallel list of optimizer passes is not a good idea as it can get out of sync with the main list in context.rs

As this was not something introduced in this PR we shouldn't fix it here, but I think filing a follow on issue to fix this would be good. I can do so

datafusion/optimizer/src/reduce_cross_join.rs

alamb · 2022-09-15T20:55:12Z

datafusion/optimizer/src/reduce_cross_join.rs

FYIW you can write this more concisely using https://docs.rs/datafusion/12.0.0/datafusion/prelude/enum.Expr.html#method.and I think.

e.g. something like

Suggested change

binary_expr(

col("t1.a").eq(col("t2.a")),

And,

col("t2.c").lt(lit(20u32)),

),

col("t1.a").eq(col("t2.a"))

.and(

col("t2.c").lt(lit(20u32)),

),

alamb · 2022-09-15T20:56:13Z

datafusion/optimizer/src/reduce_cross_join.rs

alamb · 2022-09-15T20:58:06Z

datafusion/optimizer/src/reduce_cross_join.rs

This is a good test of `((t1.a = t2.a) AND (t2.c < 15)) OR ((t1.a = t2.a) OR (t2.c = 32))

Can you please also add a test that has the predicate(t1.a = t2.a) OR (t1.b = t2.b) showing it is not rewritten?

ok, I add more tests

I have added this test and I have added 4 table joins with multiple Filter expr.
With new testcases, I have found few bugs and I have fixed them now.
The testcases are working as expected and as explained.

alamb · 2022-09-15T21:02:27Z

datafusion/optimizer/src/reduce_cross_join.rs

I recommend adding some tests with three tables (not just 2)

DhamoPS · 2022-09-16T17:39:57Z

In predicates.rs, tpch_q19_pull_predicates_to_innerjoin_simplified() is having 3 tables test. I add one more test here as well.

alamb · 2022-09-17T10:58:49Z

To be clear, I think this PR just needs a few more tests, as discussed (and the clippy / fmt tests to be cleaned up) and it will be ready to merge

Thank you so much for your help @DhamoPS

-- Added more testcases and fixed minor fixes -- Addressed review comments

DhamoPS · 2022-09-19T13:35:13Z

I have added more testcases and addressed most of review comments. Please review.
@alamb Please start the CI as well.

alamb

Looks great -- thank you @DhamoPS . This is a pretty epic first contribution.

🏅 for the tests -- I think they will pay off in avoiding tricky join bugs later on, .

alamb · 2022-09-19T21:57:22Z

datafusion/optimizer/src/reduce_cross_join.rs

alamb · 2022-09-19T22:00:16Z

datafusion/optimizer/src/reduce_cross_join.rs

It seems like this particular comment is not addressed (they are still global). However, given the test coverage I think this PR can be merged as is

alamb · 2022-09-20T09:54:52Z

@DhamoPS it looks like there is a clippy failure on this PR -- once that is resolved I think we can merge this PR in

DhamoPS · 2022-09-20T10:17:29Z

@DhamoPS it looks like there is a clippy failure on this PR -- once that is resolved I think we can merge this PR in

yes, I am resolving the clippy failure and Lint failure. I would commit it in couple of hours. Thanks @alamb

DhamoPS · 2022-09-20T14:47:51Z

@alamb I have resolved the last clippy failure. I didn't know that I could run the clippy and lint tests locally.
I am sorry for multiple commits. I will ensure that these tests are successful on my next PR.
Thanks a lot for your help!

alamb · 2022-09-20T17:19:42Z

Thanks a lot for your help!

No problem -- thanks for sticking with it!

Your next PRs will be easier as after we have merged one, the CI checks start automatically on subsequent PRs

DhamoPS · 2022-09-20T18:28:51Z

@alamb Please merge this pull request to master.

alamb · 2022-09-20T18:34:40Z

🚀 really nice first PR -- thanks @DhamoPS !

ursabot · 2022-09-20T18:42:32Z

Benchmark runs are scheduled for baseline = 1261741 and contender = 7f26cff. 7f26cff is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

kmitchener · 2022-09-20T18:53:57Z

This is great @DhamoPS! Congrats on your first merged PR :) Will you be following up with a separate PR to remove the similar code from sql/planner.rs?

DhamoPS · 2022-09-20T19:58:04Z

@kmitchener Thanks. Yes, I will be following up with separate PR for cleanup in sql/planner.rs.
Could you create the git issue to track it?

kmitchener · 2022-09-20T20:19:02Z

@kmitchener Thanks. Yes, I will be following up with separate PR for cleanup in sql/planner.rs. Could you create the git issue to track it?

You should feel free to create an issue yourself when you're ready. I'm interested in the cleanup but I don't have any special ownership here.

Address performance/execution plan of TPCH query 19

9d5771b

* Added the new optimizer rule reduce_cross_join which would convert cross joins to inner joins if the filter has the join predicates for the corresponding tables.

github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Sep 14, 2022

DhamoPS mentioned this pull request Sep 14, 2022

Address performance/execution plan of TPCH query 19 #3334

Closed

Merge branch 'master' into bugfix_crossjoin_78

6f5f6ec

updating plan change after merging with latest master

d4e7a9b

minor fixes based on CI

cd01175

- fixing the build failure in CI -> Rust/ clippy - Fixing the testcases

xudong963 reviewed Sep 15, 2022

View reviewed changes

xudong963 mentioned this pull request Sep 15, 2022

Migrate the cross join -> inner join optimization from the planner to the optimizer #2859

Closed

avantgardnerio reviewed Sep 15, 2022

View reviewed changes

datafusion/core/tests/sql/subqueries.rs Outdated

Copy link

Contributor

avantgardnerio Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for r## strings!

avantgardnerio reviewed Sep 15, 2022

View reviewed changes

datafusion/optimizer/src/reduce_cross_join.rs Outdated

Copy link

Contributor

avantgardnerio Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting nit:

Suggested change

}

} else {

avantgardnerio reviewed Sep 15, 2022

View reviewed changes

avantgardnerio requested changes Sep 15, 2022

View reviewed changes

avantgardnerio reviewed Sep 15, 2022

View reviewed changes

datafusion/optimizer/src/reduce_cross_join.rs Outdated

Copy link

Contributor

avantgardnerio Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work tracking down all the parts that needed to be recursive :)

avantgardnerio reviewed Sep 15, 2022

View reviewed changes

avantgardnerio requested changes Sep 15, 2022

View reviewed changes

avantgardnerio approved these changes Sep 15, 2022

View reviewed changes

improved based on review comments

672091a

alamb reviewed Sep 15, 2022

View reviewed changes

alamb changed the title ~~Address performance/execution plan of TPCH query 19~~ Convert more cross joins to inner joins (Address performance/execution plan of TPCH query 19) Sep 17, 2022

alamb mentioned this pull request Sep 17, 2022

Consolidate optimizer passes in optimizer module for better testing #3524

Closed

Added more testcases

e868595

-- Added more testcases and fixed minor fixes -- Addressed review comments

alamb approved these changes Sep 19, 2022

View reviewed changes

DhamoPS added 2 commits September 20, 2022 16:14

resolved clippy and lint failures

82c81ee

resolved clippy failure

2974c5e

alamb merged commit 7f26cff into apache:master Sep 20, 2022

DhamoPS mentioned this pull request Sep 20, 2022

Clean-up the planner logic for moving join predicates from filters in sql/planner.rs which is moved to optimizer rule reduce_cross_join. #3554

Open

avantgardnerio deleted the bugfix_crossjoin_78 branch September 26, 2022 18:10

HuSen8891 mentioned this pull request Oct 13, 2022

Pushdown single column predicates from ON join clauses #3578

Merged

yuuch mentioned this pull request Oct 28, 2022

Works for cross to inner join #3994

Closed

-                binary_expr(
-                    col("t1.a").eq(col("t2.a")),
-                    And,
-                    col("t2.c").lt(lit(20u32)),
-                ),
+                    col("t1.a").eq(col("t2.a"))
+                    .and(
+                      col("t2.c").lt(lit(20u32)),
+                      ),

Convert more cross joins to inner joins (Address performance/execution plan of TPCH query 19) #3482

Convert more cross joins to inner joins (Address performance/execution plan of TPCH query 19) #3482

Uh oh!

Conversation

DhamoPS commented Sep 14, 2022 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

DhamoPS commented Sep 14, 2022

Uh oh!

avantgardnerio commented Sep 14, 2022

Uh oh!

avantgardnerio commented Sep 14, 2022

Uh oh!

Dandandan commented Sep 14, 2022

Uh oh!

alamb commented Sep 14, 2022

Uh oh!

avantgardnerio commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

avantgardnerio commented Sep 14, 2022

Uh oh!

kmitchener commented Sep 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xudong963 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

xudong963 Sep 15, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DhamoPS commented Sep 14, 2022 •

edited by alamb

Loading

avantgardnerio commented Sep 14, 2022 •

edited

Loading

kmitchener commented Sep 14, 2022 •

edited

Loading

xudong963 Sep 15, 2022 •

edited

Loading