-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Convert more cross joins to inner joins (Address performance/execution plan of TPCH query 19) #3482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
* Added the new optimizer rule reduce_cross_join which would convert cross joins to inner joins if the filter has the join predicates for the corresponding tables.
|
@andygrove or @alamb can you kick off CI please? |
|
Thanks @DhamoPS ! It looks great and I appreciate the tests. I'll review in more detail tonight or tomorrow. |
|
It might be important to note that master regressed in functionality: all the queries used to pass. Some were slow and only accurate to 4 decimals, but now several actually fail. Hopefully q19 isn't one of them. |
I think we should think of ways of adding the tpch queries to the CI? |
🤔 this sounds bad -- I thought we had tests in place to catch this. Clearly we didn't have enough. I also plan to review this PR more carefully tomorrow |
Ah, my bad - this may be behind a flag: #3393 (comment) One man's regression is another's more strict acceptance criteria. Edit: and in any event, it looks like query 19 was unaffected. I should have reviewed which ones they were before speaking up. |
There are tests for several of them here: https://github.com/apache/arrow-datafusion/blob/5f029cc73755b6800217e370ebe0a7b5e8a6a224/datafusion/core/tests/sql/subqueries.rs#L119 |
|
@Dandandan adding verification of TPCH results to the CI would be nice. I was thinking of suggesting that once all the tests pass and we can verify results against the answers. Fixing q19 is one of the last major blockers, besides some decimal issues. Edit: by "add all the tests", I mean that maybe we can load the benchmark data + answers to CI and actually run the tests in the benchmark itself? Anyway, I don't want to pull attention away from this issue. |
- fixing the build failure in CI -> Rust/ clippy - Fixing the testcases
xudong963
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we have the rule in the optimizer, I think the logic of cross join -> inner join in planner can be deleted https://github.com/apache/arrow-datafusion/blob/master/datafusion/sql/src/planner.rs#L631. (It's ok to do it in the next ticket).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add the test in the tests/sql/subqueries.rs file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other TPCH Queries related testcases are in subqueries.rs and so I have added this test also here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, tests in subqueries.rs are about correlated subqueries(you can know by the file name directly), not all tpch queries.
So you need to find a proper place for your tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I move the tests to different file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we pull all the tpch_* queries out into a tcph.rs so they are easier to track.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'll be clearer to highlight those sqls by ``.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for r## strings!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. It implies that this query cannot be optimized, but then the test goes on to assert that it has been converted to an inner join (which seems correct BTW).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a typo. I have corrected it now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems weird to pass &self to a non-member function when this should be able to be made a member of the struct impl itself? (I realize this is just following an existing pattern from other optimizer rules)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I just followed existing pattern. I hope it is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting nit:
| } | |
| } else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this parameter is not needed, let's just remove it. This isn't part of the trait, so there is no contractual need to keep it around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. removed the _optimizer_config
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All branches appear to return Ok(()) (including intersect()), so this method should return () not Result().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be no unwraps in optimizer rules. @andygrove and others have been actively making PRs to remove them, based on some fuzzing results where things that were never supposed to happen kept happening.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest something like:
for (l, r) in possible_join_keys {
let dt = match left_schema.field_from_column(l) {
Ok(f) => f.data_type(),
Err(e) => e.map_err(|e| context!("Field not found!", e))?,
};
if right_schema.field_from_column(r).is_ok() && can_hash(dt) {
join_keys.push((l.clone(), r.clone()));
} else if right_schema.field_from_column(l).is_ok() && can_hash(dt) {
join_keys.push((r.clone(), l.clone()));
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, removed unwraps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, sorry, I see I accidentally left the comment on an unwrap in a test - I think that's fine BTW. Thanks for removing the others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) -> Result<()> { | |
| ) -> () { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! I had no idea this function existed, and would have used it if I knew it did :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work tracking down all the parts that needed to be recursive :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional documentation about return values would help here, e.g. what do Some() and None mean in context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
r## strings would be nice here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm missing something, left is misleading name for this variable - this is just the resultant plan after children have been optimized? There is no right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, renamed it to new_plan
avantgardnerio
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DhamoPS , nice work! I spent a few hours looking it over and trying to break it with some tests. The logic looks solid.
I gave lots of feedback, but I think the only things preventing merge (IMO) are:
- the unwraps should be replaced with proper error handling
- the functions that return a Result that is always Ok should be simplified to reduce cognitive load
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ) -> () { | |
| ) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to explicitly state in rust that you are returning ()
Codecov Report
@@ Coverage Diff @@
## master #3482 +/- ##
==========================================
+ Coverage 85.75% 85.79% +0.03%
==========================================
Files 299 300 +1
Lines 55273 55513 +240
==========================================
+ Hits 47402 47626 +224
- Misses 7871 7887 +16
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @DhamoPS -- I reviewed the tests carefully and they look very good 👌 I think if we add a few more showing this PR works with some more complicated OR predicates and with more than 2 tables it would be ready to merge.
Thank you @avantgardnerio for your thorough review
I went over the code a bit, and it also looks quite reasonable. I left a few comments but they are all stylistic.
I can't help feeling after this there are three somewhat redundant overlapping areas of the code
- This one
- the planner logic identified by @xudong963 in https://github.com/apache/arrow-datafusion/blob/master/datafusion/sql/src/planner.rs#L630-L681
- The rewrite in RewriteDisjunctivePredicate
Maybe we can work as a follow on to consolidate / remove the redundancy.
All in all very nice work 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to explicitly state in rust that you are returning ()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that sure looks better 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that this is a concern - and if it turns out to be ok, I think we should document why it can be global (across the entire plan) rather than per-Join
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like having a parallel list of optimizer passes is not a good idea as it can get out of sync with the main list in context.rs
As this was not something introduced in this PR we shouldn't fix it here, but I think filing a follow on issue to fix this would be good. I can do so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYIW you can write this more concisely using https://docs.rs/datafusion/12.0.0/datafusion/prelude/enum.Expr.html#method.and I think.
e.g. something like
| binary_expr( | |
| col("t1.a").eq(col("t2.a")), | |
| And, | |
| col("t2.c").lt(lit(20u32)), | |
| ), | |
| col("t1.a").eq(col("t2.a")) | |
| .and( | |
| col("t2.c").lt(lit(20u32)), | |
| ), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good test of `((t1.a = t2.a) AND (t2.c < 15)) OR ((t1.a = t2.a) OR (t2.c = 32))
Can you please also add a test that has the predicate(t1.a = t2.a) OR (t1.b = t2.b) showing it is not rewritten?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, I add more tests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this test and I have added 4 table joins with multiple Filter expr.
With new testcases, I have found few bugs and I have fixed them now.
The testcases are working as expected and as explained.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend adding some tests with three tables (not just 2)
|
In predicates.rs, tpch_q19_pull_predicates_to_innerjoin_simplified() is having 3 tables test. I add one more test here as well. |
|
To be clear, I think this PR just needs a few more tests, as discussed (and the clippy / fmt tests to be cleaned up) and it will be ready to merge Thank you so much for your help @DhamoPS |
-- Added more testcases and fixed minor fixes -- Addressed review comments
|
I have added more testcases and addressed most of review comments. Please review. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great -- thank you @DhamoPS . This is a pretty epic first contribution.
🏅 for the tests -- I think they will pay off in avoiding tricky join bugs later on, .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this particular comment is not addressed (they are still global). However, given the test coverage I think this PR can be merged as is
|
@DhamoPS it looks like there is a clippy failure on this PR -- once that is resolved I think we can merge this PR in |
|
@alamb I have resolved the last clippy failure. I didn't know that I could run the clippy and lint tests locally. |
No problem -- thanks for sticking with it! Your next PRs will be easier as after we have merged one, the CI checks start automatically on subsequent PRs |
|
@alamb Please merge this pull request to master. |
|
🚀 really nice first PR -- thanks @DhamoPS ! |
|
Benchmark runs are scheduled for baseline = 1261741 and contender = 7f26cff. 7f26cff is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
|
This is great @DhamoPS! Congrats on your first merged PR :) Will you be following up with a separate PR to remove the similar code from sql/planner.rs? |
|
@kmitchener Thanks. Yes, I will be following up with separate PR for cleanup in sql/planner.rs. |
You should feel free to create an issue yourself when you're ready. I'm interested in the cleanup but I don't have any special ownership here. |
Which issue does this PR close?
-- reduce_cross_join.rs is the new optimizer rule added to pull the
Joined predicates from Filters and convert cross_join into inner_join
-- tests/sql/subqueries.rs::added testcases for the above fix
Closes #78
Closes #2859
Rationale for this change
In
[PR#3334](#3334), I have tried to fix this issue in planner.rs. But to handle DataFrameAPI queries, we have decided to move this rule to optimizer.What changes are included in this PR?
Logs of TPCH Q19:
arrow-datafusion/target/release/tpch benchmark datafusion --iterations 3 --path ./data --format tbl --query 19 --batch-size 4096`
Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 19, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "./data", file_format: "tbl", mem_table: false, output_path: None }
Query 19 iteration 0 took 2640.2 ms and returned 1 rows
Query 19 iteration 1 took 2099.2 ms and returned 1 rows
Query 19 iteration 2 took 2199.9 ms and returned 1 rows
Query 19 avg time: 2313.08 ms
Are there any user-facing changes?
No