Skip to content

Conversation

@DhamoPS
Copy link
Contributor

@DhamoPS DhamoPS commented Sep 14, 2022

* Added the new optimizer rule reduce_cross_join which would convert cross joins to inner joins if the filter has the join predicates for the corresponding tables.
* Join Predicates which are part of OR expressions of FILTER PREDICATES, were not pulled properly. Due to that, CROSS JOINs are chosen for those joins instead of innerjoin and causes query to run longer. Actually TPCH Q19 was taking more than 2000 sec in my laptop. With Fix, TPCH Q19 is taking ~3 sec.

Which issue does this PR close?

-- reduce_cross_join.rs is the new optimizer rule added to pull the
Joined predicates from Filters and convert cross_join into inner_join
-- tests/sql/subqueries.rs::added testcases for the above fix

Closes #78
Closes #2859

Rationale for this change

In [PR#3334](#3334), I have tried to fix this issue in planner.rs. But to handle DataFrameAPI queries, we have decided to move this rule to optimizer.

What changes are included in this PR?

  • new optimizer rule reduce_cross_join
  • unit test cases for the optimizer rule
  • test cases by running SQL stmts

Logs of TPCH Q19:
arrow-datafusion/target/release/tpch benchmark datafusion --iterations 3 --path ./data --format tbl --query 19 --batch-size 4096`

Running benchmarks with the following options: DataFusionBenchmarkOpt { query: 19, debug: false, iterations: 3, partitions: 2, batch_size: 4096, path: "./data", file_format: "tbl", mem_table: false, output_path: None }
Query 19 iteration 0 took 2640.2 ms and returned 1 rows
Query 19 iteration 1 took 2099.2 ms and returned 1 rows
Query 19 iteration 2 took 2199.9 ms and returned 1 rows
Query 19 avg time: 2313.08 ms

Are there any user-facing changes?

No

    * Added the new optimizer rule reduce_cross_join which would convert cross joins to inner joins if the filter has the join predicates for the corresponding tables.
@github-actions github-actions bot added core Core DataFusion crate optimizer Optimizer rules labels Sep 14, 2022
@DhamoPS
Copy link
Contributor Author

DhamoPS commented Sep 14, 2022

@andygrove or @alamb can you kick off CI please?

@avantgardnerio
Copy link
Contributor

Thanks @DhamoPS ! It looks great and I appreciate the tests. I'll review in more detail tonight or tomorrow.

@avantgardnerio
Copy link
Contributor

It might be important to note that master regressed in functionality: all the queries used to pass. Some were slow and only accurate to 4 decimals, but now several actually fail. Hopefully q19 isn't one of them.

@Dandandan
Copy link
Contributor

It might be important to note that master regressed in functionality: all the queries used to pass. Some were slow and only accurate to 4 decimals, but now several actually fail. Hopefully q19 isn't one of them.

I think we should think of ways of adding the tpch queries to the CI?

@alamb
Copy link
Contributor

alamb commented Sep 14, 2022

It might be important to note that master regressed in functionality: all the queries used to pass. Some were slow and only accurate to 4 decimals, but now several actually fail. Hopefully q19 isn't one of them.

🤔 this sounds bad -- I thought we had tests in place to catch this. Clearly we didn't have enough.

I also plan to review this PR more carefully tomorrow

@avantgardnerio
Copy link
Contributor

avantgardnerio commented Sep 14, 2022

this sounds bad

Ah, my bad - this may be behind a flag: #3393 (comment)

One man's regression is another's more strict acceptance criteria.

Edit: and in any event, it looks like query 19 was unaffected. I should have reviewed which ones they were before speaking up.

@avantgardnerio
Copy link
Contributor

I think we should think of ways of adding the tpch queries to the CI?

There are tests for several of them here: https://github.com/apache/arrow-datafusion/blob/5f029cc73755b6800217e370ebe0a7b5e8a6a224/datafusion/core/tests/sql/subqueries.rs#L119

@kmitchener
Copy link
Contributor

kmitchener commented Sep 14, 2022

@Dandandan adding verification of TPCH results to the CI would be nice. I was thinking of suggesting that once all the tests pass and we can verify results against the answers. Fixing q19 is one of the last major blockers, besides some decimal issues.

Edit: by "add all the tests", I mean that maybe we can load the benchmark data + answers to CI and actually run the tests in the benchmark itself? Anyway, I don't want to pull attention away from this issue.

- fixing the build failure in CI -> Rust/ clippy
- Fixing the testcases
Copy link
Member

@xudong963 xudong963 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have the rule in the optimizer, I think the logic of cross join -> inner join in planner can be deleted https://github.com/apache/arrow-datafusion/blob/master/datafusion/sql/src/planner.rs#L631. (It's ok to do it in the next ticket).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why add the test in the tests/sql/subqueries.rs file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other TPCH Queries related testcases are in subqueries.rs and so I have added this test also here.

Copy link
Member

@xudong963 xudong963 Sep 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact, tests in subqueries.rs are about correlated subqueries(you can know by the file name directly), not all tpch queries.

So you need to find a proper place for your tests.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I move the tests to different file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps we pull all the tpch_* queries out into a tcph.rs so they are easier to track.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be clearer to highlight those sqls by ``.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for r## strings!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. It implies that this query cannot be optimized, but then the test goes on to assert that it has been converted to an inner join (which seems correct BTW).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a typo. I have corrected it now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems weird to pass &self to a non-member function when this should be able to be made a member of the struct impl itself? (I realize this is just following an existing pattern from other optimizer rules)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I just followed existing pattern. I hope it is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting nit:

Suggested change
}
} else {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this parameter is not needed, let's just remove it. This isn't part of the trait, so there is no contractual need to keep it around.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. removed the _optimizer_config

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All branches appear to return Ok(()) (including intersect()), so this method should return () not Result().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be no unwraps in optimizer rules. @andygrove and others have been actively making PRs to remove them, based on some fuzzing results where things that were never supposed to happen kept happening.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest something like:

                for (l, r) in possible_join_keys {
                    let dt = match left_schema.field_from_column(l) {
                        Ok(f) => f.data_type(),
                        Err(e) => e.map_err(|e| context!("Field not found!", e))?,
                    };
                    if right_schema.field_from_column(r).is_ok() && can_hash(dt) {
                        join_keys.push((l.clone(), r.clone()));
                    } else if right_schema.field_from_column(l).is_ok() && can_hash(dt) {
                        join_keys.push((r.clone(), l.clone()));
                    }
                }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, removed unwraps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry, I see I accidentally left the comment on an unwrap in a test - I think that's fine BTW. Thanks for removing the others.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> Result<()> {
) -> () {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I had no idea this function existed, and would have used it if I knew it did :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work tracking down all the parts that needed to be recursive :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional documentation about return values would help here, e.g. what do Some() and None mean in context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r## strings would be nice here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I'm missing something, left is misleading name for this variable - this is just the resultant plan after children have been optimized? There is no right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, renamed it to new_plan

Copy link
Contributor

@avantgardnerio avantgardnerio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DhamoPS , nice work! I spent a few hours looking it over and trying to break it with some tests. The logic looks solid.

I gave lots of feedback, but I think the only things preventing merge (IMO) are:

  • the unwraps should be replaced with proper error handling
  • the functions that return a Result that is always Ok should be simplified to reduce cognitive load

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
) -> () {
) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to explicitly state in rust that you are returning ()

@codecov-commenter
Copy link

Codecov Report

Merging #3482 (672091a) into master (f3bb84f) will increase coverage by 0.03%.
The diff coverage is 95.04%.

@@            Coverage Diff             @@
##           master    #3482      +/-   ##
==========================================
+ Coverage   85.75%   85.79%   +0.03%     
==========================================
  Files         299      300       +1     
  Lines       55273    55513     +240     
==========================================
+ Hits        47402    47626     +224     
- Misses       7871     7887      +16     
Impacted Files Coverage Δ
datafusion/optimizer/src/reduce_cross_join.rs 94.47% <94.47%> (ø)
datafusion/core/src/execution/context.rs 79.33% <100.00%> (+0.02%) ⬆️
datafusion/core/tests/sql/predicates.rs 100.00% <100.00%> (ø)
datafusion/optimizer/tests/integration-test.rs 89.04% <100.00%> (+0.15%) ⬆️
datafusion/core/src/config.rs 89.65% <0.00%> (-4.10%) ⬇️
datafusion/expr/src/logical_plan/plan.rs 77.52% <0.00%> (-0.34%) ⬇️
...n/core/src/physical_plan/file_format/row_filter.rs 87.30% <0.00%> (-0.31%) ⬇️
datafusion/common/src/scalar.rs 85.11% <0.00%> (-0.07%) ⬇️
datafusion/proto/src/lib.rs 94.14% <0.00%> (+<0.01%) ⬆️
datafusion/sql/src/planner.rs 80.97% <0.00%> (+0.02%) ⬆️
... and 5 more

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @DhamoPS -- I reviewed the tests carefully and they look very good 👌 I think if we add a few more showing this PR works with some more complicated OR predicates and with more than 2 tables it would be ready to merge.

Thank you @avantgardnerio for your thorough review

I went over the code a bit, and it also looks quite reasonable. I left a few comments but they are all stylistic.

I can't help feeling after this there are three somewhat redundant overlapping areas of the code

  1. This one
  2. the planner logic identified by @xudong963 in https://github.com/apache/arrow-datafusion/blob/master/datafusion/sql/src/planner.rs#L630-L681
  3. The rewrite in RewriteDisjunctivePredicate

Maybe we can work as a follow on to consolidate / remove the redundancy.

All in all very nice work 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to explicitly state in rust that you are returning ()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that sure looks better 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that this is a concern - and if it turns out to be ok, I think we should document why it can be global (across the entire plan) rather than per-Join

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like having a parallel list of optimizer passes is not a good idea as it can get out of sync with the main list in context.rs

As this was not something introduced in this PR we shouldn't fix it here, but I think filing a follow on issue to fix this would be good. I can do so

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 357 to 361
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYIW you can write this more concisely using https://docs.rs/datafusion/12.0.0/datafusion/prelude/enum.Expr.html#method.and I think.

e.g. something like

Suggested change
binary_expr(
col("t1.a").eq(col("t2.a")),
And,
col("t2.c").lt(lit(20u32)),
),
col("t1.a").eq(col("t2.a"))
.and(
col("t2.c").lt(lit(20u32)),
),

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 423 to 440
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good test of `((t1.a = t2.a) AND (t2.c < 15)) OR ((t1.a = t2.a) OR (t2.c = 32))

Can you please also add a test that has the predicate(t1.a = t2.a) OR (t1.b = t2.b) showing it is not rewritten?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I add more tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this test and I have added 4 table joins with multiple Filter expr.
With new testcases, I have found few bugs and I have fixed them now.
The testcases are working as expected and as explained.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding some tests with three tables (not just 2)

@DhamoPS
Copy link
Contributor Author

DhamoPS commented Sep 16, 2022

In predicates.rs, tpch_q19_pull_predicates_to_innerjoin_simplified() is having 3 tables test. I add one more test here as well.

@alamb
Copy link
Contributor

alamb commented Sep 17, 2022

To be clear, I think this PR just needs a few more tests, as discussed (and the clippy / fmt tests to be cleaned up) and it will be ready to merge

Thank you so much for your help @DhamoPS

@alamb alamb changed the title Address performance/execution plan of TPCH query 19 Convert more cross joins to inner joins (Address performance/execution plan of TPCH query 19) Sep 17, 2022
-- Added more testcases and fixed minor fixes
-- Addressed review comments
@DhamoPS
Copy link
Contributor Author

DhamoPS commented Sep 19, 2022

I have added more testcases and addressed most of review comments. Please review.
@alamb Please start the CI as well.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great -- thank you @DhamoPS . This is a pretty epic first contribution.

🏅 for the tests -- I think they will pay off in avoiding tricky join bugs later on, .

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this particular comment is not addressed (they are still global). However, given the test coverage I think this PR can be merged as is

@alamb
Copy link
Contributor

alamb commented Sep 20, 2022

@DhamoPS it looks like there is a clippy failure on this PR -- once that is resolved I think we can merge this PR in

@DhamoPS
Copy link
Contributor Author

DhamoPS commented Sep 20, 2022

@DhamoPS it looks like there is a clippy failure on this PR -- once that is resolved I think we can merge this PR in

yes, I am resolving the clippy failure and Lint failure. I would commit it in couple of hours. Thanks @alamb

@DhamoPS
Copy link
Contributor Author

DhamoPS commented Sep 20, 2022

@alamb I have resolved the last clippy failure. I didn't know that I could run the clippy and lint tests locally.
I am sorry for multiple commits. I will ensure that these tests are successful on my next PR.
Thanks a lot for your help!

@alamb
Copy link
Contributor

alamb commented Sep 20, 2022

Thanks a lot for your help!

No problem -- thanks for sticking with it!

Your next PRs will be easier as after we have merged one, the CI checks start automatically on subsequent PRs

@DhamoPS
Copy link
Contributor Author

DhamoPS commented Sep 20, 2022

@alamb Please merge this pull request to master.

@alamb alamb merged commit 7f26cff into apache:master Sep 20, 2022
@alamb
Copy link
Contributor

alamb commented Sep 20, 2022

🚀 really nice first PR -- thanks @DhamoPS !

@ursabot
Copy link

ursabot commented Sep 20, 2022

Benchmark runs are scheduled for baseline = 1261741 and contender = 7f26cff. 7f26cff is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

@kmitchener
Copy link
Contributor

This is great @DhamoPS! Congrats on your first merged PR :) Will you be following up with a separate PR to remove the similar code from sql/planner.rs?

@DhamoPS
Copy link
Contributor Author

DhamoPS commented Sep 20, 2022

@kmitchener Thanks. Yes, I will be following up with separate PR for cleanup in sql/planner.rs.
Could you create the git issue to track it?

@kmitchener
Copy link
Contributor

@kmitchener Thanks. Yes, I will be following up with separate PR for cleanup in sql/planner.rs. Could you create the git issue to track it?

You should feel free to create an issue yourself when you're ready. I'm interested in the cleanup but I don't have any special ownership here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate optimizer Optimizer rules

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Migrate the cross join -> inner join optimization from the planner to the optimizer Address performance/execution plan of TPCH query 19

8 participants