-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Reproducer tests for #18380 (resorting sorted inputs) #18352
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| "sorted", | ||
| &format!("{testdata}/alltypes_tiny_pages.parquet"), | ||
| ParquetReadOptions::default() | ||
| .file_sort_order(vec![vec![col("id").sort(true, false)]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Sidenote: Interestingly, with nulls_first: true (L3074 too), even with the fixes from #9867, the plan includes an extra SortExec node that re-sorts with nulls last. I'm not sure whether that's on purpose, or if there's another issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know whether the file is actually sorted or you just add this function to trick the planner to plan this file as it is sorted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file is not actually sorted no, but I was hoping this was a valid way of making the planner think it is, and plan accordingly.
Can this cause issues?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this cause issues?
Likely not but I am not % sure if we do anything special with parquet file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reproducer @rgehan . Now I understand what you are trying to do. I have a few comments to make the tests clearer
| "sorted", | ||
| &format!("{testdata}/alltypes_tiny_pages.parquet"), | ||
| ParquetReadOptions::default() | ||
| .file_sort_order(vec![vec![col("id").sort(true, false)]]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know whether the file is actually sorted or you just add this function to trick the planner to plan this file as it is sorted?
| SortExec: expr=[nullable_col@0 ASC], preserve_partitioning=[false] | ||
| DataSourceExec: file_groups={1 group: [[x]]}, projection=[nullable_col, non_nullable_col], file_type=parquet | ||
| DataSourceExec: file_groups={1 group: [[x]]}, projection=[nullable_col, non_nullable_col], output_ordering=[nullable_col@0 ASC], file_type=parquet | ||
| "); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this test is to let us know we need repartition_sorts = true for it to work. This works as expected: DF understands the 2 input of the Union is sorted and only do the merge after that. It will fail if repartition_sorts = false
| async fn reproducer_with_repartition_sorts_false() -> Result<()> { | ||
| reproducer_impl(false).await?; | ||
|
|
||
| // 💥 Doesn't pass, and generates this plan: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is expected that this test fails here. See
datafusion/datafusion/common/src/config.rs
Line 919 in e432d55
| pub repartition_sorts: bool, default = true |
| | | SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false] | | ||
| | | DataSourceExec: file_groups={1 group: [[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id], file_type=parquet | | ||
| | | | | ||
| +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Since the 2 input streams out of the union are sorted, the 2 streams coming out from Partial Aggregate are also sorted. Thus, we should only do the merge
And this only happens with repartition_sorts = true
datafusion/datafusion/common/src/config.rs
Line 919 in e432d55
| pub repartition_sorts: bool, default = true |
| // AggregateExec: mode=Partial, gby=[id@0 as id], aggr=[] | ||
| // UnionExec | ||
| // DataSourceExec: file_groups={1 group: [[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id], output_ordering=[id@0 ASC NULLS LAST], file_type=parquet | ||
| // DataSourceExec: file_groups={1 group: [[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id], file_type=parquet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since repartition_sorts= false. This test fails as expected. I would modify the test to make it pass instead of not pass. See my comment for reproducer_e2e_impl below
| Ok(()) | ||
| } | ||
|
|
||
| async fn reproducer_e2e_impl(repartition_sorts: bool) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you modify this function to accept 2 parameters: repartition_sorts and expected_plan? Then at the comparison step, you compare with expected_plan. This will help make the tests clearer . You have 4 tests in this PR and only one should fail. The other 3 will pass
| Ok(()) | ||
| } | ||
|
|
||
| async fn reproducer_impl(repartition_sorts: bool) -> Result<()> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, can you have this function to accept 2 inputs and have all its 2 tests passed
|
@NGA-TRAN thanks a lot for the review and clarifications! I've adapted the PR to hopefully make it clearer what's expected / what's not. I'll also create an issue |
|
Here's the corresponding feature request: #18380 |
Once you've updated the PR so that three tests pass and one fails, mark the failing test as ignored (with comment) and move it to "ready for review." It's great that the repro is included in the repo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Just a few minor comment from me.
@alamb This is a good reproducer for a nice optimization request
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Ok(()) | ||
| } | ||
|
|
||
| #[ignore] // See https://github.com/apache/datafusion/issues/18380 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we ignoring the test when the plan is also commented out in the body?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The plan is not commented out.
I asked @rgehan to add the relevant part of explain verbose for us see when the issue happens that was the commented out plan
Co-authored-by: Nga Tran <[email protected]>
|
@NGA-TRAN I've applied the requested changes 👍 |
|
I fixed the formatting/clippy issues, but one of the tests is still failing for formatting reasons. The plan in the snapshot refers to a machine-dependent filepath, so I had tweaked the test logic to replace it with some constant, but failed to realize this would impact the formatting of the snapshot too (since this is a space-padded table). |
|
I tweaked the tests to use simpler snapshots (no more table format), hopefully this should pass on the CI now 👍 |
Which issue does this PR close?
None, but relates to issue #9898
Rationale for this change
N/A
What changes are included in this PR?
This PR adds reproducer tests demonstrating issues with suboptimal optimizations performed on plans that mix pre-sorted parquets and
SortExecunder an UNION.Two sets of tests included:
datafusion/core/tests/physical_optimizer/enforce_sorting.rsdatafusion/core/tests/dataframe/mod.rs, starting from logical plans simulating our use-caseNote
These tests pass with the changes from #9867
Are these changes tested?
N/A
Are there any user-facing changes?
N/A