Reproducer tests for #18380 (resorting sorted inputs) #18352

rgehan · 2025-10-29T12:01:27Z

Which issue does this PR close?

None, but relates to issue #9898

Rationale for this change

N/A

What changes are included in this PR?

This PR adds reproducer tests demonstrating issues with suboptimal optimizations performed on plans that mix pre-sorted parquets and SortExec under an UNION.

Two sets of tests included:

Unit tests in datafusion/core/tests/physical_optimizer/enforce_sorting.rs
E2E-ish tests in datafusion/core/tests/dataframe/mod.rs, starting from logical plans simulating our use-case

Note

These tests pass with the changes from #9867

Are these changes tested?

N/A

Are there any user-facing changes?

N/A

rgehan · 2025-10-29T12:08:58Z

datafusion/core/tests/dataframe/mod.rs

+        "sorted",
+        &format!("{testdata}/alltypes_tiny_pages.parquet"),
+        ParquetReadOptions::default()
+            .file_sort_order(vec![vec![col("id").sort(true, false)]]),


(Sidenote: Interestingly, with nulls_first: true (L3074 too), even with the fixes from #9867, the plan includes an extra SortExec node that re-sorts with nulls last. I'm not sure whether that's on purpose, or if there's another issue)

Do you know whether the file is actually sorted or you just add this function to trick the planner to plan this file as it is sorted?

The file is not actually sorted no, but I was hoping this was a valid way of making the planner think it is, and plan accordingly.

Can this cause issues?

Can this cause issues?

Likely not but I am not % sure if we do anything special with parquet file.

NGA-TRAN

Thanks for the reproducer @rgehan . Now I understand what you are trying to do. I have a few comments to make the tests clearer

NGA-TRAN · 2025-10-29T17:34:09Z

datafusion/core/tests/dataframe/mod.rs

+        "sorted",
+        &format!("{testdata}/alltypes_tiny_pages.parquet"),
+        ParquetReadOptions::default()
+            .file_sort_order(vec![vec![col("id").sort(true, false)]]),


Do you know whether the file is actually sorted or you just add this function to trick the planner to plan this file as it is sorted?

NGA-TRAN · 2025-10-29T18:14:55Z

datafusion/core/tests/physical_optimizer/enforce_sorting.rs

+          SortExec: expr=[nullable_col@0 ASC], preserve_partitioning=[false]
+            DataSourceExec: file_groups={1 group: [[x]]}, projection=[nullable_col, non_nullable_col], file_type=parquet
+          DataSourceExec: file_groups={1 group: [[x]]}, projection=[nullable_col, non_nullable_col], output_ordering=[nullable_col@0 ASC], file_type=parquet
+    ");


So this test is to let us know we need repartition_sorts = true for it to work. This works as expected: DF understands the 2 input of the Union is sorted and only do the merge after that. It will fail if repartition_sorts = false

NGA-TRAN · 2025-10-29T18:15:15Z

datafusion/core/tests/physical_optimizer/enforce_sorting.rs

+async fn reproducer_with_repartition_sorts_false() -> Result<()> {
+    reproducer_impl(false).await?;
+
+    // 💥 Doesn't pass, and generates this plan:


It is expected that this test fails here. See

datafusion/datafusion/common/src/config.rs

Line 919 in e432d55

pub repartition_sorts: bool, default = true

datafusion/core/tests/dataframe/mod.rs

NGA-TRAN · 2025-10-29T18:39:27Z

datafusion/core/tests/dataframe/mod.rs

+    |               |         SortExec: expr=[id@0 ASC NULLS LAST], preserve_partitioning=[false]                                                                                                                                     |
+    |               |           DataSourceExec: file_groups={1 group: [[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id], file_type=parquet                                      |
+    |               |                                                                                                                                                                                                                 |
+    +---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+


👍 Since the 2 input streams out of the union are sorted, the 2 streams coming out from Partial Aggregate are also sorted. Thus, we should only do the merge

And this only happens with repartition_sorts = true

datafusion/datafusion/common/src/config.rs

Line 919 in e432d55

pub repartition_sorts: bool, default = true

NGA-TRAN · 2025-10-29T18:45:00Z

datafusion/core/tests/dataframe/mod.rs

+    //       AggregateExec: mode=Partial, gby=[id@0 as id], aggr=[]
+    //         UnionExec
+    //           DataSourceExec: file_groups={1 group: [[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id], output_ordering=[id@0 ASC NULLS LAST], file_type=parquet
+    //           DataSourceExec: file_groups={1 group: [[{testdata}/alltypes_tiny_pages.parquet]]}, projection=[id], file_type=parquet


Since repartition_sorts= false. This test fails as expected. I would modify the test to make it pass instead of not pass. See my comment for reproducer_e2e_impl below

datafusion/core/tests/dataframe/mod.rs

NGA-TRAN · 2025-10-29T18:47:50Z

datafusion/core/tests/dataframe/mod.rs

+    Ok(())
+}
+
+async fn reproducer_e2e_impl(repartition_sorts: bool) -> Result<()> {


Can you modify this function to accept 2 parameters: repartition_sorts and expected_plan? Then at the comparison step, you compare with expected_plan. This will help make the tests clearer . You have 4 tests in this PR and only one should fail. The other 3 will pass

NGA-TRAN · 2025-10-29T18:51:18Z

datafusion/core/tests/physical_optimizer/enforce_sorting.rs

+    Ok(())
+}
+
+async fn reproducer_impl(repartition_sorts: bool) -> Result<()> {


Similarly, can you have this function to accept 2 inputs and have all its 2 tests passed

rgehan · 2025-10-30T09:52:39Z

@NGA-TRAN thanks a lot for the review and clarifications! I've adapted the PR to hopefully make it clearer what's expected / what's not. I'll also create an issue

rgehan · 2025-10-30T10:57:20Z

Here's the corresponding feature request: #18380

NGA-TRAN · 2025-10-30T11:10:25Z

@NGA-TRAN thanks a lot for the review and clarifications! I've adapted the PR to hopefully make it clearer what's expected / what's not. I'll also create an issue

Once you've updated the PR so that three tests pass and one fails, mark the failing test as ignored (with comment) and move it to "ready for review." It's great that the repro is included in the repo.

NGA-TRAN

Looks great. Just a few minor comment from me.

@alamb This is a good reproducer for a nice optimization request

datafusion/core/tests/dataframe/mod.rs

datafusion/core/tests/physical_optimizer/enforce_sorting.rs

alamb

Thanks @rgehan and @NGA-TRAN

alamb · 2025-10-30T15:11:37Z

datafusion/core/tests/dataframe/mod.rs

+    Ok(())
+}
+
+#[ignore] // See https://github.com/apache/datafusion/issues/18380


why are we ignoring the test when the plan is also commented out in the body?

The plan is not commented out.
I asked @rgehan to add the relevant part of explain verbose for us see when the issue happens that was the commented out plan

Co-authored-by: Nga Tran <[email protected]>

rgehan · 2025-10-30T15:32:06Z

@NGA-TRAN I've applied the requested changes 👍

rgehan · 2025-11-01T11:22:57Z

I fixed the formatting/clippy issues, but one of the tests is still failing for formatting reasons.

The plan in the snapshot refers to a machine-dependent filepath, so I had tweaked the test logic to replace it with some constant, but failed to realize this would impact the formatting of the snapshot too (since this is a space-padded table).

rgehan · 2025-11-01T11:54:14Z

I tweaked the tests to use simpler snapshots (no more table format), hopefully this should pass on the CI now 👍

alamb · 2025-11-03T21:48:09Z

Thanks again @rgehan and @NGA-TRAN

rgehan added 2 commits October 29, 2025 12:51

Add reproducer tests

a4288a2

Add e2e reproducer test

fcd4d9f

github-actions bot added the core Core DataFusion crate label Oct 29, 2025

rgehan mentioned this pull request Oct 29, 2025

Teach UnionExec to require its inputs sorted #9898

Open

rgehan commented Oct 29, 2025

View reviewed changes

NGA-TRAN reviewed Oct 29, 2025

View reviewed changes

review: make assertions cleaner / fix expectations

b44cf08

rgehan mentioned this pull request Oct 30, 2025

Preserving sort on UnionExec inputs instead of introducing a suboptimal top-level sort #18380

Open

rgehan added 3 commits October 30, 2025 12:49

review: add verbose explain to comment

6c31946

review: ignore test, add reference to issue

566da90

extra: rename tests in prevision of merging

b09a26e

rgehan changed the title ~~Reproducer tests for #9898~~ Reproducer tests for #18380 Oct 30, 2025

rgehan marked this pull request as ready for review October 30, 2025 12:04

rgehan requested a review from NGA-TRAN October 30, 2025 12:05

NGA-TRAN approved these changes Oct 30, 2025

View reviewed changes

alamb approved these changes Oct 30, 2025

View reviewed changes

alamb changed the title ~~Reproducer tests for #18380~~ Reproducer tests for #18380 (resorting sorted inputs) Oct 30, 2025

rgehan and others added 2 commits October 30, 2025 16:17

review: add suggested comments

602ac89

Co-authored-by: Nga Tran <[email protected]>

review: full explain in comment

7f153b9

rgehan added 3 commits October 30, 2025 17:22

review: back to the excerpt

83e8d15

cargo fmt

f98d6ae

fix clippy issue

fa07937

simplify snapshot formatting to avoid mismatches

1fb2219

alamb added this pull request to the merge queue Nov 3, 2025

Merged via the queue into apache:main with commit e4f2b49 Nov 3, 2025
28 checks passed

Reproducer tests for #18380 (resorting sorted inputs) #18352

Reproducer tests for #18380 (resorting sorted inputs) #18352

Conversation

rgehan commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NGA-TRAN Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

NGA-TRAN left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgehan commented Oct 30, 2025

Uh oh!

rgehan commented Oct 30, 2025

Uh oh!

NGA-TRAN commented Oct 30, 2025

Uh oh!

NGA-TRAN left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rgehan commented Oct 30, 2025

Uh oh!

rgehan commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rgehan commented Nov 1, 2025

Uh oh!

alamb commented Nov 3, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rgehan commented Oct 29, 2025 •

edited

Loading

NGA-TRAN Oct 30, 2025 •

edited

Loading

rgehan commented Nov 1, 2025 •

edited

Loading