Enhance/Refactor Ordering Equivalence Properties #7566

mustafasrepo · 2023-09-15T07:47:54Z

Which issue does this PR close?

Closes #7162.

Rationale for this change

Ordering equivalence now considers constants during normalization, with this change we can use information after filter (when result is constant after filtering) to produce better plans.

While trying to close #7162. I realized that, having OrderingEquivalence and Equivalence as different instantiations of same generic is a bit constraining. I took this opportunity to refactor OrderingEquivalence implementation (Moving functions to struct methods, and keeping track of constants so that we can use this information during normalization to produce better plans.).

What changes are included in this PR?

Are these changes tested?

Yes new tests are added to show that constants are ignored during Sort analysis.

Are there any user-facing changes?

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

# Conflicts: # datafusion/core/src/physical_plan/filter.rs # datafusion/core/src/physical_plan/joins/utils.rs # datafusion/physical-expr/src/equivalence.rs # datafusion/physical-expr/src/lib.rs # datafusion/physical-expr/src/utils.rs # datafusion/sql/src/statement.rs # datafusion/substrait/src/logical_plan/consumer.rs

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_distribution.rs

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

alamb

Thank you very much @mustafasrepo

I am approving this PR based on the tests and their improvements, nice work! I also really like the refactoring you have done to the equivalence logic - I think that makes the code easier to understand ❤️

I am concerned with the use of unwrap_or to ignore errors, and I think we should address that (if not in this PR than in a future one -- I can file a ticket to track it)

Also I think there may be additional optimization opportunities in FilterExec, again which could be done as a follow on PR.

FYI @NGA-TRAN

Thanks again

datafusion/common/src/stats.rs

datafusion/core/src/physical_optimizer/enforce_distribution.rs

datafusion/core/src/physical_plan/joins/utils.rs

alamb · 2023-09-16T11:32:11Z

datafusion/core/src/physical_plan/filter.rs

    fn ordering_equivalence_properties(&self) -> OrderingEquivalenceProperties {
-        self.input.ordering_equivalence_properties()
+        let stats = self.statistics();
+        // Add the columns that have only one value (singleton) after filtering to constants.


Why is there a a difference between OrderingEquivalenceProperties and EquivalenceProperties -- it seems like using statistics based equivalence as well as predicate based equivalence would be relevant to both

In other words, if the filter has a predicate like column = 5 shouldn't column be added to list of constants even if the column had more than one value in the statistics?

In other words, if the filter has a predicate like column = 5 shouldn't column be added to list of constants even if the column had more than one value in the statistics?

In this case, statistics should be able to determine column will have single value(5) onwards. Hence I presumed, using statistics is sufficient. However, if there are cases, where it is obvious from the predicate value is constant, but statistics fail to resolve it. We can change this implementation I think

alamb · 2023-09-16T11:34:38Z

datafusion/core/src/physical_plan/filter.rs

        let input_column_stats = match input_stats.column_statistics {
            Some(stats) => stats,
-            None => return Statistics::default(),
+            None => self


this seems like it is more general than just for FilterExec -- shouldn't all ExecutionPlan's return statistics that have unbounded columns in the absence of other information? Maybe we should change Statistics::default() to do this 🤔

I agree with you. We can remove Statistics::default() implementation, and propagate unbounded columns (in the absence of information) from the source. I will try it.

alamb · 2023-09-16T11:41:39Z

datafusion/physical-expr/src/equivalence.rs

+                let item = PhysicalSortRequirement::from_sort_exprs(item);
+                let item = prune_sort_reqs_with_constants(&item, &self.constants);
+                let ranges = get_compatible_ranges(&normalized_sort_reqs, &item);
+                let mut offset: i64 = 0;


why use an i64 here? It seems like if offset was always a usize the math below would be much simpler as the casting as usize and as i64 could be avoided?

offset += head.len() as i64 - range as i64; can result in negative number, hence we use i64. However, I agree that casting to i64 and usize a bit weird. If we can avoid it, it would be great. I will try to re-write this logic, to remove this casting.

datafusion/sqllogictest/test_files/select.slt

alamb · 2023-09-16T11:45:45Z

datafusion/sqllogictest/test_files/subquery.slt

------ProjectionExec: expr=[SUM(t2.t2_int)@1 as SUM(t2.t2_int), t2_id@0 as t2_id]
+--ProjectionExec: expr=[t1_id@2 as t1_id, SUM(t2.t2_int)@0 as SUM(t2.t2_int), t2_id@1 as t2_id]
+----CoalesceBatchesExec: target_batch_size=8192
+------HashJoinExec: mode=Partitioned, join_type=Right, on=[(t2_id@1, t1_id@0)]


Is the difference in this plan that the inputs switched order? Do you know why they did?

Since with the changes in this PR, filter doesn't return Statistics::default. It propagates num_rows of AggregateExec above. Then join_selection rule chooses filter side as build (since filter side has less rows than the other side). Previously, since number of rows is not propagated, join_selection rule didn't change sides. Since, we propagate additional information now, planner can choose better build side

datafusion/sqllogictest/test_files/window.slt

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

alamb · 2023-09-18T12:59:54Z

Thank you @mustafasrepo

NGA-TRAN · 2023-09-18T14:12:34Z

datafusion/sqllogictest/test_files/select.slt

+--CoalesceBatchesExec: target_batch_size=8192
+----FilterExec: a@1 = 0 AND b@2 = 0
+------RepartitionExec: partitioning=RoundRobinBatch(2), input_partitions=1
+--------CsvExec: file_groups={1 group: [[WORKSPACE_ROOT/datafusion/core/tests/data/window_2.csv]]}, projection=[a0, a, b, c, d], output_ordering=[a@1 ASC NULLS LAST, b@2 ASC NULLS LAST, c@3 ASC NULLS LAST], has_header=true


This plan without SortExec is awesome. Thanks so much @mustafasrepo for implementing it and @alamb for reviewing it

mustafasrepo and others added 30 commits August 10, 2023 14:44

separate implementation of oeq properties

ea83c37

Merge branch 'apache_main' into refactor/oeq_properties

0d9f208

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Simplifications

4ad5ec5

Move utils to methods

016558b

Remove unnecesary code

8007b1b

Address todo

5d896a8

Buggy is_aggressive mod eklenecek

b8def0a

start implementing aggressive mode

8850f33

all tests pass

0d32ca5

minor changes

aac0a0c

All tests pass

f0dbd85

Minor changes

7112a25

All tests pass

ec41194

minor changes

b16ad15

all tests pass

717631e

Simplifications

b93cc5d

minor changes

b832b2d

Resolve linter error

858576b

Minor changes

09aa6c8

minor changes

7212e56

Update plan

49ea333

Merge branch 'apache_main' into refactor/oeq_properties

eb81b43

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Simplifications, update comments

18c4bab

Update comments, Use existing stats to find constants

fe322b4

Merge branch 'apache_main' into refactor/oeq_properties

0cb1ee2

# Conflicts: # datafusion/core/src/physical_optimizer/enforce_distribution.rs

Simplifications

cff6f2f

Unknown input stats are handled

6cb4d5a

Address reviews

ef994fb

Merge branch 'apache_main' into refactor/oeq_properties

0290184

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Simplifications

4fe9c0d

github-actions bot added physical-expr Changes to the physical-expr crates core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Sep 15, 2023

Simplifications

1c1de0d

mustafasrepo changed the title ~~Enhance/Refactor Oeq Properties~~ Enhance/Refactor Ordering Equivalence Properties Sep 15, 2023

alamb approved these changes Sep 16, 2023

View reviewed changes

mustafasrepo added 3 commits September 18, 2023 09:28

Merge branch 'apache_main' into refactor/oeq_properties

23e30ae

# Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.

Address reviews

abe0f31

Fix subdirectories

2eaf755

mustafasrepo merged commit c72b98e into apache:main Sep 18, 2023

NGA-TRAN reviewed Sep 18, 2023

View reviewed changes

Enhance/Refactor Ordering Equivalence Properties #7566

Enhance/Refactor Ordering Equivalence Properties #7566

Uh oh!

Conversation

mustafasrepo commented Sep 15, 2023

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb Sep 16, 2023

Choose a reason for hiding this comment

Uh oh!

mustafasrepo Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb Sep 16, 2023

Choose a reason for hiding this comment

Uh oh!

mustafasrepo Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

alamb Sep 16, 2023

Choose a reason for hiding this comment

Uh oh!

mustafasrepo Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

alamb Sep 16, 2023

Choose a reason for hiding this comment

Uh oh!

mustafasrepo Sep 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Sep 18, 2023

Uh oh!

NGA-TRAN Sep 18, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mustafasrepo Sep 18, 2023 •

edited

Loading

mustafasrepo Sep 18, 2023 •

edited

Loading