-
Notifications
You must be signed in to change notification settings - Fork 1.8k
#17801 Improve nullability reporting of case expressions #17813
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
d2e613a to
88a911b
Compare
88a911b to
482d0be
Compare
4bbaa82 to
7f8d7cf
Compare
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @pepijnve --
I am not quite sure about this implementation (I am hoping #17628 might solve the problem too with more sophisticated case folding)
However, I verified it does solve the problem with running the benchmarks so from that perspective I think we should proceed
My only real concern is that the newly added tests cover only the new code, and not the "end to end" behavior you tracked down (namely that the case pattern with coalesce changes the nullability).
Would it be possible to add some of the cases as expr simplification tests too? Somewhere like here?
| #[test] |
datafusion/expr/src/expr_schema.rs
Outdated
| when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo")) | ||
| .otherwise(lit(0))?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: You can probably make this more concise using the eq method, something like this:
| when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo")) | |
| .otherwise(lit(0))?, | |
| when(col("foo").eq(lit(5))), col("foo")).otherwise(lit(0))?, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise there is Expr::and for ands that could be used as well below
However, the current setup of using and as a prefix is pretty clear too, so maybe what you have here is actually more readable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I missed that. I was looking for prefix versions, and hadn't realised infix ones existed too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up sticking with prefix notation for the boolean combinators and infix for the rest. Using infix for the boolean made it hard to read. I've also added the SQL equivalent as a comment.
datafusion/expr/src/expr_schema.rs
Outdated
| assert!(expr.nullable(&get_schema(false)).unwrap()); | ||
| } | ||
|
|
||
| fn check_nullability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this a little confusing at first, because it makes an explicit assumption that expr's will never introduce nulls (in order for !expr.nullable(&get_schema(false))?, to be true). So for example, it wouldn't do the right thing with the NULLIF function NULLIF(foo, 25) or something
Maybe some comments would help
| fn check_nullability( | |
| /// Verifies that `expr` has `nullable` nullability when the 'foo' column is | |
| /// null. | |
| /// Also assumes and verifies that `expr` is NOT nullable when 'foo' is NOT null | |
| fn check_nullability( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've reworked the logical plan test cases already to (hopefully) make it more obvious what's going on. I hadn't given this function much thought since it was only a test thing.
datafusion/expr/src/expr_schema.rs
Outdated
| check_nullability( | ||
| when(binary_expr(col("foo"), Operator::Eq, lit(5)), col("foo")) | ||
| .otherwise(lit(0))?, | ||
| true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically this could also be reported as false, given that if foo is null, then the expr resolves to 0 (non null)
> create table t(foo int) as values (0), (NULL), (5);
0 row(s) fetched.
Elapsed 0.001 seconds.
> select foo, CASE WHEN foo=5 THEN foo ELSE 0 END from t;
+------+---------------------------------------------------------+
| foo | CASE WHEN t.foo = Int64(5) THEN t.foo ELSE Int64(0) END |
+------+---------------------------------------------------------+
| 0 | 0 |
| NULL | 0 |
| 5 | 5 |
+------+---------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.002 seconds.However, maybe we can improve that in a follow on PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the const evaluation is far from complete. I tried to do something good enough for the coalesce simplification initially.
I was wondering the whole time if there isn't some existing null analysis logic somewhere in the codebase we could reuse. The best I could come up with is rewriting the full expression by replacing the then expression with literal NULL and then attempting const evaluation. But that got me worrying about planning overhead again.
datafusion/expr/src/expr_schema.rs
Outdated
| when( | ||
| or( | ||
| is_not_null(col("foo")), | ||
| binary_expr(col("foo"), Operator::Eq, lit(5)), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as above, I don't think this expression can everr be true so this overall expression is still non nullable
datafusion/expr/src/expr_schema.rs
Outdated
| col("foo"), | ||
| ) | ||
| .otherwise(lit(0))?, | ||
| true, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too -- this expression is not nullabile
> select foo, CASE WHEN foo=5 OR foo IS NOT NULL THEN foo ELSE 0 END from t;
+------+------------------------------------------------------------------------------+
| foo | CASE WHEN t.foo = Int64(5) OR t.foo IS NOT NULL THEN t.foo ELSE Int64(0) END |
+------+------------------------------------------------------------------------------+
| 0 | 0 |
| NULL | 0 |
| 5 | 5 |
+------+------------------------------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.002 seconds.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, if you comment out the filter step (i.e. revert to the pre-patch version) all of these cases are reported as being nullable. The scope of this PR is to get at least some cases that are definitely not nullable reported as such, not ensure all cases are reported correctly.
datafusion/expr/src/expr_schema.rs
Outdated
| .otherwise(lit(0))?, | ||
| true, | ||
| get_schema, | ||
| )?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also please add a check with is_null in the OR clause (which should be null)
Something like the equivalent to
> select foo, CASE WHEN foo=5 OR foo IS NULL THEN foo ELSE 0 END from t;
+------+--------------------------------------------------------------------------+
| foo | CASE WHEN t.foo = Int64(5) OR t.foo IS NULL THEN t.foo ELSE Int64(0) END |
+------+--------------------------------------------------------------------------+
| 0 | 0 |
| NULL | NULL |
| 5 | 5 |
+------+--------------------------------------------------------------------------+
3 row(s) fetched.
Elapsed 0.000 seconds.Like
check_nullability(
when(
or(
binary_expr(col("foo"), Operator::Eq, lit(5)),
is_null(col("foo")),
),
col("foo"),
)
.otherwise(lit(0))?,
true,
get_schema,
)?;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added this test case
I warned you it wasn't very elegant. 😄 I don't think #17628 covers the same thing though. What we're trying to do here is get
I'm not sure what kind of test you have in mind. The end to end case is (admittedly very indirectly) covered by TPC-DS query 75 and the removal of the double optimisation. If you revert the production code change in this PR, but keep the test change you'll see that it fails. For the simplifier itself, I was wondering if there shouldn't be some internal assertions that verifies that the result of calling |
9516262 to
a6ab83a
Compare
|
@alamb thinking about this a bit more. I'm going to struggle expressing myself sufficiently clearly here, but I'll try to explain the idea behind what I'm doing. Maybe that can help us figure out a better way to express the idea. What I'm trying to do is improve the accuracy of the predicate In particular there's one interesting case (pun not intended) which results from the What I attempted to do in this PR is to look at the more general form I tried to implement this in a cheap, but imprecise way. My rationale was that even though it's not perfect, it's an improvement in accuracy over the current code. |
|
I've massaged the logical plan version of the code a bit further already to hopefully clarify what it's doing. I then ran the test cases with logging output rather than assertions before and after the extra filtering to illustrate what's being changed. After the change all tests pass. Before the patch it reports the following |
|
@alamb I've taken the logical expression portion of the PR another step further which ensures correct answers for the expressions you mentioned earlier. I can complete the physical expression portion as well if you like. Unless you tell me this path is a dead end. |
Thank you -- I will try and get to this one asap. Somehow every time i think I am getting the queue of reviews under control there are like 50 new notifications ! It is a good problem to have. |
No pressure from my side. I just write up my notes and move on to the next thing. Async delayed response is fine. |
|
I experimented a bit with the rewrite + const eval approach on the physical expression side of things. While attractive and simple to implement, the downside is that it's going to be very hard to ensure the logical and physical side agree. Logical needs to work without |
|
Than you -- this is on my list of things to review shortly |
datafusion/expr/Cargo.toml
Outdated
| [dependencies] | ||
| arrow = { workspace = true } | ||
| async-trait = { workspace = true } | ||
| bitflags = "2.9.4" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already a dependency via arrow-schema so I assumed it would be ok to use.
| // if the rule is not permitted to change the schema, confirm that it did not change. | ||
| if self.rule.schema_check() && plan.schema() != previous_schema { | ||
| if self.rule.schema_check() | ||
| && !is_allowed_schema_change(previous_schema.as_ref(), plan.schema().as_ref()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change relaxes the schema check slightly. It now allows individual fields to change from nullable to not-nullable which is ok because it only allow a strict subset of the original schema. schema_check has documentation stating that you should disable the schema check entirely if you want to do this. Seemed better to not have to disable checking entirely.
This is one of the reasons we (at least I) love working with you!
I don't necessarily think it is a dead end, I am just trying to balance getting a release out and
So what I think we should do is:
I will try and find enough time to properly review this PR this upcoming week |
|
@pepijnve have you already tried (just) special casing nullability reporting in case expressions like CASE
WHEN <x> IS NOT NULL THEN <x>
WHEN <y> IS NOT NULL THEN <y>
ELSE <z>To be the nullability of (this is basically a special case for the pattern created by the Please forgive me if you have already tried this, I can't remember the entire history I realize this PR is the same idea but implements a more general case of it |
|
BTW I don't think this PR obviates these other PRs -- I think instead it unblocks them / allows them to merge |
…ion (#18567) Note this targets the branch-51 release branch ## Which issue does this PR close? - part of #17558 - resolves #17801 in the 51 release branch ## Rationale for this change - We merged some clever rewrites for `coalesce` and `nvl2` to use `CASE` which are faster and more correct (👏 @chenkovsky @kosiew ) - However, these rewrites cause subtle schema mismatches in some cases planning (b/c the CASE simplification nullability logic can't determine the correct nullability in some cases - see #17801) - @pepijnve has some heroic efforts to fix the schema mismatch in #17813 (comment), but it is non trivial and I am worried about merging it so close to the 51 release and introducing new edge cases ## What changes are included in this PR? 1. Revert #17357 / e5dcc8c 3. Revert #17991 / ea83c26 2. Revert #18191 / 22c4214 2. Cherry-pick 6202254, a test that reproduces the schema mismatch issue (from #18536) 3. Cherry-pick 735cacf, a fix for the benchmarks that regressed due to the revert (from #17833) 4. Update datafusion-testing (see separate PR here) for extended tests (see apache/datafusion-testing#15) ## Are these changes tested? Yes I added a new test ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. -->
I added that since the benchmark is enabled and not double optimized in this PR as well. |
No worries, it's been a twisty path so far. That was v0.1 I think. To implement that you need to add some logic in After adding that it was immediately obvious that a number of other situations are just as trivial to test, so I thought I might as well add those. That's the version in the first commit It was #17813 (comment) that got me thinking that this can be made even more general. The other feedback I got in between then and now is what led to the current state of the PR. I've been familiarising myself with more of the code in Even better would be to also leverage the simplification and const evaluation from |
0fcaeaa to
ada24a2
Compare
ada24a2 to
427fc30
Compare
|
In 427fc30 I've added an SLT that illustrates how better nullability reporting may allow optimisation of certain queries. It's a contrived query, but it's just a test The query get squashed down to while current |
c52c9f4 to
4b879e4
Compare
|
🤖 |
|
I’m trying to see if I can make more use of existing code in the meantime. The three PRs related to (Nullable)Interval came out of that work. |
|
🤖: Benchmark completed Details
|
|
I am working up to reviewing this PR in detail |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for this PR @pepijnve
I think we could merge this code as is as it:
- Fixes the bug
- Is well commented and tested
- We don't have a realistic alternative despite trying pretty hard
I also ran the benchmarks in #17813 (comment) and am happy that this PR does not slow down planning measurably
However, as we have been discussing, I am worried that this PR introduces a substantial amount of additional code that is almost, but not quite, like existing features.
Anything you can do to help reduce the duplication would be most appreciated
| Expr::IsNull(Box::new(expr)) | ||
| } | ||
|
|
||
| /// Create is not null expression |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a nice drive by cleanup
| .collect::<Result<Vec<_>>>()?; | ||
| if then_nullable.contains(&true) { | ||
| Ok(true) | ||
| .filter_map(|(w, t)| { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is clear and easy to follow. Very nice
| [dependencies] | ||
| arrow = { workspace = true } | ||
| async-trait = { workspace = true } | ||
| bitflags = "2.0.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is already a dependency of arrow-schema
Thus while this is a new explicit dependency it is not a new-new dependency and thus I think it is fine to add
|
|
||
| /// Checks if the change from `old` schema to `new` is allowed or not. | ||
| /// The current implementation only allows nullability of individual fields to change | ||
| /// from 'nullable' to 'not nullable'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding rationale about why would be helpful here as it may not be immediately obvious
| /// from 'nullable' to 'not nullable'. | |
| /// from 'nullable' to 'not nullable' (that is that the physical expression knows | |
| /// more about its null-ness than its logical counterpart) | |
| /// - `Some(true)` if the predicate evaluates to a truthy value. | ||
| /// - `Some(false)` if the predicate evaluates to a falsy value. | ||
| /// - `None` if the predicate could not be evaluated. | ||
| fn evaluate_predicate(predicate: &Arc<dyn PhysicalExpr>) -> Result<Option<bool>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as a follow on, I feel like this would be a more generally useful function as it is not specific to case.
Maybe we could add it in expression utils or something: https://github.com/apache/datafusion/blob/149d3e9533acadfb92329f5445d7deb72f6f679c/datafusion/physical-expr/src/utils/mod.rs#L268-L267
| return Some(Ok(())); | ||
| } | ||
|
|
||
| // For branches with a nullable 'then' expression, try to determine |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also I found another related part of the code here:
| /// | |
| /// Guarantees are a mapping from an expression (which currently is always a | |
| /// column reference) to a [NullableInterval]. The interval represents the known | |
| /// possible values of the column. Using these known values, expressions are | |
| /// rewritten so they can be simplified using `ConstEvaluator` and `Simplifier`. | |
| /// | |
| /// For example, if we know that a column is not null and has values in the | |
| /// range [1, 10), we can rewrite `x IS NULL` to `false` or `x < 10` to `true`. | |
| /// | |
| /// See a full example in [`ExprSimplifier::with_guarantees()`]. | |
| /// | |
| /// [`ExprSimplifier::with_guarantees()`]: crate::simplify_expressions::expr_simplifier::ExprSimplifier::with_guarantees | |
| pub struct GuaranteeRewriter<'a> { | |
| guarantees: HashMap<&'a Expr, &'a NullableInterval>, | |
| } |
This code seems to be simplifying the predicates based on nullability information.
I am not sure if we can reuse the other parts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's actually what I'm investigating. I was thinking about moving that file over to expr-common and then rewriting using the when expression using a guarantee of then -> NullableInterval::Null. That would remove the need for the certainly_null_expr: Option<&Expr> argument.
The downside is that I then have to materialise the rewritten Expr where that's now done implicitly.
If you agree with the move to expr-common I can give this a go and then we can measure the planning impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure -- it sounds reasonable but I can't really tell without actually seeing the code
| certainly_null_expr: Option<&Expr>, | ||
| input_schema: &dyn ExprSchema, | ||
| ) -> Result<NullableInterval> { | ||
| let evaluator = PredicateBoundsEvaluator { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your idea that you might be able to use the NullableInterval code instead of this (now that you have fixed the bugs?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My idea was to see if I can remove TernarySet and use NullableInterval instead indeed. As I was trying to do so I stumbled upon the fact that the logical operators were broken.
The bitfield is much less 🤯 if you ask me, but it's silly to have multiple implementations of the same concept and using NullableInterval directly will get rid of the ugly translation block code.
|
🤖 |
|
🤖: Benchmark completed Details
|
Which issue does this PR close?
Rationale for this change
#17357 introduced a change that replaces
coalescefunction calls withcaseexpressions. In the current implementation these two differ in the way they report their nullability.coalesceis more precise thancaseall will report itself as not nullable in situations where the equivalentcasedoes report being nullable.The rest of the codebase currently does not expect the nullability property of an expression to change as a side effect of expression simplification. This PR is a first attempt to align the nullability of
coalesceandcase.What changes are included in this PR?
Tweaks to the
nullablelogic for the logical and physicalcaseexpression code to reportcaseas being not nullable in more situations.case, a best effort const evaluation of 'when' expressions is done to determine 'then' reachability. The code errs on the conservative side wrt nullability.case, const evaluation of 'when' expressions using a placeholder record batch is attempted to determine 'then' reachability. Again if const evaluation is not possible, the code errs on the conservative side.Are these changes tested?
Additional unit tests have been added to test the new logic.
Are there any user-facing changes?
No