-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: The inconsistency between scalar and array on the cast decimal to timestamp #16539
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
datafusion/common/src/scalar/mod.rs
Outdated
| DataType::Timestamp(time_unit, None), | ||
| ) => { | ||
| let scale_factor = 10_i128.pow(*scale as u32); | ||
| let scale_factor = 10_i128.pow(*scale as u32 + 3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this depend on the timestamp precision (time unit)? Can you please add tests for other units?
Do we need this special casing at all for casting decimal to timestamp?
or can we end up calling cast_with_options, just like we do for virtually all other type pairs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the underlying arrow kernel may not support casting decimal --> timestamp (I bet this is something spark related) 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the underlying arrow kernel may not support casting decimal --> timestamp (I bet this is something spark related) 🤔
@alamb here implements casting decimal --> timestamp
https://github.com/apache/arrow-rs/blob/b6240b32e235d4ca330372e3be31f784ba133252/arrow-cast/src/cast/mod.rs#L4032
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this depend on the timestamp precision (time unit)? Can you please add tests for other units?
Do we need this special casing at all for casting decimal to timestamp? or can we end up calling
cast_with_options, just like we do for virtually all other type pairs?
@findepi you are right, it's time unit related. I updated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the underlying arrow kernel may not support casting decimal --> timestamp
Can be!
So how does this work when the value is not constant-folded?
What code ends up executing in the 'normal query' case?
Can the same code be executed also in constant-folding case?
| SELECT CAST(CAST(x AS decimal(17,2)) AS timestamp(3)) = CAST(CAST(1 AS decimal(17,2)) AS timestamp(3)) from (values (1)) t(x); | ||
| ---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Formatting like this will help see why these should be equal. Also it's enough to project the values. (They should be equal, but also they should be particular values).
SELECT CAST(CAST(1 AS decimal(17,2)) AS timestamp(3)) AS a UNION ALL
SELECT CAST(CAST(one AS decimal(17,2)) AS timestamp(3)) AS a FROM (VALUES (1)) t(one);
|
should I also change
|
| SELECT CAST(CAST(1 AS decimal(17,2)) AS timestamp(3)) AS a UNION ALL | ||
| SELECT CAST(CAST(one AS decimal(17,2)) AS timestamp(3)) AS a FROM (VALUES (1)) t(one); | ||
| ---- | ||
| 1970-01-01T00:00:00.001 | ||
| 1970-01-01T00:00:00.001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the desired semantics when casting from decimal(p,s) to timestamp(q)?
For example, when casting from decimal(38,3) to timestamp(3) it could feel natural to interpret decimal's whole integer part as seconds-of-epoch, and decimal's fraction as millis-of-second.
With the current implementation, the decimal's fraction seems to be discarded.
(Should we have tests with some non-whole integer decimal values as well)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what the semantics should be when casting from decimal --> timestamp. I recommend that we add the tests @findepi suggests and file a follow on ticket to clarify this point
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @chenkovsky and @findepi -- from my perspective this PR is an improvement as now the semantics for scalars match the semantics for arrays.
It is not 100% clear to me if this is the "right" behavior, but in my opinion consistent is better than inconsistent
Thanks again for working on this
| SELECT CAST(CAST(1 AS decimal(17,2)) AS timestamp(3)) AS a UNION ALL | ||
| SELECT CAST(CAST(one AS decimal(17,2)) AS timestamp(3)) AS a FROM (VALUES (1)) t(one); | ||
| ---- | ||
| 1970-01-01T00:00:00.001 | ||
| 1970-01-01T00:00:00.001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know what the semantics should be when casting from decimal --> timestamp. I recommend that we add the tests @findepi suggests and file a follow on ticket to clarify this point
agreed my second thought before merging is that: if we fix the bug now into one direction (whichever), and then reconsider desired semantics concluding we want opposite, it will be another breaking change and may be a point of contention. Given breaking changes should be done very judiciously (for good reasons), i would somewhat prefer to force us to decide on the desired semantics. |
Looks like the original was added as part of this PR from @jatin510 maybe they remember |
|
That PR added ScalarValue cast, but not the runtime cast. |
I am not sure -- we would need to trace the code down |
|
I think i understand. Before #15486 we were using Arrow's cast semantics for Decimal to Timestamp conversion. This semantics has unfortunate effect of discarding the fractional part of decimal. In #15486, a to_timestamp(decimal) was implemented, but only for scalar values / constant-foldable values (#15486 (comment)). To match the cast behavior in scalar tests, the The #15486 was merged fairly recently and didn't change the non-scalar code path. I think we should restore the decimal cast behavior to as it was before #15486. For those who -- like me -- don't want decimal fraction to get ignored, the to_timestamp seems like the alternative to use. Further, we should remove any other type-special handling in Last but not least, i start to appreciate engines which deliberately don't implement a cast from number to timestamp. there is no single good way to cast, so supporting this with named functions is better. |
I think this makes sense to me -- ideally we can document this somewhere so we don't churn / rework the semantics again in the future |
|
hey ! When i was working with #14612 I found out that, after changing the default My initial impression was, not to change the behaviour of the existing test cases with this parse option. |
datafusion/common/src/scalar/mod.rs
Outdated
| .to_array()?, | ||
| ( | ||
| ScalarValue::Decimal128(Some(decimal_value), _, scale), | ||
| DataType::Timestamp(time_unit, None), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In #16639 i tried to clear the path for more changes here.
If and once that one merged, we should be able to remove this whole code block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16639 is now merged.
please remove the whole let scalar_array = match (self, target_type) { ... } block, replacing it with just self.to_array()?.
|
What is the status of this PR? Shall we merge it? Or are there outstanding issues to resolve? |
requires an update -- #16539 (comment) |
|
@alamb PTAL |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @findepi and @chenkovsky
| _ => self.to_array()?, | ||
| }; | ||
|
|
||
| let scalar_array = self.to_array()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is pretty nice fixing bugs by deleting code 🏆
Which issue does this PR close?
Rationale for this change
In arrow(array), it will treat decimal as millisecond. but in datafusion(scalar), decimal will be treated as second.
for int64, datafusion's behavior is same as arrow. only decimal is incorrect.
What changes are included in this PR?
treat decimal as millisecond.
Are these changes tested?
UT
Are there any user-facing changes?
Yes