-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix: coalesce schema issues #12308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: coalesce schema issues #12308
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -151,21 +151,22 @@ impl ExprSchemable for Expr { | |
| .collect::<Result<Vec<_>>>()?; | ||
|
|
||
| // verify that function is invoked with correct number and type of arguments as defined in `TypeSignature` | ||
| data_types_with_scalar_udf(&arg_data_types, func).map_err(|err| { | ||
| plan_datafusion_err!( | ||
| "{} {}", | ||
| err, | ||
| utils::generate_signature_error_msg( | ||
| func.name(), | ||
| func.signature().clone(), | ||
| &arg_data_types, | ||
| let new_data_types = data_types_with_scalar_udf(&arg_data_types, func) | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this is the root cause of the issue and to solve this other changes are necessary. Therefore, I think we should go with this change and maybe further optimize the coercion in another PR.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ok, so should I leave it as it is? Or change it back to how it was: data_types_with_scalar_udf(&arg_data_types, func)
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I defer to @jayzhan211 -- if he is good to merge this PR, let's get the conflicts resolved and merge it in. If there is additional work we know is needed / could be cleaned up, let's try and file them as tickets
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Conflicts solved! 😄 |
||
| .map_err(|err| { | ||
| plan_datafusion_err!( | ||
| "{} {}", | ||
| err, | ||
| utils::generate_signature_error_msg( | ||
| func.name(), | ||
| func.signature().clone(), | ||
| &arg_data_types, | ||
| ) | ||
| ) | ||
| ) | ||
| })?; | ||
| })?; | ||
|
|
||
| // perform additional function arguments validation (due to limited | ||
| // expressiveness of `TypeSignature`), then infer return type | ||
| Ok(func.return_type_from_exprs(args, schema, &arg_data_types)?) | ||
| Ok(func.return_type_from_exprs(args, schema, &new_data_types)?) | ||
| } | ||
| Expr::WindowFunction(window_function) => self | ||
| .data_type_and_nullable_with_window_function(schema, window_function) | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -32,7 +32,6 @@ use datafusion_expr::ColumnarValue; | |
| use std::sync::Arc; | ||
| use std::{fmt, str::FromStr}; | ||
|
|
||
| use datafusion_expr::TypeSignature::*; | ||
| use datafusion_expr::{ScalarUDFImpl, Signature, Volatility}; | ||
| use std::any::Any; | ||
|
|
||
|
|
@@ -49,17 +48,8 @@ impl Default for EncodeFunc { | |
|
|
||
| impl EncodeFunc { | ||
| pub fn new() -> Self { | ||
| use DataType::*; | ||
| Self { | ||
| signature: Signature::one_of( | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems to me that moving the signature from a data driven description (aka describe "what" is needed and letting some other code compute if the given arguments match that signature), this PR is moving many of the functions towards more functional (each function has to implement its own custom coercion, likely resulting in significant duplication). What do you think (perhaps as a follow on PR) of adding Maybe something like that would support automatically coercing arguments from null? Or maybe we should always support coercing Null to any type
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alternative signature like Signature::String, similar to Signature::numeric that includes converting null to string too?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am not sure -- I was just reacting that this "handle null" pattern seems common and it seems like this approach will require custom coerce logic for all functions 🤔
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Null to T coercion needs to be handled elsewhere anyway (eg when computing type of a UNION, etc.). This is actually super fundamental for DataFusion vision as a composable query engine. Coercion rules are very implementation-specific. If we had functions spiced up with coercions inside them, that would make those functions non-reusable.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 100% It seems to me like
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @findepi Are you suggesting something like general coercion that is non-function specific? But what if we want different coercion rule for different function, we might need to do coercion function wise
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why would we want different coercion rules for different functions?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. My idea is that it is more flexible to the user, although, without the real use case, it might be a premature optimization 🤔.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry to chip in late; this PR addresses other issues, such as #12307. I wonder if I could split it and leave the changes regarding the coercion of functions in this one (to keep the discussion in one place) and the others in a new PR. Would that be ok? |
||
| vec![ | ||
| Exact(vec![Utf8, Utf8]), | ||
| Exact(vec![LargeUtf8, Utf8]), | ||
| Exact(vec![Binary, Utf8]), | ||
| Exact(vec![LargeBinary, Utf8]), | ||
| ], | ||
| Volatility::Immutable, | ||
| ), | ||
| signature: Signature::user_defined(Volatility::Immutable), | ||
| } | ||
| } | ||
| } | ||
|
|
@@ -77,23 +67,39 @@ impl ScalarUDFImpl for EncodeFunc { | |
| } | ||
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| use DataType::*; | ||
|
|
||
| Ok(match arg_types[0] { | ||
| Utf8 => Utf8, | ||
| LargeUtf8 => LargeUtf8, | ||
| Binary => Utf8, | ||
| LargeBinary => LargeUtf8, | ||
| Null => Null, | ||
| _ => { | ||
| return plan_err!("The encode function can only accept utf8 or binary."); | ||
| } | ||
| }) | ||
| Ok(arg_types[0].to_owned()) | ||
| } | ||
|
|
||
| fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
| encode(args) | ||
| } | ||
|
|
||
| fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> { | ||
| if arg_types.len() != 2 { | ||
| return plan_err!( | ||
| "{} expects to get 2 arguments, but got {}", | ||
| self.name(), | ||
| arg_types.len() | ||
| ); | ||
| } | ||
|
|
||
| if arg_types[1] != DataType::Utf8 { | ||
| return Err(DataFusionError::Plan("2nd argument should be Utf8".into())); | ||
| } | ||
|
|
||
| match arg_types[0] { | ||
| DataType::Utf8 | DataType::Binary | DataType::Null => { | ||
| Ok(vec![DataType::Utf8; 2]) | ||
| } | ||
| DataType::LargeUtf8 | DataType::LargeBinary => { | ||
| Ok(vec![DataType::LargeUtf8, DataType::Utf8]) | ||
| } | ||
| _ => plan_err!( | ||
| "1st argument should be Utf8 or Binary or Null, got {:?}", | ||
| arg_types[0] | ||
| ), | ||
| } | ||
| } | ||
| } | ||
|
|
||
| #[derive(Debug)] | ||
|
|
@@ -109,17 +115,8 @@ impl Default for DecodeFunc { | |
|
|
||
| impl DecodeFunc { | ||
| pub fn new() -> Self { | ||
| use DataType::*; | ||
| Self { | ||
| signature: Signature::one_of( | ||
| vec![ | ||
| Exact(vec![Utf8, Utf8]), | ||
| Exact(vec![LargeUtf8, Utf8]), | ||
| Exact(vec![Binary, Utf8]), | ||
| Exact(vec![LargeBinary, Utf8]), | ||
| ], | ||
| Volatility::Immutable, | ||
| ), | ||
| signature: Signature::user_defined(Volatility::Immutable), | ||
| } | ||
| } | ||
| } | ||
|
|
@@ -137,23 +134,39 @@ impl ScalarUDFImpl for DecodeFunc { | |
| } | ||
|
|
||
| fn return_type(&self, arg_types: &[DataType]) -> Result<DataType> { | ||
| use DataType::*; | ||
|
|
||
| Ok(match arg_types[0] { | ||
| Utf8 => Binary, | ||
| LargeUtf8 => LargeBinary, | ||
| Binary => Binary, | ||
| LargeBinary => LargeBinary, | ||
| Null => Null, | ||
| _ => { | ||
| return plan_err!("The decode function can only accept utf8 or binary."); | ||
| } | ||
| }) | ||
| Ok(arg_types[0].to_owned()) | ||
| } | ||
|
|
||
| fn invoke(&self, args: &[ColumnarValue]) -> Result<ColumnarValue> { | ||
| decode(args) | ||
| } | ||
|
|
||
| fn coerce_types(&self, arg_types: &[DataType]) -> Result<Vec<DataType>> { | ||
| if arg_types.len() != 2 { | ||
| return plan_err!( | ||
| "{} expects to get 2 arguments, but got {}", | ||
| self.name(), | ||
| arg_types.len() | ||
| ); | ||
| } | ||
|
|
||
| if arg_types[1] != DataType::Utf8 { | ||
| return plan_err!("2nd argument should be Utf8"); | ||
| } | ||
|
|
||
| match arg_types[0] { | ||
| DataType::Utf8 | DataType::Binary | DataType::Null => { | ||
| Ok(vec![DataType::Binary, DataType::Utf8]) | ||
| } | ||
| DataType::LargeUtf8 | DataType::LargeBinary => { | ||
| Ok(vec![DataType::LargeBinary, DataType::Utf8]) | ||
| } | ||
| _ => plan_err!( | ||
| "1st argument should be Utf8 or Binary or Null, got {:?}", | ||
| arg_types[0] | ||
| ), | ||
| } | ||
| } | ||
| } | ||
|
|
||
| #[derive(Debug, Copy, Clone)] | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.