-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #5157
There are many places in the code that use fallible functions on DFSchema to check if a column exists:
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column
For example, there is code that looks like this (call is_ok() or is_err()and totally discards the error with the string)
input_schema.field_from_column(col).is_ok()This is problematic because they return a DataFusionError that not only has an allocated String but also often has gone through a lot of effort to construct a nice error message. You can see them appearing in the trace on #5157
As part of making the optimizer faster Related to #5157 we need to avoid these string allocations,
Thus I propose:
- Add new functions for checking that return a bool rather than an error
- Replace the use of
is_err()with
Find the field with the given qualified column
For example,
impl DFSchema {
// existing function that returns Result
pub fn field_from_column(&self, column: &Column) -> Result<&DFField> {...}
// new function that returns bool <---- Add this new function
pub fn has_column(&self, column: &Column) -> bool {...}
}And then replace in the code that have the pattern
input_schema.field_from_column(col).is_ok()With
input_schema.has_column(col)Describe the solution you'd like
Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )
Describe alternatives you've considered
There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot
Additional context
I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise