Skip to content

Make a faster way to check column existence in optimizer (not is_err()) #5309

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Related to #5157

There are many places in the code that use fallible functions on DFSchema to check if a column exists:
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.index_of_column_by_name
https://docs.rs/datafusion-common/18.0.0/datafusion_common/struct.DFSchema.html#method.field_from_column

For example, there is code that looks like this (call is_ok() or is_err()and totally discards the error with the string)

input_schema.field_from_column(col).is_ok()

This is problematic because they return a DataFusionError that not only has an allocated String but also often has gone through a lot of effort to construct a nice error message. You can see them appearing in the trace on #5157

As part of making the optimizer faster Related to #5157 we need to avoid these string allocations,

Thus I propose:

  1. Add new functions for checking that return a bool rather than an error
  2. Replace the use of is_err() with

Find the field with the given qualified column

For example,

impl DFSchema {
  // existing function that returns Result
  pub fn field_from_column(&self, column: &Column) -> Result<&DFField> {...}

  // new function that returns bool  <---- Add this new function
  pub fn has_column(&self, column: &Column) -> bool {...}
}

And then replace in the code that have the pattern

input_schema.field_from_column(col).is_ok()

With

input_schema.has_column(col)

Describe the solution you'd like
Ideally someone would do this transition one function on DFSchema at a time (not one giant PR please 🙏 )

Describe alternatives you've considered
There are more involved proposals for larger changes to DFSchema but simply avoiding this check might help a lot

Additional context
I think this is a good first exercise as the desire is well spelled out and it is a software engineering exercise rather than requires deep datafusion expertise

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions