Skip to content

Add support for third party table providers #41

@ccciudatu

Description

@ccciudatu

DataFusion Ray can be extended to leverage datasource integrations that are not built into DataFusion itself, but could be brought in as features, such as:

However, adding a custom table provider to a distributed DataFusion engine requires two things:

  1. registering the corresponding TableProviderFactory with the DataFusion SessionContext
  2. for integrations that define custom ExecutionPlan nodes, registering the corresponding PhysicalExtensionCodecs

The current code does not allow for such extensions, because:

  1. the datafusion SessionContext (which is wrapped in a PySessionContext) is created outside the datafusion-ray library and can only be used via its python interface (i.e. invoking named methods on PyAny flavours of session context and execution plan)
  2. the only extension codec used for plan serialization is the ShuffleCodec, which only handles the shuffle read/write nodes of datafusion-ray itself

The solution that we came up with for addressing the above limitations in our fork involves the following changes:

  1. Add (back) the rust dependency on datafusion-python and provide a python function that creates a PySessionContext from within datafusion_ray itself (which can be customized with the enabled table factories and whatnot and can also be downcast so we can use the rust reference directly).
    The "external" python datafusion.SessionContext() will continue to be supported, it will just have no support for any "extensions" that datafusion-ray was compiled with (since we don't want to attempt unsafe downcasts).
  2. Add an Extension trait (not necessarily limited to table providers) that gets implemented by each such "extension" in order to:
    a. customize the SessionContext before it gets returned to python (e.g. registering table provider factories, catalog providers etc.)
    b. provide a list of physical extension codecs required for serializing its custom physical plan nodes, if any
  3. create a composite Extensions singleton that prepares session contexts created by the new datafusion_ray.extended_session_context() python function and also maintains a composite PhysicalExtensionCodec to serialize both the built-in shuffle nodes as well as any additional codecs provided by the enabled extensions.

I'd be glad to open a PR for contributing this, unless the feature request is out of scope or there are plans to address it differently.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions