Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Feb 18, 2024

Which issue does this PR close?

Closes #8820

My primary rationale is that I think substrait will be a very important technology
in the upcoming years and substrait integration will be a very important feature
for DataFusion.

Thus as system designers evaluate DataFusion for thier uses, it is important
they understand the depth and breadth of DataFusion's substrait support. Having
reasonable documentation is the first key step for this

The current docs for substrait are minimal:
Screenshot 2024-02-17 at 11 11 21 AM

Rationale for this change

  1. Move most datafusion-proto content from README --> rustdocs (so they can be more easily tested and found)
  2. Add examples and documentation for datafusion-subtrait

Note that since datafusion-proto and datadusion-substrait are interrelated
(they perform similar functions) I also wanted to make their relationship clear
and the documentation consistent.

Sorry for this large of a PR but I think it makes sense for the content of both
crates tp be reviewed together as they are related. Also I had a plane trip
without internet so I got a bit carried away

What changes are included in this PR?

Are these changes tested?

Now that the examples are part of the rustdocs, I believe they are both easier to find as well as tested as part of the doctests that are part of CI 🎉

Are there any user-facing changes?

@github-actions github-actions bot added the substrait Changes to the substrait crate label Feb 18, 2024
@alamb alamb marked this pull request as ready for review February 18, 2024 08:21
@alamb alamb added the documentation Improvements or additions to documentation label Feb 18, 2024
Copy link
Contributor

@Jefffrey Jefffrey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documentation 🚀

(Sorry for all the nits 😅 )


[df]: https://crates.io/crates/datafusion
[datafusion]: https://arrow.apache.org/datafusion
[api docs]: http://docs.rs/datafusion-substrait/latest/datafusion-substrait
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be datafusion-proto?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, excellent catch. Fixed in f596397

The prost/tonic code can be generated by running, which in turn invokes the Rust binary located in [gen](./gen)

This is necessary after modifying the protobuf definitions or altering the dependencies of [gen](./gen), and requires a
valid installation of [protoc](https://github.com/protocolbuffers/protobuf#protocol-compiler-installation).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the DataFusion doc about this for consistency: https://arrow.apache.org/datafusion/contributor-guide/#protoc-installation

Or better to have doc direct from source?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a link to the rendered docs makes sense -- updated in aa09db6. Thanks for the suggestion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this before. Filed #9300

Comment on lines +39 to +41
//! The binary format created by this crate supports the full range of DataFusion
//! plans, but is DataFusion specific. See [datafusion-substrait] for a crate
//! which can encode many DataFusion plans using the [substrait.io] standard.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since format is DataFusion specific, are we able to give any info on version compatibility here, etc.? If that has been decided yet 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- I think the current state of affairs is that there is no gaurantee about version compatibility, and I added that in 7d91abf. If version compatibility is a desired feature I think what is most important would be a test harness for verifying it (which is a non trivial undertaking)

Copy link
Member

@waynexia waynexia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome! Thanks @alamb for making this 💯

//!
//! Potential uses of this crate:
//! * Use DataFusion run Substrait plans created by other systems (e.g. Apache Calcite)
//! * Use DataFusion to create plans to run on other systems
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

//! Potential uses of this crate:
//! * Use DataFusion run Substrait plans created by other systems (e.g. Apache Calcite)
//! * Use DataFusion to create plans to run on other systems
//! * Pass query plans over FFI boundaries, such as from Python to Rust
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to provide another use case from GreptimeDB, where substrait plan is used to pass query plans across compute nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a note in 708c7cd

I am curious why you chose to substrait and not datafusion-proto within the same system (we use datafusion-proto for this case, for its wider compatibility)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering combining other consumer/producer together so the "node" need not be the same at that time. It turns out the work is tremendous and the related ecosystem is not ready yet 🤣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm considering combining other consumer/producer together so the "node" need not be the same at that time.

That makes a lot of sense

It turns out the work is tremendous and the related ecosystem is not ready yet 🤣

Yeah, that is my feeling too. One reason I felt this PR would be good is that it might help find / attract people who want to play with the bleeding edge and make DataFusion's substrait support better. We'll see 🎣

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW filed #9299 to track adding some more support

Co-authored-by: Jeffrey Vo <[email protected]>
Co-authored-by: Ruihang Xia <[email protected]>
@github-actions github-actions bot removed the documentation Improvements or additions to documentation label Feb 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

substrait Changes to the substrait crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add documentation to datafusion-substrait crate about what it does

4 participants