-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Improve datafuson-proto and datafusion-substrait docs #9260
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Jefffrey
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Documentation 🚀
(Sorry for all the nits 😅 )
datafusion/proto/README.md
Outdated
|
|
||
| [df]: https://crates.io/crates/datafusion | ||
| [datafusion]: https://arrow.apache.org/datafusion | ||
| [api docs]: http://docs.rs/datafusion-substrait/latest/datafusion-substrait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be datafusion-proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, excellent catch. Fixed in f596397
datafusion/proto/REAME-dev.md
Outdated
| The prost/tonic code can be generated by running, which in turn invokes the Rust binary located in [gen](./gen) | ||
|
|
||
| This is necessary after modifying the protobuf definitions or altering the dependencies of [gen](./gen), and requires a | ||
| valid installation of [protoc](https://github.com/protocolbuffers/protobuf#protocol-compiler-installation). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to the DataFusion doc about this for consistency: https://arrow.apache.org/datafusion/contributor-guide/#protoc-installation
Or better to have doc direct from source?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a link to the rendered docs makes sense -- updated in aa09db6. Thanks for the suggestion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from typo in file name, could maybe rename to CONTRIBUTING.md?
That's what is done in arrow-rs:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I missed this before. Filed #9300
| //! The binary format created by this crate supports the full range of DataFusion | ||
| //! plans, but is DataFusion specific. See [datafusion-substrait] for a crate | ||
| //! which can encode many DataFusion plans using the [substrait.io] standard. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since format is DataFusion specific, are we able to give any info on version compatibility here, etc.? If that has been decided yet 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call -- I think the current state of affairs is that there is no gaurantee about version compatibility, and I added that in 7d91abf. If version compatibility is a desired feature I think what is most important would be a test harness for verifying it (which is a non trivial undertaking)
waynexia
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome! Thanks @alamb for making this 💯
| //! | ||
| //! Potential uses of this crate: | ||
| //! * Use DataFusion run Substrait plans created by other systems (e.g. Apache Calcite) | ||
| //! * Use DataFusion to create plans to run on other systems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A famous consumer is DuckDB: https://duckdb.org/docs/extensions/substrait.html
| //! Potential uses of this crate: | ||
| //! * Use DataFusion run Substrait plans created by other systems (e.g. Apache Calcite) | ||
| //! * Use DataFusion to create plans to run on other systems | ||
| //! * Pass query plans over FFI boundaries, such as from Python to Rust |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to provide another use case from GreptimeDB, where substrait plan is used to pass query plans across compute nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a note in 708c7cd
I am curious why you chose to substrait and not datafusion-proto within the same system (we use datafusion-proto for this case, for its wider compatibility)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering combining other consumer/producer together so the "node" need not be the same at that time. It turns out the work is tremendous and the related ecosystem is not ready yet 🤣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm considering combining other consumer/producer together so the "node" need not be the same at that time.
That makes a lot of sense
It turns out the work is tremendous and the related ecosystem is not ready yet 🤣
Yeah, that is my feeling too. One reason I felt this PR would be good is that it might help find / attract people who want to play with the bleeding edge and make DataFusion's substrait support better. We'll see 🎣
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW filed #9299 to track adding some more support
Co-authored-by: Jeffrey Vo <[email protected]> Co-authored-by: Ruihang Xia <[email protected]>
Which issue does this PR close?
Closes #8820
My primary rationale is that I think substrait will be a very important technology
in the upcoming years and substrait integration will be a very important feature
for DataFusion.
Thus as system designers evaluate DataFusion for thier uses, it is important
they understand the depth and breadth of DataFusion's substrait support. Having
reasonable documentation is the first key step for this
The current docs for substrait are minimal:

Rationale for this change
Note that since
datafusion-protoanddatadusion-substraitare interrelated(they perform similar functions) I also wanted to make their relationship clear
and the documentation consistent.
Sorry for this large of a PR but I think it makes sense for the content of both
crates tp be reviewed together as they are related. Also I had a plane trip
without internet so I got a bit carried away
What changes are included in this PR?
Are these changes tested?
Now that the examples are part of the rustdocs, I believe they are both easier to find as well as tested as part of the doctests that are part of CI 🎉
Are there any user-facing changes?