-
Notifications
You must be signed in to change notification settings - Fork 1k
[Variant] ParentState tracks builder-specific state in a uniform way #8324
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alamb I would be interested in your reaction to the approach in this PR vs. the potential alternatives? |
Another potential option is to move some/all the specialized builder code into the parquet-variant crate 🤔 |
|
🤖 |
Moving |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a step in the right direction. Thank you for pushing this @scovich
I share your performance concern but I think we can address that if/when we have some benchmark results / are trying to improve performance. Let's get the functionality working first
I kicked off some benchmarks to gather more performance data
@klion26 @codephage2020 @liamzwbao perhaps you would also have a chance to review this PR as well and offer your thoughts.
parquet-variant/src/builder.rs
Outdated
| saved_value_builder_offset: usize, | ||
| metadata_builder: &'a mut dyn MetadataBuilder, | ||
| saved_metadata_builder_dict_size: usize, | ||
| state: Box<dyn CustomParentState + 'a>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 makes sense to me
I wonder if we should go even farther and change parent state to only have a state: Box<dyn CustomParentState + 'a>, (and not different variants for Variant, List, Object, etc)
That might simplify things even more, however, it would have the downside of adding allocation / dispatch for everything 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thought also crossed my mind. I initially discarded it because it seemed bad to make those existing uses slower as well... but thinking more, the overheads will mostly hurt code that builds top-level primitive variant values -- once we get into complex types, there will be plenty of other overheads to hide this small one.
Given that this PR (unfortunately) penalizes that fastest-path case, it probably wouldn't hurt (much) to make the more complex slow-path cases use the same approach?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I update the PR to use the same approach for everything (ParentState is now a struct instead of an enum).
Holler if you hate it and I can easily revert.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and I also apparently broke something. Debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found the problems:
- It wasn't safe for list and object parent state constructors to delegate to the normal constructor, because they make eager changes (which happens before invoking the delegated constructor and causes the wrong offsets to be captured). Fixed by directly creating the parent state, instead of delegating to the other constructor.
- There was a subtle design flaw in the original
CustomParentState::finish-- it was overfitted to theVariantArrayBuildercase, and calledmetadata_builder.finish()-- which no other builder wants to do. FIxed by changing the signature to just pass the builders, instead of finished offsets.
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
It doesn't seem like a bad idea to me, to be honest. The only potential issue would be people who wanted to use parquet-variant without the dependency on arrow (I am only theorizing here, I don't know if that is actually an important usecase). |
|
🤖 |
|
Rerunning the All the other benchmarks look pretty good to me |
|
🤖: Benchmark completed Details
|
That's an interesting point. In https://github.com/delta-io/delta-kernel-rs/, for example, arrow is the default (but not required) engine implementation -- a Delta connector based on the kernel could choose to use its own native facilities instead. But variant is part of the Delta specification, so it could be important to have access to a robust binary variant implementation even if the engine isn't otherwise using arrow. |
|
Re That doesn't make sense -- it doesn't even build variant values in the first place? (so no parent state creation) It's just pre-registering field names on a |
|
Also re benchmarks: Do we even have any benchmarks for appending primitive numeric values to a |
Not sure -- I agree let's not hold up this PR to look into the one benchmark report |
I just realized, I must have misunderstood something. The fact that |
According to my AI assistant:
@alamb is this really a common pattern? Is there a better way to import just |
If you really need just ArrowError arrow_schema is a fine way to do it I think it is more common to use arrow_schema for ArrowError and DataType I think parquet-variant uses ArrowError at the moment for convenience. Its own error type like We could add a |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
@scovich after your updates to have everything use the same indirection mechanism, it seems that many benchmarks got slower -- I think we should revert to having the specific enums. I am sorry for any confusion / rework I have caused. |
Honestly, that tells me we just don't have benchmarks to measure the impact of the original change, and that we either need to go generic or accept the architectural violation of having enum variants for array builders (but the latter would be difficult to pull off, given that |
|
@alamb -- I changed to generic Meanwhile:
Thoughts? |
|
|
||
| /// Builder-specific state for array building that manages array-level offsets and nulls | ||
| #[derive(Debug)] | ||
| struct ArrayBuilderState<'a> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just moved this code down to impl VariantBuilderExt that actually uses it. It seems better to have the actual array builder be the first class defined in this module.
| /// Creates a nested list builder. See e.g. [`VariantBuilder::new_list`]. Panics if the nested | ||
| /// builder cannot be created, see e.g. [`ObjectBuilder::new_list`]. | ||
| fn new_list(&mut self) -> ListBuilder<'_> { | ||
| fn new_list(&mut self) -> ListBuilder<'_, Self::State<'_>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NOTE: If associated type defaults were stable, we could simplify the code even further:
pub trait VariantBuilderExt {
/// The builder specific state used by nested builders
type State<'a>: BuilderSpecificState + 'a
where
Self: 'a;
type ListBuilder<'a> = ListBuilder<'a, Self::State<'a>>
where
Self: 'a;
type ObjectBuilder<'a> = ObjectBuilder<'a, Self::State<'a>>
where
Self: 'a;
...
fn try_new_list(&mut self) -> Result<Self::ListBuilder<'_>, ArrowError>;
fn try_new_object(&mut self) -> Result<Self::ObjectBuilder<'_>, ArrowError>;
}(the various trait implementors could just inherit and use Self::ListBuilder and Self::ObjectBuilder)
|
🤖 |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks great -- thank you @scovich
I'll wait for the benchmarks to complete, but assuming they look good I'll plan to merge this
| metadata_builder: &'a mut dyn MetadataBuilder, | ||
| saved_metadata_builder_dict_size: usize, | ||
| builder_state: Box<dyn BuilderSpecificState + 'a>, | ||
| builder_state: S, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
| } | ||
|
|
||
| /// Marks the insertion as having succeeded and invokes | ||
| /// [`BuilderSpecificState::finish`]. Internal state will no longer roll back on drop. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the reference to the past ("no longer") a little confusing
| /// [`BuilderSpecificState::finish`]. Internal state will no longer roll back on drop. | |
| /// [`BuilderSpecificState::finish`]. | |
| /// | |
| /// Note: Does not call `rollback()` on drop |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
Looks great and the performance seems like they are even a little faster. 🚀 thank you so much @scovich |
Which issue does this PR close?
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.
Rationale for this change
The
ParentStateclass, combined withVariantBuilderExttrait, makes it pretty easy to work with variant builders. But it only works for "well-known" builder types -- which does not and cannot include theVariantArrayBuilderbecause it lives in a different crate.This becomes a problem for e.g. #8323, because it's currently impossible to append multiple values to a
VariantArrayBuilder-- it needs to create andfinishoneVariantArrayVariantBuilderadapter for each appended value.Plus, we will eventually need a
VariantValueArrayBuilderthat works with read-only metadata, for shredding, unshredding, and projecting variant values. Which will undoubtedly encounter the same sorts of problems, since shredding and unshredding code relies heavily onVariantBuilderExt.What changes are included in this PR?
Make
ParentStatea customizable struct instead of an enum, with aBuilderSpecificStatethat encapsulates the bits of finish and rollback logic specific to each kind of builder. This allowsVariantArrayBuilderto directly implementVariantBuilderExt. It simplifies both the array builder's implementation and the code that uses it, and also opens the way for other custom builders like theVariantValueArrayBuilderwe will eventually need.NOTE: One downside of this approach is the use of a boxed trait instance. This effectively requires a heap allocation (and virtual method dispatch) for every single value appended to a variant array, which I don't love. However, none of our builder-using benchmarks show a measurable slowdown.
If we don't like the overhead of the boxed trait approach, alternatives we've considered include:
VariantBuilderExt, even those that come from other crates.ParentStateto a (not dyn-compat) trait that becomes a type constraint for those classes.VariantBuilderExtis already not dyn-compatVariantArrayBuilderclass into theparquet-variantcrateparquet-varianta newarrow-arraydependency (currently, it only depends onarrow-schema).parquet-variantcrate.Are these changes tested?
Yes, many unit tests were updated to use the new approach instead of the old (removed) approach.
Are there any user-facing changes?
No, because variant support is still experimental, but:
ParentStatebecomes a struct that references a new publicBuilderSpecificStatetrait. All builders are updated to use it.VariantArrayBuildernow implementsVariantBuilderExtdirectly, and the oldVariantArrayVariantBuilderadapter class has been removed.