Skip to content

Conversation

@scovich
Copy link
Contributor

@scovich scovich commented Sep 11, 2025

Which issue does this PR close?

We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax.

  • Closes #NNN.

Rationale for this change

The ParentState class, combined with VariantBuilderExt trait, makes it pretty easy to work with variant builders. But it only works for "well-known" builder types -- which does not and cannot include the VariantArrayBuilder because it lives in a different crate.

This becomes a problem for e.g. #8323, because it's currently impossible to append multiple values to a VariantArrayBuilder -- it needs to create and finish oneVariantArrayVariantBuilder adapter for each appended value.

Plus, we will eventually need a VariantValueArrayBuilder that works with read-only metadata, for shredding, unshredding, and projecting variant values. Which will undoubtedly encounter the same sorts of problems, since shredding and unshredding code relies heavily on VariantBuilderExt.

What changes are included in this PR?

Make ParentState a customizable struct instead of an enum, with a BuilderSpecificState that encapsulates the bits of finish and rollback logic specific to each kind of builder. This allows VariantArrayBuilder to directly implement VariantBuilderExt. It simplifies both the array builder's implementation and the code that uses it, and also opens the way for other custom builders like the VariantValueArrayBuilder we will eventually need.

NOTE: One downside of this approach is the use of a boxed trait instance. This effectively requires a heap allocation (and virtual method dispatch) for every single value appended to a variant array, which I don't love. However, none of our builder-using benchmarks show a measurable slowdown.

If we don't like the overhead of the boxed trait approach, alternatives we've considered include:

  • Add new parent state enum variants for each new type of VariantBuilderExt, even those that come from other crates.
    • PRO: The least amount of code of any alternative I've considered
    • PRO: Zero additional overhead compared to "native" types
    • CON: Architectural violation to make parquet-variant crate (at least somewhat) aware of parquet-variant-compute crate that depends on it.
  • Make the various builder classes generic, and change ParentState to a (not dyn-compat) trait that becomes a type constraint for those classes.
    • NOTE: VariantBuilderExt is already not dyn-compat
    • PRO: Even less overhead than what we have today, because we no longer need enum variant dispatch all over the place
    • CON: A lot of code churn to make all the necessary classes generic. Tho it's unclear how much that will actually impact users of the API. Messy library code isn't necessarily bad, as long as it has a clean user surface.
  • Move the VariantArrayBuilder class into the parquet-variant crate
    • PRO: "fixes" the architectural violation
    • CON: Gives parquet-variant a new arrow-array dependency (currently, it only depends on arrow-schema).
    • CON: Not flexible or future-proof -- anyone wishing to add a new kind of builder must do it in the parquet-variant crate.

Are these changes tested?

Yes, many unit tests were updated to use the new approach instead of the old (removed) approach.

Are there any user-facing changes?

No, because variant support is still experimental, but:

  • ParentState becomes a struct that references a new public BuilderSpecificState trait. All builders are updated to use it.
  • VariantArrayBuilder now implements VariantBuilderExt directly, and the old VariantArrayVariantBuilder adapter class has been removed.

@github-actions github-actions bot added parquet Changes to the parquet crate parquet-variant parquet-variant* crates labels Sep 11, 2025
@scovich
Copy link
Contributor Author

scovich commented Sep 11, 2025

@alamb I would be interested in your reaction to the approach in this PR vs. the potential alternatives?

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

CON: Architectural violation to make parquet-variant crate (at least somewhat) aware of parquet-variant-compute crate that depends on it.

Another potential option is to move some/all the specialized builder code into the parquet-variant crate 🤔

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (1110bb3) to 7e38bbb diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@scovich
Copy link
Contributor Author

scovich commented Sep 11, 2025

CON: Architectural violation to make parquet-variant crate (at least somewhat) aware of parquet-variant-compute crate that depends on it.

Another potential option is to move some/all the specialized builder code into the parquet-variant crate 🤔

Moving VariantArrayBuilder to from parquet-variant-compute to parquet-variant crate would indeed resolve the architectural violation. Is that even such a bad thing?

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a step in the right direction. Thank you for pushing this @scovich

I share your performance concern but I think we can address that if/when we have some benchmark results / are trying to improve performance. Let's get the functionality working first

I kicked off some benchmarks to gather more performance data

@klion26 @codephage2020 @liamzwbao perhaps you would also have a chance to review this PR as well and offer your thoughts.

saved_value_builder_offset: usize,
metadata_builder: &'a mut dyn MetadataBuilder,
saved_metadata_builder_dict_size: usize,
state: Box<dyn CustomParentState + 'a>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 makes sense to me

I wonder if we should go even farther and change parent state to only have a state: Box<dyn CustomParentState + 'a>, (and not different variants for Variant, List, Object, etc)

That might simplify things even more, however, it would have the downside of adding allocation / dispatch for everything 🤔

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The thought also crossed my mind. I initially discarded it because it seemed bad to make those existing uses slower as well... but thinking more, the overheads will mostly hurt code that builds top-level primitive variant values -- once we get into complex types, there will be plenty of other overheads to hide this small one.

Given that this PR (unfortunately) penalizes that fastest-path case, it probably wouldn't hurt (much) to make the more complex slow-path cases use the same approach?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I update the PR to use the same approach for everything (ParentState is now a struct instead of an enum).
Holler if you hate it and I can easily revert.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and I also apparently broke something. Debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found the problems:

  1. It wasn't safe for list and object parent state constructors to delegate to the normal constructor, because they make eager changes (which happens before invoking the delegated constructor and causes the wrong offsets to be captured). Fixed by directly creating the parent state, instead of delegating to the other constructor.
  2. There was a subtle design flaw in the original CustomParentState::finish -- it was overfitted to the VariantArrayBuilder case, and called metadata_builder.finish() -- which no other builder wants to do. FIxed by changing the signature to just pass the builders, instead of finished offsets.

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

🤖: Benchmark completed

Details

group                                                                main                                   variant-builder-custom-state
-----                                                                ----                                   ----------------------------
batch_json_string_to_variant json_list 8k string                     1.00     27.8±0.14ms        ? ?/sec    1.01     27.9±0.13ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    315.7±4.44ms        ? ?/sec    1.00    315.3±0.52ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.02      7.9±0.04ms        ? ?/sec    1.00      7.8±0.02ms        ? ?/sec
variant_get_primitive                                                1.00    617.8±1.12ns        ? ?/sec    1.00    619.7±6.88ns        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (1110bb3) to 7e38bbb diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

🤖: Benchmark completed

Details

group                                       main                                   variant-builder-custom-state
-----                                       ----                                   ----------------------------
bench_extend_metadata_builder               1.00     57.4±3.39ms        ? ?/sec    1.11     63.6±3.15ms        ? ?/sec
bench_object_field_names_reverse_order      1.02     21.1±0.92ms        ? ?/sec    1.00     20.6±1.26ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1274.1±21.53µs        ? ?/sec    1.00  1280.4±21.61µs        ? ?/sec
bench_object_list_same_schema               1.01     25.4±0.24ms        ? ?/sec    1.00     25.2±0.24ms        ? ?/sec
bench_object_list_unknown_schema            1.00     13.7±0.18ms        ? ?/sec    1.01     13.8±0.17ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.3±0.01ms        ? ?/sec    1.01      3.3±0.01ms        ? ?/sec
bench_object_same_schema                    1.00     38.7±0.09ms        ? ?/sec    1.02     39.3±0.23ms        ? ?/sec
bench_object_unknown_schema                 1.00     16.2±0.14ms        ? ?/sec    1.00     16.3±0.04ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.7±0.01ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
iteration/validated_iteration               1.01     49.5±0.54µs        ? ?/sec    1.00     49.1±0.19µs        ? ?/sec
validation/unvalidated_construction         1.01      6.7±0.01µs        ? ?/sec    1.00      6.7±0.01µs        ? ?/sec
validation/validated_construction           1.00     61.1±0.13µs        ? ?/sec    1.00     60.9±0.14µs        ? ?/sec
validation/validation_cost                  1.01     54.6±0.08µs        ? ?/sec    1.00     54.2±0.06µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (1110bb3) to 7e38bbb diff
BENCH_NAME=variant_validation
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_validation
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

🤖: Benchmark completed

Details

group                               main                                   variant-builder-custom-state
-----                               ----                                   ----------------------------
bench_validate_complex_object       1.00    230.9±0.44µs        ? ?/sec    1.00    230.4±0.45µs        ? ?/sec
bench_validate_large_nested_list    1.00     19.5±0.06ms        ? ?/sec    1.00     19.4±0.05ms        ? ?/sec
bench_validate_large_object         1.00     55.0±0.13ms        ? ?/sec    1.00     54.9±0.09ms        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

CON: Architectural violation to make parquet-variant crate (at least somewhat) aware of parquet-variant-compute crate that depends on it.

Another potential option is to move some/all the specialized builder code into the parquet-variant crate 🤔

Moving VariantArrayBuilder to from parquet-variant-compute to parquet-variant crate would indeed resolve the architectural violation. Is that even such a bad thing?

It doesn't seem like a bad idea to me, to be honest.

The only potential issue would be people who wanted to use parquet-variant without the dependency on arrow (I am only theorizing here, I don't know if that is actually an important usecase).

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (1110bb3) to 7e38bbb diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

Rerunning the BENCH_NAME=variant_builder benchmark to see if this is reproduceable:

group                                       main                                   variant-builder-custom-state
-----                                       ----                                   ----------------------------
bench_extend_metadata_builder               1.00     57.4±3.39ms        ? ?/sec    1.11     63.6±3.15ms        ? ?/sec

All the other benchmarks look pretty good to me

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

🤖: Benchmark completed

Details

group                                       main                                   variant-builder-custom-state
-----                                       ----                                   ----------------------------
bench_extend_metadata_builder               1.00     61.8±4.60ms        ? ?/sec    1.07     66.0±3.63ms        ? ?/sec
bench_object_field_names_reverse_order      1.00     21.9±0.67ms        ? ?/sec    1.01     22.2±1.08ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1277.9±29.72µs        ? ?/sec    1.00  1276.6±16.12µs        ? ?/sec
bench_object_list_same_schema               1.01     25.6±0.21ms        ? ?/sec    1.00     25.4±0.19ms        ? ?/sec
bench_object_list_unknown_schema            1.00     13.9±0.15ms        ? ?/sec    1.00     14.0±0.11ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.4±0.01ms        ? ?/sec    1.00      3.3±0.01ms        ? ?/sec
bench_object_same_schema                    1.00     38.9±0.20ms        ? ?/sec    1.01     39.3±0.11ms        ? ?/sec
bench_object_unknown_schema                 1.01     16.5±0.04ms        ? ?/sec    1.00     16.3±0.04ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.7±0.00ms        ? ?/sec    1.00      2.7±0.01ms        ? ?/sec
iteration/validated_iteration               1.00     49.3±0.06µs        ? ?/sec    1.01     49.7±0.39µs        ? ?/sec
validation/unvalidated_construction         1.00      6.7±0.01µs        ? ?/sec    1.00      6.7±0.01µs        ? ?/sec
validation/validated_construction           1.05     63.5±0.11µs        ? ?/sec    1.00     60.6±0.10µs        ? ?/sec
validation/validation_cost                  1.00     56.8±0.06µs        ? ?/sec    1.04     59.3±0.12µs        ? ?/sec

@scovich
Copy link
Contributor Author

scovich commented Sep 11, 2025

CON: Architectural violation to make parquet-variant crate (at least somewhat) aware of parquet-variant-compute crate that depends on it.

Another potential option is to move some/all the specialized builder code into the parquet-variant crate 🤔

Moving VariantArrayBuilder to from parquet-variant-compute to parquet-variant crate would indeed resolve the architectural violation. Is that even such a bad thing?

It doesn't seem like a bad idea to me, to be honest.

The only potential issue would be people who wanted to use parquet-variant without the dependency on arrow (I am only theorizing here, I don't know if that is actually an important usecase).

That's an interesting point. In https://github.com/delta-io/delta-kernel-rs/, for example, arrow is the default (but not required) engine implementation -- a Delta connector based on the kernel could choose to use its own native facilities instead. But variant is part of the Delta specification, so it could be important to have access to a robust binary variant implementation even if the engine isn't otherwise using arrow.

@scovich
Copy link
Contributor Author

scovich commented Sep 11, 2025

Re bench_extend_metadata_builder 11% slower:

That doesn't make sense -- it doesn't even build variant values in the first place? (so no parent state creation)

It's just pre-registering field names on a VariantBuilder (which has a dedicated -- not custom -- parent state)

@scovich
Copy link
Contributor Author

scovich commented Sep 11, 2025

Also re benchmarks: Do we even have any benchmarks for appending primitive numeric values to a VariantArrayBuilder? Because that's the case that would expose the increased overhead, if any would.

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

Also re benchmarks: Do we even have any benchmarks for appending primitive numeric values to a VariantArrayBuilder? Because that's the case that would expose the increased overhead, if any would.

Not sure -- I agree let's not hold up this PR to look into the one benchmark report

@scovich scovich requested a review from alamb September 11, 2025 19:29
@scovich scovich changed the title [Variant] Support custom variant builder parent state [Variant] ParentState tracks builder-specific state in a uniform way Sep 11, 2025
@scovich
Copy link
Contributor Author

scovich commented Sep 11, 2025

  • CON: Each impl would have to define all trait fields and nearly all methods -- even the ones that are the same for every impl -- to avoid running afoul of the rust borrow checker. The same borrow checker issues are why every ParentState enum variant defines the same half dozen fields over and over, in addition to any variant-specific extra state.

I just realized, I must have misunderstood something. The fact that BuilderSpecificState approach actually works, especially the ArrayBuilderState, suggests that the borrow checker is only a problem if the mut references are being taken by different functions?

@scovich
Copy link
Contributor Author

scovich commented Sep 11, 2025

currently, it [parquet-variant crate] only depends on arrow-schema

According to my AI assistant:

The parquet-variant crate does NOT use any other Arrow functionality - it doesn't use Arrow arrays, schemas, data types, or compute functions. The dependency on arrow-schema is purely for the standardized ArrowError type, which provides consistent error handling across the Arrow ecosystem.

This is a common pattern in the Arrow Rust ecosystem where crates use arrow-schema just for the error types to maintain consistency, even when they don't use other Arrow functionality.

@alamb is this really a common pattern? Is there a better way to import just ArrowError?
Or nobody cares because the crate is anyway super lightweight (with no required dependencies on other crates)?

@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

@alamb is this really a common pattern? Is there a better way to import just ArrowError? Or nobody cares because the crate is anyway super lightweight (with no required dependencies on other crates)

If you really need just ArrowError arrow_schema is a fine way to do it

I think it is more common to use arrow_schema for ArrowError and DataType

I think parquet-variant uses ArrowError at the moment for convenience. Its own error type like VariantError probably makes sense long term

We could add a From impls for ArrowError and it would be pretty easy to switch back and forth

@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (b3fcdde) to 7e38bbb diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

🤖: Benchmark completed

Details

group                                                                main                                   variant-builder-custom-state
-----                                                                ----                                   ----------------------------
batch_json_string_to_variant json_list 8k string                     1.00     27.6±0.12ms        ? ?/sec    1.19     32.8±0.21ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    305.5±0.56ms        ? ?/sec    1.11    337.9±0.65ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.00      7.9±0.01ms        ? ?/sec    1.03      8.1±0.02ms        ? ?/sec
variant_get_primitive                                                1.00    620.0±1.35ns        ? ?/sec    1.00    621.1±0.79ns        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (b3fcdde) to 7e38bbb diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

🤖: Benchmark completed

Details

group                                       main                                   variant-builder-custom-state
-----                                       ----                                   ----------------------------
bench_extend_metadata_builder               1.00     59.1±2.50ms        ? ?/sec    1.07     63.1±2.26ms        ? ?/sec
bench_object_field_names_reverse_order      1.23     21.4±0.48ms        ? ?/sec    1.00     17.4±0.80ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1278.4±15.36µs        ? ?/sec    1.08  1385.4±17.41µs        ? ?/sec
bench_object_list_same_schema               1.00     25.7±0.18ms        ? ?/sec    1.14     29.4±0.18ms        ? ?/sec
bench_object_list_unknown_schema            1.00     13.7±0.09ms        ? ?/sec    1.13     15.6±0.12ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.3±0.01ms        ? ?/sec    1.05      3.5±0.01ms        ? ?/sec
bench_object_same_schema                    1.00     38.5±0.09ms        ? ?/sec    1.10     42.2±0.09ms        ? ?/sec
bench_object_unknown_schema                 1.00     16.3±0.05ms        ? ?/sec    1.08     17.7±0.04ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.7±0.00ms        ? ?/sec    1.02      2.7±0.00ms        ? ?/sec
iteration/validated_iteration               1.01     49.9±0.11µs        ? ?/sec    1.00     49.3±0.16µs        ? ?/sec
validation/unvalidated_construction         1.00      6.7±0.01µs        ? ?/sec    1.00      6.7±0.02µs        ? ?/sec
validation/validated_construction           1.00     61.1±0.11µs        ? ?/sec    1.00     61.0±0.10µs        ? ?/sec
validation/validation_cost                  1.03     56.2±0.12µs        ? ?/sec    1.00     54.5±0.14µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (b3fcdde) to 7e38bbb diff
BENCH_NAME=variant_validation
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_validation
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

🤖: Benchmark completed

Details

group                               main                                   variant-builder-custom-state
-----                               ----                                   ----------------------------
bench_validate_complex_object       1.00    231.3±1.65µs        ? ?/sec    1.00    230.3±0.44µs        ? ?/sec
bench_validate_large_nested_list    1.00     19.2±0.04ms        ? ?/sec    1.02     19.6±0.04ms        ? ?/sec
bench_validate_large_object         1.00     55.1±0.08ms        ? ?/sec    1.00     55.3±0.08ms        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 12, 2025

@scovich after your updates to have everything use the same indirection mechanism, it seems that many benchmarks got slower -- I think we should revert to having the specific enums. I am sorry for any confusion / rework I have caused.

@scovich
Copy link
Contributor Author

scovich commented Sep 12, 2025

after your updates to have everything use the same indirection mechanism, it seems that many benchmarks got slower -- I think we should revert to having the specific enums. I am sorry for any confusion / rework I have caused.

Honestly, that tells me we just don't have benchmarks to measure the impact of the original change, and that we either need to go generic or accept the architectural violation of having enum variants for array builders (but the latter would be difficult to pull off, given that NullBuffer is part of the builder specific state, and arrow-array is not available in parquet-variant.

@scovich
Copy link
Contributor Author

scovich commented Sep 12, 2025

@alamb -- I changed to generic ParentState<S: BuilderSpecificState>, could you take it for a benchmarking spin so we know what impact it has?

Meanwhile:

  • VariantBuilderExt did not become generic -- it can capture the genericity with an associated type instead.
  • Changing to generic only seems to have affected code implementing builders and parent state; I didn't have to change any use sites.
  • The PR is still a net win in LoC, in spite of that churn.

Thoughts?


/// Builder-specific state for array building that manages array-level offsets and nulls
#[derive(Debug)]
struct ArrayBuilderState<'a> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved this code down to impl VariantBuilderExt that actually uses it. It seems better to have the actual array builder be the first class defined in this module.

/// Creates a nested list builder. See e.g. [`VariantBuilder::new_list`]. Panics if the nested
/// builder cannot be created, see e.g. [`ObjectBuilder::new_list`].
fn new_list(&mut self) -> ListBuilder<'_> {
fn new_list(&mut self) -> ListBuilder<'_, Self::State<'_>> {
Copy link
Contributor Author

@scovich scovich Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE: If associated type defaults were stable, we could simplify the code even further:

pub trait VariantBuilderExt {
    /// The builder specific state used by nested builders
    type State<'a>: BuilderSpecificState + 'a
    where
        Self: 'a;
    
    type ListBuilder<'a> = ListBuilder<'a, Self::State<'a>> 
    where 
        Self: 'a;
    
    type ObjectBuilder<'a> = ObjectBuilder<'a, Self::State<'a>> 
    where 
        Self: 'a;

    ...

    fn try_new_list(&mut self) -> Result<Self::ListBuilder<'_>, ArrowError>;
    fn try_new_object(&mut self) -> Result<Self::ObjectBuilder<'_>, ArrowError>;
}

(the various trait implementors could just inherit and use Self::ListBuilder and Self::ObjectBuilder)

@alamb
Copy link
Contributor

alamb commented Sep 13, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (f636d6f) to 7e38bbb diff
BENCH_NAME=variant_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_kernels
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks great -- thank you @scovich

I'll wait for the benchmarks to complete, but assuming they look good I'll plan to merge this

metadata_builder: &'a mut dyn MetadataBuilder,
saved_metadata_builder_dict_size: usize,
builder_state: Box<dyn BuilderSpecificState + 'a>,
builder_state: S,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

}

/// Marks the insertion as having succeeded and invokes
/// [`BuilderSpecificState::finish`]. Internal state will no longer roll back on drop.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the reference to the past ("no longer") a little confusing

Suggested change
/// [`BuilderSpecificState::finish`]. Internal state will no longer roll back on drop.
/// [`BuilderSpecificState::finish`].
///
/// Note: Does not call `rollback()` on drop

@alamb
Copy link
Contributor

alamb commented Sep 13, 2025

🤖: Benchmark completed

Details

group                                                                main                                   variant-builder-custom-state
-----                                                                ----                                   ----------------------------
batch_json_string_to_variant json_list 8k string                     1.11     27.8±0.10ms        ? ?/sec    1.00     25.1±0.17ms        ? ?/sec
batch_json_string_to_variant random_json(2633 bytes per document)    1.00    303.7±0.54ms        ? ?/sec    1.02    309.4±6.90ms        ? ?/sec
batch_json_string_to_variant repeated_struct 8k string               1.06      7.9±0.06ms        ? ?/sec    1.00      7.4±0.02ms        ? ?/sec
variant_get_primitive                                                1.00    619.5±1.78ns        ? ?/sec    1.00    618.2±3.92ns        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 13, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (f636d6f) to 7e38bbb diff
BENCH_NAME=variant_builder
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_builder
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 13, 2025

🤖: Benchmark completed

Details

group                                       main                                   variant-builder-custom-state
-----                                       ----                                   ----------------------------
bench_extend_metadata_builder               1.10     59.9±2.17ms        ? ?/sec    1.00     54.5±2.34ms        ? ?/sec
bench_object_field_names_reverse_order      1.05     20.2±0.50ms        ? ?/sec    1.00     19.2±0.96ms        ? ?/sec
bench_object_list_partially_same_schema     1.00  1268.3±14.63µs        ? ?/sec    1.00  1272.3±16.27µs        ? ?/sec
bench_object_list_same_schema               1.01     25.4±0.26ms        ? ?/sec    1.00     25.1±0.16ms        ? ?/sec
bench_object_list_unknown_schema            1.00     13.5±0.07ms        ? ?/sec    1.00     13.6±0.14ms        ? ?/sec
bench_object_partially_same_schema          1.00      3.3±0.01ms        ? ?/sec    1.01      3.4±0.01ms        ? ?/sec
bench_object_same_schema                    1.00     38.8±0.11ms        ? ?/sec    1.02     39.6±0.13ms        ? ?/sec
bench_object_unknown_schema                 1.00     16.2±0.07ms        ? ?/sec    1.01     16.3±0.13ms        ? ?/sec
iteration/unvalidated_fallible_iteration    1.00      2.6±0.01ms        ? ?/sec    1.04      2.7±0.04ms        ? ?/sec
iteration/validated_iteration               1.01     49.3±0.07µs        ? ?/sec    1.00     49.1±0.05µs        ? ?/sec
validation/unvalidated_construction         1.01      6.7±0.01µs        ? ?/sec    1.00      6.7±0.02µs        ? ?/sec
validation/validated_construction           1.00     60.3±0.10µs        ? ?/sec    1.01     61.1±0.57µs        ? ?/sec
validation/validation_cost                  1.01     53.6±0.08µs        ? ?/sec    1.00     53.2±0.08µs        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 13, 2025

🤖 ./gh_compare_arrow.sh Benchmark Script Running
Linux aal-dev 6.14.0-1014-gcp #15~24.04.1-Ubuntu SMP Fri Jul 25 23:26:08 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing variant-builder-custom-state (f636d6f) to 7e38bbb diff
BENCH_NAME=variant_validation
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental --bench variant_validation
BENCH_FILTER=
BENCH_BRANCH_NAME=variant-builder-custom-state
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Sep 13, 2025

🤖: Benchmark completed

Details

group                               main                                   variant-builder-custom-state
-----                               ----                                   ----------------------------
bench_validate_complex_object       1.05    242.1±0.45µs        ? ?/sec    1.00    229.6±0.27µs        ? ?/sec
bench_validate_large_nested_list    1.00     19.0±0.04ms        ? ?/sec    1.00     18.9±0.03ms        ? ?/sec
bench_validate_large_object         1.00     55.6±0.06ms        ? ?/sec    1.00     55.5±0.08ms        ? ?/sec

@alamb
Copy link
Contributor

alamb commented Sep 13, 2025

Looks great and the performance seems like they are even a little faster. 🚀 thank you so much @scovich

@alamb alamb merged commit 2c79a4f into apache:main Sep 13, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants