-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Enable reading StringViewArray by default from Parquet
#12092
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
StringViewArray by default from Parquet
This comment was marked as outdated.
This comment was marked as outdated.
bc5d7f7 to
27f7d1e
Compare
This comment was marked as outdated.
This comment was marked as outdated.
27f7d1e to
ce5470d
Compare
datafusion/core/src/datasource/physical_plan/parquet/row_group_filter.rs
Show resolved
Hide resolved
This comment was marked as outdated.
This comment was marked as outdated.
2275216 to
e8f7384
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
|
Benchmarks results clickbench_1:Is slower due to #12771 |
clickbench_extendedLooking good here |
clickbench_partitionedSlowing down also due to #12509 / #12788 |
|
Current status: #11682 (comment) |
6500361 to
5b67afe
Compare
|
My plan is to pull the changes from the following PRs into this PR and rerun the overall perf test
Hopefully that will show the performance we need. Then we just need to get the various PRs actually merged and we can do this 😅 |
5b67afe to
62a738e
Compare
62a738e to
dda0a9a
Compare
dda0a9a to
4a57773
Compare
|
Here are the current performance results: 🚀 Basically 10% faster across all queries and no slow downs 👏 @Rachelint @jayzhan211 @XiangpengHao and many many others. Also for partitioned: |
|
Next steps are to get the various PRs merged, get this one ready, and then write all about it. |
820b9ab to
6112175
Compare
6112175 to
cbdc592
Compare
|
I made a new PR as this one has lots of now irrelevant historical context New PR here: #13101 |
Draft as it builds on:
53.1.0/ fix clippy #12724binary_as_stringparquet option, upgrade to arrow/parquet53.2.0#12816 @goldmedalGroupColumnsupport forStringView/ByteView(faster grouping performance) #12809 from @Rachelinthits_partitionedWhich issue does this PR close?
Closes #11682
Rationale for this change
Reading data as
StringViewArrayis significantly faster thanStringArray. We have been testing this behind a feature flag but it is now stable enough to enable by default.See blog post #11603:
Benchmark Results
What changes are included in this PR?
schema_force_view_typesto trueAre these changes tested?
Yes, by CI tests
Are there any user-facing changes?
If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option
Context
@XiangpengHao debugged these tests previously using #11862