-
Notifications
You must be signed in to change notification settings - Fork 1k
POC: Avoid a copy for uncompressed pages #8756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| let decompressed_size = uncompressed_page_size - offset; | ||
| let mut decompressed = Vec::with_capacity(uncompressed_page_size); | ||
| decompressed.extend_from_slice(&buffer.as_ref()[..offset]); | ||
| if decompressed_size > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see here that if decompressed_size is 0, it returns decompressed unmodified and decompressed was just the first offset bytes of the input buffer
I changed the code to return buffer.slice(..offset) in this case
I can add it to this location directly to make the diff smaller, but the logic was already hard to follow and highly indented to I moved it into its own function while I was at it
|
|
||
| let decompressed_size = uncompressed_page_size - offset; | ||
|
|
||
| if decompressed_size == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense to me 👍
| // TODO: page header could be huge because of statistics. We should set a | ||
| // maximum page header size and abort if that is exceeded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not relevant to this PR, but I think this TODO has largely been addressed by #8376 which enabled skipping the decoding of the page statistics.
| /// the provided decompressor if available and applicable. If the buffer is not | ||
| /// compressed, it will be returned as is. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code looks good to me but the I don't know if the comment "not compressed" can be replaced, if decompress_buffer is called and decompressed_size == 0 , seems that it generally means something like "this page only have levels, but not have non-null values"? ( Point me out if I'm wrong)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually also do not know what decompressed_size really means -- you are probably right. I will research it more carefully
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read https://github.com/apache/parquet-format/blob/master/README.md#data-pages and I agree with your assessment that if the decompressed_size=0 it corresponds to no non-null values
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
If anything the benchmark results show a small, but consistent slowdown with this version |
# Which issue does this PR close? - follow on to #8756 # Rationale for this change @etseidl comments: #8756 (comment) > Not relevant to this PR, but I think this TODO has largely been addressed by #8376 which enabled skipping the decoding of the page statistics. While I was in here, I also wanted to capture the learning based on @mapleFU 's comment #8756 (comment) > The code looks good to me but the I don't know if the comment "not compressed" can be replaced, if decompress_buffer is called and decompressed_size == 0 , seems that it generally means something like "this page only have levels, but not have non-null values"? ( Point me out if I'm wrong) # What changes are included in this PR? Include some comments # Are these changes tested? No (there are no code changes) # Are there any user-facing changes? No, this is internal comments only. No code / behavior changes
Which issue does this PR close?
SerializedPageReader#8745Rationale for this change
While reviewing #8745 I noticed the page was always copied into a new allocation even when there is at least one path that doesn't use it: #8745 (comment)
What changes are included in this PR?
Are these changes tested?
By CI, and I will run benchmarks
I also ran code coverage in the tests to see if this code path (decompressed_size == 0) is hit often:
It does appear to happen, but not often in the tests (happens 3 times in the tests)
Are there any user-facing changes?
No