-
Notifications
You must be signed in to change notification settings - Fork 1k
[Parquet] Return error from RleDecoder::reload rather than panic
#8729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
536d048 to
6a6fb8a
Compare
|
Hi @alamb @etseidl, this PR is ready for review. PTAL, thanks! Ran local benchmarks and most test cases showed slight improvements while some showed regression, results varied across runs tho. However, I don’t believe this change brings any meaningful performance gain or loss overall, likely measurement noise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @liamzwbao -- this looks good to me.
I will also run some benchmarks to confirm, but I also don't expect to see anything other than noise
| decoder: DictIndexDecoder::new(data, num_levels, num_values), | ||
| } | ||
| fn new(data: Bytes, num_levels: usize, num_values: Option<usize>) -> Result<Self> { | ||
| Ok(Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I verified this is a crate private structure, so this is not a public API change
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
When I did essentially the same thing as this PR, the only reproducible slowdown was in some of the fixed-length tests. I never really ran that to ground, and am willing to chalk it up to (once again) the arrow_reader bench being too twitchy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @liamzwbao. Just a few (very minor) stylistic nits you are free to ignore 😄
| let bit_reader = self | ||
| .bit_reader | ||
| .as_mut() | ||
| .ok_or_else(|| general_err!("bit_reader should be set"))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| let bit_reader = self | |
| .bit_reader | |
| .as_mut() | |
| .ok_or_else(|| general_err!("bit_reader should be set"))?; | |
| let Some(bit_reader) = self.bit_reader.as_mut() else { | |
| return Err(general_err!("bit_reader should be set")); | |
| }; |
is a bit more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, but I'd like to keep this one for consistency as we already have a lot ok_or_else in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I meant all of them 😅. But it compiles down to the same thing so your choice.
| match &mut self.decoder { | ||
| MaybePacked::Packed(d) => d.set_data(encoding, data), | ||
| MaybePacked::Packed(d) => { | ||
| d.set_data(encoding, data); | ||
| Ok(()) | ||
| } | ||
| MaybePacked::Fallback(d) => d.set_data(encoding, data), | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| match &mut self.decoder { | |
| MaybePacked::Packed(d) => d.set_data(encoding, data), | |
| MaybePacked::Packed(d) => { | |
| d.set_data(encoding, data); | |
| Ok(()) | |
| } | |
| MaybePacked::Fallback(d) => d.set_data(encoding, data), | |
| } | |
| Ok(match &mut self.decoder { | |
| MaybePacked::Packed(d) => d.set_data(encoding, data), | |
| MaybePacked::Fallback(d) => d.set_data(encoding, data)?, | |
| }) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will trigger Clippy: passing a unit value to a function. But the following works:
match &mut self.decoder {
MaybePacked::Packed(d) => d.set_data(encoding, data),
MaybePacked::Fallback(d) => d.set_data(encoding, data)?,
};
Ok(())
parquet/src/encodings/rle.rs
Outdated
| } | ||
|
|
||
| let _ = self.reload(); | ||
| self.reload()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still want to convince myself that ignoring the result is ok here, but this isn't a behavior change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My reading of this change is that the old code ignored the result, but the PR proposes propagating the result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code ignored the result...the new code ignores the result less obviously and then returns Ok(()). I was wondering if it would make more sense to test the result and error when false, but that would be a behavior change. I'm not familiar enough with this code to suggest doing that. Maybe just leave a comment here that the result is deliberately ignored. Justification for doing so would be the icing on the cake. 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ignores the result less obviously
I think I am going crazy -- this PR changes the function signature to return a Result and then calls self.reload()?
I thought the ? propagates the error now.
How is the error ignored 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reload() returns a bool, so if it’s Ok(), the return value is simply ignored, but any error will still propagate with ?.
In the old code, the return value was ignored and could potentially cause a panic.
In the new code, the return value is still ignored, but it will return an Err instead of crashing the program.
Might be better to keep the let _ = and add comments here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you -- sorry for my ignorance
| // Initialize decoder state. The boolean only reports whether the first run contained data, | ||
| // and `get`/`get_batch` already interpret that result to drive iteration. We only need | ||
| // errors propagated here, so the flag returned is intentionally ignored. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️❤️❤️❤️❤️
|
Thank you @liamzwbao and @etseidl |
Which issue does this PR close?
RleDecoder::resetrather than panic #8632.Rationale for this change
What changes are included in this PR?
RleDecoder::reloadto return Result instead of panicking.Are these changes tested?
Covered by existing tests
Are there any user-facing changes?
No