-
Couldn't load subscription status.
- Fork 1k
Add with_skip_validation flag to IPC StreamReader, FileReader and FileDecoder
#7120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
d4102d6 to
59b3033
Compare
7b85e5e to
77d3de5
Compare
|
|
||
| mod private { | ||
| /// A boolean flag that cannot be mutated outside of unsafe code. | ||
| /// A boolean flag that cannot be mutated outside of unsafe code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to make this UnsafeFlag public (and added examples and more docs) so I could use it across the two crates. However, I can also make a private copy of it in arrow-ipc if reviewers feel it would be better to avoid a new API
arrow-ipc/benches/ipc_reader.rs
Outdated
| writer.write(&batch).unwrap(); | ||
| } | ||
| writer.finish().unwrap(); | ||
| let buffer = ipc_stream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added new versions of each benchmark that work with disabled validation
| StructArray::try_new(struct_fields.clone(), struct_arrays, None)? | ||
| }; | ||
| Ok(Arc::new(struct_array)) | ||
| self.create_struct_array(struct_node, null_buffer, struct_fields, struct_arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I refactored this code into its own function so it was eaiser to call StructArray::new_unchecked when validation was disabled
arrow-ipc/src/reader.rs
Outdated
| /// | ||
| /// Relies on the caller only passing a flag with `true` value if they are | ||
| /// certain that the data is valid | ||
| pub fn with_skip_validation(mut self, skip_validation: UnsafeFlag) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that RecordBatchDecoder is not a public API:
https://docs.rs/arrow-ipc/54.1.0/arrow_ipc/reader/struct.StreamReader.html?search=RecordBatchDecoder
| self | ||
| } | ||
|
|
||
| /// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a new public API and follows the same pattern as ArrayData::skip_validation
| &mut self.reader | ||
| } | ||
|
|
||
| /// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new API
| &mut self.reader | ||
| } | ||
|
|
||
| /// Specifies if validation should be skipped when reading data (defaults to `false`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new API
| } | ||
|
|
||
| #[test] | ||
| fn test_invalid_nested_array_ipc_read_errors() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some additional coverage to make sure the flag got passed when decoding multiple nesting levels
| } | ||
|
|
||
| #[test] | ||
| fn test_validation_of_invalid_union_array() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out UnionArray had its own path / handling so new coverage added
|
|
||
| // IPC Stream format | ||
| let buf = write_stream(&rb); // write is ok | ||
| read_stream_skip_validation(&buf).unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now all the validation tests also verify they can read the batch back without error if validation is disabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly just some minor nits
arrow-ipc/benches/ipc_reader.rs
Outdated
| } | ||
|
|
||
| /// Return an IPC stream with ZSTD compression with 10 record batches | ||
| fn ipc_stream_zstd() -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nit, but I wonder if we could have an ipc_stream_options that takes a IpcWriteOptions and then have ipc_stream just call through to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this was a good suggestion -- I did it in 7d3f020 🧹
Co-authored-by: Raphael Taylor-Davies <[email protected]>
…-rs into alamb/ipc_disable_validation
Co-authored-by: Raphael Taylor-Davies <[email protected]>
Co-authored-by: Raphael Taylor-Davies <[email protected]>
…-rs into alamb/ipc_disable_validation
|
Thank you very much for the review @tustvold |
…nd `FileDecoder` (apache#7120) * Make UnsafeFlag pub * Add `with_skip_validation` flag to IPC `StreamReader`, `FileReader` and `FileDecoder` * Apply suggestions from code review Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-ipc/src/reader.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Update arrow-ipc/src/reader.rs Co-authored-by: Raphael Taylor-Davies <[email protected]> * Add notes to RecordBatchDecoder::with_skip_validation flag * remove repetition in read benchmark --------- Co-authored-by: Raphael Taylor-Davies <[email protected]>
Which issue does this PR close?
with_skip_validationflag to IPC readers/writers #7093Rationale for this change
Forcing
Arrayvalidation while reading arrow IPC trusted data is inefficient. Users should be able to avoid doing so if they wantWhat changes are included in this PR?
This PR builds on this PR from @totoroyyb
Are there any user-facing changes?
with_disable_validationAPIs onStreamReader,FileReaderandFileDecoderBenchmark results Mac M3
Observations
mmapis super fast and thus the validation is a much larger portion of the time, resulting in a corresponding 8-9x faster without validationwith_skip_validationStreamReaderStreamReader(zstd)FileReaderFileDecoder+mmapDetails for Mac M3
Benchmarks: GCP
On a GCP
c2-standard-16 (16 vCPUs, 64 GB Memory)with_skip_validationStreamReaderStreamReader(zstd)FileReaderFileDecoder+mmapRaw Results (gpc)