Skip to content

Conversation

@alamb
Copy link
Contributor

@alamb alamb commented Jul 29, 2025

Which issue does this PR close?

Rationale for this change

Right now the InMemoryRowGroup does both the IO and the CPU work to on a call to fetch

in #7997 I am trying to separate the two, so that we can have more control over
when IO happens and when CPU work happens. I figured I could refactor the
InMemoryRowGroup as a separate PR to make it easier to review and convince myself this refactor is correct.

What changes are included in this PR?

Refactor InMemoryRowGroup::fetch into three functions

Are these changes tested?

By existing CI

If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2025
projection: &ProjectionMask,
selection: Option<&RowSelection>,
) -> Result<()> {
let FetchRanges {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just broke up fetch into three functions

@alamb alamb force-pushed the alamb/split_in_memory_reader branch from 671f698 to c881d34 Compare July 30, 2025 12:22
@alamb alamb marked this pull request as draft August 8, 2025 19:16
@alamb alamb closed this in #7997 Oct 29, 2025
alamb added a commit to alamb/arrow-rs that referenced this pull request Oct 29, 2025
# Which issue does this PR close?

- Part of apache#8000

- closes  apache#7983


# Rationale for this change

This PR is the first part of separating IO and decode operations in the
rust parquet decoder.

Decoupling IO and CPU enables several important usecases:
1. Different IO patterns  (e.g. not buffer the entire row group at once)
2. Different IO APIs e.g. use io_uring, or OpenDAL, etc.
3. Deliberate prefetching within a file
4. Avoid code duplication between the `ParquetRecordBatchStreamBuilder`
and `ParquetRecordBatchReaderBuilder`


# What changes are included in this PR?

1. Add new `ParquetDecoderBuilder`, and `ParquetDecoder` and tests

It is effectively an explicit version of the state machine that is used
in existing async reader (where the state machine is encoded as Rust
`async` / `await` structures)



# Are these changes tested?
Yes -- there are extensive tests for the new code

Note that this PR actually adds a **3rd** path for control flow (when I
claim this will remove duplication!) In follow on PRs I will convert the
existing readers to use this new pattern, similarly to the sequence I
did for the metadata decoder:
- apache#8080
- apache#8340

Here is a preview of a PR that consolidates the async reader to use the
push decoder internally (and removes one duplicate):
- apache#8159

- closes apache#8022

# Are there any user-facing changes?

Yes, a new API, but now changes to the existing APIs

---------

Co-authored-by: Matthijs Brobbel <[email protected]>
Co-authored-by: Adrian Garcia Badaracco <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant