Skip to content

Conversation

@jonded94
Copy link
Contributor

@jonded94 jonded94 commented Nov 5, 2025

Rationale for this change

When dealing with Parquet files that have an exceedingly large amount of Binary or UTF8 data in one row group, there can be issues when returning a single RecordBatch because of index overflows (#7973).

In pyarrow this is usually solved by representing data as a pyarrow.Table object whose columns are ChunkedArrays, which basically are just lists of Arrow Arrays, or alternatively, the pyarrow.Table is just a representation of a list of RecordBatches.

I'd like to build a function in PyO3 that returns a pyarrow.Table, very similar to pyarrow's read_row_group method. With that, we could have feature parity with pyarrow in circumstances of potential index overflows without resorting to type changes (such as reading the data as LargeString or StringView columns).
Currently, AFAIS, there is no way in arrow-pyarrow to export a pyarrow.Table directly. Especially convenience methods from Vec<RecordBatch> seem to be missing. This PR tries to implement a convenience wrapper that allows directly exporting pyarrow.Table.

What changes are included in this PR?

A new struct Table in the crate arrow-pyarrow is added which can be constructed from Vec<RecordBatch> or from ArrowArrayStreamReader.
It implements FromPyArrow and IntoPyArrow.

FromPyArrow will support anything that either implements the ArrowStreamReader protocol or is a RecordBatchReader, or has a to_reader() method which does that. pyarrow.Table does both of these things.
IntoPyArrow will result int a pyarrow.Table on the Python side, constructed through pyarrow.Table.from_batches(...).

Are these changes tested?

Yes, in arrow-pyarrow-integration-tests.

Are there any user-facing changes?

A new Table convience wrapper is added!

Copy link
Member

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.

I don't know what the stance of maintainers is towards including a Table construct for python integration.

FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 6, 2025

Thanks @kylebarron for your very quick review! ❤️

Historically the attitude of this crate has been to avoid "Table" constructs to push users towards streaming approaches.

I don't know what the stance of maintainers is towards including a Table construct for python integration.

Yes, I'm also not too sure about it, that's why I just sketched out a rough implementation without tests so far. A reason why I think this potentially could be nice to have in arrow-pyarrow is that the documentation even mentions that there is no equivalent concept to pyarrow.Table in arrow-pyarrow and that one has to do slight workarounds to use them:

PyArrow has the notion of chunked arrays and tables, but arrow-rs doesn’t have these same concepts. A chunked table is instead represented with Vec. A pyarrow.Table can be imported to Rust by calling pyarrow.Table.to_reader() and then importing the reader as a [ArrowArrayStreamReader].

At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Vec<RecordBatch> on the Rust side somewhere or need to handle a pyarrow.Table on the Python side and want to have an easy method to generate such a thing from Rust. One still could mention in the documentation that generally, streaming approaches are highly preferred, and that the pyarrow.Table convenience wrapper shall only be used in cases where users know what they're doing.

Slightly nicer Python workflow

In our very specific example, we have a Python class with a function such as this one:

class ParquetFile:
  def read_row_group(self, index: int) -> pyarrow.RecordBatch: ...

In the issue I linked this unfortunately breaks down for a specific parquet file since a particular row group isn't expressable as a single RecordBatch without changing types somewhere. Either you'd have to change the underlying Arrow types from String to LargeString or StringView, or you change the returned type from pyarrow.RecordBatch to Iterator[pyarrow.RecordBatch] for example (or RecordBatchReader or any other streaming-capable object).

The latter comes with a bit of syntactic shortcomings in contexts where you want to apply .to_pylist() on whatever read_row_group(...) returns:

rg: pyarrow.RecordBatch | Iterator[pyarrow.RecordBatch] = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]]
if isinstance(rg, pyarrow.RecordBatch):
  python_objs = rg.to_pylist()
else:
  python_objs = list(itertools.chain.from_iterable(batch.to_pylist() for batch in rg))

With pyarrow.Table, there already exists a thing which simplifies this a lot on the Python side:

rg: pyarrow.RecordBatch | pyarrow.Table = ParquetFile(...).read_row_group(0)
python_objs: list[dict[str, Any]] = rg.to_pylist()

And just for clarity, we unfortunately need to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable.

FWIW if you wanted to look at external crates, PyTable exists that probably does what you want. (disclosure it's my project). That alternatively might give you ideas for how to handle the Table here if you still want to do that. (It's a separate crate for these reasons)

Yes, in general, I much prefer the approach of arro3 to be totally pyarrow agnostic. In our case unfortunately, we're right now still pretty hardcoded against pyarrow specifics and just use arrow-rs as a means to reduce memory load compared to reading & writing parquet datasets with pyarrow directly.

@kylebarron
Copy link
Member

one has to do slight workarounds to use them:

I think that's outdated for Python -> Rust. I haven't tried but you should be able to pass a pyarrow.Table directly into an ArrowArrayStreamReader on the Rust side, because it just looks for the __arrow_c_stream__ method that exists either on the Table or the pyarrow.RecordBatchReader.

But I assume there's no way today to easily return a Table from Rust to Python.

At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Vec<RecordBatch> on the Rust side somewhere or need to handle a pyarrow.Table on the Python side and want to have an easy method to generate such a thing from Rust.

I'm fine with that; and I think other maintainers would probably be fine with that too, since it's only a concept that exists in the Python integration.

I'm not sure I totally get your example. Seems bad to be returning a union of multiple types to Python. But seems reasonable to return a Table there. The alternative is to return a stream and have the user either iterate over it lazily or choose to materialize it with pa.table(ParquetFile.read_row_group(...)).

And just for clarity, we unfortunately need to have the entire Row group deserialized as Python objects because our data ingestion pipelines that consume this are expecting to have access to the entire row group in bulk, so streaming approaches are sadly not usable.

Well there's nothing stopping you from materializing the stream by passing it to pa.table(). You don't have to use the stream as a stream.

Yes, in general, I much prefer the approach of arro3 to be totally pyarrow agnostic. In our case unfortunately, we're right now still pretty hardcoded against pyarrow specifics and just use arrow-rs as a means to reduce memory load compared to reading & writing parquet datasets with pyarrow directly.

You can use pyo3-arrow with pyarrow as well, but I'm not opposed to adding this functionality to arrow-rs as well.

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 6, 2025

I think that's outdated for Python -> Rust. I haven't tried but you should be able to pass a pyarrow.Table directly into an ArrowArrayStreamReader on the Rust side

Yes, exactly, that's what I even mentioned here in this PR (https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR491-R492 + https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR545-R549):

/// (although technically, since `pyarrow.Table` implements the ArrayStreamReader PyCapsule
/// interface, one could also consume a `PyArrowType<ArrowArrayStreamReader>` instead)

This is even used to convert pyarrow.Table to ArrowArrayStreamReader and eventually to Table: https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR544 + https://github.com/apache/arrow-rs/pull/8790/files#diff-2cc622072ff5fa80cf1a32a161da31ac058336ebedfeadbc8532fa52ea4224faR567

As you said, the opposite, namely easily returning a Vec<RecordBatch> as a pyarrow.Table to Python is what's really missing here and what this PR mainly is about.

I'm not sure I totally get your example. Seems bad to be returning a union of multiple types to Python.

My example wasn't entirely complete for simplicitly (and still isn't), it would be more something like this:

class ParquetFile:
  @overload
  def read_row_group(self, index: int, as_table: Literal[True]) -> pyarrow.Table: ...
  @overload
  def read_row_group(self, index: int, as_table: Literal[False]) -> pyarrow.RecordBatch: ...
  def read_row_group(self, index: int, as_table: bool = False) -> pyarrow.RecordBatch | pyarrow.Table: ...

The advantage of that would be that both pyrrow.RecordBatch and pyarrow.Table implement .to_list() -> list[dict[str, Any]]. This is the important bit here, as we later just want to be able to call to_pylist() on whatever singular object read_row_group(...) returns and be guaranteed that the entire row group is deserialized as Python objects in this list. So it also could be expressed in our very specific example as:

class ToListCapable(Protocol):
  def to_pylist(self) -> list[dict[str, Any]]: ...

class ParquetFile:
  def read_row_group(self, index: int, as_table: bool = False) -> ToListCapable: ...

The alternative is to return a stream and have the user either iterate over it lazily or choose to materialize it with pa.table(ParquetFile.read_row_group(...)).

&

Well there's nothing stopping you from materializing the stream by passing it to pa.table(). You don't have to use the stream as a stream.

Yes, sure!. We also do that in other places, or have entirely streamable pipelines elsewhere that use the PyCapsule ArrowStream interface. It's just that for this very specific use case, a Vec<RecordBatch> -> pyarrow.Table convenience wrapper perfectly maps to what we need with no required changes in any consuming code, and I would be interested in whether maintainers of arrow-pyarrow find that useful for similar very specific niche use cases, as I said.

@alamb
Copy link
Contributor

alamb commented Nov 6, 2025

I am not a python expert nor have I fully understood all the discussion on this ticket,

At least I personally think having such a wrapper could be nice, since it simplifies stuff a bit when you anyways already have Vec on the Rust side somewhere or need to handle a pyarrow.Table on the Python side and want to have an easy method to generate such a thing from Rust. One still could mention in the documentation that generally, streaming approaches are highly preferred, and that the pyarrow.Table convenience wrapper shall only be used in cases where users know what they're doing.

This would be my preferred approach -- make it easy to go from Rust <> Python, while trying to encourage good practices (e.g. streaming). There is no reason to be pedantic and force someone through hoops to make PyTable if that is what they want

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 7, 2025

I now added a bunch of tests in arrow-pyarrow-integration-testing and overhauled Table to allow exporting empty pyarrow.Tables. I also edited the arrow-pyarrow create docstring to signal that there now exists a pyarrow.Table equivalent, but streaming approaches in general are preferred.

@jonded94 jonded94 force-pushed the implement-pyarrow-table-convenience-class branch from 3836afc to 61464b5 Compare November 7, 2025 14:50
CQ fixes

CQ fix

CQ fix

Let `Table` be a combination of `Vec<RecordBatch>` and `SchemaRef` instead

`cargo fmt`

Overhauled `Table` definition, Added tests

Add empty `Table` integration test

Update `arrow-pyarrow`'s crate documentation

Overhaul documentation even more

Typo fix
@jonded94 jonded94 force-pushed the implement-pyarrow-table-convenience-class branch from 61464b5 to 37b46be Compare November 7, 2025 14:52
Comment on lines 59 to 60
//! For example, a `pyarrow.Table` can be imported to Rust through `PyArrowType<ArrowArrayStreamReader>`
//! instead (since `pyarrow.Table` implements the ArrayStream PyCapsule interface).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to note here that another advantage of using ArrowArrayStreamReader is that it works with tables and stream input out of the box. It doesn't matter which type the user passes in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well actually slight correction, assuming PyCapsule Interface input, both Table and ArrowArrayStreamReader will work with both table and stream input out of the box, the difference is just whether the Rust code materializes the data.

This is why I have this table in the pyo3-arrow docs:

Image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also reading through the docs again, I'd suggest making a reference to Box<dyn RecordBatchReader> rather than ArrowArrayStreamReader. The former is a higher level API and much easier to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to note here that another advantage of using ArrowArrayStreamReader is that it works with tables and stream input out of the box.

I added that in the docs.

Also reading through the docs again, I'd suggest making a reference to Box rather than ArrowArrayStreamReader. The former is a higher level API and much easier to use.

I'm not exactly sure what you mean here. Box<dyn RecordBatchReader> only implements IntoPyArrow, but not FromPyArrow. So in the example I state in the new documentation, that for consuming a pyarrow.Table in Rust, also a streaming approach could be used, the Box<dyn RecordBatchReader> isn't helping sadly. One has to use ArrowArrayStreamReader, since that properly implements FromPyArrow.

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 7, 2025

Hey @kylebarron, thanks for the review! I implemented everything as you suggested.

As you can see, now the CI is broken, because of a suttle problem that was uncovered. Maybe you can help me, as I'm not too familiar with the FFI logic and all my sanity checks did not help me:

In the test_table_roundtrip Python test in arrow-pyarrow-integration-test, we're simply handing a pyarrow.Table to Rust and letting it roundtrip through the conversion layers back to a pyarrow.Table. Unfortunately, the conversion from PyArrowType<Table> -> ArrowArrayStreamReader -> Box<dyn RecordBatchReader> -> Table is now failing, specifically the last part (Box<dyn RecordBatchReader> -> Table).

The RecordBatches that read from Box<dyn RecordBatchReader> in the try_new function of Table seem to be metadata-less. This leads to an error, because the try_new function validates that the schema of all record batches corresponds to the explicitly given schema.

The schema itself from the Box<dyn RecordBatchReader> still has the metadata {"key1": "value1"} attached, but not the individual RecordBatches. I left a somewhat verbose error message in the Rust error:

ValueError: Schema error: All record batches must have the same schema. Expected schema: Schema { fields: [Field { name: "ints", data_type: List(Field { data_type: Int32, nullable: true }), nullable: true }], metadata: {"key1": "value1"} }, got schema: Schema { fields: [Field { name: "ints", data_type: List(Field { data_type: Int32, nullable: true }), nullable: true }], metadata: {} 

This previously worked because I used an unsafe interface for building a Table before which didn't check for schema validity.

Sanity checks:

  • All other roundtrips work without problem, the metadata seems to be handed through all layers otherwise
  • In the failing Python test test_table_roundtrip, I asserted that the pyarrow.Table definitly has the metadata still attached, and especially all RecordBatches from it
    • Just in the conversion to Rust RecordBatch through the Box<dyn ArrowArrayStreamReader> they somehow seem to lose their metadata. Is this something which the FFI interface doesn't guarantee?

EDIT: More importantly, omitting the schema check in Table::try_new actually let's the test test_table_roundtrip succeed. I saw that in pyo3-arrow, you don't check schema equality with schema == record_batch.schema(), but have a custom function schema_equals which seem to be a little bit more forgiving? Shall we use something similar in Table::try_new?

@jonded94
Copy link
Contributor Author

jonded94 commented Nov 7, 2025

@kylebarron for now I stole the schema_equals function from pyo3_arrow, and everything seems to work again. I don't quite understand why the Record Batches are metadata less when read from Box<dyn RecordBatchReader> though.

/// TODO: Either remove this check, replace it with something already existing in `arrow-rs`
/// or move it to a central `utils` location.
fn schema_equals(left: &SchemaRef, right: &SchemaRef) -> bool {
left.fields
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This impl seems incorrect - the zip() operation does not check that the iterators have the same number of items. It actually checks that left is a subset of right or right is a subset of left. So, if right has one field more than left this function will still return true.
https://play.rust-lang.org/?version=stable&mode=debug&edition=2024&gist=95c113900129b392365cdfb3b4c2b4e6

Copy link
Contributor Author

@jonded94 jonded94 Nov 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, instead of using this schema check method at all, I'd much rather have the underlying issue solved by understanding why either the ArrowStreamReader PyCapsule interface of pyarrow.Table or the Rust Box<dyn RecordBatchReader> part seems to swallow up RecordBatch metadata.

If that issue would be fixed, then this function can be left out again and a normal schema == recordbatch.schema() test could be used. This function would only have relevance if it's expected that the RecordBatch coming from a stream reader somehow doesn't have metadata anymore.

But in general your comment would be also relevant for @kylebarron as he is using this function as-is in his crate pyo3-arrow.

for record_batch in &record_batches {
if !schema_equals(&schema, &record_batch.schema()) {
return Err(ArrowError::SchemaError(
//"All record batches must have the same schema.".to_owned(),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//"All record batches must have the same schema.".to_owned(),

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only have the more verbose error message here right now to understand what's going on in the schema mismatch. This is currently commented out to signal that this is not intended to be merged as-is, but the schema mismatch issue shall be understood first.

In general I'm opinionless about how verbose the error message shall be, I'd happily to eventually remove whatever variant you dislike.

Co-authored-by: Martin Grigorov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants