Adds Partitioned CSV test to object store access tests #18370

BlakeOrth · 2025-10-29T23:50:21Z

Which issue does this PR close?

N/A -- This PR is a supporting effort to:

Rationale for this change

Adding these tests not only improves test coverage/expected output validation, but also gives us a common way to test and talk about object store access for specific query scenarios.

What changes are included in this PR?

Adds a new test to the object store access integration tests that selects all rows from a set of CSV files under a hive partitioned directory structure
Adds new test harness method to build a partitioned ListingTable backed by CSV data
Adds a new helper method to build a partitioned csv data and register the table

Are these changes tested?

The changes are tests!

Are there any user-facing changes?

No

cc @alamb

- Adds a new test to the object store access integration tests that selects all rows from a set of CSV files under a hive partitioned directory structure - Adds new test harness method to build a partitioned ListingTable backed by CSV data - Adds a new helper method to build a partitioned csv data and register the table

BlakeOrth · 2025-10-29T23:52:39Z

datafusion/core/tests/datasource/object_store_access.rs

+    /// Register a partitioned CSV table at the given path relative to the [`datafusion_test_data`]
+    /// directory


When writing this I noticed the doc comments on these methods don't quite make sense since they're using a mem:// path, but the comment references the datafusion_test_data directory. Unless I'm missing something, I think the existing doc comments are incorrect?

If so, I'd be happy to change this doc comment and update the existing doc comments to be more accurate.

Yes, please do -- I think the comments are outdated (from an earlier implementation that did in fact use the directory)

alamb

Thank you @BlakeOrth -- this is great and will make it much easier to understand the impact of other changes.

I left a few suggestions for some more tests, but I think we could add them as a follow on too

alamb · 2025-10-30T13:18:50Z

datafusion/core/tests/datasource/object_store_access.rs

+    ------- Object Store Request Summary -------
+    RequestCountingObjectStore()
+    Total Requests: 13
+    - LIST (with delimiter) prefix=data


This makes it super clear what is going on. It is a terrifying number of LIST commands

Yes, one for each directory! I had initially used 3 files in each directory, but I thought this test produced an even more interesting result because there are more list requests than there are data files.

I will say one thing we can't easily see here is the sequencing and parallelism of the list requests. The current implementation does a pretty good job of hiding the latency behind concurrency.

alamb · 2025-10-30T13:19:15Z

datafusion/core/tests/datasource/object_store_access.rs

+    /// Register a partitioned CSV table at the given path relative to the [`datafusion_test_data`]
+    /// directory


Yes, please do -- I think the comments are outdated (from an earlier implementation that did in fact use the directory)

alamb · 2025-10-30T13:21:07Z

datafusion/core/tests/datasource/object_store_access.rs

 }

+#[tokio::test]
+async fn query_partitioned_csv_file() {


Could you also please add a test with a query that applies predicates to the three partition columns?

Something like

select * from csv_table_partitioned WHERE a = 2;

-- apply predicate to last in directory select * from csv_table_partitioned WHERE c = 200;

-- apply predicate to both select * from csv_table_partitioned WHERE a = 2 AND b = 20;

Absolutely! I'll go ahead and add those onto this PR rather than a follow-on. It should be pretty quick and we can reduce some PR noise by combining them.

Makes doc comments more accurate

BlakeOrth · 2025-10-30T17:14:31Z

@alamb I've added several additional tests cases with predicates. Thanks for this suggestion, I think these tests really show some interesting access behavior!

alamb

@alamb I've added several additional tests cases with predicates. Thanks for this suggestion, I think these tests really show some interesting access behavior!

Sounds like a version of "May you live in interesting times" 😆

alamb · 2025-10-30T18:51:23Z

Thank you @BlakeOrth

github-actions bot added the core Core DataFusion crate label Oct 29, 2025

BlakeOrth commented Oct 29, 2025

View reviewed changes

alamb approved these changes Oct 30, 2025

View reviewed changes

Adds additional test cases

9da561c

Makes doc comments more accurate

alamb approved these changes Oct 30, 2025

View reviewed changes

alamb added this pull request to the merge queue Oct 30, 2025

Merged via the queue into apache:main with commit 6514ec7 Oct 30, 2025
28 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds Partitioned CSV test to object store access tests #18370

Adds Partitioned CSV test to object store access tests #18370

BlakeOrth commented Oct 29, 2025

Uh oh!

BlakeOrth Oct 29, 2025

Uh oh!

alamb Oct 30, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 30, 2025

Uh oh!

BlakeOrth Oct 30, 2025

Uh oh!

alamb Oct 30, 2025

Uh oh!

alamb Oct 30, 2025

Uh oh!

BlakeOrth Oct 30, 2025

Uh oh!

BlakeOrth commented Oct 30, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		/// Register a partitioned CSV table at the given path relative to the [`datafusion_test_data`]
		/// directory

Adds Partitioned CSV test to object store access tests #18370

Adds Partitioned CSV test to object store access tests #18370

Conversation

BlakeOrth commented Oct 29, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

BlakeOrth Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

BlakeOrth Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

BlakeOrth Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

BlakeOrth commented Oct 30, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Oct 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants