Skip to content

Conversation

@BlakeOrth
Copy link
Contributor

Which issue does this PR close?

N/A -- This PR is a supporting effort to:

Rationale for this change

Adding these tests not only improves test coverage/expected output validation, but also gives us a common way to test and talk about object store access for specific query scenarios.

What changes are included in this PR?

  • Adds a new test to the object store access integration tests that selects all rows from a set of CSV files under a hive partitioned directory structure
  • Adds new test harness method to build a partitioned ListingTable backed by CSV data
  • Adds a new helper method to build a partitioned csv data and register the table

Are these changes tested?

The changes are tests!

Are there any user-facing changes?

No

cc @alamb

 - Adds a new test to the object store access integration tests that
   selects all rows from a set of CSV files under a hive partitioned
   directory structure
 - Adds new test harness method to build a partitioned ListingTable
   backed by CSV data
 - Adds a new helper method to build a partitioned csv data and register
   the table
@github-actions github-actions bot added the core Core DataFusion crate label Oct 29, 2025
Comment on lines 418 to 419
/// Register a partitioned CSV table at the given path relative to the [`datafusion_test_data`]
/// directory
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When writing this I noticed the doc comments on these methods don't quite make sense since they're using a mem:// path, but the comment references the datafusion_test_data directory. Unless I'm missing something, I think the existing doc comments are incorrect?

If so, I'd be happy to change this doc comment and update the existing doc comments to be more accurate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please do -- I think the comments are outdated (from an earlier implementation that did in fact use the directory)

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @BlakeOrth -- this is great and will make it much easier to understand the impact of other changes.

I left a few suggestions for some more tests, but I think we could add them as a follow on too

------- Object Store Request Summary -------
RequestCountingObjectStore()
Total Requests: 13
- LIST (with delimiter) prefix=data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it super clear what is going on. It is a terrifying number of LIST commands

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, one for each directory! I had initially used 3 files in each directory, but I thought this test produced an even more interesting result because there are more list requests than there are data files.

I will say one thing we can't easily see here is the sequencing and parallelism of the list requests. The current implementation does a pretty good job of hiding the latency behind concurrency.

Comment on lines 418 to 419
/// Register a partitioned CSV table at the given path relative to the [`datafusion_test_data`]
/// directory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, please do -- I think the comments are outdated (from an earlier implementation that did in fact use the directory)

}

#[tokio::test]
async fn query_partitioned_csv_file() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add a test with a query that applies predicates to the three partition columns?

Something like

select * from csv_table_partitioned WHERE a = 2;
-- apply predicate to last in directory
select * from csv_table_partitioned WHERE c = 200;
-- apply predicate to both
select * from csv_table_partitioned WHERE a = 2 AND b = 20;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! I'll go ahead and add those onto this PR rather than a follow-on. It should be pretty quick and we can reduce some PR noise by combining them.

Makes doc comments more accurate
@BlakeOrth
Copy link
Contributor Author

@alamb I've added several additional tests cases with predicates. Thanks for this suggestion, I think these tests really show some interesting access behavior!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb I've added several additional tests cases with predicates. Thanks for this suggestion, I think these tests really show some interesting access behavior!

Sounds like a version of "May you live in interesting times" 😆

@alamb alamb added this pull request to the merge queue Oct 30, 2025
@alamb
Copy link
Contributor

alamb commented Oct 30, 2025

Thank you @BlakeOrth

Merged via the queue into apache:main with commit 6514ec7 Oct 30, 2025
28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants