- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.7k
Adds Partitioned CSV test to object store access tests #18370
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds Partitioned CSV test to object store access tests #18370
Conversation
- Adds a new test to the object store access integration tests that selects all rows from a set of CSV files under a hive partitioned directory structure - Adds new test harness method to build a partitioned ListingTable backed by CSV data - Adds a new helper method to build a partitioned csv data and register the table
| /// Register a partitioned CSV table at the given path relative to the [`datafusion_test_data`] | ||
| /// directory | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When writing this I noticed the doc comments on these methods don't quite make sense since they're using a mem:// path, but the comment references the datafusion_test_data directory. Unless I'm missing something, I think the existing doc comments are incorrect?
If so, I'd be happy to change this doc comment and update the existing doc comments to be more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please do -- I think the comments are outdated (from an earlier implementation that did in fact use the directory)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @BlakeOrth -- this is great and will make it much easier to understand the impact of other changes.
I left a few suggestions for some more tests, but I think we could add them as a follow on too
| ------- Object Store Request Summary ------- | ||
| RequestCountingObjectStore() | ||
| Total Requests: 13 | ||
| - LIST (with delimiter) prefix=data | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it super clear what is going on. It is a terrifying number of LIST commands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, one for each directory! I had initially used 3 files in each directory, but I thought this test produced an even more interesting result because there are more list requests than there are data files.
I will say one thing we can't easily see here is the sequencing and parallelism of the list requests. The current implementation does a pretty good job of hiding the latency behind concurrency.
| /// Register a partitioned CSV table at the given path relative to the [`datafusion_test_data`] | ||
| /// directory | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please do -- I think the comments are outdated (from an earlier implementation that did in fact use the directory)
| } | ||
|  | ||
| #[tokio::test] | ||
| async fn query_partitioned_csv_file() { | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add a test with a query that applies predicates to the three partition columns?
Something like
select * from csv_table_partitioned WHERE a = 2;-- apply predicate to last in directory
select * from csv_table_partitioned WHERE c = 200;-- apply predicate to both
select * from csv_table_partitioned WHERE a = 2 AND b = 20;There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely! I'll go ahead and add those onto this PR rather than a follow-on. It should be pretty quick and we can reduce some PR noise by combining them.
Makes doc comments more accurate
| @alamb I've added several additional tests cases with predicates. Thanks for this suggestion, I think these tests really show some interesting access behavior! | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb I've added several additional tests cases with predicates. Thanks for this suggestion, I think these tests really show some interesting access behavior!
Sounds like a version of "May you live in interesting times" 😆
| Thank you @BlakeOrth | 
Which issue does this PR close?
N/A -- This PR is a supporting effort to:
ListFilesCacheto be available for partitioned tables #17211Rationale for this change
Adding these tests not only improves test coverage/expected output validation, but also gives us a common way to test and talk about object store access for specific query scenarios.
What changes are included in this PR?
Are these changes tested?
The changes are tests!
Are there any user-facing changes?
No
cc @alamb