-
Notifications
You must be signed in to change notification settings - Fork 454
PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 33 commits
e311a8a
c60568b
d341446
e49b908
f9fba05
97d22a5
c6e244b
24fafcc
0556bfb
6fe15bb
c5ca0e1
766e62d
a0d0d43
5edbf34
0dfc307
f00f611
cb68370
b7ebf1c
407443f
ab3ef49
a9ec32f
d0b051c
8591f23
98d1881
0181c5e
9fa9f9c
4f8dcf0
ce7904d
1752f2d
5ee0864
b4a703d
7081735
aee8c0e
d7856b1
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -191,6 +191,52 @@ enum FieldRepetitionType { | |
| REPEATED = 2; | ||
| } | ||
|
|
||
| /** | ||
| * A structure for capturing metadata for estimating the unencoded, | ||
| * uncompressed size of data written. This is useful for readers to estimate | ||
| * how much memory is needed to reconstruct data in their memory model and for | ||
| * fine grained filter pushdown on nested structures (the histograms contained | ||
| * in this structure can help determine the number of nulls at a particular | ||
| * nesting level and maximum length of lists). | ||
| */ | ||
| struct SizeStatistics { | ||
| /** | ||
| * The number of physical bytes stored for BYTE_ARRAY data values assuming | ||
| * no encoding. This is exclusive of the bytes needed to store the length of | ||
| * each byte array. In other words, this field is equivalent to the `(size | ||
| * of PLAIN-ENCODING the byte array values) - (4 bytes * number of values | ||
| * written)`. To determine unencoded sizes of other types readers can use | ||
| * schema information multiplied by the number of non-null and null values. | ||
| * The number of null/non-null values can be inferred from the histograms | ||
| * below. | ||
| * | ||
| * For example, if a column chunk is dictionary-encoded with dictionary | ||
| * ["a", "bc", "cde"], and a data page contains the indices [0, 0, 1, 2], | ||
| * then this value for that data page should be 7 (1 + 1 + 2 + 3). | ||
| * | ||
| * This field should only be set for types that use BYTE_ARRAY as their | ||
| * physical type. | ||
| */ | ||
| 1: optional i64 unencoded_byte_array_data_bytes; | ||
| /** | ||
| * When present, there is expected to be one element corresponding to each | ||
| * repetition (i.e. size=max repetition_level+1) where each element | ||
| * represents the number of times the repetition level was observed in the | ||
| * data. | ||
| * | ||
| * This field may be omitted if max_repetition_level is 0 without loss | ||
| * of information. | ||
| **/ | ||
| 2: optional list<i64> repetition_level_histogram; | ||
emkornfield marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| /** | ||
| * Same as repetition_level_histogram except for definition levels. | ||
| * | ||
| * This field may be omitted if max_definition_level is 0 or 1 without | ||
| * loss of information. | ||
| **/ | ||
| 3: optional list<i64> definition_level_histogram; | ||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. BTW, do we need to add an extra histogram for
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think so. But I might not be following what you are suggesting (I gave two examples from Arrow below on usage).
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we store def_levels and ref_levels separately, how can we derive number of nulls in each level precisely?
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I am thinking of supporting pushing down filters like
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It might pay to illustrate exact queries, but if this is just answering a question is there any null element at a particular nesting level I think definition level histogram by itself gives that information. Take a nested lists where both lists and elements can be nullable at each level. IIRC, the definition levels would represent as follows: So if the query is for top level list I believe similar logic holds for One thing the joint histogram (pairs of rep/def level counts) could give you is the number first list elements that are null, but I'm not sure how useful that is. I would need to think about other queries the joint histogram would enable (or if you have more examples of supported queries we can figure out if one is needed). |
||
| } | ||
|
|
||
| /** | ||
| * Statistics per row group and per page | ||
| * All fields are optional. | ||
|
|
@@ -529,7 +575,7 @@ struct DataPageHeader { | |
| /** Encoding used for repetition levels **/ | ||
| 4: required Encoding repetition_level_encoding; | ||
|
|
||
| /** Optional statistics for the data in this page**/ | ||
| /** Optional statistics for the data in this page **/ | ||
| 5: optional Statistics statistics; | ||
| } | ||
|
|
||
|
|
@@ -571,19 +617,19 @@ struct DataPageHeaderV2 { | |
|
|
||
| // repetition levels and definition levels are always using RLE (without size in it) | ||
|
|
||
| /** length of the definition levels */ | ||
| /** Length of the definition levels */ | ||
| 5: required i32 definition_levels_byte_length; | ||
| /** length of the repetition levels */ | ||
| /** Length of the repetition levels */ | ||
| 6: required i32 repetition_levels_byte_length; | ||
|
|
||
| /** whether the values are compressed. | ||
| /** Whether the values are compressed. | ||
| Which means the section of the page between | ||
| definition_levels_byte_length + repetition_levels_byte_length + 1 and compressed_page_size (included) | ||
| is compressed with the compression_codec. | ||
| If missing it is considered compressed */ | ||
| 7: optional bool is_compressed = true; | ||
|
|
||
| /** optional statistics for the data in this page **/ | ||
| /** Optional statistics for the data in this page **/ | ||
| 8: optional Statistics statistics; | ||
| } | ||
|
|
||
|
|
@@ -596,11 +642,11 @@ union BloomFilterAlgorithm { | |
| } | ||
|
|
||
| /** Hash strategy type annotation. xxHash is an extremely fast non-cryptographic hash | ||
| * algorithm. It uses 64 bits version of xxHash. | ||
| * algorithm. It uses 64 bits version of xxHash. | ||
| **/ | ||
| struct XxHash {} | ||
|
|
||
| /** | ||
| /** | ||
| * The hash function used in Bloom filter. This function takes the hash of a column value | ||
| * using plain encoding. | ||
| **/ | ||
|
|
@@ -764,6 +810,14 @@ struct ColumnMetaData { | |
| * in a single I/O. | ||
| */ | ||
| 15: optional i32 bloom_filter_length; | ||
|
|
||
| /** | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (not related to this line, but to For completeness-reasons, we might also want to add This way, readers could plan how much memory the dictionary of a column chunk will take. This can help in decisions whether, e.g., to load the dictionary up-front to perform pre-filtering on the dictionary. It also helps to right-size the buffer that will hold the dictionary. I'm not suggesting that this is a must-have for this commit or at all, so feel free to drop this issue. I just wanted to voice that if we already want to provide tools for size estimation, the dictionary is currently not really accounted for.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree this could be useful, my preference is to leave this out for now as it hasn't come up in discussion before we can always add this as follow-up.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These fields are already presented in the page header but it requires an in-efficient hop to read it. |
||
| * Optional statistics to help estimate total memory when converted to in | ||
| * memory representations. The histograms contained on these statistics can | ||
emkornfield marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| * also be useful in some cases for more fine-grained nullability/list length | ||
| * filter pushdown. | ||
| */ | ||
| 16: optional SizeStatistics size_statistics; | ||
| } | ||
|
|
||
| struct EncryptionWithFooterKey { | ||
|
|
@@ -772,7 +826,7 @@ struct EncryptionWithFooterKey { | |
| struct EncryptionWithColumnKey { | ||
| /** Column path in schema **/ | ||
| 1: required list<string> path_in_schema | ||
|
|
||
| /** Retrieval metadata of column encryption key **/ | ||
| 2: optional binary key_metadata | ||
| } | ||
|
|
@@ -811,7 +865,7 @@ struct ColumnChunk { | |
|
|
||
| /** Crypto metadata of encrypted columns **/ | ||
| 8: optional ColumnCryptoMetaData crypto_metadata | ||
|
|
||
| /** Encrypted column metadata for this chunk **/ | ||
| 9: optional binary encrypted_column_metadata | ||
| } | ||
|
|
@@ -938,6 +992,13 @@ struct OffsetIndex { | |
| * that page_locations[i].first_row_index < page_locations[i+1].first_row_index. | ||
| */ | ||
| 1: required list<PageLocation> page_locations | ||
| /** | ||
| * Unencoded/uncompressed size for BYTE_ARRAY types. | ||
| * | ||
| * See documention for unencoded_byte_array_data_bytes in SizeStatistics for | ||
| * more details on this field. | ||
| */ | ||
| 2: optional list<i64> unencoded_byte_array_data_bytes | ||
| } | ||
|
|
||
| /** | ||
|
|
@@ -977,6 +1038,25 @@ struct ColumnIndex { | |
|
|
||
| /** A list containing the number of null values for each page **/ | ||
| 5: optional list<i64> null_counts | ||
|
|
||
| /** | ||
| * Contains repetition level histograms for each page | ||
| * concatenated together. The repetition_level_histogram field on | ||
| * SizeStatistics contains more details. | ||
| * | ||
| * When present the length should always be (number of pages * | ||
| * (max_repetition_level + 1)) elements. | ||
| * | ||
| * Element 0 is the first element of the histogram for the first page. | ||
| * Element (max_repetition_level + 1) is the first element of the histogram | ||
| * for the second page. | ||
|
Comment on lines
+1050
to
+1052
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I'll take a stab (and likely be wrong 😉). I believe the discussion has assumed the histograms form a matrix with the row index being page number, the column index being the level. Assuming that, what you have defined would be the row major ordering, where elements of the same row are contiguous in memory, as in a C matrix. What @JFinis and @pitrou seem to prefer is the opposite, where elements of the same column are contiguous in memory, as in a Fortran matrix.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Well, I guess I'm confused by the terminology as well. Perhaps we can drop "row-major" and "column-major" entirely when discussing this? :-) Personally, what I mean is that the levels of a given page should be contiguous in memory. I suppose that can be called "page-major" but I might be mistaken. (if it was a 2D array or list, you would index first by page and then by level)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Agreed on moving away from the terminology "major" terminology and just assess if the comment matches what you want (which I think it does). The other "major" option would be something like
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yes, it does. That said, I don't have a strong opinion on this anymore. I got confused by the "column-major" terminology and mistakenly assumed that Parquet columns were involved. Actually they're not :-)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm listing an option, I agree there is tension between size estimation use-case and filtering use-case, it really depends what we want to favor (the plus side of favoring size estimation is it a more obvious representation).
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Understood. And I'm just voicing a preference 😉 I don't need to squeeze every last microsecond out of size estimation, so am happy to yield to those with more pressing performance constraints.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would expect the per level representation to be slightly superior, as it is more useful for filtering. Filtering is a process that might lead to most pages being skipped, so the overall query time might be super short in this case. The most extreme case would be a point look-up where only a single row in a single page survives the filters. In this case, the performance of actually performing the filtering on the page index might have a measurable impact. In contrast, for the size-estimation case, we're estimating the size because we're planning to read the page. This reading will take orders of magnitude longer, so it is not too important to avoid every possible cache-miss in this case. That being said, we're talking about micro optimizations here. Even though my gut feeling is that the other ordering would be superior, I don't mind this order. We're not creating hundreds of lists anymore, that's the most important point for performance.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, will leave as is, unless we get strong data on the trade-offs here.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Speaking of data, as will surprise no one, the current set of changes has a huge impact on the column index size (at least for the cases with thousands of pages with short histograms). I can post an update to the index size tables if there's interest. |
||
| **/ | ||
| 6: optional list<i64> repetition_level_histograms; | ||
| /** | ||
| * Same as repetition_level_histograms except for definitions levels. | ||
| **/ | ||
| 7: optional list<i64> definition_level_histograms; | ||
|
|
||
| } | ||
|
|
||
| struct AesGcmV1 { | ||
|
|
@@ -985,7 +1065,7 @@ struct AesGcmV1 { | |
|
|
||
| /** Unique file identifier part of AAD suffix **/ | ||
| 2: optional binary aad_file_unique | ||
|
|
||
| /** In files encrypted with AAD prefix without storing it, | ||
| * readers must supply the prefix **/ | ||
| 3: optional bool supply_aad_prefix | ||
|
|
@@ -997,7 +1077,7 @@ struct AesGcmCtrV1 { | |
|
|
||
| /** Unique file identifier part of AAD suffix **/ | ||
| 2: optional binary aad_file_unique | ||
|
|
||
| /** In files encrypted with AAD prefix without storing it, | ||
| * readers must supply the prefix **/ | ||
| 3: optional bool supply_aad_prefix | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.