-
Notifications
You must be signed in to change notification settings - Fork 1k
Closed
Labels
parquetChanges to the parquet crateChanges to the parquet crate
Description
The observation is that MetadataBuilder has both a vec and a BTreeMap. Each field name is stored twice, once in the Vec and once in the BTreeMap.
arrow-rs/parquet-variant/src/builder.rs
Lines 235 to 237 in 674dc17
| struct MetadataBuilder { | |
| field_name_to_id: BTreeMap<String, u32>, | |
| field_names: Vec<String>, |
I think the idea is if we used an IndexSet we would not need the second copy of the field
If I'm understanding correctly...
- We need a map (not necessarily ordered) in order to cheaply find the field id for a given field name.
- We need a vec that remembers insertion order?
- The map cannot reference strings in the vec, unless we're willing to mess with interior mutability and "fun" like that (but in theory we could use a hashmap that stores indexes into the vec, with a fancy custom indirect hasher)
- The vec cannot reference strings in the map, because there's no stable way to reference the strings it hosts.
Have we considered using an IndexSet? An IndexSet<String> should behave like a Vec<String> but with O(1) cost to return the index of any entry.
Originally posted by @scovich in #7795 (comment)
Metadata
Metadata
Assignees
Labels
parquetChanges to the parquet crateChanges to the parquet crate