Skip to content

[Variant] Avoid second copy of field name in MetadataBuilder #7814

@alamb

Description

@alamb

The observation is that MetadataBuilder has both a vec and a BTreeMap. Each field name is stored twice, once in the Vec and once in the BTreeMap.

struct MetadataBuilder {
field_name_to_id: BTreeMap<String, u32>,
field_names: Vec<String>,

I think the idea is if we used an IndexSet we would not need the second copy of the field

If I'm understanding correctly...

  • We need a map (not necessarily ordered) in order to cheaply find the field id for a given field name.
  • We need a vec that remembers insertion order?
  • The map cannot reference strings in the vec, unless we're willing to mess with interior mutability and "fun" like that (but in theory we could use a hashmap that stores indexes into the vec, with a fancy custom indirect hasher)
  • The vec cannot reference strings in the map, because there's no stable way to reference the strings it hosts.

Have we considered using an IndexSet? An IndexSet<String> should behave like a Vec<String> but with O(1) cost to return the index of any entry.

Originally posted by @scovich in #7795 (comment)

Metadata

Metadata

Assignees

No one assigned

    Labels

    parquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions