Skip to content

Conversation

@friendlymatthew
Copy link
Contributor

@friendlymatthew friendlymatthew commented Jun 26, 2025

Rationale for this change

This PR simplifies how we build up the internal VariantBuffer by appending to it directly, rather than pre-allocating a buffer filled with zeroes and setting values at indices. This avoids indexing math that can be hard to follow and reason about.

This PR also aims to design a well-defined API for ValueBuffer. My thought here was we should not touch the inner Vec<u8>. It's quite sensitive.

Are there any user-facing changes?

Nope!

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 26, 2025
fn make_room_for_header(buffer: &mut Vec<u8>, start_pos: usize, header_size: usize) {
let current_len = buffer.len();
buffer.resize(current_len + header_size, 0);
fn write_header(header_size: usize, header_byte: u8, is_large: bool, num_items: usize) -> Vec<u8> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed the ObjectBuilder and ListBuilder were writing out the header in the same way.

Moved it into a freestanding function called write_header.

@friendlymatthew friendlymatthew force-pushed the friendlymatthew/simplify branch from b85ebc8 to 822924f Compare June 26, 2025 19:03
Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One naming nit and several follow-ups to consider.

Comment on lines 58 to 59
for i in 0..nbytes {
buf[i as usize] = (value >> (i * 8)) as u8;
buf.push((value >> (i * 8)) as u8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: I wonder if it would be cheaper (better compiler codegen) to do something like:

let bytes = value.to_le_bytes();
buf.extend_from_slice(&bytes[..nbytes]);

Copy link
Contributor Author

@friendlymatthew friendlymatthew Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godbolt showed the same instruction count for both implementations, so appending n bytes at a time makes sense to me 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think extend_from_slice is more "idomatic" rust but I realize that is a opinion and I don't think this needs to be changed (though if I were writing the code I would use extend_from_slice)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR uses extend_from_slice!

/// Write little-endian integer to buffer
fn write_offset(buf: &mut Vec<u8>, value: usize, nbytes: u8) {
let bytes = value.to_le_bytes();
buf.extend_from_slice(&bytes[..nbytes as usize]);
}

Sorry if my comment above was unclear, I was agreeing with @scovich


let mut metadata = Vec::with_capacity(metadata_size);

// Write header: version=1, not sorted, with calculated offset_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a TODO for supporting sorted dictionaries? I liked @alamb idea to let callers pre-populate and pre-sort the dictionary if they wanted (before any variant objects reference it), and here we just check whether it's sorted as we write it out. If not sorted, go back and rewrite the header (which is at metadata[0], pretty safe)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think sorted dictionaries are covered by this ticket (and technically it was @zeroshade 's idea :) )

Also after this PR I think it will be rather elegant / easy to add support for sorted dictionaries. Thank you @friendlymatthew

write_offset(&mut metadata, cur_offset, offset_size);

// Write string data
for key in self.field_names.iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly...

  • We need a map (not necessarily ordered) in order to cheaply find the field id for a given field name.
  • We need a vec that remembers insertion order?
  • The map cannot reference strings in the vec, unless we're willing to mess with interior mutability and "fun" like that (but in theory we could use a hashmap that stores indexes into the vec, with a fancy custom indirect hasher)
  • The vec cannot reference strings in the map, because there's no stable way to reference the strings it hosts.

Have we considered using an IndexSet? An IndexSet<String> should behave like a Vec<String> but with O(1) cost to return the index of any entry.

Copy link
Contributor

@alamb alamb Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree IndexSet would be a good choice of structure here and we have used it elsewhere to good effect

A good follow on ticket perhaps

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @scovich says, if/when we decide we need to optimize even more and avoid individual allocations for strings, there are fancier structures we could use but I think an IndexSet will get us pretty far

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I'm a bit confused to why we need IndexSet. From what I understand, IndexSets offer a deterministic order of elements by insertion order and O(1) lookup time.

Our MetadataBuilder has a Vec of field names. We don't support Object field deletion, so don't we already have a deterministic ordering of field names?

Plus, we already have O(1) lookups, since field ids basically serve as indices into the Vec.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our MetadataBuilder has a Vec of field names. We don't support Object field deletion, so don't we already have a deterministic ordering of field names?

MetadataBuilder has both a vec and a BTreeMap. Each field name is stored twice, once in the Vec and once in the BTreeMap.

struct MetadataBuilder {
field_name_to_id: BTreeMap<String, u32>,
field_names: Vec<String>,

I think the idea is if we used an IndexSet we would not need the second copy of the field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 716 to 674
}
write_offset(
&mut self.parent_buffer.0[pos..pos + offset_size as usize],
data_size,
offset_size,
);

write_offset(&mut self.parent_buffer.0, data_size, offset_size);

self.parent_buffer.0.extend_from_slice(&self.buffer.0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make sure I'm understanding:

  • When building a nested list or object, the value bytes go into a separate buffer (self.buffer), and the field ids+offsets are tracked separately as well.
  • When finalizing, we first write out the field ids, then the offsets, and then the value bytes

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this carefully and I think it is a significant improvement in the structure of the code and I think it sets us up nicely to support sorted dictionaries and other things. Thank you @friendlymatthew and @scovich

fyi @PinkCrow007 and @mkarbo as you may be interested in how the VariantBuilder is progressing along

self.append_slice(&micros.to_le_bytes());
}

fn append_timestamp_ntz_micros(&mut self, value: chrono::NaiveDateTime) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the way this code now looks 👍


let mut metadata = Vec::with_capacity(metadata_size);

// Write header: version=1, not sorted, with calculated offset_size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW I think sorted dictionaries are covered by this ticket (and technically it was @zeroshade 's idea :) )

Also after this PR I think it will be rather elegant / easy to add support for sorted dictionaries. Thank you @friendlymatthew

write_offset(&mut metadata, cur_offset, offset_size);

// Write string data
for key in self.field_names.iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As @scovich says, if/when we decide we need to optimize even more and avoid individual allocations for strings, there are fancier structures we could use but I think an IndexSet will get us pretty far

@alamb
Copy link
Contributor

alamb commented Jun 27, 2025

Given there are no blockers to merge and two approvals, I am going to merge this one in to keep the code flowing. We can continue to iterate in follow on PRs.

@alamb alamb merged commit e42df82 into apache:main Jun 27, 2025
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants