-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Describe the bug
When merging GB of high cardinality dictionary data, the size reported by the RowInterner is a signficiant (GB) undercount
This leads to our system to significant exceed its configured memory limit in several cases.
I believe the bug is that the Bucket::size() does not account for size of embedded Bucket in Slot. I will make a PR shortly
To Reproduce
I can reproduce this when merge GB of high cardinality proprietary dictionary encoded data
I tried to make a unit test but I could not figure out how to. Any thoughts would be appreciated
#[test]
fn test_intern_sizes() {
let mut interner = OrderPreservingInterner::default();
// Intern a 1M values each 10 bytes large, and the interner
// should report at least 10MB bytes
// ...
let num_items = 3000;
let mut values: Vec<usize> = (0..num_items).collect();
values.reverse();
interner.intern(values.iter().map(|v| Some(v.to_be_bytes())));
let actual_size = interner.size();
let min_expected_size =
// at least space for each item
num_items * std::mem::size_of::<usize>()
// at least one slot for each item
+ num_items * std::mem::size_of::<Slot>();
println!("Actual size: {actual_size}, min {min_expected_size}");
assert!(actual_size > min_expected_size,
"actual_size {actual_size} not larger than min_expected_size: {min_expected_size}")
}
Expected behavior
Additional context
I found this while testing apache/datafusion#7130 with our internal data -- it did not reduce memory requirements the way I expected. I tracked the root cause down to this