-
Couldn't load subscription status.
- Fork 13.9k
Description
Inspired by https://users.rust-lang.org/t/hash-prefix-collisions/71823/10?u=scottmcm
Hash::hash_slice has a bunch of text clarifying that h.hash_slice(&[a, b]); h.hash_slice(&[c]); is not guaranteed to be the same as h.hash_slice(&[a]); h.hash_slice(&[b, c]);.
However, Hasher::write is unclear whether that same rule applies to it. It's very clear that .write(&[a]) is not the same as .write_u8(a), but not whether the same sequence of bytes to write is supposed to be the same thing, even if they're in different groupings, like h.write(&[a, b]); h.write(&[c]); vs h.write(&[a]); h.write(&[b, c]);.
This is important for the same kind of things as the VecDeque example mentioned on hash_slice. If I have a circular byte buffer, is it legal for its Hash to just .write the two parts? Or does it need to write_u8 all the individual bytes since two circular buffers should compare equal regardless of where the split happens to be?
Given that Hash for str and Hash for [T] are doing prefix-freedom already, it feels to me like write should not be doing it again.
Also, our SipHasher implementation is going out of its way to maintain the "different chunking of writes is fine":
rust/library/core/src/hash/sip.rs
Lines 264 to 308 in 6bf3008
| fn write(&mut self, msg: &[u8]) { | |
| let length = msg.len(); | |
| self.length += length; | |
| let mut needed = 0; | |
| if self.ntail != 0 { | |
| needed = 8 - self.ntail; | |
| // SAFETY: `cmp::min(length, needed)` is guaranteed to not be over `length` | |
| self.tail |= unsafe { u8to64_le(msg, 0, cmp::min(length, needed)) } << (8 * self.ntail); | |
| if length < needed { | |
| self.ntail += length; | |
| return; | |
| } else { | |
| self.state.v3 ^= self.tail; | |
| S::c_rounds(&mut self.state); | |
| self.state.v0 ^= self.tail; | |
| self.ntail = 0; | |
| } | |
| } | |
| // Buffered tail is now flushed, process new input. | |
| let len = length - needed; | |
| let left = len & 0x7; // len % 8 | |
| let mut i = needed; | |
| while i < len - left { | |
| // SAFETY: because `len - left` is the biggest multiple of 8 under | |
| // `len`, and because `i` starts at `needed` where `len` is `length - needed`, | |
| // `i + 8` is guaranteed to be less than or equal to `length`. | |
| let mi = unsafe { load_int_le!(msg, i, u64) }; | |
| self.state.v3 ^= mi; | |
| S::c_rounds(&mut self.state); | |
| self.state.v0 ^= mi; | |
| i += 8; | |
| } | |
| // SAFETY: `i` is now `needed + len.div_euclid(8) * 8`, | |
| // so `i + left` = `needed + len` = `length`, which is by | |
| // definition equal to `msg.len()`. | |
| self.tail = unsafe { u8to64_le(msg, i, left) }; | |
| self.ntail = left; | |
| } |
So it seems to me like this has been the expected behaviour the whole time. And if not, we should optimize SipHasher to be faster.
cc #80303 which lead to this text in hash_slice.