Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 21 additions & 3 deletions arrow-json/src/reader/tape.rs
Original file line number Diff line number Diff line change
Expand Up @@ -705,9 +705,16 @@ fn err(b: u8, ctx: &str) -> ArrowError {

/// Creates a character from an UTF-16 surrogate pair
fn char_from_surrogate_pair(low: u16, high: u16) -> Result<char, ArrowError> {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
match (low, high) {
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
}
_ => Err(ArrowError::JsonError(format!(
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
))),
}
Comment on lines +708 to +717
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a pair of checked_sub work?

Suggested change
match (low, high) {
(0xDC00..=0xDFFF, 0xD800..=0xDBFF) => {
let n = (((high - 0xD800) as u32) << 10) | ((low - 0xDC00) as u32 + 0x1_0000);
char::from_u32(n)
.ok_or_else(|| ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair {n}")))
}
_ => Err(ArrowError::JsonError(format!(
"Invalid UTF-16 surrogate pair. High: {high:#02X}, Low: {low:#02X}"
))),
}
char_from_surrogate_pair_opt(low, high).ok_or_else(|| {
ArrowError::JsonError(format!("Invalid UTF-16 surrogate pair: {lo:#02X}, {high:#02X}"))
})
}
fn char_from_surrogate_pair_opt(low: u16, high: u16) -> Option<char> {
let high = high.checked_sub(0xD800)? as u32;
let low = low.checked_sub(0xDC00)? as u32;
char::from_u32((high << 10) | (low + 0x1_0000))

Copy link
Contributor Author

@nicklan nicklan Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did consider this, and it does have the benefit of being more succinct. I opted for the range check because if one of the arguments was too high the checked sub would not catch it and I wasn't able to convince myself that only valid inputs would allow char::from_u32 to succeed.

I could dig more into the calling layers to check if that's prevented there, but it still felt like more of a foot-gun than the simple range check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what it is worth, I think the current way this is encoded is quite understandable and will generate clear error messages. I double checked and it conforms to my reading of UTF-16 on https://en.wikipedia.org/wiki/UTF-16

}

/// Writes `c` as UTF-8 to `out`
Expand Down Expand Up @@ -951,4 +958,15 @@ mod tests {
let err = decoder.finish().unwrap_err().to_string();
assert_eq!(err, "Json error: Encountered truncated UTF-8 sequence");
}

#[test]
fn test_invalid_surrogates() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I verified this test panic's on main:

attempt to subtract with overflow
thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:708:49:
attempt to subtract with overflow

And the test actually fails to detect an error in release mode


---- reader::tape::tests::test_invalid_surrogates stdout ----

thread 'reader::tape::tests::test_invalid_surrogates' panicked at arrow-json/src/reader/tape.rs:959:9:
assertion failed: res.is_err()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

let mut decoder = TapeDecoder::new(16, 2);
let res = decoder.decode(b"{\"test\": \"\\ud800\\ud801\"}");
assert!(res.is_err());

let mut decoder = TapeDecoder::new(16, 2);
let res = decoder.decode(b"{\"test\": \"\\udc00\\udc01\"}");
assert!(res.is_err());
}
}
Loading