-
Notifications
You must be signed in to change notification settings - Fork 15
[PROF-12853] Catch panics inside wrap_with_ffi_result and wrap_with_void_ffi_result
#1334
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…h_void_ffi_result` **What does this PR do?** This PR updates the `wrap_with_ffi_result` and `wrap_with_void_ffi_result` macros to catch any panics that happen inside them, returning them as errors. The error handling is made in such a way (see `handle_panic_error` for details) that it should be able to report back an error even if we fail to do any allocations. Important note: Because only the macros have been changed, and ffi APIs that don't use the macros are of course not affected and can still trigger panics. If we like this approach, I'll follow-up with a separate PR to update other APIs to use the new macros. **Motivation:** In <https://docs.google.com/document/d/1weMu9P03KKhPQ-gh9BMqRrEzpa1BnnY0LaSRGJbfc7A/edit?usp=sharing> (Datadog-only link, sorry!) we saw `ddog_prof_Exporter_send` crashing due to what can be summed up as `ddog_prof_Exporter_send` (report a profile) -> hyper-util tries to do dns resolution in a separate thread pool -> tokio failed to create a new thread -> panic and we tear down the app because we can't report a profile This is not good at all, and this PR solves this inspired by earlier work in #815 and #1083. **Additional Notes:** While I don't predict that will happen very often, callers that want to opt-out of the catch unwind behavior can still use the `..._no_catch` variants of the macros. The return type change in `ddog_crasht_CrashInfoBuilder_build` does change the tag enum entries, which unfortunately is a breaking change. Ideas on how to work around this? This makes the following enum entries change: * `DDOG_CRASHT_CRASH_INFO_NEW_RESULT_OK` => `DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_OK_HANDLE_CRASH_INFO` * `DDOG_CRASHT_CRASH_INFO_NEW_RESULT_ERR` => `DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_ERR_HANDLE_CRASH_INFO` **How to test the change?** This change includes test coverage. I've also separately tried to sprinkle a few `panic!` calls manually and tested that it works as expected.
danielsn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the consumer supposed to do if there was a panic caught? We should clearly document that.
| const CANNOT_ALLOCATE: &std::ffi::CStr = | ||
| c"libdatadog failed: (panic) Cannot allocate error message"; | ||
| const CANNOT_ALLOCATE_CHAR_SLICE: CharSlice = unsafe { | ||
| crate::Slice::from_raw_parts( | ||
| CANNOT_ALLOCATE.as_ptr(), | ||
| CANNOT_ALLOCATE.to_bytes_with_nul().len(), | ||
| ) | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way to run this with MIRI to make sure we get lifetimes right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does miri not run our regular test suite? E.g. I somewhat assumed the new tests in utils.rs would be automatically covered... aren't they? 👀
| /// You probably don't want to use this directly. This constant is used by `handle_panic_error` to | ||
| /// signal that something went wrong, but avoid needing any allocations to represent it. | ||
| pub(crate) const CANNOT_ALLOCATE_ERROR: Error = Error { | ||
| message: Vec::new(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe add a comment for future maintainers that vec::new doesn't actually allocate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs improved in d3b2ef9
| match error { | ||
| None => CharSlice::empty(), | ||
| Some(err) => CharSlice::from(err.as_ref()), | ||
| // When the error is empty (CANNOT_ALLOCATE_ERROR) we assume we failed to allocate an actual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you need to special case ddog_error_drop to avoid dropping the static const CANNOT_ALLOCATE_ERROR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the case? Specifically ddog_error_drop receives the Error, not the CharSlice we produce here, so the ddog_error_drop should see the vec with capacity == 0, not the string produced here. But I may be missing something...? 👀
| // This pattern of String vs &str comes from | ||
| // https://doc.rust-lang.org/std/panic/struct.PanicHookInfo.html#method.payload | ||
| if let Some(s) = error.downcast_ref::<String>() { | ||
| anyhow::anyhow!("{} failed: (panic) {}", function_name, s) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
anyhow itself can allocate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! That's why there's this second catch_unwind: Either anyhow succeeds and we're able to return this error OR it fails and we fall back to the CANNOT_ALLOCATE_ERROR.
This way we should be able to return the nice error in most cases.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #1334 +/- ##
=======================================
Coverage 70.87% 70.88%
=======================================
Files 385 385
Lines 61838 61924 +86
=======================================
+ Hits 43828 43892 +64
- Misses 18010 18032 +22
🚀 New features to boost your workflow:
|
BenchmarksComparisonBenchmark execution time: 2025-11-14 21:12:01 Comparing candidate commit f8017b0 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 55 metrics, 2 unstable metrics. CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
Group 13
Group 14
Group 15
Group 16
Group 17
BaselineOmitted due to size. |
| // A Rust Vec of size 0 [has no allocated memory](https://doc.rust-lang.org/std/vec/struct.Vec.html#guarantees): | ||
| // "In particular, if you construct a Vec with capacity 0 via Vec::new, vec![], | ||
| // Vec::with_capacity(0), or by calling shrink_to_fit on an empty Vec, it will not allocate | ||
| // memory." And as per https://doc.rust-lang.org/nomicon/vec/vec-dealloc.html: | ||
| // "We must not call alloc::dealloc when self.cap == 0, as in this case we haven't actually | ||
| // allocated any memory." | ||
| if self.capacity == 0 { | ||
| return; | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this necessary in some case?
This should naturally be handled by Drop for std::vec::Vec, because it's safe to drop an empty std::vec::Vec. After creating a std::vec::Vec of its contents, the FFI Vec will now be "dropped" but drops on pointers do nothing. I can't see why it would be necessary to early-return here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question... At some point in my experiments I was using the mockalloc crate and it complained that we were still trying to drop the vec...
But I'm not confident if that's a real issue with our code or just some artifact of that crate/how I was setting up my test.
As you mentioned the alloc::vec::Vec::from_raw_parts doc kinda states that ptr only needs to be non-null and aligned if capacity is zero (e.g. doesn't need to be allocated using the global allocator) which means it won't touch it.
I'll leave it here to see if anyone has any ideas on why this may or may not be needed; and if nobody else chimes in I'll undo this part of the change.
|
If we do catch_unwind, we lose the rust frames for debugging. Anything we can do about that? |
|
Possibly we should publish a crash report when this panic happens? |
That would be cool! As a bonus we could actually include the panic message, which right now we don't get. |
Artifact Size Benchmark Reportaarch64-alpine-linux-musl
aarch64-unknown-linux-gnu
libdatadog-x64-windows
libdatadog-x86-windows
x86_64-alpine-linux-musl
x86_64-unknown-linux-gnu
|
What does this PR do?
This PR updates the
wrap_with_ffi_resultandwrap_with_void_ffi_resultmacros to catch any panics that happen inside them, returning them as errors.The error handling is made in such a way (see
handle_panic_errorfor details) that it should be able to report back an error even if we fail to do any allocations.Important note: Because only the macros have been changed, and ffi APIs that don't use the macros are of course not affected and can still trigger panics. If we like this approach, I'll follow-up with a separate PR to update other APIs to use the new macros.
Motivation
In https://docs.google.com/document/d/1weMu9P03KKhPQ-gh9BMqRrEzpa1BnnY0LaSRGJbfc7A/edit?usp=sharing (Datadog-only link, sorry!) we saw
ddog_prof_Exporter_sendcrashing due to what can be summed up asddog_prof_Exporter_send(report a profile) ->hyper-util tries to do dns resolution in a separate thread pool ->
tokio failed to create a new thread ->
panic and we tear down the app because we can't report a profile
This is not good at all, and this PR solves this inspired by earlier work in #815 and #1083.
Additional Notes
While I don't predict that will happen very often, callers that want to opt-out of the catch unwind behavior can still use the
..._no_catchvariants of the macros.The return type change in
ddog_crasht_CrashInfoBuilder_builddoes change the tag enum entries, which unfortunately is a breaking change.Ideas on how to work around this? This makes the following enum entries change:
DDOG_CRASHT_CRASH_INFO_NEW_RESULT_OK=>DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_OK_HANDLE_CRASH_INFODDOG_CRASHT_CRASH_INFO_NEW_RESULT_ERR=>DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_ERR_HANDLE_CRASH_INFOHow to test the change?
This change includes test coverage. I've also separately tried to sprinkle a few
panic!calls manually and tested that it works as expected.