-
Notifications
You must be signed in to change notification settings - Fork 15
[WIP][Profiling] Prevent panic from crossing FFI boundaries #815
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
BenchmarksComparisonBenchmark execution time: 2025-01-06 14:04:28 Comparing candidate commit d8e4924 in PR branch Found 0 performance improvements and 3 performance regressions! Performance is the same for 48 metrics, 2 unstable metrics. scenario:normalization/normalize_service/normalize_service/Data🐨dog🐶 繋がっ⛰てて
scenario:tags/replace_trace_tags
CandidateCandidate benchmark detailsGroup 1
Group 2
Group 3
Group 4
Group 5
Group 6
Group 7
Group 8
Group 9
Group 10
Group 11
Group 12
BaselineOmitted due to size. |
579730c to
6730452
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #815 +/- ##
==========================================
- Coverage 71.03% 71.03% -0.01%
==========================================
Files 313 313
Lines 45908 45912 +4
==========================================
+ Hits 32612 32614 +2
- Misses 13296 13298 +2
|
6730452 to
d8e4924
Compare
|
This is a complex subject, but I think we've been avoiding addressing it for a bit too long. The ideal mode of operation would be to have no aborts, but in the real world sometimes one slips by... I think having a "last line of defense" on FFI functions that returns a clean error makes sense. Of course in some cases returning an error means the rust state can be in an inconsistent state, but it seems reasonable to put the burden on the API client to properly handle errors by backing off and stopping e.g. profiling, rather than tearing the whole app down. E.g. an abort is an extreme event, so I think stopping datadog stuff and logging an error (possibly via crashtracking...) is a reasonable outcome to such an extreme event (and is a much better alternative than killing off the host app). We could look into enabling lints to make it easier for us to catch things that may cause abort, but I suspect that will hinder and make really awkward a lot of the code we want to write. The Linux kernel folks have faced similar challenges, so perhaps there's some learnings there (I found this link on a quick googling). TL;DR: If we could write a nice macro to add this feature to most of our FFI, I think it's a reasonable and worth improvement. |
That's why this draft is here :) to talk about that subject. |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. To override this behavior, add the keep-open label or update the PR. |
…h_void_ffi_result` **What does this PR do?** This PR updates the `wrap_with_ffi_result` and `wrap_with_void_ffi_result` macros to catch any panics that happen inside them, returning them as errors. The error handling is made in such a way (see `handle_panic_error` for details) that it should be able to report back an error even if we fail to do any allocations. Important note: Because only the macros have been changed, and ffi APIs that don't use the macros are of course not affected and can still trigger panics. If we like this approach, I'll follow-up with a separate PR to update other APIs to use the new macros. **Motivation:** In <https://docs.google.com/document/d/1weMu9P03KKhPQ-gh9BMqRrEzpa1BnnY0LaSRGJbfc7A/edit?usp=sharing> (Datadog-only link, sorry!) we saw `ddog_prof_Exporter_send` crashing due to what can be summed up as `ddog_prof_Exporter_send` (report a profile) -> hyper-util tries to do dns resolution in a separate thread pool -> tokio failed to create a new thread -> panic and we tear down the app because we can't report a profile This is not good at all, and this PR solves this inspired by earlier work in #815 and #1083. **Additional Notes:** While I don't predict that will happen very often, callers that want to opt-out of the catch unwind behavior can still use the `..._no_catch` variants of the macros. **How to test the change?** This change includes test coverage. I've also separately tried to sprinkle a few `panic!` calls manually and tested that it works as expected.
…h_void_ffi_result` **What does this PR do?** This PR updates the `wrap_with_ffi_result` and `wrap_with_void_ffi_result` macros to catch any panics that happen inside them, returning them as errors. The error handling is made in such a way (see `handle_panic_error` for details) that it should be able to report back an error even if we fail to do any allocations. Important note: Because only the macros have been changed, and ffi APIs that don't use the macros are of course not affected and can still trigger panics. If we like this approach, I'll follow-up with a separate PR to update other APIs to use the new macros. **Motivation:** In <https://docs.google.com/document/d/1weMu9P03KKhPQ-gh9BMqRrEzpa1BnnY0LaSRGJbfc7A/edit?usp=sharing> (Datadog-only link, sorry!) we saw `ddog_prof_Exporter_send` crashing due to what can be summed up as `ddog_prof_Exporter_send` (report a profile) -> hyper-util tries to do dns resolution in a separate thread pool -> tokio failed to create a new thread -> panic and we tear down the app because we can't report a profile This is not good at all, and this PR solves this inspired by earlier work in #815 and #1083. **Additional Notes:** While I don't predict that will happen very often, callers that want to opt-out of the catch unwind behavior can still use the `..._no_catch` variants of the macros. **How to test the change?** This change includes test coverage. I've also separately tried to sprinkle a few `panic!` calls manually and tested that it works as expected.
…h_void_ffi_result` **What does this PR do?** This PR updates the `wrap_with_ffi_result` and `wrap_with_void_ffi_result` macros to catch any panics that happen inside them, returning them as errors. The error handling is made in such a way (see `handle_panic_error` for details) that it should be able to report back an error even if we fail to do any allocations. Important note: Because only the macros have been changed, and ffi APIs that don't use the macros are of course not affected and can still trigger panics. If we like this approach, I'll follow-up with a separate PR to update other APIs to use the new macros. **Motivation:** In <https://docs.google.com/document/d/1weMu9P03KKhPQ-gh9BMqRrEzpa1BnnY0LaSRGJbfc7A/edit?usp=sharing> (Datadog-only link, sorry!) we saw `ddog_prof_Exporter_send` crashing due to what can be summed up as `ddog_prof_Exporter_send` (report a profile) -> hyper-util tries to do dns resolution in a separate thread pool -> tokio failed to create a new thread -> panic and we tear down the app because we can't report a profile This is not good at all, and this PR solves this inspired by earlier work in #815 and #1083. **Additional Notes:** While I don't predict that will happen very often, callers that want to opt-out of the catch unwind behavior can still use the `..._no_catch` variants of the macros. **How to test the change?** This change includes test coverage. I've also separately tried to sprinkle a few `panic!` calls manually and tested that it works as expected.
…h_void_ffi_result` **What does this PR do?** This PR updates the `wrap_with_ffi_result` and `wrap_with_void_ffi_result` macros to catch any panics that happen inside them, returning them as errors. The error handling is made in such a way (see `handle_panic_error` for details) that it should be able to report back an error even if we fail to do any allocations. Important note: Because only the macros have been changed, and ffi APIs that don't use the macros are of course not affected and can still trigger panics. If we like this approach, I'll follow-up with a separate PR to update other APIs to use the new macros. **Motivation:** In <https://docs.google.com/document/d/1weMu9P03KKhPQ-gh9BMqRrEzpa1BnnY0LaSRGJbfc7A/edit?usp=sharing> (Datadog-only link, sorry!) we saw `ddog_prof_Exporter_send` crashing due to what can be summed up as `ddog_prof_Exporter_send` (report a profile) -> hyper-util tries to do dns resolution in a separate thread pool -> tokio failed to create a new thread -> panic and we tear down the app because we can't report a profile This is not good at all, and this PR solves this inspired by earlier work in #815 and #1083. **Additional Notes:** While I don't predict that will happen very often, callers that want to opt-out of the catch unwind behavior can still use the `..._no_catch` variants of the macros. The return type change in `ddog_crasht_CrashInfoBuilder_build` does change the tag enum entries, which unfortunately is a breaking change. Ideas on how to work around this? This makes the following enum entries change: * `DDOG_CRASHT_CRASH_INFO_NEW_RESULT_OK` => `DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_OK_HANDLE_CRASH_INFO` * `DDOG_CRASHT_CRASH_INFO_NEW_RESULT_ERR` => `DDOG_CRASHT_RESULT_HANDLE_CRASH_INFO_ERR_HANDLE_CRASH_INFO` **How to test the change?** This change includes test coverage. I've also separately tried to sprinkle a few `panic!` calls manually and tested that it works as expected.
What does this PR do?
This PR is a communication channel to discuss about preventing panic from crossing FFI boundaries.
The code change here is to discuss, and find the best way to achieve this.
Motivation
Playing around with the new crashtracker api, we got a crash (#756) because there were missing information. The example was not modified nor ran which left with an unusable version of libdatadog (crashtracking)
This may be a misusage and the client's code must be changed, but in no case, this should panic. (at the minimum, there should be tests.
Additional Notes
Anything else we should know when reviewing?
How to test the change?
Describe here in detail how the change can be validated.