-
Notifications
You must be signed in to change notification settings - Fork 5.2k
[release/6.0] Delete dangling thread session states #76431
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[release/6.0] Delete dangling thread session states #76431
Conversation
src/native/eventpipe/ep-session.c
Outdated
| ep_rt_thread_array_iterator_t threads_iterator = ep_rt_thread_array_iterator_begin (&threads); | ||
| while (!ep_rt_thread_array_iterator_end (&threads, &threads_iterator)) { | ||
| EventPipeThread *thread = ep_rt_thread_array_iterator_value (&threads_iterator); | ||
| if (thread) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When can this be null?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it ever can be null, we only ever add non-null threads in ep_thread_get_threads. I think this just keeps getting copy pasted around, anywhere we iterate threads we check for null on each entry
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is really not needed (at least not anymore). ep_thread_get_threads is currently only called from ep_session_suspend_write_event (before this change added a new one), and the logic in ep_thread_get_threads already makes sure threads added to the lists are not null, so only way to get null pointers in the list is corruption or future coding errors. Maybe we should change the check to an EP_ASSERT or drop it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approved. we will take for consideration. please get a code review and best if we can get verification from the partner that the issue is addressed.
|
|
||
| ep_thread_requires_lock_held (thread); | ||
|
|
||
| EP_ASSERT (thread->session_state [ep_session_get_index (session)] != NULL); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you hit this assert? Looking at the callers of ep_thread_get_session_state, you should end up with more asserts in the case this returns NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code in ep_session_remove_dangling_session_states iterates over all threads and under normal circumstances it will return NULL, it will only return non-NULL if a session state is leaked. So we will hit this assert in the code I added
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! I didn't find your call since I didn't search your branch for callers of ep_thread_get_session_state and for the other callers it is critical that this assert holds. Maybe we could move the assert to the callers (only two places) where it should still be true, ep_thread_get_session_state shouldn't return a NULL session state.
src/native/eventpipe/ep-session.c
Outdated
| // has been exceeded, we can leak the EventPipeThreadSessionState* and crash later trying to access | ||
| // the session from the thread session state. Whenever we terminate a session we check to make sure | ||
| // we haven't leaked any thread session states. | ||
| ep_thread_delete_session_state(thread, session); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case where the thread session state ends up in buffer_manager->thread_session_state_list and later deleted when we flush or delete the buffer manager buffers (ep_buffer_manager_deallocate_buffers), I guess the order in ep_session_free will prevent us from having any additional copies of thread session state when we call ep_session_remove_dangling_session_states?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's correct, I made sure to add the cleanup code as the last thing before we free the session object so everything except the leaked session states are cleaned up
|
Could an alternative fix be to just handle the case where we created a new thread session state but fails to add it to buffer manager, triggering the error case? All the logic seems to be located in ep_buffer_manager_write_event where we also detect the case where we drop events that will lead to the issue with leaked session state. We can detect when we create a new session state for current thread as well as failing to add it into buffer managers thread_session_state_list and make sure we get rid of the allocated thread state directly in ep_buffer_manager_write_event if/when we hit that case. |
I thought about that approach but this way seemed easier to reason about for servicing, so it has less risk of introducing bugs. |
|
Please check the test failures. |
OK, handling it on the thread that creates the state before ever getting into threads session list in case we hit failure case without any future needs to do lookup and potentially cleanup on all threads session state lists sounds rather straight forward and removes any potential multithreading issues related to the fix as well, but if you validated the current approach, I agree that we should go with the fix you feel most comfortable with and then we should probably fix it in main by handling the error case and only add the session state to the thread list when we successfully transferred ownership of session state. |
Co-authored-by: Johan Lorensson <[email protected]>
Co-authored-by: Johan Lorensson <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
We have partner teams reporting the same issue on 7.0, and so this change should also be considered for 7.0 as well. |
|
@davmason can you please add the |
|
I am from the partner team @davmason is referencing. We went ahead and applied patched dotnet 6 dlls to half of our machines which had been encountering the issue and left the other half the same. The patched machine are no longer crashing, but the un-patched machines have also stopped crashing which was unexpected. Note the nature of these crashes is transient, so its possible that we just aren't seeing crashes at the moment, but may see unpatched machines start crashing tomorrow... |
|
@jeffschwMSFT - I was able to create a local repro that demonstrates the issue, and my changes fix the issue. Combined with the fact that Zach was able to run the privates I provided him with no issues (so we verified it doesn't break him at least) I think we should proceed with the fix. |
|
CI green, approved, signed-off. No OOB package authoring changes needed. |
As described in #76430, we have a partner team running in to a previously unknown EventPipe issue. We can leak EventPipeThreadSessionState* under certain circumstances, leading to a fatal CLR error and crashing the process.
This fix is a targeted fix for 6.0 where we do a simple loop over the existing threads and manually delete any dangling session states when we delete the session object. The fix for 8.0 will be more of a refactoring to make sure EventPipeThreadSessionStates are owned in a logically consistent manner.
Customer Impact
Customers that have many threads and short lived sessions are vulnerable to this issue and there is no workaround. They will experience random crashes.
Risk
Low, it is a targeted fix that is easy to reason about
Testing
Manual testing by our team, and pending partner validation