[release/6.0] Delete dangling thread session states #76431

davmason · 2022-09-30T09:34:46Z

As described in #76430, we have a partner team running in to a previously unknown EventPipe issue. We can leak EventPipeThreadSessionState* under certain circumstances, leading to a fatal CLR error and crashing the process.

This fix is a targeted fix for 6.0 where we do a simple loop over the existing threads and manually delete any dangling session states when we delete the session object. The fix for 8.0 will be more of a refactoring to make sure EventPipeThreadSessionStates are owned in a logically consistent manner.

Customer Impact

Customers that have many threads and short lived sessions are vulnerable to this issue and there is no workaround. They will experience random crashes.

Risk

Low, it is a targeted fix that is easy to reason about

Testing

Manual testing by our team, and pending partner validation

jkotas · 2022-09-30T13:38:51Z

src/native/eventpipe/ep-session.c

+	ep_rt_thread_array_iterator_t threads_iterator = ep_rt_thread_array_iterator_begin (&threads);
+	while (!ep_rt_thread_array_iterator_end (&threads, &threads_iterator)) {
+		EventPipeThread *thread = ep_rt_thread_array_iterator_value (&threads_iterator);
+		if (thread) {


When can this be null?

I don't think it ever can be null, we only ever add non-null threads in ep_thread_get_threads. I think this just keeps getting copy pasted around, anywhere we iterate threads we check for null on each entry

This check is really not needed (at least not anymore). ep_thread_get_threads is currently only called from ep_session_suspend_write_event (before this change added a new one), and the logic in ep_thread_get_threads already makes sure threads added to the lists are not null, so only way to get null pointers in the list is corruption or future coding errors. Maybe we should change the check to an EP_ASSERT or drop it.

jeffschwMSFT

approved. we will take for consideration. please get a code review and best if we can get verification from the partner that the issue is addressed.

lateralusX · 2022-10-03T10:18:25Z

src/native/eventpipe/ep-thread.c


 	ep_thread_requires_lock_held (thread);

-	EP_ASSERT (thread->session_state [ep_session_get_index (session)] != NULL);


Did you hit this assert? Looking at the callers of ep_thread_get_session_state, you should end up with more asserts in the case this returns NULL

The code in ep_session_remove_dangling_session_states iterates over all threads and under normal circumstances it will return NULL, it will only return non-NULL if a session state is leaked. So we will hit this assert in the code I added

Great! I didn't find your call since I didn't search your branch for callers of ep_thread_get_session_state and for the other callers it is critical that this assert holds. Maybe we could move the assert to the callers (only two places) where it should still be true, ep_thread_get_session_state shouldn't return a NULL session state.

lateralusX · 2022-10-03T10:52:49Z

src/native/eventpipe/ep-session.c

+					// has been exceeded, we can leak the EventPipeThreadSessionState* and crash later trying to access 
+					// the session from the thread session state. Whenever we terminate a session we check to make sure
+					// we haven't leaked any thread session states.
+					ep_thread_delete_session_state(thread, session);


In case where the thread session state ends up in buffer_manager->thread_session_state_list and later deleted when we flush or delete the buffer manager buffers (ep_buffer_manager_deallocate_buffers), I guess the order in ep_session_free will prevent us from having any additional copies of thread session state when we call ep_session_remove_dangling_session_states?

That's correct, I made sure to add the cleanup code as the last thing before we free the session object so everything except the leaked session states are cleaned up

lateralusX · 2022-10-03T11:04:43Z

Could an alternative fix be to just handle the case where we created a new thread session state but fails to add it to buffer manager, triggering the error case? All the logic seems to be located in ep_buffer_manager_write_event where we also detect the case where we drop events that will lead to the issue with leaked session state. We can detect when we create a new session state for current thread as well as failing to add it into buffer managers thread_session_state_list and make sure we get rid of the allocated thread state directly in ep_buffer_manager_write_event if/when we hit that case.

davmason · 2022-10-04T03:42:16Z

Could an alternative fix be to just handle the case where we created a new thread session state but fails to add it to buffer manager, triggering the error case? All the logic seems to be located in ep_buffer_manager_write_event where we also detect the case where we drop events that will lead to the issue with leaked session state. We can detect when we create a new session state for current thread as well as failing to add it into buffer managers thread_session_state_list and make sure we get rid of the allocated thread state directly in ep_buffer_manager_write_event if/when we hit that case.

I thought about that approach but this way seemed easier to reason about for servicing, so it has less risk of introducing bugs.

JulieLeeMSFT · 2022-10-04T06:05:52Z

Please check the test failures.

lateralusX · 2022-10-04T07:48:56Z

Could an alternative fix be to just handle the case where we created a new thread session state but fails to add it to buffer manager, triggering the error case? All the logic seems to be located in ep_buffer_manager_write_event where we also detect the case where we drop events that will lead to the issue with leaked session state. We can detect when we create a new session state for current thread as well as failing to add it into buffer managers thread_session_state_list and make sure we get rid of the allocated thread state directly in ep_buffer_manager_write_event if/when we hit that case.

I thought about that approach but this way seemed easier to reason about for servicing, so it has less risk of introducing bugs.

OK, handling it on the thread that creates the state before ever getting into threads session list in case we hit failure case without any future needs to do lookup and potentially cleanup on all threads session state lists sounds rather straight forward and removes any potential multithreading issues related to the fix as well, but if you validated the current approach, I agree that we should go with the fix you feel most comfortable with and then we should probably fix it in main by handling the error case and only add the session state to the thread list when we successfully transferred ownership of session state.

src/native/eventpipe/ep-session.c

Co-authored-by: Johan Lorensson <[email protected]>

lateralusX

LGTM!

tommcdon · 2022-10-04T14:07:59Z

We have partner teams reporting the same issue on 7.0, and so this change should also be considered for 7.0 as well.

carlossanlop · 2022-10-05T20:13:29Z

@davmason can you please add the servicing-consider label and send an email to Tactics requesting approval?

ZacWein · 2022-10-05T21:46:06Z

I am from the partner team @davmason is referencing.

We went ahead and applied patched dotnet 6 dlls to half of our machines which had been encountering the issue and left the other half the same.

The patched machine are no longer crashing, but the un-patched machines have also stopped crashing which was unexpected.

Note the nature of these crashes is transient, so its possible that we just aren't seeing crashes at the moment, but may see unpatched machines start crashing tomorrow...

davmason · 2022-10-05T22:31:08Z

@jeffschwMSFT - I was able to create a local repro that demonstrates the issue, and my changes fix the issue. Combined with the fact that Zach was able to run the privates I provided him with no issues (so we verified it doesn't break him at least) I think we should proceed with the fix.

carlossanlop · 2022-10-07T16:11:28Z

CI green, approved, signed-off. No OOB package authoring changes needed.
Checked with David via chat, this is ready to merge (all feedback was addressed).

Delete dangling thread session states

a0f2b9d

davmason added the EventPipe label Sep 30, 2022

davmason added this to the 6.0.x milestone Sep 30, 2022

davmason requested review from a team, lateralusX and noahfalk September 30, 2022 09:34

davmason self-assigned this Sep 30, 2022

ghost added the area-Tracing-coreclr label Sep 30, 2022

Add function prototype

0d36eab

jkotas reviewed Sep 30, 2022

View reviewed changes

jeffschwMSFT approved these changes Sep 30, 2022

View reviewed changes

lateralusX reviewed Oct 3, 2022

View reviewed changes

lateralusX reviewed Oct 4, 2022

View reviewed changes

src/native/eventpipe/ep-session.c Outdated Show resolved Hide resolved

lateralusX reviewed Oct 4, 2022

View reviewed changes

src/native/eventpipe/ep-session.c Outdated Show resolved Hide resolved

davmason and others added 3 commits October 4, 2022 01:13

Update src/native/eventpipe/ep-session.c

61dee66

Co-authored-by: Johan Lorensson <[email protected]>

Update src/native/eventpipe/ep-session.c

e182575

Co-authored-by: Johan Lorensson <[email protected]>

Code review feedback

86e4601

lateralusX approved these changes Oct 4, 2022

View reviewed changes

sandersaares mentioned this pull request Oct 4, 2022

There seems to be no way to disable collector recycling djluck/prometheus-net.DotNetRuntime#72

Closed

jeffschwMSFT added the Servicing-consider Issue for next servicing release review label Oct 5, 2022

davmason mentioned this pull request Oct 5, 2022

[Release/7.0] Delete dangling thread session states #76691

Merged

jeffschwMSFT modified the milestones: 6.0.x, 6.0.11 Oct 6, 2022

jeffschwMSFT added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Oct 6, 2022

carlossanlop merged commit cea0fb5 into dotnet:release/6.0 Oct 7, 2022

ghost locked as resolved and limited conversation to collaborators Nov 6, 2022


		ep_thread_requires_lock_held (thread);

		EP_ASSERT (thread->session_state [ep_session_get_index (session)] != NULL);

[release/6.0] Delete dangling thread session states #76431

[release/6.0] Delete dangling thread session states #76431

Uh oh!

Conversation

davmason commented Sep 30, 2022

Customer Impact

Risk

Testing

Uh oh!

jkotas Sep 30, 2022

Choose a reason for hiding this comment

Uh oh!

davmason Oct 1, 2022

Choose a reason for hiding this comment

Uh oh!

lateralusX Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeffschwMSFT left a comment

Choose a reason for hiding this comment

Uh oh!

lateralusX Oct 3, 2022

Choose a reason for hiding this comment

Uh oh!

davmason Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

lateralusX Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lateralusX Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

davmason Oct 4, 2022

Choose a reason for hiding this comment

Uh oh!

lateralusX commented Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

davmason commented Oct 4, 2022

Uh oh!

JulieLeeMSFT commented Oct 4, 2022

Uh oh!

lateralusX commented Oct 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lateralusX left a comment

Choose a reason for hiding this comment

Uh oh!

tommcdon commented Oct 4, 2022

Uh oh!

carlossanlop commented Oct 5, 2022

Uh oh!

ZacWein commented Oct 5, 2022

Uh oh!

davmason commented Oct 5, 2022

Uh oh!

carlossanlop commented Oct 7, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

lateralusX Oct 3, 2022 •

edited

Loading

lateralusX Oct 4, 2022 •

edited

Loading

lateralusX Oct 3, 2022 •

edited

Loading

lateralusX commented Oct 3, 2022 •

edited

Loading

lateralusX commented Oct 4, 2022 •

edited

Loading