Skip to content

Conversation

@davmason
Copy link
Contributor

As described in #76430, we have a partner team running in to a previously unknown EventPipe issue. We can leak EventPipeThreadSessionState* under certain circumstances, leading to a fatal CLR error and crashing the process.

This fix is a targeted fix for 6.0 where we do a simple loop over the existing threads and manually delete any dangling session states when we delete the session object. The fix for 8.0 will be more of a refactoring to make sure EventPipeThreadSessionStates are owned in a logically consistent manner.

Customer Impact

Customers that have many threads and short lived sessions are vulnerable to this issue and there is no workaround. They will experience random crashes.

Risk

Low, it is a targeted fix that is easy to reason about

Testing

Manual testing by our team, and pending partner validation

@davmason davmason added this to the 6.0.x milestone Sep 30, 2022
@davmason davmason requested review from a team, lateralusX and noahfalk September 30, 2022 09:34
@davmason davmason self-assigned this Sep 30, 2022
@ghost ghost added the area-Tracing-coreclr label Sep 30, 2022
ep_rt_thread_array_iterator_t threads_iterator = ep_rt_thread_array_iterator_begin (&threads);
while (!ep_rt_thread_array_iterator_end (&threads, &threads_iterator)) {
EventPipeThread *thread = ep_rt_thread_array_iterator_value (&threads_iterator);
if (thread) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When can this be null?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it ever can be null, we only ever add non-null threads in ep_thread_get_threads. I think this just keeps getting copy pasted around, anywhere we iterate threads we check for null on each entry

Copy link
Member

@lateralusX lateralusX Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is really not needed (at least not anymore). ep_thread_get_threads is currently only called from ep_session_suspend_write_event (before this change added a new one), and the logic in ep_thread_get_threads already makes sure threads added to the lists are not null, so only way to get null pointers in the list is corruption or future coding errors. Maybe we should change the check to an EP_ASSERT or drop it.

Copy link
Member

@jeffschwMSFT jeffschwMSFT left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved. we will take for consideration. please get a code review and best if we can get verification from the partner that the issue is addressed.


ep_thread_requires_lock_held (thread);

EP_ASSERT (thread->session_state [ep_session_get_index (session)] != NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you hit this assert? Looking at the callers of ep_thread_get_session_state, you should end up with more asserts in the case this returns NULL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code in ep_session_remove_dangling_session_states iterates over all threads and under normal circumstances it will return NULL, it will only return non-NULL if a session state is leaked. So we will hit this assert in the code I added

Copy link
Member

@lateralusX lateralusX Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! I didn't find your call since I didn't search your branch for callers of ep_thread_get_session_state and for the other callers it is critical that this assert holds. Maybe we could move the assert to the callers (only two places) where it should still be true, ep_thread_get_session_state shouldn't return a NULL session state.

// has been exceeded, we can leak the EventPipeThreadSessionState* and crash later trying to access
// the session from the thread session state. Whenever we terminate a session we check to make sure
// we haven't leaked any thread session states.
ep_thread_delete_session_state(thread, session);
Copy link
Member

@lateralusX lateralusX Oct 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case where the thread session state ends up in buffer_manager->thread_session_state_list and later deleted when we flush or delete the buffer manager buffers (ep_buffer_manager_deallocate_buffers), I guess the order in ep_session_free will prevent us from having any additional copies of thread session state when we call ep_session_remove_dangling_session_states?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct, I made sure to add the cleanup code as the last thing before we free the session object so everything except the leaked session states are cleaned up

@lateralusX
Copy link
Member

lateralusX commented Oct 3, 2022

Could an alternative fix be to just handle the case where we created a new thread session state but fails to add it to buffer manager, triggering the error case? All the logic seems to be located in ep_buffer_manager_write_event where we also detect the case where we drop events that will lead to the issue with leaked session state. We can detect when we create a new session state for current thread as well as failing to add it into buffer managers thread_session_state_list and make sure we get rid of the allocated thread state directly in ep_buffer_manager_write_event if/when we hit that case.

@davmason
Copy link
Contributor Author

davmason commented Oct 4, 2022

Could an alternative fix be to just handle the case where we created a new thread session state but fails to add it to buffer manager, triggering the error case? All the logic seems to be located in ep_buffer_manager_write_event where we also detect the case where we drop events that will lead to the issue with leaked session state. We can detect when we create a new session state for current thread as well as failing to add it into buffer managers thread_session_state_list and make sure we get rid of the allocated thread state directly in ep_buffer_manager_write_event if/when we hit that case.

I thought about that approach but this way seemed easier to reason about for servicing, so it has less risk of introducing bugs.

@JulieLeeMSFT
Copy link
Member

Please check the test failures.

@lateralusX
Copy link
Member

lateralusX commented Oct 4, 2022

Could an alternative fix be to just handle the case where we created a new thread session state but fails to add it to buffer manager, triggering the error case? All the logic seems to be located in ep_buffer_manager_write_event where we also detect the case where we drop events that will lead to the issue with leaked session state. We can detect when we create a new session state for current thread as well as failing to add it into buffer managers thread_session_state_list and make sure we get rid of the allocated thread state directly in ep_buffer_manager_write_event if/when we hit that case.

I thought about that approach but this way seemed easier to reason about for servicing, so it has less risk of introducing bugs.

OK, handling it on the thread that creates the state before ever getting into threads session list in case we hit failure case without any future needs to do lookup and potentially cleanup on all threads session state lists sounds rather straight forward and removes any potential multithreading issues related to the fix as well, but if you validated the current approach, I agree that we should go with the fix you feel most comfortable with and then we should probably fix it in main by handling the error case and only add the session state to the thread list when we successfully transferred ownership of session state.

Copy link
Member

@lateralusX lateralusX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tommcdon
Copy link
Member

tommcdon commented Oct 4, 2022

We have partner teams reporting the same issue on 7.0, and so this change should also be considered for 7.0 as well.

@carlossanlop
Copy link
Contributor

@davmason can you please add the servicing-consider label and send an email to Tactics requesting approval?

@jeffschwMSFT jeffschwMSFT added the Servicing-consider Issue for next servicing release review label Oct 5, 2022
@ZacWein
Copy link

ZacWein commented Oct 5, 2022

I am from the partner team @davmason is referencing.

We went ahead and applied patched dotnet 6 dlls to half of our machines which had been encountering the issue and left the other half the same.

The patched machine are no longer crashing, but the un-patched machines have also stopped crashing which was unexpected.

Note the nature of these crashes is transient, so its possible that we just aren't seeing crashes at the moment, but may see unpatched machines start crashing tomorrow...

@davmason
Copy link
Contributor Author

davmason commented Oct 5, 2022

@jeffschwMSFT - I was able to create a local repro that demonstrates the issue, and my changes fix the issue. Combined with the fact that Zach was able to run the privates I provided him with no issues (so we verified it doesn't break him at least) I think we should proceed with the fix.

@jeffschwMSFT jeffschwMSFT modified the milestones: 6.0.x, 6.0.11 Oct 6, 2022
@jeffschwMSFT jeffschwMSFT added Servicing-approved Approved for servicing release and removed Servicing-consider Issue for next servicing release review labels Oct 6, 2022
@carlossanlop
Copy link
Contributor

CI green, approved, signed-off. No OOB package authoring changes needed.
Checked with David via chat, this is ready to merge (all feedback was addressed). :shipit:

@carlossanlop carlossanlop merged commit cea0fb5 into dotnet:release/6.0 Oct 7, 2022
@ghost ghost locked as resolved and limited conversation to collaborators Nov 6, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants