Skip to content

Conversation

@PeterSolMS
Copy link
Contributor

@PeterSolMS PeterSolMS commented Nov 23, 2022

Fixes #78206

Dumps provided by the customer showed in all cases that the min_overflow_address/max_overflow_address fields had values different from their initial values of MAX_PTR and 0. This implies that a mark stack overflow has occurred, but has not been properly handled.

Looking at the code, we realized that we may still have objects in the mark prefetch queue as we enter process_mark_overflow. These objects may cause another mark stack overflow when they are traced. So we need to drain the mark prefetch queue before we check the min_overflow_address/max_overflow_address fields.

We provided a private build of clrgc.dll with this fix to the customer reporting the issue, and customer has validated that the fix resolves the issue.

…with mark stack overflow.

Dumps provided by the customer showed in all cases that the min_overflow_address/max_overflow_address fields had values different from their initial values of MAX_PTR and 0. This implies that a mark stack overflow has occurred, but has not been properly handled.

Looking at the code, we realized that we may still have objects in the mark prefetch queue as we enter process_mark_overflow. These objects may cause another mark stack overflow when they are traced. So we need to drain the mark prefetch queue before we check the min_overflow_address/max_overflow_address fields.

We provided a private build of clrgc.dll to the customer reporting the issue, and customer has validated that the fix resolves the issue.
@ghost ghost added the area-GC-coreclr label Nov 23, 2022
@ghost ghost assigned PeterSolMS Nov 23, 2022
@ghost
Copy link

ghost commented Nov 23, 2022

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Dumps provided by the customer showed in all cases that the min_overflow_address/max_overflow_address fields had values different from their initial values of MAX_PTR and 0. This implies that a mark stack overflow has occurred, but has not been properly handled.

Looking at the code, we realized that we may still have objects in the mark prefetch queue as we enter process_mark_overflow. These objects may cause another mark stack overflow when they are traced. So we need to drain the mark prefetch queue before we check the min_overflow_address/max_overflow_address fields.

We provided a private build of clrgc.dll with this fix to the customer reporting the issue, and customer has validated that the fix resolves the issue.

Author: PeterSolMS
Assignees: PeterSolMS
Labels:

area-GC-coreclr

Milestone: -

Copy link
Member

@mangod9 mangod9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this. Assume we will port this to 7?

@Maoni0
Copy link
Member

Maoni0 commented Nov 24, 2022

oops, I forgot to mention that this means we should not have the drain_mark_queue call after scan_dependent_handles because we are guaranteed to call process_mark_overflow which will guarantee to run drain_mark_queue.

@PeterSolMS
Copy link
Contributor Author

Right, we have the invariant that the mark queue is empty after calling process_mark_overflow, and because of the logic inside both flavors of scan_dependent_handles, also after calling scan_dependent_handles. Thus the calls to drain_mark_queue right after calls to scan_dependent_handles can be eliminated, as well as one call to drain_mark_queue inside the WKS flavor of scan_dependent_handles.

…ere the mark queue should be empty with asserts.
…_t::verify_empty instead of an assert testing the result from mark_queue_t::get_next_marked().
@mangod9
Copy link
Member

mangod9 commented Nov 25, 2022

/backport to release/7.0

@github-actions
Copy link
Contributor

Started backporting to release/7.0: https://github.com/dotnet/runtime/actions/runs/3549509468

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fatal error. Internal CLR error. (0x80131506) in .Net 7

3 participants