- 
                Notifications
    You must be signed in to change notification settings 
- Fork 849
[sdk-metrics] Fix race condition for MemoryPoint Reclaim #5546
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[sdk-metrics] Fix race condition for MemoryPoint Reclaim #5546
Conversation
| Codecov ReportAll modified and coverable lines are covered by tests ✅ 
 Additional details and impacted files@@            Coverage Diff             @@
##             main    #5546      +/-   ##
==========================================
+ Coverage   83.38%   85.58%   +2.19%     
==========================================
  Files         297      289       -8     
  Lines       12531    12493      -38     
==========================================
+ Hits        10449    10692     +243     
+ Misses       2082     1801     -281     
 Flags with carried forward coverage won't be shown. Click here to find out more. 
 | 
| @Yun-Ting Would you mind spinning up an issue to make sure we have coyote test coverage over this delta metric point reclaim feature? | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Changes
On the main branch:
An update thread does the following:
MetricPointarrayReferenceCountfor theMetricPointat that indexMetricPointis valid for useMetricPointStatustoCollectPendingReferenceCountfor theMetricPointThe collect thread does the following:
MetricPointStatusfor a givenMetricPointisNoCollectPending, then check if it can be marked invalidMetricPointinvalid (this happens by setting theReferenceCounttoint.MinValuewhen no one is using it)MetricPointWhere's the race condition?
Consider this sequence of steps where both the Update thread and the Collect thread are working on the same MetricPoint
MetricPointindexMetricPointsMetricPointStatusfor theMetricPointMetricPointStatusisNoCollectPendingReferenceCountto1MetricPointis valid for useMetricPointStatustoCollectPending(This is the update that we would miss)ReferenceCountto0ReferenceCounttoint.MinValueas it finds theReferenceCountto be0MetricPointand misses the update that happened at T7Fix
I'm introducing a double-checked locking type construct to recheck if the
MetricPointStatuswas changed toCollectPendingbefore the Collect thread could mark the MetricPoint invalid for use. When that happens, we would now callSnapshotfor thatMetricPointand mark it to be reclaimed in the next Collect cycle. Note that theMetricPointwould remain invalid for use until the next Collect cycle reclaims it.