-
Notifications
You must be signed in to change notification settings - Fork 293
Make rrdd more robust against domain appearing and disappearing #6515
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make rrdd more robust against domain appearing and disappearing #6515
Conversation
The consolidator used to be aware of which domains were paused, this was used to avoid reporting memory changes for paused domains, exclusively. Move that responsibility to the domain memory reporter instead, this makes the decision local, simplifying code. This is useful to separate the memory code from the rest of rrdd. Signed-off-by: Pau Ruiz Safont <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Somewhat confusing to follow (especially on Github, as it does not differentiate between new lines of code and moved lines of code).
Maybe some of the refactoring/reformatting in domain_snapshot could be split into its own commit, making it clearer that the key part of ensuring no metrics are removed because the domain was not noticed/deleted is in do_monitor_write?
Other than this, functionally everything looks good.
Currently rrdd needs to know when a metric comes from a newly created domain, (after a local migration, for example). This is because when a new domain is created the counters start from zero again. This needs special logic for aggregating metrics since xcp-rrdd needs to provide continuity of metrics of a VM with a UUID, even if the domid changes. Previously rrdd fetched the data about domains before metrics from plugins were collected, and reused the data for self-reported metrics. While this meant that for self-reported metrics it was impossible to miss collected information, for plugin metrics it meant that for created and destroyed domains, the between between domain id and VM UUID was not available. With the current change the domain ids and VM UUIDs are collected every iteration of the monitor loop, and kept for one more iteration, so domains destroyed in the last iteration are remembered and not missed. With this done it's now safe to move the host and memory metrics collection into its own plugin. Also use sequences more thoroughly in the code for transformations Signed-off-by: Pau Ruiz Safont <[email protected]>
539ca31 to
fd49f35
Compare
| let path = Printf.sprintf "/vm/%s/%s" uuid key in | ||
| try Ezxenstore_core.Xenstore.(with_xs (fun xs -> xs.read path)) | ||
| with Xs_protocol.Enoent _hint -> | ||
| info "Couldn't read path %s; falling back to actual uuid" path ; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Report __FUNCTION__?

Currently rrdd needs to know when a metric comes from a new domain, (after a
local migration, for example). This is because when a new domain is created the
counters start from zero again, and so this needs special logic to handle when
aggregating the metrics into rrds.
Previously rrdd collected this information before metrics were collected, this means that metrics collected by plugins could be be lost if the
domain was created in that small amount of time, or if the domain was destroyed
after a plugin collected data about it.
With the current change the domains are collected every loop and added to the
domains collected in the previous loop to avoid missing any newly created or
destroyed domains. The current iteration only gets fed data from the last
iteration to avoid accumulating all domains seen since the start of xcp-rrdd.
With this done it's now safe to move the host and memory metrics collection
into its own plugin.
Also use sequences more throroughly in the code for transformations
I've manually tested this change by repeatedly by single-host live-migrating a VM and checking that no beats are missed on the graphs.
