Skip to content

Conversation

@psafont
Copy link
Member

@psafont psafont commented Jun 9, 2025

Currently rrdd needs to know when a metric comes from a new domain, (after a
local migration, for example). This is because when a new domain is created the
counters start from zero again, and so this needs special logic to handle when
aggregating the metrics into rrds.

Previously rrdd collected this information before metrics were collected, this means that metrics collected by plugins could be be lost if the
domain was created in that small amount of time, or if the domain was destroyed
after a plugin collected data about it.

With the current change the domains are collected every loop and added to the
domains collected in the previous loop to avoid missing any newly created or
destroyed domains. The current iteration only gets fed data from the last
iteration to avoid accumulating all domains seen since the start of xcp-rrdd.

With this done it's now safe to move the host and memory metrics collection
into its own plugin.

Also use sequences more throroughly in the code for transformations

I've manually tested this change by repeatedly by single-host live-migrating a VM and checking that no beats are missed on the graphs.
Screenshot 2025-06-09 at 15 55 54

The consolidator used to be aware of which domains were paused, this was used
to avoid reporting memory changes for paused domains, exclusively. Move that
responsibility to the domain memory reporter instead, this makes the decision
local, simplifying code.

This is useful to separate the memory code from the rest of rrdd.

Signed-off-by: Pau Ruiz Safont <[email protected]>
Copy link
Contributor

@last-genius last-genius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Somewhat confusing to follow (especially on Github, as it does not differentiate between new lines of code and moved lines of code).

Maybe some of the refactoring/reformatting in domain_snapshot could be split into its own commit, making it clearer that the key part of ensuring no metrics are removed because the domain was not noticed/deleted is in do_monitor_write?

Other than this, functionally everything looks good.

@psafont
Copy link
Member Author

psafont commented Jun 9, 2025

Somewhat confusing to follow (especially on Github, as it does not differentiate between new lines of code and moved lines of code).

I wish github remembered to hide whitespace for all PRs in xapi, it's much less confusing to review refactorings with changing indentation this way.
image

Currently rrdd needs to know when a metric comes from a newly created domain,
(after a local migration, for example). This is because when a new domain is
created the counters start from zero again. This needs special logic for
aggregating metrics since xcp-rrdd needs to provide continuity of metrics of a
VM with a UUID, even if the domid changes.

Previously rrdd fetched the data about domains before metrics from plugins
were collected, and reused the data for self-reported metrics. While this meant
that for self-reported metrics it was impossible to miss collected information,
for plugin metrics it meant that for created and destroyed domains, the
between between domain id and VM UUID was not available.

With the current change the domain ids and VM UUIDs are collected every
iteration of the monitor loop, and kept for one more iteration, so domains
destroyed in the last iteration are remembered and not missed.

With this done it's now safe to move the host and memory metrics collection
into its own plugin.

Also use sequences more thoroughly in the code for transformations

Signed-off-by: Pau Ruiz Safont <[email protected]>
@psafont psafont force-pushed the private/paus/numa-mem-metrics branch from 539ca31 to fd49f35 Compare June 9, 2025 15:57
let path = Printf.sprintf "/vm/%s/%s" uuid key in
try Ezxenstore_core.Xenstore.(with_xs (fun xs -> xs.read path))
with Xs_protocol.Enoent _hint ->
info "Couldn't read path %s; falling back to actual uuid" path ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Report __FUNCTION__?

@psafont psafont added this pull request to the merge queue Jun 10, 2025
Merged via the queue into xapi-project:master with commit 6727c4e Jun 10, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants