Make rrdd more robust against domain appearing and disappearing #6515

psafont · 2025-06-09T14:56:14Z

Currently rrdd needs to know when a metric comes from a new domain, (after a
local migration, for example). This is because when a new domain is created the
counters start from zero again, and so this needs special logic to handle when
aggregating the metrics into rrds.

Previously rrdd collected this information before metrics were collected, this means that metrics collected by plugins could be be lost if the
domain was created in that small amount of time, or if the domain was destroyed
after a plugin collected data about it.

With the current change the domains are collected every loop and added to the
domains collected in the previous loop to avoid missing any newly created or
destroyed domains. The current iteration only gets fed data from the last
iteration to avoid accumulating all domains seen since the start of xcp-rrdd.

With this done it's now safe to move the host and memory metrics collection
into its own plugin.

Also use sequences more throroughly in the code for transformations

I've manually tested this change by repeatedly by single-host live-migrating a VM and checking that no beats are missed on the graphs.

The consolidator used to be aware of which domains were paused, this was used to avoid reporting memory changes for paused domains, exclusively. Move that responsibility to the domain memory reporter instead, this makes the decision local, simplifying code. This is useful to separate the memory code from the rest of rrdd. Signed-off-by: Pau Ruiz Safont <[email protected]>

last-genius

Somewhat confusing to follow (especially on Github, as it does not differentiate between new lines of code and moved lines of code).

Maybe some of the refactoring/reformatting in domain_snapshot could be split into its own commit, making it clearer that the key part of ensuring no metrics are removed because the domain was not noticed/deleted is in do_monitor_write?

Other than this, functionally everything looks good.

ocaml/xcp-rrdd/bin/rrdd/xcp_rrdd.ml

psafont · 2025-06-09T15:49:07Z

Somewhat confusing to follow (especially on Github, as it does not differentiate between new lines of code and moved lines of code).

I wish github remembered to hide whitespace for all PRs in xapi, it's much less confusing to review refactorings with changing indentation this way.

Currently rrdd needs to know when a metric comes from a newly created domain, (after a local migration, for example). This is because when a new domain is created the counters start from zero again. This needs special logic for aggregating metrics since xcp-rrdd needs to provide continuity of metrics of a VM with a UUID, even if the domid changes. Previously rrdd fetched the data about domains before metrics from plugins were collected, and reused the data for self-reported metrics. While this meant that for self-reported metrics it was impossible to miss collected information, for plugin metrics it meant that for created and destroyed domains, the between between domain id and VM UUID was not available. With the current change the domain ids and VM UUIDs are collected every iteration of the monitor loop, and kept for one more iteration, so domains destroyed in the last iteration are remembered and not missed. With this done it's now safe to move the host and memory metrics collection into its own plugin. Also use sequences more thoroughly in the code for transformations Signed-off-by: Pau Ruiz Safont <[email protected]>

lindig · 2025-06-10T09:46:14Z

ocaml/xcp-rrdd/bin/rrdd/xcp_rrdd.ml

+      let path = Printf.sprintf "/vm/%s/%s" uuid key in
+      try Ezxenstore_core.Xenstore.(with_xs (fun xs -> xs.read path))
+      with Xs_protocol.Enoent _hint ->
+        info "Couldn't read path %s; falling back to actual uuid" path ;


Report __FUNCTION__?

last-genius approved these changes Jun 9, 2025

View reviewed changes

ocaml/xcp-rrdd/bin/rrdd/xcp_rrdd.ml Show resolved Hide resolved

last-genius mentioned this pull request Jun 9, 2025

8.3: Backport a lot of RRD fixes from upstream xcp-ng-rpms/xapi#104

Merged

psafont force-pushed the private/paus/numa-mem-metrics branch from 539ca31 to fd49f35 Compare June 9, 2025 15:57

lindig approved these changes Jun 10, 2025

View reviewed changes

psafont added this pull request to the merge queue Jun 10, 2025

Merged via the queue into xapi-project:master with commit 6727c4e Jun 10, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make rrdd more robust against domain appearing and disappearing #6515

Make rrdd more robust against domain appearing and disappearing #6515

Uh oh!

psafont commented Jun 9, 2025 •

edited

Loading

Uh oh!

last-genius left a comment

Uh oh!

Uh oh!

psafont commented Jun 9, 2025

Uh oh!

lindig Jun 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Make rrdd more robust against domain appearing and disappearing #6515

Make rrdd more robust against domain appearing and disappearing #6515

Uh oh!

Conversation

psafont commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

last-genius left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

psafont commented Jun 9, 2025

Uh oh!

lindig Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

psafont commented Jun 9, 2025 •

edited

Loading