Skip to content

Conversation

@changlei-li
Copy link
Contributor

No conflict.
Commit 62f8482 is to add new network interface in the cli command to meet check, which is introduced by #6502.

LunfanZhang and others added 30 commits March 27, 2025 01:33
…onfigure

Add new host object fields:
  - ssh_enabled
  - ssh_enabled_timeout
  - ssh_expiry
  - console_idle_timeout
Add new host/pool API to enable to set a temporary enabled SSH service timeout
  - set_ssh_enabled_timeout
Add new host/pool API to enable to set console timeout
  - set_console_idle_timeout

Signed-off-by: Lunfan Zhang <[email protected]>
…i-project#6388)

This PR introduces support for Dom0 SSH control, providing the following
capabilities:

Query the SSH status.
Configure a temporary SSH enable timeout for a specific host or all
hosts in the pool.
Configure the console idle timeout for a specific host or all hosts in
the pool.

Changes
New Host Object Fields:

- `ssh_enabled`: Indicates whether SSH is enabled.
- `ssh_enabled_timeout`: Specifies the timeout for temporary SSH
enablement.
- `ssh_expiry`: Tracks the expiration time for temporary SSH enablement.
- `console_idle_timeout`: Configures the idle timeout for the console.

New Host/Pool APIs (This PR only include the change of data model, the
implementation of this API will be include in the next PR):

- `set_ssh_enabled_timeout`: Allows setting a temporary timeout for
enabling the SSH service.
- `set_console_idle_timeout`: Allows configuring the console idle
timeout.
Add "Changed" records for 2 APIs which were missed.

Fix "param_release" for 3 added parameters.

Signed-off-by: Gang Ji <[email protected]>
During pool join, create a new host obj in the remote pool coordinator
DB with the same SSH settings as pool coordinator.

Also configure SSH service locally before xapi restart which will
persist after xapi restart.

Signed-off-by: Gang Ji <[email protected]>
Useless here, the local DB will be dropped soon as the joinner will
switch to the remote DB of the new coordinator.

And latest_synced_updates_applied will be set to `unknown in host.create
in remote DB as default value.

Signed-off-by: Gang Ji <[email protected]>
After being ejected from a pool, a new host obj will be created with
default settings in DB.

This commit configures SSH service in the ejected host to default state
during pool eject.

Signed-off-by: Gang Ji <[email protected]>
Implemented XAPI APIs:
  - `host.set_console_idle_timeout`
  - `pool.set_console_idle_timeout`

These APIs allow XAPI to configure timeout for idle console sessions.

Signed-off-by: Lunfan Zhang <[email protected]>
Implemented XAPI APIs:
  - `host.set_ssh_enabled_timeout`
  - `pool.set_ssh_enabled_timeout`
These APIs allow XAPI to configure timeout for SSH service.
`host.enable_ssh` now also supports enabling the SSH service with a ssh_enabled_timeout

Signed-off-by: Lunfan Zhang <[email protected]>
Updated `records.ml` file to support `host-param-set/get/list` and `pool-param-set/get/list` for SSH-related fields.

Signed-off-by: Lunfan Zhang <[email protected]>
…ct#6394)

Implemented XAPI APIs:
  - `set_ssh_enabled_timeout`
  - `set_console_idle_timeout`

These APIs allow XAPI to configure timeouts for the SSH service and idle
console sessions from both host and pool level.

Updated `records.ml` to support `host-param-set/get/list` and
`pool-param-set/get/list` for SSH-related fields.
The error set_console_idle_timeout_failed was added in feature branch
while it is not used anywhere. The error used in
set_console_idle_timeout now is invalid_value.

Signed-off-by: Gang Ji <[email protected]>
Signed-off-by: Zeroday BYTE <[email protected]>
 - Ensure host.enabled_ssh reflects the actual SSH service state on startup, in case it was manually changed by the user.

 - Reschedule the "disable SSH" job if:
   - host.ssh_enabled_timeout is set to a positive value, and
   - host.ssh_expiry is in the future.

 - Disable the SSH if:
   - host.ssh_enabled_timeout is set to a positive value, and
   - host.ssh_expiry is in the past.

Signed-off-by: Lunfan Zhang <[email protected]>
Viewing RRDs produced by xcp-rrdd is difficult, because the format is incompatible with rrdtool.
rrdtool has a hardcoded limit of 20 char for RRD names for backward compat with its binary format.

Steps:
* given a directory of xml .gz files containing xcp-rrdd produced rrds
* invokes itself recursively with each file in turn using xargs -P (easy way to parallelize on OCaml 4)
* load all RRDs, and split them into separate files, allowing us to shorten many of their names without conflicts
* some names are still too long, there is a builtin translation table to shorten these
* once split an .rrd file is created using 'rrdtool restore'. This can further be queried/inspected/transformed by rrdtool as needed
* a .sh script is produced that can plot the RRD if desired.

There are many RRDs so plotting isn't done automatically yet.

RRDs contain min/avg/max usually, so this is drawn as a strong line at the average, and an area in a lighter color for min/max
(especially useful for historic data that has been aggregated).

Caveats:
* we don't know the unit name, that is part of the XAPI metadata, but not the XML apparently?
* separate plots are generated for separate intervals, it'd be nice to join all these into the same graph
* the visualization type is not the best for all RRDs, some might benefit from a smoother line, etc.
* for now the tool is just built, but not installed (that'll require a .spec change too and can be done later)

This is just a starting point to be able to visualize this data somehow, and we can improve the actual plotting later.

Signed-off-by: Edwin Török <[email protected]>
Add 'tyre' dependency.

Signed-off-by: Edwin Török <[email protected]>
 - Refine the exception when host.enable_ssh/host.disable_ssh failed
 - Reset the host.ssh_expiry to default when host.enabl_ssh with no
   timeout

Signed-off-by: Lunfan Zhang[Lunfan.Zhang] <[email protected]>
When we have multiple SM plugins in XAPI for the same type (which
happens only because of past problems) and want to remove the obsolete
one, do this iby  reference. The code so far was assuming only one per
type and looked up the reference by name which was not unique and hence
could end up removing the wrong SM entry.

Signed-off-by: Christian Lindig <[email protected]>
…i-project#6490)

When we have multiple SM plugins in XAPI for the same type (which
happens only because of past problems) and want to remove the obsolete
one, do this iby reference. The code so far was assuming only one per
type and looked up the reference by name which was not unique and hence
could end up removing the wrong SM entry.
When adding a feature, developers had to change the variant, and the list
all_features. Now the list is autogenerated from the variant, and the
compiler will complain if its properties are not defined. Also reduced
complexity of the code in the rest of the module.

Signed-off-by: Pau Ruiz Safont <[email protected]>
I tried sharing more code between hard and soft affinities, but the memory
management of the two cpumaps blows up the number of branches that need to be
taken care of, making it more worthwhile to duplicate a bit of code instead.

Signed-off-by: Pau Ruiz Safont <[email protected]>
No functional change. This prepares pre_build in the domain module to be able
to set the hard affinity mask, without communicating the mask to xenguest
through xenstore.

Signed-off-by: Pau Ruiz Safont <[email protected]>
psafont and others added 26 commits June 5, 2025 14:18
Some rough guidelines on the contribution process for the project.
Intended more as a starting point for a discussion.
I wrote this in ~2022, so I don't fully remember how it all works, but I
tried to document what I know in the commit message, in the CLI flag
docs and here:
```
scp -r root@$YOURBOX:/var/lib/xcp/blobs/rrds /tmp/rrds
dune exec ./rrdview.exe -- /tmp/rrds
bash /tmp/rrds/16db833b-7cd6-4b69-9037-144076c71033.cpu_avg.DERIVE.sh
```

![rrd](https://github.com/user-attachments/assets/b3c98936-cc29-4871-b4ea-1511853d623e)

Viewing RRDs produced by xcp-rrdd is difficult, because the format is
incompatible with rrdtool. rrdtool has a hardcoded limit of 20 char for
RRD names for backward compat with its binary format.

Steps:
* given a directory of xml .gz files containing xcp-rrdd produced rrds
* invokes itself recursively with each file in turn using xargs -P (easy
way to parallelize on OCaml 4)
* load all RRDs, and split them into separate files, allowing us to
shorten many of their names without conflicts
* some names are still too long, there is a builtin translation table to
shorten these
* once split an .rrd file is created using 'rrdtool restore'. This can
further be queried/inspected/transformed by rrdtool as needed
* a .sh script is produced that can plot the RRD if desired.

There are many RRDs so plotting isn't done automatically yet.

RRDs contain min/avg/max usually, so this is drawn as a strong line at
the average, and an area in a lighter color for min/max (especially
useful for historic data that has been aggregated).

Caveats:
* we don't know the unit name, that is part of the XAPI metadata, but
not the XML apparently?
* separate plots are generated for separate intervals, it'd be nice to
join all these into the same graph
* the visualization type is not the best for all RRDs, some might
benefit from a smoother line, etc.
* for now the tool is just built, but not installed (that'll require a
.spec change too and can be done later)
* there is some code there to start parsing the data source definitions,
eventually I wanted to plot the data using OCaml instead of rrdtool
(e.g. generate Vega/Vega-lite graphs), but I can't find which branch I
put *that* code on, what I have here is incomplete (or maybe I never
wrote that part, just thought about it). We could trim the dead code
from here if needed, but it might be useful if we continue improving the
tool later, so for now I left the parsing in.

This is just a starting point to be able to visualize this data somehow,
and we can improve the actual plotting later.
The consolidator used to be aware of which domains were paused, this was used
to avoid reporting memory changes for paused domains, exclusively. Move that
responsibility to the domain memory reporter instead, this makes the decision
local, simplifying code.

This is useful to separate the memory code from the rest of rrdd.

Signed-off-by: Pau Ruiz Safont <[email protected]>
Update all `25.20.0-next` to `25.21.0` in `datamodel_lifecycle.ml`.

Signed-off-by: Bengang Yuan <[email protected]>
The /sys/fs/cgroup/systemd/cgroup.procs file is not always present,
particularly in updated Linux systems with newer cgroup and SystemD.
So fallback to root /sys/fs/cgroup/cgroup.procs.
Also handle and report errors back to Ocaml.

Although SystemD discourage handling cgroups without service
configuration changes the root cgroup is a bit special as receiving
processes from multiple sources, including the kernel.

Signed-off-by: Frediano Ziglio <[email protected]>
Update all `25.20.0-next` to `25.21.0` in `datamodel_lifecycle.ml`.
Unfortunately mirage-crypto has accumulated breaking changes:
- Cstructs have been replaced with strings
- The digestif library has replaced ad-hoc hash implementation

A deprecation has happened as well:
- RNG initialization has changed

Signed-off-by: Pau Ruiz Safont <[email protected]>
This API call and corresponding XE implementation calls a host plugin
on the host where a VM is running. It thus takes care of finding the
right host, compared to Host.call_plugin where this would be left to the
user.

Signed-off-by: Christian Lindig <[email protected]>
Add a new function that will invoke a callback every time one of the tasks is
deemed non-pending. This will allow its users to:

1) track the progress of tasks within the submitted batch
2) schedule new tasks to replace the completed ones

Modify wait_for_all_inner so that it adds the tasks returned from the callback
to its internal set on every new task completion.

Signed-off-by: Andrii Sultanov <[email protected]>
With bab83d9, host evacuation was parallelized
by grouping VMs into batches, and starting a new batch once the previous one
has finished. This means that a single slow VM can potentially slow down the
whole evacuation.

Instead use Tasks.wait_for_all_with_callback to schedule a new migration as
soon as any of the previous ones have finished, thus maintaining a constant
flow of n migrations.

Signed-off-by: Andrii Sultanov <[email protected]>
This API call and corresponding XE implementation calls a host plugin on
the host where a VM is running. It thus takes care of finding the right
host, compared to Host.call_plugin where this would be left to the user.
at least for a while longer...

Mirrors the changes in the 1.249 LCM branch: xapi-project#6473

Signed-off-by: Pau Ruiz Safont <[email protected]>
Currently rrdd needs to know when a metric comes from a newly created domain,
(after a local migration, for example). This is because when a new domain is
created the counters start from zero again. This needs special logic for
aggregating metrics since xcp-rrdd needs to provide continuity of metrics of a
VM with a UUID, even if the domid changes.

Previously rrdd fetched the data about domains before metrics from plugins
were collected, and reused the data for self-reported metrics. While this meant
that for self-reported metrics it was impossible to miss collected information,
for plugin metrics it meant that for created and destroyed domains, the
between between domain id and VM UUID was not available.

With the current change the domain ids and VM UUIDs are collected every
iteration of the monitor loop, and kept for one more iteration, so domains
destroyed in the last iteration are remembered and not missed.

With this done it's now safe to move the host and memory metrics collection
into its own plugin.

Also use sequences more thoroughly in the code for transformations

Signed-off-by: Pau Ruiz Safont <[email protected]>
The only use of it was a parameter that was not used anywhere

Signed-off-by: Pau Ruiz Safont <[email protected]>
at least for a while longer...

Mirrors the changes in the 1.249 LCM branch: xapi-project#6473
…-project#6515)

Currently rrdd needs to know when a metric comes from a new domain,
(after a
local migration, for example). This is because when a new domain is
created the
counters start from zero again, and so this needs special logic to
handle when
aggregating the metrics into rrds.

Previously rrdd collected this information before metrics were
collected, this means that metrics collected by plugins could be be lost
if the
domain was created in that small amount of time, or if the domain was
destroyed
after a plugin collected data about it.

With the current change the domains are collected every loop and added
to the
domains collected in the previous loop to avoid missing any newly
created or
destroyed domains. The current iteration only gets fed data from the
last
iteration to avoid accumulating all domains seen since the start of
xcp-rrdd.

With this done it's now safe to move the host and memory metrics
collection
into its own plugin.

Also use sequences more throroughly in the code for transformations

I've manually tested this change by repeatedly by single-host
live-migrating a VM and checking that no beats are missed on the graphs.
![Screenshot 2025-06-09 at 15 55
54](https://github.com/user-attachments/assets/8d5dea0a-a1aa-4a49-a712-9512e18036cc)
With bab83d9, host evacuation was
parallelized by grouping VMs
into batches, and starting a new batch once the previous one has
finished.
This means that a single slow VM can potentially slow down the whole
evacuation.

Add a new `Tasks.wait_for_all_with_callback` function that will
invoke a callback every time one of the tasks is
deemed non-pending. This will allow its users to:

1) track the progress of tasks within the submitted batch
2) schedule new tasks to replace the completed ones

Use the new `Tasks.wait_for_all_with_callback` in `xapi_host` to
schedule a new migration as soon as any of the previous ones have
finished, thus maintaining a constant flow of `n` migrations.

Additionally expose the `evacuate-batch-size` parameter in the CLI, this
was missed when it was originally added with the CLI setting it to `0`
(pick the default) all the time.

===

Manually tested multiple times, confirmed to not break anything and to
actually maintain a constant flow of migrations. This should greatly
speed up host evacuations when there is a combination of bigger and
smaller VMs (in terms of memory/disk, or VMs with some other reason for
slow migration) on the host
Unfortunately mirage-crypto has accumulated breaking changes:
- Cstructs have been replaced with strings
- The digestif library has replaced ad-hoc hash implementation

A deprecation has happened as well:
- RNG initialization has changed

Because there are breaking changes, xs-opam changes need to be
introduced at the same time:
xapi-project/xs-opam#731

Only xapi is affected by the breaking builds, so no other toolstack
repositories have incoming PRs.
I've tested builds with  Smoke and validation tests: SR 218740

This means that the merge will be done as such:
- Both PRs are approved
- First I will merge xs-opam will be emrged (with failing CI)
- Then this PR will be merged with the merge train that runs tests
before actually merging.
- CI is rerun manually on xs-opam to make it green again

After merging both xenserver's CI should create a successful build with
both PR included
The /sys/fs/cgroup/systemd/cgroup.procs file is not always present,
particularly in updated Linux systems with newer cgroup and SystemD. So
fallback to root /sys/fs/cgroup/cgroup.procs.
Also handle and report errors back to Ocaml.

Although SystemD discourage handling cgroups without service
configuration changes the root cgroup is a bit special as receiving
processes from multiple sources, including the kernel.
Also fix the Makefile, so that 'make clean' also deletes the `.o.d` files.

This avoids accidentally adding these files to git
(although normally dune would invoke make in _build,
 only if you manually invoke it would it create these extra files):
```
A ocaml/forkexecd/helper/close_from.o
A ocaml/forkexecd/helper/close_from.o.d
A ocaml/forkexecd/helper/syslog.o
A ocaml/forkexecd/helper/syslog.o.d
A ocaml/forkexecd/helper/vfork_helper
A ocaml/forkexecd/helper/vfork_helper.o
A ocaml/forkexecd/helper/vfork_helper.o.d
```

Signed-off-by: Edwin Török <[email protected]>
Also fix the Makefile, so that 'make clean' also deletes the `.o.d`
files.

This avoids accidentally adding these files to git (although normally
dune would invoke make in _build,
 only if you manually invoke it would it create these extra files):
```
A ocaml/forkexecd/helper/close_from.o
A ocaml/forkexecd/helper/close_from.o.d
A ocaml/forkexecd/helper/syslog.o
A ocaml/forkexecd/helper/syslog.o.d
A ocaml/forkexecd/helper/vfork_helper
A ocaml/forkexecd/helper/vfork_helper.o
A ocaml/forkexecd/helper/vfork_helper.o.d
```
Add Network.Interface.get_interface_positions.
Generated by `dune runtest --auto-promote`

Signed-off-by: Changlei Li <[email protected]>
@changlei-li changlei-li merged commit 988ece6 into xapi-project:feature/host-network-device-ordering Jun 11, 2025
16 checks passed
@changlei-li changlei-li deleted the private/changleli/sync_feature_branch branch June 11, 2025 09:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.