CA-411766: Detach VBDs right after VM Halted #6501

BengangY · 2025-06-04T07:50:44Z

In XSI-1915, MCS shutdowned a VM and tried to destroy VBD right after MCS received the event which came from power_state's change and failed. The failure reason is below:

The update for VM's power_state and the update for VBDs is not a transaction, so the client may receive the event from the update for power_state and operate VBDs before the update for VBDs.
The VM's running on supporter. The DB operation needs to send RPC to the coordinator. This needs time.
Between the update for VM's power_state and the update for VBD, xapi also updates the field pending_guildencs which needs at least 8 DB operation. This also delays the update for VBDs.

It's not straightforward to add transactions for these DB operations. The workaround is to move the update for pending_guildencs to the end of the DB operation of VBDs, VIFs, GPUs, etc. So that VBD will be updated after the update for VM's power_state immediately.

lindig

I like the explanation

edwintorok · 2025-06-04T09:20:19Z

They should probably wait for the task's completion instead of acting on power state changes. There is always going to be a race, even with this change (this just reduces the race window)

edwintorok · 2025-06-04T09:32:39Z

The correct way would be to watch for events on VBD allowed operations. But apparently we don't expose 'destroy' as an allowed op, we probably should

stormi · 2025-06-04T09:53:25Z

What is MCS, in the context of XAPI?

BengangY · 2025-06-04T09:58:30Z

What is MCS, in the context of XAPI?

MCS means Machine Creation Services. It's a component of CVAD (Citrix Virtual Apps and Desktops).

lindig · 2025-06-04T09:58:55Z

MCS is a service that is part of another a Citrix product to deploy VMs. XenServer is only one of its supported backends.

edwintorok · 2025-06-04T10:08:06Z

The correct way would be to watch for events on VBD allowed operations. But apparently we don't expose 'destroy' as an allowed op, we probably should

Discussed that watching VBD.is_currently_attached to become false would work better here. Then there would be no race.
Some documentation here on how the client should behave would be good, otherwise all they can do is guess (and we've seen that they'd often guess wrong)

stormi · 2025-06-04T10:09:46Z

Thanks. The commit message clearly describes the problem, but for someone external to XenServer it would be even better if the introduction would not directly refer to XSI and to an internal acronym.

Example of how I'd put it:

Beginning of commit message:

An internal tool at XenServer tried to destroy VBD right after it received the event which came from power_state's change and failed. The failure reason is below:

Then at the end of the commit message, additional useful details for XenServer devs:

This is related to XSI-1915 where internal deploy tool MCS triggered the issue.

It feels more inclusive to the rest of the open source community involved in XAPI, in my opinion.

That was my nitpicking of the day, just sharing my impressions from outside XenServer :)

BengangY · 2025-06-04T10:29:28Z

Thanks. The commit message clearly describes the problem, but for someone external to XenServer it would be even better if the introduction would not directly refer to XSI and to an internal acronym.

Example of how I'd put it:

Beginning of commit message:
An internal tool at XenServer tried to destroy VBD right after it received the event which came from power_state's change and failed. The failure reason is below:
Then at the end of the commit message, additional useful details for XenServer devs:
This is related to XSI-1915 where internal deploy tool MCS triggered the issue.
It feels more inclusive to the rest of the open source community involved in XAPI, in my opinion.

That was my nitpicking of the day, just sharing my impressions from outside XenServer :)

@stormi Thanks for your suggestion! I will update the description.

Fix race condition when destroying VBD after VM power_state change. A client of XenServer attempted to destroy a VBD immediately after receiving an event triggered by a VM power_state change, resulting in a failure. The root cause is below: 1. The update to VM's power_state and the update to VBDs are not performed atomically, so the client may receive the event from the update to VM's power_state and attempt to operate VBDs before their state is updated. 2. If the VM is running on a supporter, database operations require sending RPCs to the coordinator, introducing additional latency. 3. Between the updates to the VM's power_state and the VBDs, xapi also updates the pending_guidences fields, which requires at least eight database operations and then further delays the VBD update. It's not straightforward to add transactions for these DB operations. The workaround is to move the update to pending_guildences to the end of the relevant database operations (VBDs, VIFs, GPUs, etc), ensuring that VBDs are updated immediately after the VM's power_state change. This is related to XSI-1915 where Citrix deploy tool MCS triggered the issue. Signed-off-by: Bengang Yuan <[email protected]>

BengangY marked this pull request as ready for review June 4, 2025 08:52

lindig approved these changes Jun 4, 2025

View reviewed changes

BengangY requested review from minglumlu and robhoes June 4, 2025 09:16

edwintorok approved these changes Jun 4, 2025

View reviewed changes

BengangY force-pushed the private/bengangy/CA-411766 branch from 83f3966 to 19e1704 Compare June 5, 2025 01:54

BengangY added this pull request to the merge queue Jun 5, 2025

Merged via the queue into xapi-project:master with commit 6c9167f Jun 5, 2025
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CA-411766: Detach VBDs right after VM Halted #6501

CA-411766: Detach VBDs right after VM Halted #6501

Uh oh!

BengangY commented Jun 4, 2025 •

edited

Loading

Uh oh!

lindig left a comment

Uh oh!

edwintorok commented Jun 4, 2025

Uh oh!

edwintorok commented Jun 4, 2025

Uh oh!

stormi commented Jun 4, 2025

Uh oh!

BengangY commented Jun 4, 2025

Uh oh!

lindig commented Jun 4, 2025

Uh oh!

edwintorok commented Jun 4, 2025

Uh oh!

stormi commented Jun 4, 2025 •

edited

Loading

Uh oh!

BengangY commented Jun 4, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CA-411766: Detach VBDs right after VM Halted #6501

CA-411766: Detach VBDs right after VM Halted #6501

Uh oh!

Conversation

BengangY commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lindig left a comment

Choose a reason for hiding this comment

Uh oh!

edwintorok commented Jun 4, 2025

Uh oh!

edwintorok commented Jun 4, 2025

Uh oh!

stormi commented Jun 4, 2025

Uh oh!

BengangY commented Jun 4, 2025

Uh oh!

lindig commented Jun 4, 2025

Uh oh!

edwintorok commented Jun 4, 2025

Uh oh!

stormi commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

BengangY commented Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

BengangY commented Jun 4, 2025 •

edited

Loading

stormi commented Jun 4, 2025 •

edited

Loading

BengangY commented Jun 4, 2025 •

edited

Loading