Skip to content

Conversation

@BengangY
Copy link
Contributor

@BengangY BengangY commented Jun 4, 2025

In XSI-1915, MCS shutdowned a VM and tried to destroy VBD right after MCS received the event which came from power_state's change and failed. The failure reason is below:

  1. The update for VM's power_state and the update for VBDs is not a transaction, so the client may receive the event from the update for power_state and operate VBDs before the update for VBDs.
  2. The VM's running on supporter. The DB operation needs to send RPC to the coordinator. This needs time.
  3. Between the update for VM's power_state and the update for VBD, xapi also updates the field pending_guildencs which needs at least 8 DB operation. This also delays the update for VBDs.

It's not straightforward to add transactions for these DB operations. The workaround is to move the update for pending_guildencs to the end of the DB operation of VBDs, VIFs, GPUs, etc. So that VBD will be updated after the update for VM's power_state immediately.

@BengangY BengangY marked this pull request as ready for review June 4, 2025 08:52
Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the explanation

@BengangY BengangY requested review from minglumlu and robhoes June 4, 2025 09:16
@edwintorok
Copy link
Contributor

They should probably wait for the task's completion instead of acting on power state changes. There is always going to be a race, even with this change (this just reduces the race window)

@edwintorok
Copy link
Contributor

The correct way would be to watch for events on VBD allowed operations. But apparently we don't expose 'destroy' as an allowed op, we probably should

@stormi
Copy link
Contributor

stormi commented Jun 4, 2025

What is MCS, in the context of XAPI?

@BengangY
Copy link
Contributor Author

BengangY commented Jun 4, 2025

What is MCS, in the context of XAPI?

MCS means Machine Creation Services. It's a component of CVAD (Citrix Virtual Apps and Desktops).

@lindig
Copy link
Contributor

lindig commented Jun 4, 2025

MCS is a service that is part of another a Citrix product to deploy VMs. XenServer is only one of its supported backends.

@edwintorok
Copy link
Contributor

The correct way would be to watch for events on VBD allowed operations. But apparently we don't expose 'destroy' as an allowed op, we probably should

Discussed that watching VBD.is_currently_attached to become false would work better here. Then there would be no race.
Some documentation here on how the client should behave would be good, otherwise all they can do is guess (and we've seen that they'd often guess wrong)

@stormi
Copy link
Contributor

stormi commented Jun 4, 2025

Thanks. The commit message clearly describes the problem, but for someone external to XenServer it would be even better if the introduction would not directly refer to XSI and to an internal acronym.

Example of how I'd put it:

Beginning of commit message:

An internal tool at XenServer tried to destroy VBD right after it received the event which came from power_state's change and failed. The failure reason is below:

Then at the end of the commit message, additional useful details for XenServer devs:

This is related to XSI-1915 where internal deploy tool MCS triggered the issue.

It feels more inclusive to the rest of the open source community involved in XAPI, in my opinion.

That was my nitpicking of the day, just sharing my impressions from outside XenServer :)

@BengangY
Copy link
Contributor Author

BengangY commented Jun 4, 2025

Thanks. The commit message clearly describes the problem, but for someone external to XenServer it would be even better if the introduction would not directly refer to XSI and to an internal acronym.

Example of how I'd put it:

Beginning of commit message:

An internal tool at XenServer tried to destroy VBD right after it received the event which came from power_state's change and failed. The failure reason is below:

Then at the end of the commit message, additional useful details for XenServer devs:

This is related to XSI-1915 where internal deploy tool MCS triggered the issue.

It feels more inclusive to the rest of the open source community involved in XAPI, in my opinion.

That was my nitpicking of the day, just sharing my impressions from outside XenServer :)

@stormi Thanks for your suggestion! I will update the description.

Fix race condition when destroying VBD after VM power_state change.

A client of XenServer attempted to destroy a VBD immediately after
receiving an event triggered by a VM power_state change, resulting in
a failure. The root cause is below:
1. The update to VM's power_state and the update to VBDs are not
   performed atomically, so the client may receive the event from the
   update to VM's power_state and attempt to operate VBDs before their
   state is updated.
2. If the VM is running on a supporter, database operations require
   sending RPCs to the coordinator, introducing additional latency.
3. Between the updates to the VM's power_state and the VBDs, xapi also
   updates the pending_guidences fields, which requires at least eight
   database operations and then further delays the VBD update.

It's not straightforward to add transactions for these DB operations.
The workaround is to move the update to pending_guildences to the end
of the relevant database operations (VBDs, VIFs, GPUs, etc), ensuring
that VBDs are updated immediately after the VM's power_state change.

This is related to XSI-1915 where Citrix deploy tool MCS triggered
the issue.

Signed-off-by: Bengang Yuan <[email protected]>
@BengangY BengangY force-pushed the private/bengangy/CA-411766 branch from 83f3966 to 19e1704 Compare June 5, 2025 01:54
@BengangY BengangY added this pull request to the merge queue Jun 5, 2025
Merged via the queue into xapi-project:master with commit 6c9167f Jun 5, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants