@@ -9,6 +9,12 @@ Title: Storage migration
99 - [ Thought experiments on an alternative design] ( #thought-experiments-on-an-alternative-design )
1010 - [ Design] ( #design )
1111- [ SMAPIv1 migration] ( #smapiv1-migration )
12+ - [ Preparation] ( #preparation )
13+ - [ Establish mirror] ( #establish-mirror )
14+ - [ Mirror] ( #mirror )
15+ - [ Snapshot] ( #snapshot )
16+ - [ Copy and compose] ( #copy-and-compose )
17+ - [ Finish] ( #finish )
1218- [ SMAPIv3 migration] ( #smapiv3-migration )
1319- [ Error Handling] ( #error-handling )
1420 - [ Preparation (SMAPIv1 and SMAPIv3)] ( #preparation-smapiv1-and-smapiv3 )
@@ -122,10 +128,44 @@ it will be handled just as before.
122128
123129## SMAPIv1 migration
124130
131+ This section is about migration from SMAPIv1 SRs to SMAPIv1 or SMAPIv3 SRs, since
132+ the migration is driven by the source host, it is usally the source host that
133+ determines most of the logic during a storage migration.
134+
135+ First we take a look at an overview diagram of what happens during SMAPIv1 SXM:
136+ the diagram is labelled with S1, S2 ... which indicates different stages of the migration.
137+ We will talk about each stage in more detail below.
138+
139+ ![ overview-v1] ( sxm-overview-v1.svg )
140+
141+ ### Preparation
142+
143+ Before we can start our migration process, there are a number of preparations
144+ needed to prepare for the following mirror. For SMAPIv1 this involves:
145+
146+ 1 . Create a new VDI (called leaf) that will be used as the receiving VDI for all the new writes
147+ 2 . Create a dummy snapshot of the VDI above to make sure it is a differencing disk and can be composed later on
148+ 3 . Create a VDI (called parent) that will be used to receive the existing content of the disk (the snapshot)
149+
150+ Note that the leaf VDI needs to be attached and activated on the destination host (to a non-exsiting ` mirror_vm ` )
151+ since it will later on accept writes to mirror what is written on the source host.
152+
153+ The parent VDI may be created in two different ways: 1. If there is a "similar VDI",
154+ clone it on the destination host and use it as the parent VDI; 2. If there is no
155+ such VDI, create a new blank VDI. The similarity here is defined by the distances
156+ between different VDIs in the VHD tree, which is exploiting the internal representation
157+ of the storage layer, hence we will not go into too much detail about this here.
158+
159+ Once these preparations are done, a ` mirror_receive_result ` data structure is then
160+ passed back to the source host that will contain all the necessary information about
161+ these new VDIs, etc.
162+
163+ ### Establishing mirror
164+
125165At a high level, mirror establishment for SMAPIv1 works as follows:
126166
1271671 . Take a snapshot of a VDI that is attached to VM1. This gives us an immutable
128- copy of the current state of the VDI, with all the data until the point we took
168+ copy of the current state of the VDI, with all the data up until the point we took
129169the snapshot. This is illustrated in the diagram as a VDI and its snapshot connecting
130170to a shared parent, which stores the shared content for the snapshot and the writable
131171VDI from which we took the snapshot (snapshot)
@@ -135,8 +175,79 @@ client VDI will also be written to the mirrored VDI on the remote host (mirror)
1351754 . Compose the mirror and the snapshot to form a single VDI
1361765 . Destroy the snapshot on the local host (cleanup)
137177
178+ #### Mirror
179+
180+ The mirroring process for SMAPIv1 is rather unconventional, so it is worth
181+ documenting how this works. Instead of a conventional client server architecture,
182+ where the source client connects to the destination server directly through the
183+ NBD protocol in tapdisk, the connection is established in xapi and then passed
184+ onto tapdisk.
185+
186+ The diagram below illustrates this prcess. First, xapi on the source host will
187+ initiate an http request to the remote xapi. This request contains the necessary
188+ information about the VDI to be mirrored, and the SR that contains it, etc. This
189+ information is then passed onto the http handler on the destination host (called
190+ ` nbd_handler ` ) which then processes this information. Now the unusual step is that
191+ both the source and the destination xapi will pass this connection onto tapdisk,
192+ by sending the fd representing the socket connection to the tapdisk process. On
193+ the source this would be nbd client process of tapdisk, and on the destination
194+ this would be the nbd server process of the tapdisk. After this step, we can consider
195+ a client-server connection is established between two tapdisks on the client and
196+ server, as if the tapdisk on the source host makes a request to the tapdisk on the
197+ destination host and initiates the connection. On the diagram, this is indicated
198+ by the dashed lines between the tapdisk processes. Logically, we can view this as
199+ xapi creates the connection, and then passes this connection down into tapdisk.
200+
201+ ![ mirror] ( sxm-mirror-v1.svg )
202+
203+ #### Snapshot
204+
205+ The next step would be create a snapshot of the VDI. This is easily done as a
206+ ` VDI.snapshot ` operation. If the VDI was in VHD format, then internally this would
207+ create two children for, one for the snapshot, which only contains the metadata
208+ information and tends to be small, the other for the writable VDI where all the
209+ new writes will go to. The shared base copy contains the shared blocks.
210+
211+ ![ snapshot] ( sxm-snapshot-v1.svg )
212+
213+ #### Copy and compose
214+
215+ Once the snapshot is created, we can then copy the snapshot from the source
216+ to the destination. This step is done by ` sparse_dd ` using the nbd protocol. This
217+ is also the step that takes the most time to complete.
218+
219+ ` sparse_dd ` is a process forked by xapi that does the copying of the disk blocks.
220+ ` sparse_dd ` can supports a number of protocols, including nbd. In this case, ` sparse_dd `
221+ will initiate an http put request to the destination host, with a url of the form
222+ ` <address>/services/SM/nbdproxy/<sr>/<vdi> ` . This http request then
223+ gets handled by the http handler on the destination host B, which will then spawn
224+ a handler thread. This handler will find the
225+ "generic" nbd server[ ^ 2 ] of either tapdisk or qemu-dp, depending on the destination
226+ SR type, and then start proxying data between the http connection socket and the
227+ socket connected to the nbd server.
228+
229+ [ ^ 2 ] : The server is generic because it does not accept fd passing, and I call those
230+ "special" nbd server/fd receiver.
231+
232+ ![ sxm new copy] ( sxm-new-copy-v1.svg )
233+
234+ Once copying is done, the snapshot and mirrored VDI can be then composed into a
235+ single VDI.
236+
237+ #### Finish
238+
239+ At this point the VDI is synchronised to the new host! Mirror is still working at this point
240+ though because that will not be destroyed until the VM itself has been migrated
241+ as well. Some cleanups are done at this point, such as deleting the snapshot
242+ that is taken on the source, destroying the mirror datapath, etc.
243+
244+ The end results look like the following. Note that VM2 is in dashed line as it
245+ is not yet created yet. The next steps would be to migrate the VM1 itself to the
246+ destination as well, but this is part of the VM migration process and will not
247+ be covered here.
248+
249+ ![ final] ( sxm-final-v1.svg )
138250
139- more detail to come...
140251
141252## SMAPIv3 migration
142253
@@ -168,10 +279,10 @@ helps separate the error handling logic into the `with` part of a `try with` blo
168279which is where they are supposed to be. Since we need to accommodate the existing
169280SMAPIv1 migration (which has more stages than SMAPIv3), the following stages are
170281introduced: preparation (v1,v3), snapshot(v1), mirror(v1, v3), copy(v1). Note that
171- each stage also roughly corresponds to a helper function that is called within ` MIRROR .start` ,
282+ each stage also roughly corresponds to a helper function that is called within ` Storage_migrate .start` ,
172283which is the wrapper function that initiates storage migration. And each helper
173284functions themselves would also have error handling logic within themselves as
174- needed (e.g. see `Storage_smapiv1_migrate.receive_start) to deal with exceptions
285+ needed (e.g. see ` Storage_smapiv1_migrate.receive_start ` ) to deal with exceptions
175286that happen within each helper functions.
176287
177288### Preparation (SMAPIv1 and SMAPIv3)
@@ -215,6 +326,14 @@ failure during copying.
215326
216327## SMAPIv1 Migration implementation detail
217328
329+ {{% notice info %}}
330+ The following doc refers to the xapi a [ version] ( https://github.com/xapi-project/xen-api/blob/v24.37.0/ocaml/xapi/storage_migrate.ml )
331+ of xapi that is before 24.37 after which point this code structure has undergone
332+ many changes as part of adding support for SMAPIv3 SXM. Therefore the following
333+ tutorial might be less relevant in terms of the implementation detail. Although
334+ the general principle should remain the same.
335+ {{% /notice %}}
336+
218337``` mermaid
219338sequenceDiagram
220339participant local_tapdisk as local tapdisk
0 commit comments