1.. _migration: 2 3=================== 4Migration framework 5=================== 6 7QEMU has code to load/save the state of the guest that it is running. 8These are two complementary operations. Saving the state just does 9that, saves the state for each device that the guest is running. 10Restoring a guest is just the opposite operation: we need to load the 11state of each device. 12 13For this to work, QEMU has to be launched with the same arguments the 14two times. I.e. it can only restore the state in one guest that has 15the same devices that the one it was saved (this last requirement can 16be relaxed a bit, but for now we can consider that configuration has 17to be exactly the same). 18 19Once that we are able to save/restore a guest, a new functionality is 20requested: migration. This means that QEMU is able to start in one 21machine and being "migrated" to another machine. I.e. being moved to 22another machine. 23 24Next was the "live migration" functionality. This is important 25because some guests run with a lot of state (specially RAM), and it 26can take a while to move all state from one machine to another. Live 27migration allows the guest to continue running while the state is 28transferred. Only while the last part of the state is transferred has 29the guest to be stopped. Typically the time that the guest is 30unresponsive during live migration is the low hundred of milliseconds 31(notice that this depends on a lot of things). 32 33.. contents:: 34 35Transports 36========== 37 38The migration stream is normally just a byte stream that can be passed 39over any transport. 40 41- tcp migration: do the migration using tcp sockets 42- unix migration: do the migration using unix sockets 43- exec migration: do the migration using the stdin/stdout through a process. 44- fd migration: do the migration using a file descriptor that is 45 passed to QEMU. QEMU doesn't care how this file descriptor is opened. 46- file migration: do the migration using a file that is passed to QEMU 47 by path. A file offset option is supported to allow a management 48 application to add its own metadata to the start of the file without 49 QEMU interference. Note that QEMU does not flush cached file 50 data/metadata at the end of migration. 51 52 The file migration also supports using a file that has already been 53 opened. A set of file descriptors is passed to QEMU via an "fdset" 54 (see add-fd QMP command documentation). This method allows a 55 management application to have control over the migration file 56 opening operation. There are, however, strict requirements to this 57 interface if the multifd capability is enabled: 58 59 - the fdset must contain two file descriptors that are not 60 duplicates between themselves; 61 - if the direct-io capability is to be used, exactly one of the 62 file descriptors must have the O_DIRECT flag set; 63 - the file must be opened with WRONLY on the migration source side 64 and RDONLY on the migration destination side. 65 66- rdma migration: support is included for migration using RDMA, which 67 transports the page data using ``RDMA``, where the hardware takes 68 care of transporting the pages, and the load on the CPU is much 69 lower. While the internals of RDMA migration are a bit different, 70 this isn't really visible outside the RAM migration code. 71 72All these migration protocols use the same infrastructure to 73save/restore state devices. This infrastructure is shared with the 74savevm/loadvm functionality. 75 76Common infrastructure 77===================== 78 79The files, sockets or fd's that carry the migration stream are abstracted by 80the ``QEMUFile`` type (see ``migration/qemu-file.h``). In most cases this 81is connected to a subtype of ``QIOChannel`` (see ``io/``). 82 83 84Saving the state of one device 85============================== 86 87For most devices, the state is saved in a single call to the migration 88infrastructure; these are *non-iterative* devices. The data for these 89devices is sent at the end of precopy migration, when the CPUs are paused. 90There are also *iterative* devices, which contain a very large amount of 91data (e.g. RAM or large tables). See the iterative device section below. 92 93General advice for device developers 94------------------------------------ 95 96- The migration state saved should reflect the device being modelled rather 97 than the way your implementation works. That way if you change the implementation 98 later the migration stream will stay compatible. That model may include 99 internal state that's not directly visible in a register. 100 101- When saving a migration stream the device code may walk and check 102 the state of the device. These checks might fail in various ways (e.g. 103 discovering internal state is corrupt or that the guest has done something bad). 104 Consider carefully before asserting/aborting at this point, since the 105 normal response from users is that *migration broke their VM* since it had 106 apparently been running fine until then. In these error cases, the device 107 should log a message indicating the cause of error, and should consider 108 putting the device into an error state, allowing the rest of the VM to 109 continue execution. 110 111- The migration might happen at an inconvenient point, 112 e.g. right in the middle of the guest reprogramming the device, during 113 guest reboot or shutdown or while the device is waiting for external IO. 114 It's strongly preferred that migrations do not fail in this situation, 115 since in the cloud environment migrations might happen automatically to 116 VMs that the administrator doesn't directly control. 117 118- If you do need to fail a migration, ensure that sufficient information 119 is logged to identify what went wrong. 120 121- The destination should treat an incoming migration stream as hostile 122 (which we do to varying degrees in the existing code). Check that offsets 123 into buffers and the like can't cause overruns. Fail the incoming migration 124 in the case of a corrupted stream like this. 125 126- Take care with internal device state or behaviour that might become 127 migration version dependent. For example, the order of PCI capabilities 128 is required to stay constant across migration. Another example would 129 be that a special case handled by subsections (see below) might become 130 much more common if a default behaviour is changed. 131 132- The state of the source should not be changed or destroyed by the 133 outgoing migration. Migrations timing out or being failed by 134 higher levels of management, or failures of the destination host are 135 not unusual, and in that case the VM is restarted on the source. 136 Note that the management layer can validly revert the migration 137 even though the QEMU level of migration has succeeded as long as it 138 does it before starting execution on the destination. 139 140- Buses and devices should be able to explicitly specify addresses when 141 instantiated, and management tools should use those. For example, 142 when hot adding USB devices it's important to specify the ports 143 and addresses, since implicit ordering based on the command line order 144 may be different on the destination. This can result in the 145 device state being loaded into the wrong device. 146 147VMState 148------- 149 150Most device data can be described using the ``VMSTATE`` macros (mostly defined 151in ``include/migration/vmstate.h``). 152 153An example (from hw/input/pckbd.c) 154 155.. code:: c 156 157 static const VMStateDescription vmstate_kbd = { 158 .name = "pckbd", 159 .version_id = 3, 160 .minimum_version_id = 3, 161 .fields = (const VMStateField[]) { 162 VMSTATE_UINT8(write_cmd, KBDState), 163 VMSTATE_UINT8(status, KBDState), 164 VMSTATE_UINT8(mode, KBDState), 165 VMSTATE_UINT8(pending, KBDState), 166 VMSTATE_END_OF_LIST() 167 } 168 }; 169 170We are declaring the state with name "pckbd". The ``version_id`` is 1713, and there are 4 uint8_t fields in the KBDState structure. We 172registered this ``VMSTATEDescription`` with one of the following 173functions. The first one will generate a device ``instance_id`` 174different for each registration. Use the second one if you already 175have an id that is different for each instance of the device: 176 177.. code:: c 178 179 vmstate_register_any(NULL, &vmstate_kbd, s); 180 vmstate_register(NULL, instance_id, &vmstate_kbd, s); 181 182For devices that are ``qdev`` based, we can register the device in the class 183init function: 184 185.. code:: c 186 187 dc->vmsd = &vmstate_kbd_isa; 188 189The VMState macros take care of ensuring that the device data section 190is formatted portably (normally big endian) and make some compile time checks 191against the types of the fields in the structures. 192 193VMState macros can include other VMStateDescriptions to store substructures 194(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length 195arrays (``VMSTATE_VARRAY_``). Various other macros exist for special 196cases. 197 198Note that the format on the wire is still very raw; i.e. a VMSTATE_UINT32 199ends up with a 4 byte bigendian representation on the wire; in the future 200it might be possible to use a more structured format. 201 202Legacy way 203---------- 204 205This way is going to disappear as soon as all current users are ported to VMSTATE; 206although converting existing code can be tricky, and thus 'soon' is relative. 207 208Each device has to register two functions, one to save the state and 209another to load the state back. 210 211.. code:: c 212 213 int register_savevm_live(const char *idstr, 214 int instance_id, 215 int version_id, 216 SaveVMHandlers *ops, 217 void *opaque); 218 219Two functions in the ``ops`` structure are the ``save_state`` 220and ``load_state`` functions. Notice that ``load_state`` receives a version_id 221parameter to know what state format is receiving. ``save_state`` doesn't 222have a version_id parameter because it always uses the latest version. 223 224Note that because the VMState macros still save the data in a raw 225format, in many cases it's possible to replace legacy code 226with a carefully constructed VMState description that matches the 227byte layout of the existing code. 228 229Changing migration data structures 230---------------------------------- 231 232When we migrate a device, we save/load the state as a series 233of fields. Sometimes, due to bugs or new functionality, we need to 234change the state to store more/different information. Changing the migration 235state saved for a device can break migration compatibility unless 236care is taken to use the appropriate techniques. In general QEMU tries 237to maintain forward migration compatibility (i.e. migrating from 238QEMU n->n+1) and there are users who benefit from backward compatibility 239as well. 240 241Subsections 242----------- 243 244The most common structure change is adding new data, e.g. when adding 245a newer form of device, or adding that state that you previously 246forgot to migrate. This is best solved using a subsection. 247 248A subsection is "like" a device vmstate, but with a particularity, it 249has a Boolean function that tells if that values are needed to be sent 250or not. If this functions returns false, the subsection is not sent. 251Subsections have a unique name, that is looked for on the receiving 252side. 253 254On the receiving side, if we found a subsection for a device that we 255don't understand, we just fail the migration. If we understand all 256the subsections, then we load the state with success. There's no check 257that a subsection is loaded, so a newer QEMU that knows about a subsection 258can (with care) load a stream from an older QEMU that didn't send 259the subsection. 260 261If the new data is only needed in a rare case, then the subsection 262can be made conditional on that case and the migration will still 263succeed to older QEMUs in most cases. This is OK for data that's 264critical, but in some use cases it's preferred that the migration 265should succeed even with the data missing. To support this the 266subsection can be connected to a device property and from there 267to a versioned machine type. 268 269The 'pre_load' and 'post_load' functions on subsections are only 270called if the subsection is loaded. 271 272One important note is that the outer post_load() function is called "after" 273loading all subsections, because a newer subsection could change the same 274value that it uses. A flag, and the combination of outer pre_load and 275post_load can be used to detect whether a subsection was loaded, and to 276fall back on default behaviour when the subsection isn't present. 277 278Example: 279 280.. code:: c 281 282 static bool ide_drive_pio_state_needed(void *opaque) 283 { 284 IDEState *s = opaque; 285 286 return ((s->status & DRQ_STAT) != 0) 287 || (s->bus->error_status & BM_STATUS_PIO_RETRY); 288 } 289 290 const VMStateDescription vmstate_ide_drive_pio_state = { 291 .name = "ide_drive/pio_state", 292 .version_id = 1, 293 .minimum_version_id = 1, 294 .pre_save = ide_drive_pio_pre_save, 295 .post_load = ide_drive_pio_post_load, 296 .needed = ide_drive_pio_state_needed, 297 .fields = (const VMStateField[]) { 298 VMSTATE_INT32(req_nb_sectors, IDEState), 299 VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, 300 vmstate_info_uint8, uint8_t), 301 VMSTATE_INT32(cur_io_buffer_offset, IDEState), 302 VMSTATE_INT32(cur_io_buffer_len, IDEState), 303 VMSTATE_UINT8(end_transfer_fn_idx, IDEState), 304 VMSTATE_INT32(elementary_transfer_size, IDEState), 305 VMSTATE_INT32(packet_transfer_size, IDEState), 306 VMSTATE_END_OF_LIST() 307 } 308 }; 309 310 const VMStateDescription vmstate_ide_drive = { 311 .name = "ide_drive", 312 .version_id = 3, 313 .minimum_version_id = 0, 314 .post_load = ide_drive_post_load, 315 .fields = (const VMStateField[]) { 316 .... several fields .... 317 VMSTATE_END_OF_LIST() 318 }, 319 .subsections = (const VMStateDescription * const []) { 320 &vmstate_ide_drive_pio_state, 321 NULL 322 } 323 }; 324 325Here we have a subsection for the pio state. We only need to 326save/send this state when we are in the middle of a pio operation 327(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is 328not enabled, the values on that fields are garbage and don't need to 329be sent. 330 331Connecting subsections to properties 332------------------------------------ 333 334Using a condition function that checks a 'property' to determine whether 335to send a subsection allows backward migration compatibility when 336new subsections are added, especially when combined with versioned 337machine types. 338 339For example: 340 341 a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and 342 default it to true. 343 b) Add an entry to the ``hw_compat_`` for the previous version that sets 344 the property to false. 345 c) Add a static bool support_foo function that tests the property. 346 d) Add a subsection with a .needed set to the support_foo function 347 e) (potentially) Add an outer pre_load that sets up a default value 348 for 'foo' to be used if the subsection isn't loaded. 349 350Now that subsection will not be generated when using an older 351machine type and the migration stream will be accepted by older 352QEMU versions. 353 354Not sending existing elements 355----------------------------- 356 357Sometimes members of the VMState are no longer needed: 358 359 - removing them will break migration compatibility 360 361 - making them version dependent and bumping the version will break backward migration 362 compatibility. 363 364Adding a dummy field into the migration stream is normally the best way to preserve 365compatibility. 366 367If the field really does need to be removed then: 368 369 a) Add a new property/compatibility/function in the same way for subsections above. 370 b) replace the VMSTATE macro with the _TEST version of the macro, e.g.: 371 372 ``VMSTATE_UINT32(foo, barstruct)`` 373 374 becomes 375 376 ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)`` 377 378 Sometime in the future when we no longer care about the ancient versions these can be killed off. 379 Note that for backward compatibility it's important to fill in the structure with 380 data that the destination will understand. 381 382Any difference in the predicates on the source and destination will end up 383with different fields being enabled and data being loaded into the wrong 384fields; for this reason conditional fields like this are very fragile. 385 386Versions 387-------- 388 389Version numbers are intended for major incompatible changes to the 390migration of a device, and using them breaks backward-migration 391compatibility; in general most changes can be made by adding Subsections 392(see above) or _TEST macros (see above) which won't break compatibility. 393 394Each version is associated with a series of fields saved. The ``save_state`` always saves 395the state as the newer version. But ``load_state`` sometimes is able to 396load state from an older version. 397 398You can see that there are two version fields: 399 400- ``version_id``: the maximum version_id supported by VMState for that device. 401- ``minimum_version_id``: the minimum version_id that VMState is able to understand 402 for that device. 403 404VMState is able to read versions from minimum_version_id to version_id. 405 406There are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields, 407e.g. 408 409.. code:: c 410 411 VMSTATE_UINT16_V(ip_id, Slirp, 2), 412 413only loads that field for versions 2 and newer. 414 415Saving state will always create a section with the 'version_id' value 416and thus can't be loaded by any older QEMU. 417 418Massaging functions 419------------------- 420 421Sometimes, it is not enough to be able to save the state directly 422from one structure, we need to fill the correct values there. One 423example is when we are using kvm. Before saving the cpu state, we 424need to ask kvm to copy to QEMU the state that it is using. And the 425opposite when we are loading the state, we need a way to tell kvm to 426load the state for the cpu that we have just loaded from the QEMUFile. 427 428The functions to do that are inside a vmstate definition, and are called: 429 430- ``int (*pre_load)(void *opaque);`` 431 432 This function is called before we load the state of one device. 433 434- ``int (*post_load)(void *opaque, int version_id);`` 435 436 This function is called after we load the state of one device. 437 438- ``int (*pre_save)(void *opaque);`` 439 440 This function is called before we save the state of one device. 441 442- ``int (*post_save)(void *opaque);`` 443 444 This function is called after we save the state of one device 445 (even upon failure, unless the call to pre_save returned an error). 446 447Example: You can look at hpet.c, that uses the first three functions 448to massage the state that is transferred. 449 450The ``VMSTATE_WITH_TMP`` macro may be useful when the migration 451data doesn't match the stored device data well; it allows an 452intermediate temporary structure to be populated with migration 453data and then transferred to the main structure. 454 455If you use memory or portio_list API functions that update memory layout outside 456initialization (i.e., in response to a guest action), this is a strong 457indication that you need to call these functions in a ``post_load`` callback. 458Examples of such API functions are: 459 460 - memory_region_add_subregion() 461 - memory_region_del_subregion() 462 - memory_region_set_readonly() 463 - memory_region_set_nonvolatile() 464 - memory_region_set_enabled() 465 - memory_region_set_address() 466 - memory_region_set_alias_offset() 467 - portio_list_set_address() 468 - portio_list_set_enabled() 469 470Since the order of device save/restore is not defined, you must 471avoid accessing or changing any other device's state in one of these 472callbacks. (For instance, don't do anything that calls ``update_irq()`` 473in a ``post_load`` hook.) Otherwise, restore will not be deterministic, 474and this will break execution record/replay. 475 476Iterative device migration 477-------------------------- 478 479Some devices, such as RAM or certain platform devices, 480have large amounts of data that would mean that the CPUs would be 481paused for too long if they were sent in one section. For these 482devices an *iterative* approach is taken. 483 484The iterative devices generally don't use VMState macros 485(although it may be possible in some cases) and instead use 486qemu_put_*/qemu_get_* macros to read/write data to the stream. Specialist 487versions exist for high bandwidth IO. 488 489 490An iterative device must provide: 491 492 - A ``save_setup`` function that initialises the data structures and 493 transmits a first section containing information on the device. In the 494 case of RAM this transmits a list of RAMBlocks and sizes. 495 496 - A ``load_setup`` function that initialises the data structures on the 497 destination. 498 499 - A ``state_pending_exact`` function that indicates how much more 500 data we must save. The core migration code will use this to 501 determine when to pause the CPUs and complete the migration. 502 503 - A ``state_pending_estimate`` function that indicates how much more 504 data we must save. When the estimated amount is smaller than the 505 threshold, we call ``state_pending_exact``. 506 507 - A ``save_live_iterate`` function should send a chunk of data until 508 the point that stream bandwidth limits tell it to stop. Each call 509 generates one section. 510 511 - A ``save_live_complete_precopy`` function that must transmit the 512 last section for the device containing any remaining data. 513 514 - A ``load_state`` function used to load sections generated by 515 any of the save functions that generate sections. 516 517 - ``cleanup`` functions for both save and load that are called 518 at the end of migration. 519 520Note that the contents of the sections for iterative migration tend 521to be open-coded by the devices; care should be taken in parsing 522the results and structuring the stream to make them easy to validate. 523 524Device ordering 525--------------- 526 527There are cases in which the ordering of device loading matters; for 528example in some systems where a device may assert an interrupt during loading, 529if the interrupt controller is loaded later then it might lose the state. 530 531Some ordering is implicitly provided by the order in which the machine 532definition creates devices, however this is somewhat fragile. 533 534The ``MigrationPriority`` enum provides a means of explicitly enforcing 535ordering. Numerically higher priorities are loaded earlier. 536The priority is set by setting the ``priority`` field of the top level 537``VMStateDescription`` for the device. 538 539Stream structure 540================ 541 542The stream tries to be word and endian agnostic, allowing migration between hosts 543of different characteristics running the same VM. 544 545 - Header 546 547 - Magic 548 - Version 549 - VM configuration section 550 551 - Machine type 552 - Target page bits 553 - List of sections 554 Each section contains a device, or one iteration of a device save. 555 556 - section type 557 - section id 558 - ID string (First section of each device) 559 - instance id (First section of each device) 560 - version id (First section of each device) 561 - <device data> 562 - Footer mark 563 - EOF mark 564 - VM Description structure 565 Consisting of a JSON description of the contents for analysis only 566 567The ``device data`` in each section consists of the data produced 568by the code described above. For non-iterative devices they have a single 569section; iterative devices have an initial and last section and a set 570of parts in between. 571Note that there is very little checking by the common code of the integrity 572of the ``device data`` contents, that's up to the devices themselves. 573The ``footer mark`` provides a little bit of protection for the case where 574the receiving side reads more or less data than expected. 575 576The ``ID string`` is normally unique, having been formed from a bus name 577and device address, PCI devices and storage devices hung off PCI controllers 578fit this pattern well. Some devices are fixed single instances (e.g. "pc-ram"). 579Others (especially either older devices or system devices which for 580some reason don't have a bus concept) make use of the ``instance id`` 581for otherwise identically named devices. 582 583Return path 584----------- 585 586Only a unidirectional stream is required for normal migration, however a 587``return path`` can be created when bidirectional communication is desired. 588This is primarily used by postcopy, but is also used to return a success 589flag to the source at the end of migration. 590 591``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return 592path. 593 594 Source side 595 596 Forward path - written by migration thread 597 Return path - opened by main thread, read by return-path thread 598 599 Destination side 600 601 Forward path - read by main thread 602 Return path - opened by main thread, written by main thread AND postcopy 603 thread (protected by rp_mutex) 604 605