12e3c8f8dSDr. David Alan Gilbert========= 22e3c8f8dSDr. David Alan GilbertMigration 32e3c8f8dSDr. David Alan Gilbert========= 42e3c8f8dSDr. David Alan Gilbert 52e3c8f8dSDr. David Alan GilbertQEMU has code to load/save the state of the guest that it is running. 62e3c8f8dSDr. David Alan GilbertThese are two complementary operations. Saving the state just does 72e3c8f8dSDr. David Alan Gilbertthat, saves the state for each device that the guest is running. 82e3c8f8dSDr. David Alan GilbertRestoring a guest is just the opposite operation: we need to load the 92e3c8f8dSDr. David Alan Gilbertstate of each device. 102e3c8f8dSDr. David Alan Gilbert 112e3c8f8dSDr. David Alan GilbertFor this to work, QEMU has to be launched with the same arguments the 122e3c8f8dSDr. David Alan Gilberttwo times. I.e. it can only restore the state in one guest that has 132e3c8f8dSDr. David Alan Gilbertthe same devices that the one it was saved (this last requirement can 142e3c8f8dSDr. David Alan Gilbertbe relaxed a bit, but for now we can consider that configuration has 152e3c8f8dSDr. David Alan Gilbertto be exactly the same). 162e3c8f8dSDr. David Alan Gilbert 172e3c8f8dSDr. David Alan GilbertOnce that we are able to save/restore a guest, a new functionality is 182e3c8f8dSDr. David Alan Gilbertrequested: migration. This means that QEMU is able to start in one 192e3c8f8dSDr. David Alan Gilbertmachine and being "migrated" to another machine. I.e. being moved to 202e3c8f8dSDr. David Alan Gilbertanother machine. 212e3c8f8dSDr. David Alan Gilbert 222e3c8f8dSDr. David Alan GilbertNext was the "live migration" functionality. This is important 232e3c8f8dSDr. David Alan Gilbertbecause some guests run with a lot of state (specially RAM), and it 242e3c8f8dSDr. David Alan Gilbertcan take a while to move all state from one machine to another. Live 252e3c8f8dSDr. David Alan Gilbertmigration allows the guest to continue running while the state is 262e3c8f8dSDr. David Alan Gilberttransferred. Only while the last part of the state is transferred has 272e3c8f8dSDr. David Alan Gilbertthe guest to be stopped. Typically the time that the guest is 282e3c8f8dSDr. David Alan Gilbertunresponsive during live migration is the low hundred of milliseconds 292e3c8f8dSDr. David Alan Gilbert(notice that this depends on a lot of things). 302e3c8f8dSDr. David Alan Gilbert 31edd70806SDr. David Alan GilbertTransports 32edd70806SDr. David Alan Gilbert========== 332e3c8f8dSDr. David Alan Gilbert 34edd70806SDr. David Alan GilbertThe migration stream is normally just a byte stream that can be passed 35edd70806SDr. David Alan Gilbertover any transport. 362e3c8f8dSDr. David Alan Gilbert 372e3c8f8dSDr. David Alan Gilbert- tcp migration: do the migration using tcp sockets 382e3c8f8dSDr. David Alan Gilbert- unix migration: do the migration using unix sockets 392e3c8f8dSDr. David Alan Gilbert- exec migration: do the migration using the stdin/stdout through a process. 409277d81fSVille Skyttä- fd migration: do the migration using a file descriptor that is 412e3c8f8dSDr. David Alan Gilbert passed to QEMU. QEMU doesn't care how this file descriptor is opened. 422e3c8f8dSDr. David Alan Gilbert 43edd70806SDr. David Alan GilbertIn addition, support is included for migration using RDMA, which 44edd70806SDr. David Alan Gilberttransports the page data using ``RDMA``, where the hardware takes care of 45edd70806SDr. David Alan Gilberttransporting the pages, and the load on the CPU is much lower. While the 46edd70806SDr. David Alan Gilbertinternals of RDMA migration are a bit different, this isn't really visible 47edd70806SDr. David Alan Gilbertoutside the RAM migration code. 48edd70806SDr. David Alan Gilbert 49edd70806SDr. David Alan GilbertAll these migration protocols use the same infrastructure to 502e3c8f8dSDr. David Alan Gilbertsave/restore state devices. This infrastructure is shared with the 512e3c8f8dSDr. David Alan Gilbertsavevm/loadvm functionality. 522e3c8f8dSDr. David Alan Gilbert 53979da8b3SMarc-André LureauDebugging 54979da8b3SMarc-André Lureau========= 55979da8b3SMarc-André Lureau 564df3a7bfSPeter MaydellThe migration stream can be analyzed thanks to ``scripts/analyze-migration.py``. 57979da8b3SMarc-André Lureau 58979da8b3SMarc-André LureauExample usage: 59979da8b3SMarc-André Lureau 60979da8b3SMarc-André Lureau.. code-block:: shell 61979da8b3SMarc-André Lureau 62243e7480SMarkus Armbruster $ qemu-system-x86_64 -display none -monitor stdio 63979da8b3SMarc-André Lureau (qemu) migrate "exec:cat > mig" 64243e7480SMarkus Armbruster (qemu) q 65243e7480SMarkus Armbruster $ ./scripts/analyze-migration.py -f mig 66979da8b3SMarc-André Lureau { 67979da8b3SMarc-André Lureau "ram (3)": { 68979da8b3SMarc-André Lureau "section sizes": { 69979da8b3SMarc-André Lureau "pc.ram": "0x0000000008000000", 70979da8b3SMarc-André Lureau ... 71979da8b3SMarc-André Lureau 72243e7480SMarkus ArmbrusterSee also ``analyze-migration.py -h`` help for more options. 73979da8b3SMarc-André Lureau 742e3c8f8dSDr. David Alan GilbertCommon infrastructure 752e3c8f8dSDr. David Alan Gilbert===================== 762e3c8f8dSDr. David Alan Gilbert 772e3c8f8dSDr. David Alan GilbertThe files, sockets or fd's that carry the migration stream are abstracted by 784df3a7bfSPeter Maydellthe ``QEMUFile`` type (see ``migration/qemu-file.h``). In most cases this 794df3a7bfSPeter Maydellis connected to a subtype of ``QIOChannel`` (see ``io/``). 802e3c8f8dSDr. David Alan Gilbert 81edd70806SDr. David Alan Gilbert 822e3c8f8dSDr. David Alan GilbertSaving the state of one device 832e3c8f8dSDr. David Alan Gilbert============================== 842e3c8f8dSDr. David Alan Gilbert 85edd70806SDr. David Alan GilbertFor most devices, the state is saved in a single call to the migration 86edd70806SDr. David Alan Gilbertinfrastructure; these are *non-iterative* devices. The data for these 87edd70806SDr. David Alan Gilbertdevices is sent at the end of precopy migration, when the CPUs are paused. 88edd70806SDr. David Alan GilbertThere are also *iterative* devices, which contain a very large amount of 89edd70806SDr. David Alan Gilbertdata (e.g. RAM or large tables). See the iterative device section below. 902e3c8f8dSDr. David Alan Gilbert 91edd70806SDr. David Alan GilbertGeneral advice for device developers 92edd70806SDr. David Alan Gilbert------------------------------------ 932e3c8f8dSDr. David Alan Gilbert 94edd70806SDr. David Alan Gilbert- The migration state saved should reflect the device being modelled rather 95edd70806SDr. David Alan Gilbert than the way your implementation works. That way if you change the implementation 96edd70806SDr. David Alan Gilbert later the migration stream will stay compatible. That model may include 97edd70806SDr. David Alan Gilbert internal state that's not directly visible in a register. 982e3c8f8dSDr. David Alan Gilbert 99edd70806SDr. David Alan Gilbert- When saving a migration stream the device code may walk and check 100edd70806SDr. David Alan Gilbert the state of the device. These checks might fail in various ways (e.g. 101edd70806SDr. David Alan Gilbert discovering internal state is corrupt or that the guest has done something bad). 102edd70806SDr. David Alan Gilbert Consider carefully before asserting/aborting at this point, since the 103edd70806SDr. David Alan Gilbert normal response from users is that *migration broke their VM* since it had 104edd70806SDr. David Alan Gilbert apparently been running fine until then. In these error cases, the device 105edd70806SDr. David Alan Gilbert should log a message indicating the cause of error, and should consider 106edd70806SDr. David Alan Gilbert putting the device into an error state, allowing the rest of the VM to 107edd70806SDr. David Alan Gilbert continue execution. 1082e3c8f8dSDr. David Alan Gilbert 109edd70806SDr. David Alan Gilbert- The migration might happen at an inconvenient point, 110edd70806SDr. David Alan Gilbert e.g. right in the middle of the guest reprogramming the device, during 111edd70806SDr. David Alan Gilbert guest reboot or shutdown or while the device is waiting for external IO. 112edd70806SDr. David Alan Gilbert It's strongly preferred that migrations do not fail in this situation, 113edd70806SDr. David Alan Gilbert since in the cloud environment migrations might happen automatically to 114edd70806SDr. David Alan Gilbert VMs that the administrator doesn't directly control. 1152e3c8f8dSDr. David Alan Gilbert 116edd70806SDr. David Alan Gilbert- If you do need to fail a migration, ensure that sufficient information 117edd70806SDr. David Alan Gilbert is logged to identify what went wrong. 1182e3c8f8dSDr. David Alan Gilbert 119edd70806SDr. David Alan Gilbert- The destination should treat an incoming migration stream as hostile 120edd70806SDr. David Alan Gilbert (which we do to varying degrees in the existing code). Check that offsets 121edd70806SDr. David Alan Gilbert into buffers and the like can't cause overruns. Fail the incoming migration 122edd70806SDr. David Alan Gilbert in the case of a corrupted stream like this. 1232e3c8f8dSDr. David Alan Gilbert 124edd70806SDr. David Alan Gilbert- Take care with internal device state or behaviour that might become 125edd70806SDr. David Alan Gilbert migration version dependent. For example, the order of PCI capabilities 126edd70806SDr. David Alan Gilbert is required to stay constant across migration. Another example would 127edd70806SDr. David Alan Gilbert be that a special case handled by subsections (see below) might become 128edd70806SDr. David Alan Gilbert much more common if a default behaviour is changed. 1292e3c8f8dSDr. David Alan Gilbert 130edd70806SDr. David Alan Gilbert- The state of the source should not be changed or destroyed by the 131edd70806SDr. David Alan Gilbert outgoing migration. Migrations timing out or being failed by 132edd70806SDr. David Alan Gilbert higher levels of management, or failures of the destination host are 133edd70806SDr. David Alan Gilbert not unusual, and in that case the VM is restarted on the source. 134edd70806SDr. David Alan Gilbert Note that the management layer can validly revert the migration 135edd70806SDr. David Alan Gilbert even though the QEMU level of migration has succeeded as long as it 136edd70806SDr. David Alan Gilbert does it before starting execution on the destination. 137edd70806SDr. David Alan Gilbert 138edd70806SDr. David Alan Gilbert- Buses and devices should be able to explicitly specify addresses when 139edd70806SDr. David Alan Gilbert instantiated, and management tools should use those. For example, 140edd70806SDr. David Alan Gilbert when hot adding USB devices it's important to specify the ports 141edd70806SDr. David Alan Gilbert and addresses, since implicit ordering based on the command line order 142edd70806SDr. David Alan Gilbert may be different on the destination. This can result in the 143edd70806SDr. David Alan Gilbert device state being loaded into the wrong device. 1442e3c8f8dSDr. David Alan Gilbert 1452e3c8f8dSDr. David Alan GilbertVMState 1462e3c8f8dSDr. David Alan Gilbert------- 1472e3c8f8dSDr. David Alan Gilbert 148edd70806SDr. David Alan GilbertMost device data can be described using the ``VMSTATE`` macros (mostly defined 149edd70806SDr. David Alan Gilbertin ``include/migration/vmstate.h``). 1502e3c8f8dSDr. David Alan Gilbert 1512e3c8f8dSDr. David Alan GilbertAn example (from hw/input/pckbd.c) 1522e3c8f8dSDr. David Alan Gilbert 1532e3c8f8dSDr. David Alan Gilbert.. code:: c 1542e3c8f8dSDr. David Alan Gilbert 1552e3c8f8dSDr. David Alan Gilbert static const VMStateDescription vmstate_kbd = { 1562e3c8f8dSDr. David Alan Gilbert .name = "pckbd", 1572e3c8f8dSDr. David Alan Gilbert .version_id = 3, 1582e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 3, 1592e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 1602e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(write_cmd, KBDState), 1612e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(status, KBDState), 1622e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(mode, KBDState), 1632e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(pending, KBDState), 1642e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 1652e3c8f8dSDr. David Alan Gilbert } 1662e3c8f8dSDr. David Alan Gilbert }; 1672e3c8f8dSDr. David Alan Gilbert 1682e3c8f8dSDr. David Alan GilbertWe are declaring the state with name "pckbd". 1694df3a7bfSPeter MaydellThe ``version_id`` is 3, and the fields are 4 uint8_t in a KBDState structure. 1702e3c8f8dSDr. David Alan GilbertWe registered this with: 1712e3c8f8dSDr. David Alan Gilbert 1722e3c8f8dSDr. David Alan Gilbert.. code:: c 1732e3c8f8dSDr. David Alan Gilbert 1742e3c8f8dSDr. David Alan Gilbert vmstate_register(NULL, 0, &vmstate_kbd, s); 1752e3c8f8dSDr. David Alan Gilbert 1764df3a7bfSPeter MaydellFor devices that are ``qdev`` based, we can register the device in the class 177edd70806SDr. David Alan Gilbertinit function: 1782e3c8f8dSDr. David Alan Gilbert 179edd70806SDr. David Alan Gilbert.. code:: c 1802e3c8f8dSDr. David Alan Gilbert 181edd70806SDr. David Alan Gilbert dc->vmsd = &vmstate_kbd_isa; 1822e3c8f8dSDr. David Alan Gilbert 183edd70806SDr. David Alan GilbertThe VMState macros take care of ensuring that the device data section 184edd70806SDr. David Alan Gilbertis formatted portably (normally big endian) and make some compile time checks 185edd70806SDr. David Alan Gilbertagainst the types of the fields in the structures. 1862e3c8f8dSDr. David Alan Gilbert 187edd70806SDr. David Alan GilbertVMState macros can include other VMStateDescriptions to store substructures 188edd70806SDr. David Alan Gilbert(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length 189edd70806SDr. David Alan Gilbertarrays (``VMSTATE_VARRAY_``). Various other macros exist for special 190edd70806SDr. David Alan Gilbertcases. 1912e3c8f8dSDr. David Alan Gilbert 192edd70806SDr. David Alan GilbertNote that the format on the wire is still very raw; i.e. a VMSTATE_UINT32 193edd70806SDr. David Alan Gilbertends up with a 4 byte bigendian representation on the wire; in the future 194edd70806SDr. David Alan Gilbertit might be possible to use a more structured format. 1952e3c8f8dSDr. David Alan Gilbert 196edd70806SDr. David Alan GilbertLegacy way 197edd70806SDr. David Alan Gilbert---------- 1982e3c8f8dSDr. David Alan Gilbert 199edd70806SDr. David Alan GilbertThis way is going to disappear as soon as all current users are ported to VMSTATE; 200edd70806SDr. David Alan Gilbertalthough converting existing code can be tricky, and thus 'soon' is relative. 2012e3c8f8dSDr. David Alan Gilbert 202edd70806SDr. David Alan GilbertEach device has to register two functions, one to save the state and 203edd70806SDr. David Alan Gilbertanother to load the state back. 2042e3c8f8dSDr. David Alan Gilbert 205edd70806SDr. David Alan Gilbert.. code:: c 2062e3c8f8dSDr. David Alan Gilbert 207ce62df53SDr. David Alan Gilbert int register_savevm_live(const char *idstr, 208edd70806SDr. David Alan Gilbert int instance_id, 209edd70806SDr. David Alan Gilbert int version_id, 210edd70806SDr. David Alan Gilbert SaveVMHandlers *ops, 211edd70806SDr. David Alan Gilbert void *opaque); 2122e3c8f8dSDr. David Alan Gilbert 2134df3a7bfSPeter MaydellTwo functions in the ``ops`` structure are the ``save_state`` 2144df3a7bfSPeter Maydelland ``load_state`` functions. Notice that ``load_state`` receives a version_id 2154df3a7bfSPeter Maydellparameter to know what state format is receiving. ``save_state`` doesn't 216edd70806SDr. David Alan Gilberthave a version_id parameter because it always uses the latest version. 2172e3c8f8dSDr. David Alan Gilbert 218edd70806SDr. David Alan GilbertNote that because the VMState macros still save the data in a raw 219edd70806SDr. David Alan Gilbertformat, in many cases it's possible to replace legacy code 220edd70806SDr. David Alan Gilbertwith a carefully constructed VMState description that matches the 221edd70806SDr. David Alan Gilbertbyte layout of the existing code. 2222e3c8f8dSDr. David Alan Gilbert 223edd70806SDr. David Alan GilbertChanging migration data structures 224edd70806SDr. David Alan Gilbert---------------------------------- 2252e3c8f8dSDr. David Alan Gilbert 226edd70806SDr. David Alan GilbertWhen we migrate a device, we save/load the state as a series 227edd70806SDr. David Alan Gilbertof fields. Sometimes, due to bugs or new functionality, we need to 228edd70806SDr. David Alan Gilbertchange the state to store more/different information. Changing the migration 229edd70806SDr. David Alan Gilbertstate saved for a device can break migration compatibility unless 230edd70806SDr. David Alan Gilbertcare is taken to use the appropriate techniques. In general QEMU tries 231edd70806SDr. David Alan Gilbertto maintain forward migration compatibility (i.e. migrating from 232edd70806SDr. David Alan GilbertQEMU n->n+1) and there are users who benefit from backward compatibility 233edd70806SDr. David Alan Gilbertas well. 2342e3c8f8dSDr. David Alan Gilbert 2352e3c8f8dSDr. David Alan GilbertSubsections 2362e3c8f8dSDr. David Alan Gilbert----------- 2372e3c8f8dSDr. David Alan Gilbert 238edd70806SDr. David Alan GilbertThe most common structure change is adding new data, e.g. when adding 239edd70806SDr. David Alan Gilberta newer form of device, or adding that state that you previously 240edd70806SDr. David Alan Gilbertforgot to migrate. This is best solved using a subsection. 2412e3c8f8dSDr. David Alan Gilbert 242edd70806SDr. David Alan GilbertA subsection is "like" a device vmstate, but with a particularity, it 243edd70806SDr. David Alan Gilberthas a Boolean function that tells if that values are needed to be sent 244edd70806SDr. David Alan Gilbertor not. If this functions returns false, the subsection is not sent. 245edd70806SDr. David Alan GilbertSubsections have a unique name, that is looked for on the receiving 246edd70806SDr. David Alan Gilbertside. 2472e3c8f8dSDr. David Alan Gilbert 2482e3c8f8dSDr. David Alan GilbertOn the receiving side, if we found a subsection for a device that we 2492e3c8f8dSDr. David Alan Gilbertdon't understand, we just fail the migration. If we understand all 250edd70806SDr. David Alan Gilbertthe subsections, then we load the state with success. There's no check 251edd70806SDr. David Alan Gilbertthat a subsection is loaded, so a newer QEMU that knows about a subsection 252edd70806SDr. David Alan Gilbertcan (with care) load a stream from an older QEMU that didn't send 253edd70806SDr. David Alan Gilbertthe subsection. 254edd70806SDr. David Alan Gilbert 255edd70806SDr. David Alan GilbertIf the new data is only needed in a rare case, then the subsection 256edd70806SDr. David Alan Gilbertcan be made conditional on that case and the migration will still 257edd70806SDr. David Alan Gilbertsucceed to older QEMUs in most cases. This is OK for data that's 258edd70806SDr. David Alan Gilbertcritical, but in some use cases it's preferred that the migration 259edd70806SDr. David Alan Gilbertshould succeed even with the data missing. To support this the 260edd70806SDr. David Alan Gilbertsubsection can be connected to a device property and from there 261edd70806SDr. David Alan Gilbertto a versioned machine type. 2622e3c8f8dSDr. David Alan Gilbert 2633eb21fe9SDr. David Alan GilbertThe 'pre_load' and 'post_load' functions on subsections are only 2643eb21fe9SDr. David Alan Gilbertcalled if the subsection is loaded. 2653eb21fe9SDr. David Alan Gilbert 2663eb21fe9SDr. David Alan GilbertOne important note is that the outer post_load() function is called "after" 2673eb21fe9SDr. David Alan Gilbertloading all subsections, because a newer subsection could change the same 2683eb21fe9SDr. David Alan Gilbertvalue that it uses. A flag, and the combination of outer pre_load and 2693eb21fe9SDr. David Alan Gilbertpost_load can be used to detect whether a subsection was loaded, and to 270edd70806SDr. David Alan Gilbertfall back on default behaviour when the subsection isn't present. 2712e3c8f8dSDr. David Alan Gilbert 2722e3c8f8dSDr. David Alan GilbertExample: 2732e3c8f8dSDr. David Alan Gilbert 2742e3c8f8dSDr. David Alan Gilbert.. code:: c 2752e3c8f8dSDr. David Alan Gilbert 2762e3c8f8dSDr. David Alan Gilbert static bool ide_drive_pio_state_needed(void *opaque) 2772e3c8f8dSDr. David Alan Gilbert { 2782e3c8f8dSDr. David Alan Gilbert IDEState *s = opaque; 2792e3c8f8dSDr. David Alan Gilbert 2802e3c8f8dSDr. David Alan Gilbert return ((s->status & DRQ_STAT) != 0) 2812e3c8f8dSDr. David Alan Gilbert || (s->bus->error_status & BM_STATUS_PIO_RETRY); 2822e3c8f8dSDr. David Alan Gilbert } 2832e3c8f8dSDr. David Alan Gilbert 2842e3c8f8dSDr. David Alan Gilbert const VMStateDescription vmstate_ide_drive_pio_state = { 2852e3c8f8dSDr. David Alan Gilbert .name = "ide_drive/pio_state", 2862e3c8f8dSDr. David Alan Gilbert .version_id = 1, 2872e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 1, 2882e3c8f8dSDr. David Alan Gilbert .pre_save = ide_drive_pio_pre_save, 2892e3c8f8dSDr. David Alan Gilbert .post_load = ide_drive_pio_post_load, 2902e3c8f8dSDr. David Alan Gilbert .needed = ide_drive_pio_state_needed, 2912e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 2922e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(req_nb_sectors, IDEState), 2932e3c8f8dSDr. David Alan Gilbert VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, 2942e3c8f8dSDr. David Alan Gilbert vmstate_info_uint8, uint8_t), 2952e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(cur_io_buffer_offset, IDEState), 2962e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(cur_io_buffer_len, IDEState), 2972e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(end_transfer_fn_idx, IDEState), 2982e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(elementary_transfer_size, IDEState), 2992e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(packet_transfer_size, IDEState), 3002e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 3012e3c8f8dSDr. David Alan Gilbert } 3022e3c8f8dSDr. David Alan Gilbert }; 3032e3c8f8dSDr. David Alan Gilbert 3042e3c8f8dSDr. David Alan Gilbert const VMStateDescription vmstate_ide_drive = { 3052e3c8f8dSDr. David Alan Gilbert .name = "ide_drive", 3062e3c8f8dSDr. David Alan Gilbert .version_id = 3, 3072e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 0, 3082e3c8f8dSDr. David Alan Gilbert .post_load = ide_drive_post_load, 3092e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 3102e3c8f8dSDr. David Alan Gilbert .... several fields .... 3112e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 3122e3c8f8dSDr. David Alan Gilbert }, 3132e3c8f8dSDr. David Alan Gilbert .subsections = (const VMStateDescription*[]) { 3142e3c8f8dSDr. David Alan Gilbert &vmstate_ide_drive_pio_state, 3152e3c8f8dSDr. David Alan Gilbert NULL 3162e3c8f8dSDr. David Alan Gilbert } 3172e3c8f8dSDr. David Alan Gilbert }; 3182e3c8f8dSDr. David Alan Gilbert 3192e3c8f8dSDr. David Alan GilbertHere we have a subsection for the pio state. We only need to 3202e3c8f8dSDr. David Alan Gilbertsave/send this state when we are in the middle of a pio operation 3212e3c8f8dSDr. David Alan Gilbert(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is 3222e3c8f8dSDr. David Alan Gilbertnot enabled, the values on that fields are garbage and don't need to 3232e3c8f8dSDr. David Alan Gilbertbe sent. 3242e3c8f8dSDr. David Alan Gilbert 325edd70806SDr. David Alan GilbertConnecting subsections to properties 326edd70806SDr. David Alan Gilbert------------------------------------ 327edd70806SDr. David Alan Gilbert 3282e3c8f8dSDr. David Alan GilbertUsing a condition function that checks a 'property' to determine whether 329edd70806SDr. David Alan Gilbertto send a subsection allows backward migration compatibility when 330edd70806SDr. David Alan Gilbertnew subsections are added, especially when combined with versioned 331edd70806SDr. David Alan Gilbertmachine types. 3322e3c8f8dSDr. David Alan Gilbert 3332e3c8f8dSDr. David Alan GilbertFor example: 3342e3c8f8dSDr. David Alan Gilbert 3352e3c8f8dSDr. David Alan Gilbert a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and 3362e3c8f8dSDr. David Alan Gilbert default it to true. 337ac78f737SMarc-André Lureau b) Add an entry to the ``hw_compat_`` for the previous version that sets 3382e3c8f8dSDr. David Alan Gilbert the property to false. 3392e3c8f8dSDr. David Alan Gilbert c) Add a static bool support_foo function that tests the property. 3402e3c8f8dSDr. David Alan Gilbert d) Add a subsection with a .needed set to the support_foo function 3413eb21fe9SDr. David Alan Gilbert e) (potentially) Add an outer pre_load that sets up a default value 3423eb21fe9SDr. David Alan Gilbert for 'foo' to be used if the subsection isn't loaded. 3432e3c8f8dSDr. David Alan Gilbert 3442e3c8f8dSDr. David Alan GilbertNow that subsection will not be generated when using an older 3452e3c8f8dSDr. David Alan Gilbertmachine type and the migration stream will be accepted by older 346edd70806SDr. David Alan GilbertQEMU versions. 3472e3c8f8dSDr. David Alan Gilbert 3482e3c8f8dSDr. David Alan GilbertNot sending existing elements 3492e3c8f8dSDr. David Alan Gilbert----------------------------- 3502e3c8f8dSDr. David Alan Gilbert 3512e3c8f8dSDr. David Alan GilbertSometimes members of the VMState are no longer needed: 3522e3c8f8dSDr. David Alan Gilbert 3532e3c8f8dSDr. David Alan Gilbert - removing them will break migration compatibility 3542e3c8f8dSDr. David Alan Gilbert 355edd70806SDr. David Alan Gilbert - making them version dependent and bumping the version will break backward migration 356edd70806SDr. David Alan Gilbert compatibility. 3572e3c8f8dSDr. David Alan Gilbert 358edd70806SDr. David Alan GilbertAdding a dummy field into the migration stream is normally the best way to preserve 359edd70806SDr. David Alan Gilbertcompatibility. 360edd70806SDr. David Alan Gilbert 361edd70806SDr. David Alan GilbertIf the field really does need to be removed then: 3622e3c8f8dSDr. David Alan Gilbert 3632e3c8f8dSDr. David Alan Gilbert a) Add a new property/compatibility/function in the same way for subsections above. 3642e3c8f8dSDr. David Alan Gilbert b) replace the VMSTATE macro with the _TEST version of the macro, e.g.: 3652e3c8f8dSDr. David Alan Gilbert 3662e3c8f8dSDr. David Alan Gilbert ``VMSTATE_UINT32(foo, barstruct)`` 3672e3c8f8dSDr. David Alan Gilbert 3682e3c8f8dSDr. David Alan Gilbert becomes 3692e3c8f8dSDr. David Alan Gilbert 3702e3c8f8dSDr. David Alan Gilbert ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)`` 3712e3c8f8dSDr. David Alan Gilbert 3722e3c8f8dSDr. David Alan Gilbert Sometime in the future when we no longer care about the ancient versions these can be killed off. 373edd70806SDr. David Alan Gilbert Note that for backward compatibility it's important to fill in the structure with 374edd70806SDr. David Alan Gilbert data that the destination will understand. 375edd70806SDr. David Alan Gilbert 376edd70806SDr. David Alan GilbertAny difference in the predicates on the source and destination will end up 377edd70806SDr. David Alan Gilbertwith different fields being enabled and data being loaded into the wrong 378edd70806SDr. David Alan Gilbertfields; for this reason conditional fields like this are very fragile. 379edd70806SDr. David Alan Gilbert 380edd70806SDr. David Alan GilbertVersions 381edd70806SDr. David Alan Gilbert-------- 382edd70806SDr. David Alan Gilbert 383edd70806SDr. David Alan GilbertVersion numbers are intended for major incompatible changes to the 384edd70806SDr. David Alan Gilbertmigration of a device, and using them breaks backward-migration 385edd70806SDr. David Alan Gilbertcompatibility; in general most changes can be made by adding Subsections 386edd70806SDr. David Alan Gilbert(see above) or _TEST macros (see above) which won't break compatibility. 387edd70806SDr. David Alan Gilbert 3884df3a7bfSPeter MaydellEach version is associated with a series of fields saved. The ``save_state`` always saves 3894df3a7bfSPeter Maydellthe state as the newer version. But ``load_state`` sometimes is able to 390edd70806SDr. David Alan Gilbertload state from an older version. 391edd70806SDr. David Alan Gilbert 39218621987SPeter MaydellYou can see that there are two version fields: 393edd70806SDr. David Alan Gilbert 3944df3a7bfSPeter Maydell- ``version_id``: the maximum version_id supported by VMState for that device. 3954df3a7bfSPeter Maydell- ``minimum_version_id``: the minimum version_id that VMState is able to understand 396edd70806SDr. David Alan Gilbert for that device. 397edd70806SDr. David Alan Gilbert 39818621987SPeter MaydellVMState is able to read versions from minimum_version_id to version_id. 399edd70806SDr. David Alan Gilbert 400edd70806SDr. David Alan GilbertThere are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields, 401edd70806SDr. David Alan Gilberte.g. 402edd70806SDr. David Alan Gilbert 403edd70806SDr. David Alan Gilbert.. code:: c 404edd70806SDr. David Alan Gilbert 405edd70806SDr. David Alan Gilbert VMSTATE_UINT16_V(ip_id, Slirp, 2), 406edd70806SDr. David Alan Gilbert 407edd70806SDr. David Alan Gilbertonly loads that field for versions 2 and newer. 408edd70806SDr. David Alan Gilbert 409edd70806SDr. David Alan GilbertSaving state will always create a section with the 'version_id' value 410edd70806SDr. David Alan Gilbertand thus can't be loaded by any older QEMU. 411edd70806SDr. David Alan Gilbert 412edd70806SDr. David Alan GilbertMassaging functions 413edd70806SDr. David Alan Gilbert------------------- 414edd70806SDr. David Alan Gilbert 415edd70806SDr. David Alan GilbertSometimes, it is not enough to be able to save the state directly 416edd70806SDr. David Alan Gilbertfrom one structure, we need to fill the correct values there. One 417edd70806SDr. David Alan Gilbertexample is when we are using kvm. Before saving the cpu state, we 418edd70806SDr. David Alan Gilbertneed to ask kvm to copy to QEMU the state that it is using. And the 419edd70806SDr. David Alan Gilbertopposite when we are loading the state, we need a way to tell kvm to 420edd70806SDr. David Alan Gilbertload the state for the cpu that we have just loaded from the QEMUFile. 421edd70806SDr. David Alan Gilbert 422edd70806SDr. David Alan GilbertThe functions to do that are inside a vmstate definition, and are called: 423edd70806SDr. David Alan Gilbert 424edd70806SDr. David Alan Gilbert- ``int (*pre_load)(void *opaque);`` 425edd70806SDr. David Alan Gilbert 426edd70806SDr. David Alan Gilbert This function is called before we load the state of one device. 427edd70806SDr. David Alan Gilbert 428edd70806SDr. David Alan Gilbert- ``int (*post_load)(void *opaque, int version_id);`` 429edd70806SDr. David Alan Gilbert 430edd70806SDr. David Alan Gilbert This function is called after we load the state of one device. 431edd70806SDr. David Alan Gilbert 432edd70806SDr. David Alan Gilbert- ``int (*pre_save)(void *opaque);`` 433edd70806SDr. David Alan Gilbert 434edd70806SDr. David Alan Gilbert This function is called before we save the state of one device. 435edd70806SDr. David Alan Gilbert 4368c07559fSAaron Lindsay- ``int (*post_save)(void *opaque);`` 4378c07559fSAaron Lindsay 4388c07559fSAaron Lindsay This function is called after we save the state of one device 4398c07559fSAaron Lindsay (even upon failure, unless the call to pre_save returned an error). 4408c07559fSAaron Lindsay 4418c07559fSAaron LindsayExample: You can look at hpet.c, that uses the first three functions 4428c07559fSAaron Lindsayto massage the state that is transferred. 443edd70806SDr. David Alan Gilbert 444edd70806SDr. David Alan GilbertThe ``VMSTATE_WITH_TMP`` macro may be useful when the migration 445edd70806SDr. David Alan Gilbertdata doesn't match the stored device data well; it allows an 446edd70806SDr. David Alan Gilbertintermediate temporary structure to be populated with migration 447edd70806SDr. David Alan Gilbertdata and then transferred to the main structure. 448edd70806SDr. David Alan Gilbert 449edd70806SDr. David Alan GilbertIf you use memory API functions that update memory layout outside 450edd70806SDr. David Alan Gilbertinitialization (i.e., in response to a guest action), this is a strong 4514df3a7bfSPeter Maydellindication that you need to call these functions in a ``post_load`` callback. 452edd70806SDr. David Alan GilbertExamples of such memory API functions are: 453edd70806SDr. David Alan Gilbert 454edd70806SDr. David Alan Gilbert - memory_region_add_subregion() 455edd70806SDr. David Alan Gilbert - memory_region_del_subregion() 456edd70806SDr. David Alan Gilbert - memory_region_set_readonly() 457c26763f8SMarc-André Lureau - memory_region_set_nonvolatile() 458edd70806SDr. David Alan Gilbert - memory_region_set_enabled() 459edd70806SDr. David Alan Gilbert - memory_region_set_address() 460edd70806SDr. David Alan Gilbert - memory_region_set_alias_offset() 461edd70806SDr. David Alan Gilbert 462edd70806SDr. David Alan GilbertIterative device migration 463edd70806SDr. David Alan Gilbert-------------------------- 464edd70806SDr. David Alan Gilbert 465edd70806SDr. David Alan GilbertSome devices, such as RAM, Block storage or certain platform devices, 466edd70806SDr. David Alan Gilberthave large amounts of data that would mean that the CPUs would be 467edd70806SDr. David Alan Gilbertpaused for too long if they were sent in one section. For these 468edd70806SDr. David Alan Gilbertdevices an *iterative* approach is taken. 469edd70806SDr. David Alan Gilbert 470edd70806SDr. David Alan GilbertThe iterative devices generally don't use VMState macros 471edd70806SDr. David Alan Gilbert(although it may be possible in some cases) and instead use 472edd70806SDr. David Alan Gilbertqemu_put_*/qemu_get_* macros to read/write data to the stream. Specialist 473edd70806SDr. David Alan Gilbertversions exist for high bandwidth IO. 474edd70806SDr. David Alan Gilbert 475edd70806SDr. David Alan Gilbert 476edd70806SDr. David Alan GilbertAn iterative device must provide: 477edd70806SDr. David Alan Gilbert 478edd70806SDr. David Alan Gilbert - A ``save_setup`` function that initialises the data structures and 479edd70806SDr. David Alan Gilbert transmits a first section containing information on the device. In the 480edd70806SDr. David Alan Gilbert case of RAM this transmits a list of RAMBlocks and sizes. 481edd70806SDr. David Alan Gilbert 482edd70806SDr. David Alan Gilbert - A ``load_setup`` function that initialises the data structures on the 483edd70806SDr. David Alan Gilbert destination. 484edd70806SDr. David Alan Gilbert 485c8df4a7aSJuan Quintela - A ``state_pending_exact`` function that indicates how much more 486c8df4a7aSJuan Quintela data we must save. The core migration code will use this to 487c8df4a7aSJuan Quintela determine when to pause the CPUs and complete the migration. 488edd70806SDr. David Alan Gilbert 489c8df4a7aSJuan Quintela - A ``state_pending_estimate`` function that indicates how much more 490c8df4a7aSJuan Quintela data we must save. When the estimated amount is smaller than the 491c8df4a7aSJuan Quintela threshold, we call ``state_pending_exact``. 492c8df4a7aSJuan Quintela 493c8df4a7aSJuan Quintela - A ``save_live_iterate`` function should send a chunk of data until 494c8df4a7aSJuan Quintela the point that stream bandwidth limits tell it to stop. Each call 495c8df4a7aSJuan Quintela generates one section. 496edd70806SDr. David Alan Gilbert 497edd70806SDr. David Alan Gilbert - A ``save_live_complete_precopy`` function that must transmit the 498edd70806SDr. David Alan Gilbert last section for the device containing any remaining data. 499edd70806SDr. David Alan Gilbert 500edd70806SDr. David Alan Gilbert - A ``load_state`` function used to load sections generated by 501edd70806SDr. David Alan Gilbert any of the save functions that generate sections. 502edd70806SDr. David Alan Gilbert 503edd70806SDr. David Alan Gilbert - ``cleanup`` functions for both save and load that are called 504edd70806SDr. David Alan Gilbert at the end of migration. 505edd70806SDr. David Alan Gilbert 506edd70806SDr. David Alan GilbertNote that the contents of the sections for iterative migration tend 507edd70806SDr. David Alan Gilbertto be open-coded by the devices; care should be taken in parsing 508edd70806SDr. David Alan Gilbertthe results and structuring the stream to make them easy to validate. 509edd70806SDr. David Alan Gilbert 510edd70806SDr. David Alan GilbertDevice ordering 511edd70806SDr. David Alan Gilbert--------------- 512edd70806SDr. David Alan Gilbert 513edd70806SDr. David Alan GilbertThere are cases in which the ordering of device loading matters; for 514edd70806SDr. David Alan Gilbertexample in some systems where a device may assert an interrupt during loading, 515edd70806SDr. David Alan Gilbertif the interrupt controller is loaded later then it might lose the state. 516edd70806SDr. David Alan Gilbert 517edd70806SDr. David Alan GilbertSome ordering is implicitly provided by the order in which the machine 518edd70806SDr. David Alan Gilbertdefinition creates devices, however this is somewhat fragile. 519edd70806SDr. David Alan Gilbert 520edd70806SDr. David Alan GilbertThe ``MigrationPriority`` enum provides a means of explicitly enforcing 521edd70806SDr. David Alan Gilbertordering. Numerically higher priorities are loaded earlier. 522edd70806SDr. David Alan GilbertThe priority is set by setting the ``priority`` field of the top level 523edd70806SDr. David Alan Gilbert``VMStateDescription`` for the device. 524edd70806SDr. David Alan Gilbert 525edd70806SDr. David Alan GilbertStream structure 526edd70806SDr. David Alan Gilbert================ 527edd70806SDr. David Alan Gilbert 528edd70806SDr. David Alan GilbertThe stream tries to be word and endian agnostic, allowing migration between hosts 529edd70806SDr. David Alan Gilbertof different characteristics running the same VM. 530edd70806SDr. David Alan Gilbert 531edd70806SDr. David Alan Gilbert - Header 532edd70806SDr. David Alan Gilbert 533edd70806SDr. David Alan Gilbert - Magic 534edd70806SDr. David Alan Gilbert - Version 535edd70806SDr. David Alan Gilbert - VM configuration section 536edd70806SDr. David Alan Gilbert 537edd70806SDr. David Alan Gilbert - Machine type 538edd70806SDr. David Alan Gilbert - Target page bits 539edd70806SDr. David Alan Gilbert - List of sections 540edd70806SDr. David Alan Gilbert Each section contains a device, or one iteration of a device save. 541edd70806SDr. David Alan Gilbert 542edd70806SDr. David Alan Gilbert - section type 543edd70806SDr. David Alan Gilbert - section id 544edd70806SDr. David Alan Gilbert - ID string (First section of each device) 545edd70806SDr. David Alan Gilbert - instance id (First section of each device) 546edd70806SDr. David Alan Gilbert - version id (First section of each device) 547edd70806SDr. David Alan Gilbert - <device data> 548edd70806SDr. David Alan Gilbert - Footer mark 549edd70806SDr. David Alan Gilbert - EOF mark 550edd70806SDr. David Alan Gilbert - VM Description structure 551edd70806SDr. David Alan Gilbert Consisting of a JSON description of the contents for analysis only 552edd70806SDr. David Alan Gilbert 553edd70806SDr. David Alan GilbertThe ``device data`` in each section consists of the data produced 554edd70806SDr. David Alan Gilbertby the code described above. For non-iterative devices they have a single 555edd70806SDr. David Alan Gilbertsection; iterative devices have an initial and last section and a set 556edd70806SDr. David Alan Gilbertof parts in between. 557edd70806SDr. David Alan GilbertNote that there is very little checking by the common code of the integrity 558edd70806SDr. David Alan Gilbertof the ``device data`` contents, that's up to the devices themselves. 559edd70806SDr. David Alan GilbertThe ``footer mark`` provides a little bit of protection for the case where 560edd70806SDr. David Alan Gilbertthe receiving side reads more or less data than expected. 561edd70806SDr. David Alan Gilbert 562edd70806SDr. David Alan GilbertThe ``ID string`` is normally unique, having been formed from a bus name 563edd70806SDr. David Alan Gilbertand device address, PCI devices and storage devices hung off PCI controllers 564edd70806SDr. David Alan Gilbertfit this pattern well. Some devices are fixed single instances (e.g. "pc-ram"). 565edd70806SDr. David Alan GilbertOthers (especially either older devices or system devices which for 566edd70806SDr. David Alan Gilbertsome reason don't have a bus concept) make use of the ``instance id`` 567edd70806SDr. David Alan Gilbertfor otherwise identically named devices. 5682e3c8f8dSDr. David Alan Gilbert 5692e3c8f8dSDr. David Alan GilbertReturn path 5702e3c8f8dSDr. David Alan Gilbert----------- 5712e3c8f8dSDr. David Alan Gilbert 572edd70806SDr. David Alan GilbertOnly a unidirectional stream is required for normal migration, however a 573edd70806SDr. David Alan Gilbert``return path`` can be created when bidirectional communication is desired. 574edd70806SDr. David Alan GilbertThis is primarily used by postcopy, but is also used to return a success 575edd70806SDr. David Alan Gilbertflag to the source at the end of migration. 5762e3c8f8dSDr. David Alan Gilbert 5772e3c8f8dSDr. David Alan Gilbert``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return 5782e3c8f8dSDr. David Alan Gilbertpath. 5792e3c8f8dSDr. David Alan Gilbert 5802e3c8f8dSDr. David Alan Gilbert Source side 5812e3c8f8dSDr. David Alan Gilbert 5822e3c8f8dSDr. David Alan Gilbert Forward path - written by migration thread 5832e3c8f8dSDr. David Alan Gilbert Return path - opened by main thread, read by return-path thread 5842e3c8f8dSDr. David Alan Gilbert 5852e3c8f8dSDr. David Alan Gilbert Destination side 5862e3c8f8dSDr. David Alan Gilbert 5872e3c8f8dSDr. David Alan Gilbert Forward path - read by main thread 5882e3c8f8dSDr. David Alan Gilbert Return path - opened by main thread, written by main thread AND postcopy 5892e3c8f8dSDr. David Alan Gilbert thread (protected by rp_mutex) 5902e3c8f8dSDr. David Alan Gilbert 5912e3c8f8dSDr. David Alan GilbertPostcopy 5922e3c8f8dSDr. David Alan Gilbert======== 5932e3c8f8dSDr. David Alan Gilbert 5942e3c8f8dSDr. David Alan Gilbert'Postcopy' migration is a way to deal with migrations that refuse to converge 5952e3c8f8dSDr. David Alan Gilbert(or take too long to converge) its plus side is that there is an upper bound on 5962e3c8f8dSDr. David Alan Gilbertthe amount of migration traffic and time it takes, the down side is that during 5972e3c8f8dSDr. David Alan Gilbertthe postcopy phase, a failure of *either* side or the network connection causes 5982e3c8f8dSDr. David Alan Gilbertthe guest to be lost. 5992e3c8f8dSDr. David Alan Gilbert 6002e3c8f8dSDr. David Alan GilbertIn postcopy the destination CPUs are started before all the memory has been 6012e3c8f8dSDr. David Alan Gilberttransferred, and accesses to pages that are yet to be transferred cause 6022e3c8f8dSDr. David Alan Gilberta fault that's translated by QEMU into a request to the source QEMU. 6032e3c8f8dSDr. David Alan Gilbert 6042e3c8f8dSDr. David Alan GilbertPostcopy can be combined with precopy (i.e. normal migration) so that if precopy 6052e3c8f8dSDr. David Alan Gilbertdoesn't finish in a given time the switch is made to postcopy. 6062e3c8f8dSDr. David Alan Gilbert 6072e3c8f8dSDr. David Alan GilbertEnabling postcopy 6082e3c8f8dSDr. David Alan Gilbert----------------- 6092e3c8f8dSDr. David Alan Gilbert 610c2eb7f21SGreg KurzTo enable postcopy, issue this command on the monitor (both source and 611c2eb7f21SGreg Kurzdestination) prior to the start of migration: 6122e3c8f8dSDr. David Alan Gilbert 6132e3c8f8dSDr. David Alan Gilbert``migrate_set_capability postcopy-ram on`` 6142e3c8f8dSDr. David Alan Gilbert 6152e3c8f8dSDr. David Alan GilbertThe normal commands are then used to start a migration, which is still 6162e3c8f8dSDr. David Alan Gilbertstarted in precopy mode. Issuing: 6172e3c8f8dSDr. David Alan Gilbert 6182e3c8f8dSDr. David Alan Gilbert``migrate_start_postcopy`` 6192e3c8f8dSDr. David Alan Gilbert 6202e3c8f8dSDr. David Alan Gilbertwill now cause the transition from precopy to postcopy. 6212e3c8f8dSDr. David Alan GilbertIt can be issued immediately after migration is started or any 6222e3c8f8dSDr. David Alan Gilberttime later on. Issuing it after the end of a migration is harmless. 6232e3c8f8dSDr. David Alan Gilbert 6249ed01779SAlexey PerevalovBlocktime is a postcopy live migration metric, intended to show how 62576ca4b58Szhaolichanglong the vCPU was in state of interruptible sleep due to pagefault. 6269ed01779SAlexey PerevalovThat metric is calculated both for all vCPUs as overlapped value, and 6279ed01779SAlexey Perevalovseparately for each vCPU. These values are calculated on destination 6289ed01779SAlexey Perevalovside. To enable postcopy blocktime calculation, enter following 6299ed01779SAlexey Perevalovcommand on destination monitor: 6309ed01779SAlexey Perevalov 6319ed01779SAlexey Perevalov``migrate_set_capability postcopy-blocktime on`` 6329ed01779SAlexey Perevalov 6339ed01779SAlexey PerevalovPostcopy blocktime can be retrieved by query-migrate qmp command. 6349ed01779SAlexey Perevalovpostcopy-blocktime value of qmp command will show overlapped blocking 6359ed01779SAlexey Perevalovtime for all vCPU, postcopy-vcpu-blocktime will show list of blocking 6369ed01779SAlexey Perevalovtime per vCPU. 6379ed01779SAlexey Perevalov 6382e3c8f8dSDr. David Alan Gilbert.. note:: 6392e3c8f8dSDr. David Alan Gilbert During the postcopy phase, the bandwidth limits set using 640cbde7be9SDaniel P. Berrangé ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that 6412e3c8f8dSDr. David Alan Gilbert the destination is waiting for). 6422e3c8f8dSDr. David Alan Gilbert 6432e3c8f8dSDr. David Alan GilbertPostcopy device transfer 6442e3c8f8dSDr. David Alan Gilbert------------------------ 6452e3c8f8dSDr. David Alan Gilbert 6462e3c8f8dSDr. David Alan GilbertLoading of device data may cause the device emulation to access guest RAM 6472e3c8f8dSDr. David Alan Gilbertthat may trigger faults that have to be resolved by the source, as such 6482e3c8f8dSDr. David Alan Gilbertthe migration stream has to be able to respond with page data *during* the 6492e3c8f8dSDr. David Alan Gilbertdevice load, and hence the device data has to be read from the stream completely 6502e3c8f8dSDr. David Alan Gilbertbefore the device load begins to free the stream up. This is achieved by 6512e3c8f8dSDr. David Alan Gilbert'packaging' the device data into a blob that's read in one go. 6522e3c8f8dSDr. David Alan Gilbert 6532e3c8f8dSDr. David Alan GilbertSource behaviour 6542e3c8f8dSDr. David Alan Gilbert---------------- 6552e3c8f8dSDr. David Alan Gilbert 6562e3c8f8dSDr. David Alan GilbertUntil postcopy is entered the migration stream is identical to normal 6572e3c8f8dSDr. David Alan Gilbertprecopy, except for the addition of a 'postcopy advise' command at 6582e3c8f8dSDr. David Alan Gilbertthe beginning, to tell the destination that postcopy might happen. 6592e3c8f8dSDr. David Alan GilbertWhen postcopy starts the source sends the page discard data and then 6602e3c8f8dSDr. David Alan Gilbertforms the 'package' containing: 6612e3c8f8dSDr. David Alan Gilbert 6622e3c8f8dSDr. David Alan Gilbert - Command: 'postcopy listen' 6632e3c8f8dSDr. David Alan Gilbert - The device state 6642e3c8f8dSDr. David Alan Gilbert 6652e3c8f8dSDr. David Alan Gilbert A series of sections, identical to the precopy streams device state stream 6662e3c8f8dSDr. David Alan Gilbert containing everything except postcopiable devices (i.e. RAM) 6672e3c8f8dSDr. David Alan Gilbert - Command: 'postcopy run' 6682e3c8f8dSDr. David Alan Gilbert 6692e3c8f8dSDr. David Alan GilbertThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the 6702e3c8f8dSDr. David Alan Gilbertcontents are formatted in the same way as the main migration stream. 6712e3c8f8dSDr. David Alan Gilbert 6722e3c8f8dSDr. David Alan GilbertDuring postcopy the source scans the list of dirty pages and sends them 6732e3c8f8dSDr. David Alan Gilbertto the destination without being requested (in much the same way as precopy), 6742e3c8f8dSDr. David Alan Gilberthowever when a page request is received from the destination, the dirty page 6752e3c8f8dSDr. David Alan Gilbertscanning restarts from the requested location. This causes requested pages 6762e3c8f8dSDr. David Alan Gilbertto be sent quickly, and also causes pages directly after the requested page 6772e3c8f8dSDr. David Alan Gilbertto be sent quickly in the hope that those pages are likely to be used 6782e3c8f8dSDr. David Alan Gilbertby the destination soon. 6792e3c8f8dSDr. David Alan Gilbert 6802e3c8f8dSDr. David Alan GilbertDestination behaviour 6812e3c8f8dSDr. David Alan Gilbert--------------------- 6822e3c8f8dSDr. David Alan Gilbert 6832e3c8f8dSDr. David Alan GilbertInitially the destination looks the same as precopy, with a single thread 6842e3c8f8dSDr. David Alan Gilbertreading the migration stream; the 'postcopy advise' and 'discard' commands 6852e3c8f8dSDr. David Alan Gilbertare processed to change the way RAM is managed, but don't affect the stream 6862e3c8f8dSDr. David Alan Gilbertprocessing. 6872e3c8f8dSDr. David Alan Gilbert 6882e3c8f8dSDr. David Alan Gilbert:: 6892e3c8f8dSDr. David Alan Gilbert 6902e3c8f8dSDr. David Alan Gilbert ------------------------------------------------------------------------------ 6912e3c8f8dSDr. David Alan Gilbert 1 2 3 4 5 6 7 6922e3c8f8dSDr. David Alan Gilbert main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) 6932e3c8f8dSDr. David Alan Gilbert thread | | 6942e3c8f8dSDr. David Alan Gilbert | (page request) 6952e3c8f8dSDr. David Alan Gilbert | \___ 6962e3c8f8dSDr. David Alan Gilbert v \ 6972e3c8f8dSDr. David Alan Gilbert listen thread: --- page -- page -- page -- page -- page -- 6982e3c8f8dSDr. David Alan Gilbert 6992e3c8f8dSDr. David Alan Gilbert a b c 7002e3c8f8dSDr. David Alan Gilbert ------------------------------------------------------------------------------ 7012e3c8f8dSDr. David Alan Gilbert 7022e3c8f8dSDr. David Alan Gilbert- On receipt of ``CMD_PACKAGED`` (1) 7032e3c8f8dSDr. David Alan Gilbert 7042e3c8f8dSDr. David Alan Gilbert All the data associated with the package - the ( ... ) section in the diagram - 7052e3c8f8dSDr. David Alan Gilbert is read into memory, and the main thread recurses into qemu_loadvm_state_main 7062e3c8f8dSDr. David Alan Gilbert to process the contents of the package (2) which contains commands (3,6) and 7072e3c8f8dSDr. David Alan Gilbert devices (4...) 7082e3c8f8dSDr. David Alan Gilbert 7092e3c8f8dSDr. David Alan Gilbert- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) 7102e3c8f8dSDr. David Alan Gilbert 7112e3c8f8dSDr. David Alan Gilbert a new thread (a) is started that takes over servicing the migration stream, 7122e3c8f8dSDr. David Alan Gilbert while the main thread carries on loading the package. It loads normal 7132e3c8f8dSDr. David Alan Gilbert background page data (b) but if during a device load a fault happens (5) 7142e3c8f8dSDr. David Alan Gilbert the returned page (c) is loaded by the listen thread allowing the main 7152e3c8f8dSDr. David Alan Gilbert threads device load to carry on. 7162e3c8f8dSDr. David Alan Gilbert 7172e3c8f8dSDr. David Alan Gilbert- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) 7182e3c8f8dSDr. David Alan Gilbert 7192e3c8f8dSDr. David Alan Gilbert letting the destination CPUs start running. At the end of the 7202e3c8f8dSDr. David Alan Gilbert ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and 7212e3c8f8dSDr. David Alan Gilbert is no longer used by migration, while the listen thread carries on servicing 7222e3c8f8dSDr. David Alan Gilbert page data until the end of migration. 7232e3c8f8dSDr. David Alan Gilbert 7242e3c8f8dSDr. David Alan GilbertPostcopy states 7252e3c8f8dSDr. David Alan Gilbert--------------- 7262e3c8f8dSDr. David Alan Gilbert 7272e3c8f8dSDr. David Alan GilbertPostcopy moves through a series of states (see postcopy_state) from 7282e3c8f8dSDr. David Alan GilbertADVISE->DISCARD->LISTEN->RUNNING->END 7292e3c8f8dSDr. David Alan Gilbert 7302e3c8f8dSDr. David Alan Gilbert - Advise 7312e3c8f8dSDr. David Alan Gilbert 7322e3c8f8dSDr. David Alan Gilbert Set at the start of migration if postcopy is enabled, even 7332e3c8f8dSDr. David Alan Gilbert if it hasn't had the start command; here the destination 7342e3c8f8dSDr. David Alan Gilbert checks that its OS has the support needed for postcopy, and performs 7352e3c8f8dSDr. David Alan Gilbert setup to ensure the RAM mappings are suitable for later postcopy. 7362e3c8f8dSDr. David Alan Gilbert The destination will fail early in migration at this point if the 7372e3c8f8dSDr. David Alan Gilbert required OS support is not present. 7382e3c8f8dSDr. David Alan Gilbert (Triggered by reception of POSTCOPY_ADVISE command) 7392e3c8f8dSDr. David Alan Gilbert 7402e3c8f8dSDr. David Alan Gilbert - Discard 7412e3c8f8dSDr. David Alan Gilbert 7422e3c8f8dSDr. David Alan Gilbert Entered on receipt of the first 'discard' command; prior to 7432e3c8f8dSDr. David Alan Gilbert the first Discard being performed, hugepages are switched off 7442e3c8f8dSDr. David Alan Gilbert (using madvise) to ensure that no new huge pages are created 7452e3c8f8dSDr. David Alan Gilbert during the postcopy phase, and to cause any huge pages that 7462e3c8f8dSDr. David Alan Gilbert have discards on them to be broken. 7472e3c8f8dSDr. David Alan Gilbert 7482e3c8f8dSDr. David Alan Gilbert - Listen 7492e3c8f8dSDr. David Alan Gilbert 7502e3c8f8dSDr. David Alan Gilbert The first command in the package, POSTCOPY_LISTEN, switches 7512e3c8f8dSDr. David Alan Gilbert the destination state to Listen, and starts a new thread 7522e3c8f8dSDr. David Alan Gilbert (the 'listen thread') which takes over the job of receiving 7532e3c8f8dSDr. David Alan Gilbert pages off the migration stream, while the main thread carries 7542e3c8f8dSDr. David Alan Gilbert on processing the blob. With this thread able to process page 7552e3c8f8dSDr. David Alan Gilbert reception, the destination now 'sensitises' the RAM to detect 7562e3c8f8dSDr. David Alan Gilbert any access to missing pages (on Linux using the 'userfault' 7572e3c8f8dSDr. David Alan Gilbert system). 7582e3c8f8dSDr. David Alan Gilbert 7592e3c8f8dSDr. David Alan Gilbert - Running 7602e3c8f8dSDr. David Alan Gilbert 7612e3c8f8dSDr. David Alan Gilbert POSTCOPY_RUN causes the destination to synchronise all 7622e3c8f8dSDr. David Alan Gilbert state and start the CPUs and IO devices running. The main 7632e3c8f8dSDr. David Alan Gilbert thread now finishes processing the migration package and 7642e3c8f8dSDr. David Alan Gilbert now carries on as it would for normal precopy migration 7652e3c8f8dSDr. David Alan Gilbert (although it can't do the cleanup it would do as it 7662e3c8f8dSDr. David Alan Gilbert finishes a normal migration). 7672e3c8f8dSDr. David Alan Gilbert 7682e3c8f8dSDr. David Alan Gilbert - End 7692e3c8f8dSDr. David Alan Gilbert 7702e3c8f8dSDr. David Alan Gilbert The listen thread can now quit, and perform the cleanup of migration 7712e3c8f8dSDr. David Alan Gilbert state, the migration is now complete. 7722e3c8f8dSDr. David Alan Gilbert 7732e3c8f8dSDr. David Alan GilbertSource side page maps 7742e3c8f8dSDr. David Alan Gilbert--------------------- 7752e3c8f8dSDr. David Alan Gilbert 7762e3c8f8dSDr. David Alan GilbertThe source side keeps two bitmaps during postcopy; 'the migration bitmap' 7772e3c8f8dSDr. David Alan Gilbertand 'unsent map'. The 'migration bitmap' is basically the same as in 7782e3c8f8dSDr. David Alan Gilbertthe precopy case, and holds a bit to indicate that page is 'dirty' - 7792e3c8f8dSDr. David Alan Gilberti.e. needs sending. During the precopy phase this is updated as the CPU 7802e3c8f8dSDr. David Alan Gilbertdirties pages, however during postcopy the CPUs are stopped and nothing 7812e3c8f8dSDr. David Alan Gilbertshould dirty anything any more. 7822e3c8f8dSDr. David Alan Gilbert 7832e3c8f8dSDr. David Alan GilbertThe 'unsent map' is used for the transition to postcopy. It is a bitmap that 7842e3c8f8dSDr. David Alan Gilberthas a bit cleared whenever a page is sent to the destination, however during 7852e3c8f8dSDr. David Alan Gilbertthe transition to postcopy mode it is combined with the migration bitmap 7862e3c8f8dSDr. David Alan Gilbertto form a set of pages that: 7872e3c8f8dSDr. David Alan Gilbert 7882e3c8f8dSDr. David Alan Gilbert a) Have been sent but then redirtied (which must be discarded) 7892e3c8f8dSDr. David Alan Gilbert b) Have not yet been sent - which also must be discarded to cause any 7902e3c8f8dSDr. David Alan Gilbert transparent huge pages built during precopy to be broken. 7912e3c8f8dSDr. David Alan Gilbert 7922e3c8f8dSDr. David Alan GilbertNote that the contents of the unsentmap are sacrificed during the calculation 7932e3c8f8dSDr. David Alan Gilbertof the discard set and thus aren't valid once in postcopy. The dirtymap 7942e3c8f8dSDr. David Alan Gilbertis still valid and is used to ensure that no page is sent more than once. Any 7952e3c8f8dSDr. David Alan Gilbertrequest for a page that has already been sent is ignored. Duplicate requests 7962e3c8f8dSDr. David Alan Gilbertsuch as this can happen as a page is sent at about the same time the 7972e3c8f8dSDr. David Alan Gilbertdestination accesses it. 7982e3c8f8dSDr. David Alan Gilbert 7992e3c8f8dSDr. David Alan GilbertPostcopy with hugepages 8002e3c8f8dSDr. David Alan Gilbert----------------------- 8012e3c8f8dSDr. David Alan Gilbert 8022e3c8f8dSDr. David Alan GilbertPostcopy now works with hugetlbfs backed memory: 8032e3c8f8dSDr. David Alan Gilbert 8042e3c8f8dSDr. David Alan Gilbert a) The linux kernel on the destination must support userfault on hugepages. 8052e3c8f8dSDr. David Alan Gilbert b) The huge-page configuration on the source and destination VMs must be 8062e3c8f8dSDr. David Alan Gilbert identical; i.e. RAMBlocks on both sides must use the same page size. 8072e3c8f8dSDr. David Alan Gilbert c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal 8082e3c8f8dSDr. David Alan Gilbert RAM if it doesn't have enough hugepages, triggering (b) to fail. 8092e3c8f8dSDr. David Alan Gilbert Using ``-mem-prealloc`` enforces the allocation using hugepages. 8102e3c8f8dSDr. David Alan Gilbert d) Care should be taken with the size of hugepage used; postcopy with 2MB 8112e3c8f8dSDr. David Alan Gilbert hugepages works well, however 1GB hugepages are likely to be problematic 8122e3c8f8dSDr. David Alan Gilbert since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, 8132e3c8f8dSDr. David Alan Gilbert and until the full page is transferred the destination thread is blocked. 8141dc61e7bSDr. David Alan Gilbert 8151dc61e7bSDr. David Alan GilbertPostcopy with shared memory 8161dc61e7bSDr. David Alan Gilbert--------------------------- 8171dc61e7bSDr. David Alan Gilbert 8181dc61e7bSDr. David Alan GilbertPostcopy migration with shared memory needs explicit support from the other 8191dc61e7bSDr. David Alan Gilbertprocesses that share memory and from QEMU. There are restrictions on the type of 8201dc61e7bSDr. David Alan Gilbertmemory that userfault can support shared. 8211dc61e7bSDr. David Alan Gilbert 8224df3a7bfSPeter MaydellThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` 8234df3a7bfSPeter Maydell(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` 8241dc61e7bSDr. David Alan Gilbertfor hugetlbfs which may be a problem in some configurations). 8251dc61e7bSDr. David Alan Gilbert 8261dc61e7bSDr. David Alan GilbertThe vhost-user code in QEMU supports clients that have Postcopy support, 8274df3a7bfSPeter Maydelland the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes 8281dc61e7bSDr. David Alan Gilbertto support postcopy. 8291dc61e7bSDr. David Alan Gilbert 8301dc61e7bSDr. David Alan GilbertThe client needs to open a userfaultfd and register the areas 8311dc61e7bSDr. David Alan Gilbertof memory that it maps with userfault. The client must then pass the 8321dc61e7bSDr. David Alan Gilbertuserfaultfd back to QEMU together with a mapping table that allows 8331dc61e7bSDr. David Alan Gilbertfault addresses in the clients address space to be converted back to 8341dc61e7bSDr. David Alan GilbertRAMBlock/offsets. The client's userfaultfd is added to the postcopy 8351dc61e7bSDr. David Alan Gilbertfault-thread and page requests are made on behalf of the client by QEMU. 8361dc61e7bSDr. David Alan GilbertQEMU performs 'wake' operations on the client's userfaultfd to allow it 8371dc61e7bSDr. David Alan Gilbertto continue after a page has arrived. 8381dc61e7bSDr. David Alan Gilbert 8391dc61e7bSDr. David Alan Gilbert.. note:: 8401dc61e7bSDr. David Alan Gilbert There are two future improvements that would be nice: 8411dc61e7bSDr. David Alan Gilbert a) Some way to make QEMU ignorant of the addresses in the clients 8421dc61e7bSDr. David Alan Gilbert address space 8431dc61e7bSDr. David Alan Gilbert b) Avoiding the need for QEMU to perform ufd-wake calls after the 8441dc61e7bSDr. David Alan Gilbert pages have arrived 8451dc61e7bSDr. David Alan Gilbert 8461dc61e7bSDr. David Alan GilbertRetro-fitting postcopy to existing clients is possible: 8471dc61e7bSDr. David Alan Gilbert a) A mechanism is needed for the registration with userfault as above, 8481dc61e7bSDr. David Alan Gilbert and the registration needs to be coordinated with the phases of 8491dc61e7bSDr. David Alan Gilbert postcopy. In vhost-user extra messages are added to the existing 8501dc61e7bSDr. David Alan Gilbert control channel. 8511dc61e7bSDr. David Alan Gilbert b) Any thread that can block due to guest memory accesses must be 8521dc61e7bSDr. David Alan Gilbert identified and the implication understood; for example if the 8531dc61e7bSDr. David Alan Gilbert guest memory access is made while holding a lock then all other 8541dc61e7bSDr. David Alan Gilbert threads waiting for that lock will also be blocked. 855edd70806SDr. David Alan Gilbert 856edd70806SDr. David Alan GilbertFirmware 857edd70806SDr. David Alan Gilbert======== 858edd70806SDr. David Alan Gilbert 859edd70806SDr. David Alan GilbertMigration migrates the copies of RAM and ROM, and thus when running 860edd70806SDr. David Alan Gilberton the destination it includes the firmware from the source. Even after 861edd70806SDr. David Alan Gilbertresetting a VM, the old firmware is used. Only once QEMU has been restarted 862edd70806SDr. David Alan Gilbertis the new firmware in use. 863edd70806SDr. David Alan Gilbert 864edd70806SDr. David Alan Gilbert- Changes in firmware size can cause changes in the required RAMBlock size 865edd70806SDr. David Alan Gilbert to hold the firmware and thus migration can fail. In practice it's best 866edd70806SDr. David Alan Gilbert to pad firmware images to convenient powers of 2 with plenty of space 867edd70806SDr. David Alan Gilbert for growth. 868edd70806SDr. David Alan Gilbert 869edd70806SDr. David Alan Gilbert- Care should be taken with device emulation code so that newer 870edd70806SDr. David Alan Gilbert emulation code can work with older firmware to allow forward migration. 871edd70806SDr. David Alan Gilbert 872edd70806SDr. David Alan Gilbert- Care should be taken with newer firmware so that backward migration 873edd70806SDr. David Alan Gilbert to older systems with older device emulation code will work. 874edd70806SDr. David Alan Gilbert 875edd70806SDr. David Alan GilbertIn some cases it may be best to tie specific firmware versions to specific 876edd70806SDr. David Alan Gilbertversioned machine types to cut down on the combinations that will need 877edd70806SDr. David Alan Gilbertsupport. This is also useful when newer versions of firmware outgrow 878edd70806SDr. David Alan Gilbertthe padding. 879edd70806SDr. David Alan Gilbert 880