12e3c8f8dSDr. David Alan Gilbert========= 22e3c8f8dSDr. David Alan GilbertMigration 32e3c8f8dSDr. David Alan Gilbert========= 42e3c8f8dSDr. David Alan Gilbert 52e3c8f8dSDr. David Alan GilbertQEMU has code to load/save the state of the guest that it is running. 62e3c8f8dSDr. David Alan GilbertThese are two complementary operations. Saving the state just does 72e3c8f8dSDr. David Alan Gilbertthat, saves the state for each device that the guest is running. 82e3c8f8dSDr. David Alan GilbertRestoring a guest is just the opposite operation: we need to load the 92e3c8f8dSDr. David Alan Gilbertstate of each device. 102e3c8f8dSDr. David Alan Gilbert 112e3c8f8dSDr. David Alan GilbertFor this to work, QEMU has to be launched with the same arguments the 122e3c8f8dSDr. David Alan Gilberttwo times. I.e. it can only restore the state in one guest that has 132e3c8f8dSDr. David Alan Gilbertthe same devices that the one it was saved (this last requirement can 142e3c8f8dSDr. David Alan Gilbertbe relaxed a bit, but for now we can consider that configuration has 152e3c8f8dSDr. David Alan Gilbertto be exactly the same). 162e3c8f8dSDr. David Alan Gilbert 172e3c8f8dSDr. David Alan GilbertOnce that we are able to save/restore a guest, a new functionality is 182e3c8f8dSDr. David Alan Gilbertrequested: migration. This means that QEMU is able to start in one 192e3c8f8dSDr. David Alan Gilbertmachine and being "migrated" to another machine. I.e. being moved to 202e3c8f8dSDr. David Alan Gilbertanother machine. 212e3c8f8dSDr. David Alan Gilbert 222e3c8f8dSDr. David Alan GilbertNext was the "live migration" functionality. This is important 232e3c8f8dSDr. David Alan Gilbertbecause some guests run with a lot of state (specially RAM), and it 242e3c8f8dSDr. David Alan Gilbertcan take a while to move all state from one machine to another. Live 252e3c8f8dSDr. David Alan Gilbertmigration allows the guest to continue running while the state is 262e3c8f8dSDr. David Alan Gilberttransferred. Only while the last part of the state is transferred has 272e3c8f8dSDr. David Alan Gilbertthe guest to be stopped. Typically the time that the guest is 282e3c8f8dSDr. David Alan Gilbertunresponsive during live migration is the low hundred of milliseconds 292e3c8f8dSDr. David Alan Gilbert(notice that this depends on a lot of things). 302e3c8f8dSDr. David Alan Gilbert 31d8a0f054SJuan Quintela.. contents:: 32d8a0f054SJuan Quintela 33edd70806SDr. David Alan GilbertTransports 34edd70806SDr. David Alan Gilbert========== 352e3c8f8dSDr. David Alan Gilbert 36edd70806SDr. David Alan GilbertThe migration stream is normally just a byte stream that can be passed 37edd70806SDr. David Alan Gilbertover any transport. 382e3c8f8dSDr. David Alan Gilbert 392e3c8f8dSDr. David Alan Gilbert- tcp migration: do the migration using tcp sockets 402e3c8f8dSDr. David Alan Gilbert- unix migration: do the migration using unix sockets 412e3c8f8dSDr. David Alan Gilbert- exec migration: do the migration using the stdin/stdout through a process. 429277d81fSVille Skyttä- fd migration: do the migration using a file descriptor that is 432e3c8f8dSDr. David Alan Gilbert passed to QEMU. QEMU doesn't care how this file descriptor is opened. 442e3c8f8dSDr. David Alan Gilbert 45edd70806SDr. David Alan GilbertIn addition, support is included for migration using RDMA, which 46edd70806SDr. David Alan Gilberttransports the page data using ``RDMA``, where the hardware takes care of 47edd70806SDr. David Alan Gilberttransporting the pages, and the load on the CPU is much lower. While the 48edd70806SDr. David Alan Gilbertinternals of RDMA migration are a bit different, this isn't really visible 49edd70806SDr. David Alan Gilbertoutside the RAM migration code. 50edd70806SDr. David Alan Gilbert 51edd70806SDr. David Alan GilbertAll these migration protocols use the same infrastructure to 522e3c8f8dSDr. David Alan Gilbertsave/restore state devices. This infrastructure is shared with the 532e3c8f8dSDr. David Alan Gilbertsavevm/loadvm functionality. 542e3c8f8dSDr. David Alan Gilbert 55979da8b3SMarc-André LureauDebugging 56979da8b3SMarc-André Lureau========= 57979da8b3SMarc-André Lureau 584df3a7bfSPeter MaydellThe migration stream can be analyzed thanks to ``scripts/analyze-migration.py``. 59979da8b3SMarc-André Lureau 60979da8b3SMarc-André LureauExample usage: 61979da8b3SMarc-André Lureau 62979da8b3SMarc-André Lureau.. code-block:: shell 63979da8b3SMarc-André Lureau 64243e7480SMarkus Armbruster $ qemu-system-x86_64 -display none -monitor stdio 65979da8b3SMarc-André Lureau (qemu) migrate "exec:cat > mig" 66243e7480SMarkus Armbruster (qemu) q 67243e7480SMarkus Armbruster $ ./scripts/analyze-migration.py -f mig 68979da8b3SMarc-André Lureau { 69979da8b3SMarc-André Lureau "ram (3)": { 70979da8b3SMarc-André Lureau "section sizes": { 71979da8b3SMarc-André Lureau "pc.ram": "0x0000000008000000", 72979da8b3SMarc-André Lureau ... 73979da8b3SMarc-André Lureau 74243e7480SMarkus ArmbrusterSee also ``analyze-migration.py -h`` help for more options. 75979da8b3SMarc-André Lureau 762e3c8f8dSDr. David Alan GilbertCommon infrastructure 772e3c8f8dSDr. David Alan Gilbert===================== 782e3c8f8dSDr. David Alan Gilbert 792e3c8f8dSDr. David Alan GilbertThe files, sockets or fd's that carry the migration stream are abstracted by 804df3a7bfSPeter Maydellthe ``QEMUFile`` type (see ``migration/qemu-file.h``). In most cases this 814df3a7bfSPeter Maydellis connected to a subtype of ``QIOChannel`` (see ``io/``). 822e3c8f8dSDr. David Alan Gilbert 83edd70806SDr. David Alan Gilbert 842e3c8f8dSDr. David Alan GilbertSaving the state of one device 852e3c8f8dSDr. David Alan Gilbert============================== 862e3c8f8dSDr. David Alan Gilbert 87edd70806SDr. David Alan GilbertFor most devices, the state is saved in a single call to the migration 88edd70806SDr. David Alan Gilbertinfrastructure; these are *non-iterative* devices. The data for these 89edd70806SDr. David Alan Gilbertdevices is sent at the end of precopy migration, when the CPUs are paused. 90edd70806SDr. David Alan GilbertThere are also *iterative* devices, which contain a very large amount of 91edd70806SDr. David Alan Gilbertdata (e.g. RAM or large tables). See the iterative device section below. 922e3c8f8dSDr. David Alan Gilbert 93edd70806SDr. David Alan GilbertGeneral advice for device developers 94edd70806SDr. David Alan Gilbert------------------------------------ 952e3c8f8dSDr. David Alan Gilbert 96edd70806SDr. David Alan Gilbert- The migration state saved should reflect the device being modelled rather 97edd70806SDr. David Alan Gilbert than the way your implementation works. That way if you change the implementation 98edd70806SDr. David Alan Gilbert later the migration stream will stay compatible. That model may include 99edd70806SDr. David Alan Gilbert internal state that's not directly visible in a register. 1002e3c8f8dSDr. David Alan Gilbert 101edd70806SDr. David Alan Gilbert- When saving a migration stream the device code may walk and check 102edd70806SDr. David Alan Gilbert the state of the device. These checks might fail in various ways (e.g. 103edd70806SDr. David Alan Gilbert discovering internal state is corrupt or that the guest has done something bad). 104edd70806SDr. David Alan Gilbert Consider carefully before asserting/aborting at this point, since the 105edd70806SDr. David Alan Gilbert normal response from users is that *migration broke their VM* since it had 106edd70806SDr. David Alan Gilbert apparently been running fine until then. In these error cases, the device 107edd70806SDr. David Alan Gilbert should log a message indicating the cause of error, and should consider 108edd70806SDr. David Alan Gilbert putting the device into an error state, allowing the rest of the VM to 109edd70806SDr. David Alan Gilbert continue execution. 1102e3c8f8dSDr. David Alan Gilbert 111edd70806SDr. David Alan Gilbert- The migration might happen at an inconvenient point, 112edd70806SDr. David Alan Gilbert e.g. right in the middle of the guest reprogramming the device, during 113edd70806SDr. David Alan Gilbert guest reboot or shutdown or while the device is waiting for external IO. 114edd70806SDr. David Alan Gilbert It's strongly preferred that migrations do not fail in this situation, 115edd70806SDr. David Alan Gilbert since in the cloud environment migrations might happen automatically to 116edd70806SDr. David Alan Gilbert VMs that the administrator doesn't directly control. 1172e3c8f8dSDr. David Alan Gilbert 118edd70806SDr. David Alan Gilbert- If you do need to fail a migration, ensure that sufficient information 119edd70806SDr. David Alan Gilbert is logged to identify what went wrong. 1202e3c8f8dSDr. David Alan Gilbert 121edd70806SDr. David Alan Gilbert- The destination should treat an incoming migration stream as hostile 122edd70806SDr. David Alan Gilbert (which we do to varying degrees in the existing code). Check that offsets 123edd70806SDr. David Alan Gilbert into buffers and the like can't cause overruns. Fail the incoming migration 124edd70806SDr. David Alan Gilbert in the case of a corrupted stream like this. 1252e3c8f8dSDr. David Alan Gilbert 126edd70806SDr. David Alan Gilbert- Take care with internal device state or behaviour that might become 127edd70806SDr. David Alan Gilbert migration version dependent. For example, the order of PCI capabilities 128edd70806SDr. David Alan Gilbert is required to stay constant across migration. Another example would 129edd70806SDr. David Alan Gilbert be that a special case handled by subsections (see below) might become 130edd70806SDr. David Alan Gilbert much more common if a default behaviour is changed. 1312e3c8f8dSDr. David Alan Gilbert 132edd70806SDr. David Alan Gilbert- The state of the source should not be changed or destroyed by the 133edd70806SDr. David Alan Gilbert outgoing migration. Migrations timing out or being failed by 134edd70806SDr. David Alan Gilbert higher levels of management, or failures of the destination host are 135edd70806SDr. David Alan Gilbert not unusual, and in that case the VM is restarted on the source. 136edd70806SDr. David Alan Gilbert Note that the management layer can validly revert the migration 137edd70806SDr. David Alan Gilbert even though the QEMU level of migration has succeeded as long as it 138edd70806SDr. David Alan Gilbert does it before starting execution on the destination. 139edd70806SDr. David Alan Gilbert 140edd70806SDr. David Alan Gilbert- Buses and devices should be able to explicitly specify addresses when 141edd70806SDr. David Alan Gilbert instantiated, and management tools should use those. For example, 142edd70806SDr. David Alan Gilbert when hot adding USB devices it's important to specify the ports 143edd70806SDr. David Alan Gilbert and addresses, since implicit ordering based on the command line order 144edd70806SDr. David Alan Gilbert may be different on the destination. This can result in the 145edd70806SDr. David Alan Gilbert device state being loaded into the wrong device. 1462e3c8f8dSDr. David Alan Gilbert 1472e3c8f8dSDr. David Alan GilbertVMState 1482e3c8f8dSDr. David Alan Gilbert------- 1492e3c8f8dSDr. David Alan Gilbert 150edd70806SDr. David Alan GilbertMost device data can be described using the ``VMSTATE`` macros (mostly defined 151edd70806SDr. David Alan Gilbertin ``include/migration/vmstate.h``). 1522e3c8f8dSDr. David Alan Gilbert 1532e3c8f8dSDr. David Alan GilbertAn example (from hw/input/pckbd.c) 1542e3c8f8dSDr. David Alan Gilbert 1552e3c8f8dSDr. David Alan Gilbert.. code:: c 1562e3c8f8dSDr. David Alan Gilbert 1572e3c8f8dSDr. David Alan Gilbert static const VMStateDescription vmstate_kbd = { 1582e3c8f8dSDr. David Alan Gilbert .name = "pckbd", 1592e3c8f8dSDr. David Alan Gilbert .version_id = 3, 1602e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 3, 1612e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 1622e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(write_cmd, KBDState), 1632e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(status, KBDState), 1642e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(mode, KBDState), 1652e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(pending, KBDState), 1662e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 1672e3c8f8dSDr. David Alan Gilbert } 1682e3c8f8dSDr. David Alan Gilbert }; 1692e3c8f8dSDr. David Alan Gilbert 1705b146be3SJuan QuintelaWe are declaring the state with name "pckbd". The ``version_id`` is 1715b146be3SJuan Quintela3, and there are 4 uint8_t fields in the KBDState structure. We 1725b146be3SJuan Quintelaregistered this ``VMSTATEDescription`` with one of the following 1735b146be3SJuan Quintelafunctions. The first one will generate a device ``instance_id`` 1745b146be3SJuan Quinteladifferent for each registration. Use the second one if you already 1755b146be3SJuan Quintelahave an id that is different for each instance of the device: 1762e3c8f8dSDr. David Alan Gilbert 1772e3c8f8dSDr. David Alan Gilbert.. code:: c 1782e3c8f8dSDr. David Alan Gilbert 1795b146be3SJuan Quintela vmstate_register_any(NULL, &vmstate_kbd, s); 1805b146be3SJuan Quintela vmstate_register(NULL, instance_id, &vmstate_kbd, s); 1812e3c8f8dSDr. David Alan Gilbert 1824df3a7bfSPeter MaydellFor devices that are ``qdev`` based, we can register the device in the class 183edd70806SDr. David Alan Gilbertinit function: 1842e3c8f8dSDr. David Alan Gilbert 185edd70806SDr. David Alan Gilbert.. code:: c 1862e3c8f8dSDr. David Alan Gilbert 187edd70806SDr. David Alan Gilbert dc->vmsd = &vmstate_kbd_isa; 1882e3c8f8dSDr. David Alan Gilbert 189edd70806SDr. David Alan GilbertThe VMState macros take care of ensuring that the device data section 190edd70806SDr. David Alan Gilbertis formatted portably (normally big endian) and make some compile time checks 191edd70806SDr. David Alan Gilbertagainst the types of the fields in the structures. 1922e3c8f8dSDr. David Alan Gilbert 193edd70806SDr. David Alan GilbertVMState macros can include other VMStateDescriptions to store substructures 194edd70806SDr. David Alan Gilbert(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length 195edd70806SDr. David Alan Gilbertarrays (``VMSTATE_VARRAY_``). Various other macros exist for special 196edd70806SDr. David Alan Gilbertcases. 1972e3c8f8dSDr. David Alan Gilbert 198edd70806SDr. David Alan GilbertNote that the format on the wire is still very raw; i.e. a VMSTATE_UINT32 199edd70806SDr. David Alan Gilbertends up with a 4 byte bigendian representation on the wire; in the future 200edd70806SDr. David Alan Gilbertit might be possible to use a more structured format. 2012e3c8f8dSDr. David Alan Gilbert 202edd70806SDr. David Alan GilbertLegacy way 203edd70806SDr. David Alan Gilbert---------- 2042e3c8f8dSDr. David Alan Gilbert 205edd70806SDr. David Alan GilbertThis way is going to disappear as soon as all current users are ported to VMSTATE; 206edd70806SDr. David Alan Gilbertalthough converting existing code can be tricky, and thus 'soon' is relative. 2072e3c8f8dSDr. David Alan Gilbert 208edd70806SDr. David Alan GilbertEach device has to register two functions, one to save the state and 209edd70806SDr. David Alan Gilbertanother to load the state back. 2102e3c8f8dSDr. David Alan Gilbert 211edd70806SDr. David Alan Gilbert.. code:: c 2122e3c8f8dSDr. David Alan Gilbert 213ce62df53SDr. David Alan Gilbert int register_savevm_live(const char *idstr, 214edd70806SDr. David Alan Gilbert int instance_id, 215edd70806SDr. David Alan Gilbert int version_id, 216edd70806SDr. David Alan Gilbert SaveVMHandlers *ops, 217edd70806SDr. David Alan Gilbert void *opaque); 2182e3c8f8dSDr. David Alan Gilbert 2194df3a7bfSPeter MaydellTwo functions in the ``ops`` structure are the ``save_state`` 2204df3a7bfSPeter Maydelland ``load_state`` functions. Notice that ``load_state`` receives a version_id 2214df3a7bfSPeter Maydellparameter to know what state format is receiving. ``save_state`` doesn't 222edd70806SDr. David Alan Gilberthave a version_id parameter because it always uses the latest version. 2232e3c8f8dSDr. David Alan Gilbert 224edd70806SDr. David Alan GilbertNote that because the VMState macros still save the data in a raw 225edd70806SDr. David Alan Gilbertformat, in many cases it's possible to replace legacy code 226edd70806SDr. David Alan Gilbertwith a carefully constructed VMState description that matches the 227edd70806SDr. David Alan Gilbertbyte layout of the existing code. 2282e3c8f8dSDr. David Alan Gilbert 229edd70806SDr. David Alan GilbertChanging migration data structures 230edd70806SDr. David Alan Gilbert---------------------------------- 2312e3c8f8dSDr. David Alan Gilbert 232edd70806SDr. David Alan GilbertWhen we migrate a device, we save/load the state as a series 233edd70806SDr. David Alan Gilbertof fields. Sometimes, due to bugs or new functionality, we need to 234edd70806SDr. David Alan Gilbertchange the state to store more/different information. Changing the migration 235edd70806SDr. David Alan Gilbertstate saved for a device can break migration compatibility unless 236edd70806SDr. David Alan Gilbertcare is taken to use the appropriate techniques. In general QEMU tries 237edd70806SDr. David Alan Gilbertto maintain forward migration compatibility (i.e. migrating from 238edd70806SDr. David Alan GilbertQEMU n->n+1) and there are users who benefit from backward compatibility 239edd70806SDr. David Alan Gilbertas well. 2402e3c8f8dSDr. David Alan Gilbert 2412e3c8f8dSDr. David Alan GilbertSubsections 2422e3c8f8dSDr. David Alan Gilbert----------- 2432e3c8f8dSDr. David Alan Gilbert 244edd70806SDr. David Alan GilbertThe most common structure change is adding new data, e.g. when adding 245edd70806SDr. David Alan Gilberta newer form of device, or adding that state that you previously 246edd70806SDr. David Alan Gilbertforgot to migrate. This is best solved using a subsection. 2472e3c8f8dSDr. David Alan Gilbert 248edd70806SDr. David Alan GilbertA subsection is "like" a device vmstate, but with a particularity, it 249edd70806SDr. David Alan Gilberthas a Boolean function that tells if that values are needed to be sent 250edd70806SDr. David Alan Gilbertor not. If this functions returns false, the subsection is not sent. 251edd70806SDr. David Alan GilbertSubsections have a unique name, that is looked for on the receiving 252edd70806SDr. David Alan Gilbertside. 2532e3c8f8dSDr. David Alan Gilbert 2542e3c8f8dSDr. David Alan GilbertOn the receiving side, if we found a subsection for a device that we 2552e3c8f8dSDr. David Alan Gilbertdon't understand, we just fail the migration. If we understand all 256edd70806SDr. David Alan Gilbertthe subsections, then we load the state with success. There's no check 257edd70806SDr. David Alan Gilbertthat a subsection is loaded, so a newer QEMU that knows about a subsection 258edd70806SDr. David Alan Gilbertcan (with care) load a stream from an older QEMU that didn't send 259edd70806SDr. David Alan Gilbertthe subsection. 260edd70806SDr. David Alan Gilbert 261edd70806SDr. David Alan GilbertIf the new data is only needed in a rare case, then the subsection 262edd70806SDr. David Alan Gilbertcan be made conditional on that case and the migration will still 263edd70806SDr. David Alan Gilbertsucceed to older QEMUs in most cases. This is OK for data that's 264edd70806SDr. David Alan Gilbertcritical, but in some use cases it's preferred that the migration 265edd70806SDr. David Alan Gilbertshould succeed even with the data missing. To support this the 266edd70806SDr. David Alan Gilbertsubsection can be connected to a device property and from there 267edd70806SDr. David Alan Gilbertto a versioned machine type. 2682e3c8f8dSDr. David Alan Gilbert 2693eb21fe9SDr. David Alan GilbertThe 'pre_load' and 'post_load' functions on subsections are only 2703eb21fe9SDr. David Alan Gilbertcalled if the subsection is loaded. 2713eb21fe9SDr. David Alan Gilbert 2723eb21fe9SDr. David Alan GilbertOne important note is that the outer post_load() function is called "after" 2733eb21fe9SDr. David Alan Gilbertloading all subsections, because a newer subsection could change the same 2743eb21fe9SDr. David Alan Gilbertvalue that it uses. A flag, and the combination of outer pre_load and 2753eb21fe9SDr. David Alan Gilbertpost_load can be used to detect whether a subsection was loaded, and to 276edd70806SDr. David Alan Gilbertfall back on default behaviour when the subsection isn't present. 2772e3c8f8dSDr. David Alan Gilbert 2782e3c8f8dSDr. David Alan GilbertExample: 2792e3c8f8dSDr. David Alan Gilbert 2802e3c8f8dSDr. David Alan Gilbert.. code:: c 2812e3c8f8dSDr. David Alan Gilbert 2822e3c8f8dSDr. David Alan Gilbert static bool ide_drive_pio_state_needed(void *opaque) 2832e3c8f8dSDr. David Alan Gilbert { 2842e3c8f8dSDr. David Alan Gilbert IDEState *s = opaque; 2852e3c8f8dSDr. David Alan Gilbert 2862e3c8f8dSDr. David Alan Gilbert return ((s->status & DRQ_STAT) != 0) 2872e3c8f8dSDr. David Alan Gilbert || (s->bus->error_status & BM_STATUS_PIO_RETRY); 2882e3c8f8dSDr. David Alan Gilbert } 2892e3c8f8dSDr. David Alan Gilbert 2902e3c8f8dSDr. David Alan Gilbert const VMStateDescription vmstate_ide_drive_pio_state = { 2912e3c8f8dSDr. David Alan Gilbert .name = "ide_drive/pio_state", 2922e3c8f8dSDr. David Alan Gilbert .version_id = 1, 2932e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 1, 2942e3c8f8dSDr. David Alan Gilbert .pre_save = ide_drive_pio_pre_save, 2952e3c8f8dSDr. David Alan Gilbert .post_load = ide_drive_pio_post_load, 2962e3c8f8dSDr. David Alan Gilbert .needed = ide_drive_pio_state_needed, 2972e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 2982e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(req_nb_sectors, IDEState), 2992e3c8f8dSDr. David Alan Gilbert VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, 3002e3c8f8dSDr. David Alan Gilbert vmstate_info_uint8, uint8_t), 3012e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(cur_io_buffer_offset, IDEState), 3022e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(cur_io_buffer_len, IDEState), 3032e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(end_transfer_fn_idx, IDEState), 3042e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(elementary_transfer_size, IDEState), 3052e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(packet_transfer_size, IDEState), 3062e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 3072e3c8f8dSDr. David Alan Gilbert } 3082e3c8f8dSDr. David Alan Gilbert }; 3092e3c8f8dSDr. David Alan Gilbert 3102e3c8f8dSDr. David Alan Gilbert const VMStateDescription vmstate_ide_drive = { 3112e3c8f8dSDr. David Alan Gilbert .name = "ide_drive", 3122e3c8f8dSDr. David Alan Gilbert .version_id = 3, 3132e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 0, 3142e3c8f8dSDr. David Alan Gilbert .post_load = ide_drive_post_load, 3152e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 3162e3c8f8dSDr. David Alan Gilbert .... several fields .... 3172e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 3182e3c8f8dSDr. David Alan Gilbert }, 3192e3c8f8dSDr. David Alan Gilbert .subsections = (const VMStateDescription*[]) { 3202e3c8f8dSDr. David Alan Gilbert &vmstate_ide_drive_pio_state, 3212e3c8f8dSDr. David Alan Gilbert NULL 3222e3c8f8dSDr. David Alan Gilbert } 3232e3c8f8dSDr. David Alan Gilbert }; 3242e3c8f8dSDr. David Alan Gilbert 3252e3c8f8dSDr. David Alan GilbertHere we have a subsection for the pio state. We only need to 3262e3c8f8dSDr. David Alan Gilbertsave/send this state when we are in the middle of a pio operation 3272e3c8f8dSDr. David Alan Gilbert(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is 3282e3c8f8dSDr. David Alan Gilbertnot enabled, the values on that fields are garbage and don't need to 3292e3c8f8dSDr. David Alan Gilbertbe sent. 3302e3c8f8dSDr. David Alan Gilbert 331edd70806SDr. David Alan GilbertConnecting subsections to properties 332edd70806SDr. David Alan Gilbert------------------------------------ 333edd70806SDr. David Alan Gilbert 3342e3c8f8dSDr. David Alan GilbertUsing a condition function that checks a 'property' to determine whether 335edd70806SDr. David Alan Gilbertto send a subsection allows backward migration compatibility when 336edd70806SDr. David Alan Gilbertnew subsections are added, especially when combined with versioned 337edd70806SDr. David Alan Gilbertmachine types. 3382e3c8f8dSDr. David Alan Gilbert 3392e3c8f8dSDr. David Alan GilbertFor example: 3402e3c8f8dSDr. David Alan Gilbert 3412e3c8f8dSDr. David Alan Gilbert a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and 3422e3c8f8dSDr. David Alan Gilbert default it to true. 343ac78f737SMarc-André Lureau b) Add an entry to the ``hw_compat_`` for the previous version that sets 3442e3c8f8dSDr. David Alan Gilbert the property to false. 3452e3c8f8dSDr. David Alan Gilbert c) Add a static bool support_foo function that tests the property. 3462e3c8f8dSDr. David Alan Gilbert d) Add a subsection with a .needed set to the support_foo function 3473eb21fe9SDr. David Alan Gilbert e) (potentially) Add an outer pre_load that sets up a default value 3483eb21fe9SDr. David Alan Gilbert for 'foo' to be used if the subsection isn't loaded. 3492e3c8f8dSDr. David Alan Gilbert 3502e3c8f8dSDr. David Alan GilbertNow that subsection will not be generated when using an older 3512e3c8f8dSDr. David Alan Gilbertmachine type and the migration stream will be accepted by older 352edd70806SDr. David Alan GilbertQEMU versions. 3532e3c8f8dSDr. David Alan Gilbert 3542e3c8f8dSDr. David Alan GilbertNot sending existing elements 3552e3c8f8dSDr. David Alan Gilbert----------------------------- 3562e3c8f8dSDr. David Alan Gilbert 3572e3c8f8dSDr. David Alan GilbertSometimes members of the VMState are no longer needed: 3582e3c8f8dSDr. David Alan Gilbert 3592e3c8f8dSDr. David Alan Gilbert - removing them will break migration compatibility 3602e3c8f8dSDr. David Alan Gilbert 361edd70806SDr. David Alan Gilbert - making them version dependent and bumping the version will break backward migration 362edd70806SDr. David Alan Gilbert compatibility. 3632e3c8f8dSDr. David Alan Gilbert 364edd70806SDr. David Alan GilbertAdding a dummy field into the migration stream is normally the best way to preserve 365edd70806SDr. David Alan Gilbertcompatibility. 366edd70806SDr. David Alan Gilbert 367edd70806SDr. David Alan GilbertIf the field really does need to be removed then: 3682e3c8f8dSDr. David Alan Gilbert 3692e3c8f8dSDr. David Alan Gilbert a) Add a new property/compatibility/function in the same way for subsections above. 3702e3c8f8dSDr. David Alan Gilbert b) replace the VMSTATE macro with the _TEST version of the macro, e.g.: 3712e3c8f8dSDr. David Alan Gilbert 3722e3c8f8dSDr. David Alan Gilbert ``VMSTATE_UINT32(foo, barstruct)`` 3732e3c8f8dSDr. David Alan Gilbert 3742e3c8f8dSDr. David Alan Gilbert becomes 3752e3c8f8dSDr. David Alan Gilbert 3762e3c8f8dSDr. David Alan Gilbert ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)`` 3772e3c8f8dSDr. David Alan Gilbert 3782e3c8f8dSDr. David Alan Gilbert Sometime in the future when we no longer care about the ancient versions these can be killed off. 379edd70806SDr. David Alan Gilbert Note that for backward compatibility it's important to fill in the structure with 380edd70806SDr. David Alan Gilbert data that the destination will understand. 381edd70806SDr. David Alan Gilbert 382edd70806SDr. David Alan GilbertAny difference in the predicates on the source and destination will end up 383edd70806SDr. David Alan Gilbertwith different fields being enabled and data being loaded into the wrong 384edd70806SDr. David Alan Gilbertfields; for this reason conditional fields like this are very fragile. 385edd70806SDr. David Alan Gilbert 386edd70806SDr. David Alan GilbertVersions 387edd70806SDr. David Alan Gilbert-------- 388edd70806SDr. David Alan Gilbert 389edd70806SDr. David Alan GilbertVersion numbers are intended for major incompatible changes to the 390edd70806SDr. David Alan Gilbertmigration of a device, and using them breaks backward-migration 391edd70806SDr. David Alan Gilbertcompatibility; in general most changes can be made by adding Subsections 392edd70806SDr. David Alan Gilbert(see above) or _TEST macros (see above) which won't break compatibility. 393edd70806SDr. David Alan Gilbert 3944df3a7bfSPeter MaydellEach version is associated with a series of fields saved. The ``save_state`` always saves 3954df3a7bfSPeter Maydellthe state as the newer version. But ``load_state`` sometimes is able to 396edd70806SDr. David Alan Gilbertload state from an older version. 397edd70806SDr. David Alan Gilbert 39818621987SPeter MaydellYou can see that there are two version fields: 399edd70806SDr. David Alan Gilbert 4004df3a7bfSPeter Maydell- ``version_id``: the maximum version_id supported by VMState for that device. 4014df3a7bfSPeter Maydell- ``minimum_version_id``: the minimum version_id that VMState is able to understand 402edd70806SDr. David Alan Gilbert for that device. 403edd70806SDr. David Alan Gilbert 40418621987SPeter MaydellVMState is able to read versions from minimum_version_id to version_id. 405edd70806SDr. David Alan Gilbert 406edd70806SDr. David Alan GilbertThere are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields, 407edd70806SDr. David Alan Gilberte.g. 408edd70806SDr. David Alan Gilbert 409edd70806SDr. David Alan Gilbert.. code:: c 410edd70806SDr. David Alan Gilbert 411edd70806SDr. David Alan Gilbert VMSTATE_UINT16_V(ip_id, Slirp, 2), 412edd70806SDr. David Alan Gilbert 413edd70806SDr. David Alan Gilbertonly loads that field for versions 2 and newer. 414edd70806SDr. David Alan Gilbert 415edd70806SDr. David Alan GilbertSaving state will always create a section with the 'version_id' value 416edd70806SDr. David Alan Gilbertand thus can't be loaded by any older QEMU. 417edd70806SDr. David Alan Gilbert 418edd70806SDr. David Alan GilbertMassaging functions 419edd70806SDr. David Alan Gilbert------------------- 420edd70806SDr. David Alan Gilbert 421edd70806SDr. David Alan GilbertSometimes, it is not enough to be able to save the state directly 422edd70806SDr. David Alan Gilbertfrom one structure, we need to fill the correct values there. One 423edd70806SDr. David Alan Gilbertexample is when we are using kvm. Before saving the cpu state, we 424edd70806SDr. David Alan Gilbertneed to ask kvm to copy to QEMU the state that it is using. And the 425edd70806SDr. David Alan Gilbertopposite when we are loading the state, we need a way to tell kvm to 426edd70806SDr. David Alan Gilbertload the state for the cpu that we have just loaded from the QEMUFile. 427edd70806SDr. David Alan Gilbert 428edd70806SDr. David Alan GilbertThe functions to do that are inside a vmstate definition, and are called: 429edd70806SDr. David Alan Gilbert 430edd70806SDr. David Alan Gilbert- ``int (*pre_load)(void *opaque);`` 431edd70806SDr. David Alan Gilbert 432edd70806SDr. David Alan Gilbert This function is called before we load the state of one device. 433edd70806SDr. David Alan Gilbert 434edd70806SDr. David Alan Gilbert- ``int (*post_load)(void *opaque, int version_id);`` 435edd70806SDr. David Alan Gilbert 436edd70806SDr. David Alan Gilbert This function is called after we load the state of one device. 437edd70806SDr. David Alan Gilbert 438edd70806SDr. David Alan Gilbert- ``int (*pre_save)(void *opaque);`` 439edd70806SDr. David Alan Gilbert 440edd70806SDr. David Alan Gilbert This function is called before we save the state of one device. 441edd70806SDr. David Alan Gilbert 4428c07559fSAaron Lindsay- ``int (*post_save)(void *opaque);`` 4438c07559fSAaron Lindsay 4448c07559fSAaron Lindsay This function is called after we save the state of one device 4458c07559fSAaron Lindsay (even upon failure, unless the call to pre_save returned an error). 4468c07559fSAaron Lindsay 4478c07559fSAaron LindsayExample: You can look at hpet.c, that uses the first three functions 4488c07559fSAaron Lindsayto massage the state that is transferred. 449edd70806SDr. David Alan Gilbert 450edd70806SDr. David Alan GilbertThe ``VMSTATE_WITH_TMP`` macro may be useful when the migration 451edd70806SDr. David Alan Gilbertdata doesn't match the stored device data well; it allows an 452edd70806SDr. David Alan Gilbertintermediate temporary structure to be populated with migration 453edd70806SDr. David Alan Gilbertdata and then transferred to the main structure. 454edd70806SDr. David Alan Gilbert 455edd70806SDr. David Alan GilbertIf you use memory API functions that update memory layout outside 456edd70806SDr. David Alan Gilbertinitialization (i.e., in response to a guest action), this is a strong 4574df3a7bfSPeter Maydellindication that you need to call these functions in a ``post_load`` callback. 458edd70806SDr. David Alan GilbertExamples of such memory API functions are: 459edd70806SDr. David Alan Gilbert 460edd70806SDr. David Alan Gilbert - memory_region_add_subregion() 461edd70806SDr. David Alan Gilbert - memory_region_del_subregion() 462edd70806SDr. David Alan Gilbert - memory_region_set_readonly() 463c26763f8SMarc-André Lureau - memory_region_set_nonvolatile() 464edd70806SDr. David Alan Gilbert - memory_region_set_enabled() 465edd70806SDr. David Alan Gilbert - memory_region_set_address() 466edd70806SDr. David Alan Gilbert - memory_region_set_alias_offset() 467edd70806SDr. David Alan Gilbert 468edd70806SDr. David Alan GilbertIterative device migration 469edd70806SDr. David Alan Gilbert-------------------------- 470edd70806SDr. David Alan Gilbert 471edd70806SDr. David Alan GilbertSome devices, such as RAM, Block storage or certain platform devices, 472edd70806SDr. David Alan Gilberthave large amounts of data that would mean that the CPUs would be 473edd70806SDr. David Alan Gilbertpaused for too long if they were sent in one section. For these 474edd70806SDr. David Alan Gilbertdevices an *iterative* approach is taken. 475edd70806SDr. David Alan Gilbert 476edd70806SDr. David Alan GilbertThe iterative devices generally don't use VMState macros 477edd70806SDr. David Alan Gilbert(although it may be possible in some cases) and instead use 478edd70806SDr. David Alan Gilbertqemu_put_*/qemu_get_* macros to read/write data to the stream. Specialist 479edd70806SDr. David Alan Gilbertversions exist for high bandwidth IO. 480edd70806SDr. David Alan Gilbert 481edd70806SDr. David Alan Gilbert 482edd70806SDr. David Alan GilbertAn iterative device must provide: 483edd70806SDr. David Alan Gilbert 484edd70806SDr. David Alan Gilbert - A ``save_setup`` function that initialises the data structures and 485edd70806SDr. David Alan Gilbert transmits a first section containing information on the device. In the 486edd70806SDr. David Alan Gilbert case of RAM this transmits a list of RAMBlocks and sizes. 487edd70806SDr. David Alan Gilbert 488edd70806SDr. David Alan Gilbert - A ``load_setup`` function that initialises the data structures on the 489edd70806SDr. David Alan Gilbert destination. 490edd70806SDr. David Alan Gilbert 491c8df4a7aSJuan Quintela - A ``state_pending_exact`` function that indicates how much more 492c8df4a7aSJuan Quintela data we must save. The core migration code will use this to 493c8df4a7aSJuan Quintela determine when to pause the CPUs and complete the migration. 494edd70806SDr. David Alan Gilbert 495c8df4a7aSJuan Quintela - A ``state_pending_estimate`` function that indicates how much more 496c8df4a7aSJuan Quintela data we must save. When the estimated amount is smaller than the 497c8df4a7aSJuan Quintela threshold, we call ``state_pending_exact``. 498c8df4a7aSJuan Quintela 499c8df4a7aSJuan Quintela - A ``save_live_iterate`` function should send a chunk of data until 500c8df4a7aSJuan Quintela the point that stream bandwidth limits tell it to stop. Each call 501c8df4a7aSJuan Quintela generates one section. 502edd70806SDr. David Alan Gilbert 503edd70806SDr. David Alan Gilbert - A ``save_live_complete_precopy`` function that must transmit the 504edd70806SDr. David Alan Gilbert last section for the device containing any remaining data. 505edd70806SDr. David Alan Gilbert 506edd70806SDr. David Alan Gilbert - A ``load_state`` function used to load sections generated by 507edd70806SDr. David Alan Gilbert any of the save functions that generate sections. 508edd70806SDr. David Alan Gilbert 509edd70806SDr. David Alan Gilbert - ``cleanup`` functions for both save and load that are called 510edd70806SDr. David Alan Gilbert at the end of migration. 511edd70806SDr. David Alan Gilbert 512edd70806SDr. David Alan GilbertNote that the contents of the sections for iterative migration tend 513edd70806SDr. David Alan Gilbertto be open-coded by the devices; care should be taken in parsing 514edd70806SDr. David Alan Gilbertthe results and structuring the stream to make them easy to validate. 515edd70806SDr. David Alan Gilbert 516edd70806SDr. David Alan GilbertDevice ordering 517edd70806SDr. David Alan Gilbert--------------- 518edd70806SDr. David Alan Gilbert 519edd70806SDr. David Alan GilbertThere are cases in which the ordering of device loading matters; for 520edd70806SDr. David Alan Gilbertexample in some systems where a device may assert an interrupt during loading, 521edd70806SDr. David Alan Gilbertif the interrupt controller is loaded later then it might lose the state. 522edd70806SDr. David Alan Gilbert 523edd70806SDr. David Alan GilbertSome ordering is implicitly provided by the order in which the machine 524edd70806SDr. David Alan Gilbertdefinition creates devices, however this is somewhat fragile. 525edd70806SDr. David Alan Gilbert 526edd70806SDr. David Alan GilbertThe ``MigrationPriority`` enum provides a means of explicitly enforcing 527edd70806SDr. David Alan Gilbertordering. Numerically higher priorities are loaded earlier. 528edd70806SDr. David Alan GilbertThe priority is set by setting the ``priority`` field of the top level 529edd70806SDr. David Alan Gilbert``VMStateDescription`` for the device. 530edd70806SDr. David Alan Gilbert 531edd70806SDr. David Alan GilbertStream structure 532edd70806SDr. David Alan Gilbert================ 533edd70806SDr. David Alan Gilbert 534edd70806SDr. David Alan GilbertThe stream tries to be word and endian agnostic, allowing migration between hosts 535edd70806SDr. David Alan Gilbertof different characteristics running the same VM. 536edd70806SDr. David Alan Gilbert 537edd70806SDr. David Alan Gilbert - Header 538edd70806SDr. David Alan Gilbert 539edd70806SDr. David Alan Gilbert - Magic 540edd70806SDr. David Alan Gilbert - Version 541edd70806SDr. David Alan Gilbert - VM configuration section 542edd70806SDr. David Alan Gilbert 543edd70806SDr. David Alan Gilbert - Machine type 544edd70806SDr. David Alan Gilbert - Target page bits 545edd70806SDr. David Alan Gilbert - List of sections 546edd70806SDr. David Alan Gilbert Each section contains a device, or one iteration of a device save. 547edd70806SDr. David Alan Gilbert 548edd70806SDr. David Alan Gilbert - section type 549edd70806SDr. David Alan Gilbert - section id 550edd70806SDr. David Alan Gilbert - ID string (First section of each device) 551edd70806SDr. David Alan Gilbert - instance id (First section of each device) 552edd70806SDr. David Alan Gilbert - version id (First section of each device) 553edd70806SDr. David Alan Gilbert - <device data> 554edd70806SDr. David Alan Gilbert - Footer mark 555edd70806SDr. David Alan Gilbert - EOF mark 556edd70806SDr. David Alan Gilbert - VM Description structure 557edd70806SDr. David Alan Gilbert Consisting of a JSON description of the contents for analysis only 558edd70806SDr. David Alan Gilbert 559edd70806SDr. David Alan GilbertThe ``device data`` in each section consists of the data produced 560edd70806SDr. David Alan Gilbertby the code described above. For non-iterative devices they have a single 561edd70806SDr. David Alan Gilbertsection; iterative devices have an initial and last section and a set 562edd70806SDr. David Alan Gilbertof parts in between. 563edd70806SDr. David Alan GilbertNote that there is very little checking by the common code of the integrity 564edd70806SDr. David Alan Gilbertof the ``device data`` contents, that's up to the devices themselves. 565edd70806SDr. David Alan GilbertThe ``footer mark`` provides a little bit of protection for the case where 566edd70806SDr. David Alan Gilbertthe receiving side reads more or less data than expected. 567edd70806SDr. David Alan Gilbert 568edd70806SDr. David Alan GilbertThe ``ID string`` is normally unique, having been formed from a bus name 569edd70806SDr. David Alan Gilbertand device address, PCI devices and storage devices hung off PCI controllers 570edd70806SDr. David Alan Gilbertfit this pattern well. Some devices are fixed single instances (e.g. "pc-ram"). 571edd70806SDr. David Alan GilbertOthers (especially either older devices or system devices which for 572edd70806SDr. David Alan Gilbertsome reason don't have a bus concept) make use of the ``instance id`` 573edd70806SDr. David Alan Gilbertfor otherwise identically named devices. 5742e3c8f8dSDr. David Alan Gilbert 5752e3c8f8dSDr. David Alan GilbertReturn path 5762e3c8f8dSDr. David Alan Gilbert----------- 5772e3c8f8dSDr. David Alan Gilbert 578edd70806SDr. David Alan GilbertOnly a unidirectional stream is required for normal migration, however a 579edd70806SDr. David Alan Gilbert``return path`` can be created when bidirectional communication is desired. 580edd70806SDr. David Alan GilbertThis is primarily used by postcopy, but is also used to return a success 581edd70806SDr. David Alan Gilbertflag to the source at the end of migration. 5822e3c8f8dSDr. David Alan Gilbert 5832e3c8f8dSDr. David Alan Gilbert``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return 5842e3c8f8dSDr. David Alan Gilbertpath. 5852e3c8f8dSDr. David Alan Gilbert 5862e3c8f8dSDr. David Alan Gilbert Source side 5872e3c8f8dSDr. David Alan Gilbert 5882e3c8f8dSDr. David Alan Gilbert Forward path - written by migration thread 5892e3c8f8dSDr. David Alan Gilbert Return path - opened by main thread, read by return-path thread 5902e3c8f8dSDr. David Alan Gilbert 5912e3c8f8dSDr. David Alan Gilbert Destination side 5922e3c8f8dSDr. David Alan Gilbert 5932e3c8f8dSDr. David Alan Gilbert Forward path - read by main thread 5942e3c8f8dSDr. David Alan Gilbert Return path - opened by main thread, written by main thread AND postcopy 5952e3c8f8dSDr. David Alan Gilbert thread (protected by rp_mutex) 5962e3c8f8dSDr. David Alan Gilbert 5972e3c8f8dSDr. David Alan GilbertPostcopy 5982e3c8f8dSDr. David Alan Gilbert======== 5992e3c8f8dSDr. David Alan Gilbert 6002e3c8f8dSDr. David Alan Gilbert'Postcopy' migration is a way to deal with migrations that refuse to converge 6012e3c8f8dSDr. David Alan Gilbert(or take too long to converge) its plus side is that there is an upper bound on 6022e3c8f8dSDr. David Alan Gilbertthe amount of migration traffic and time it takes, the down side is that during 603f014880aSPeter Xuthe postcopy phase, a failure of *either* side causes the guest to be lost. 6042e3c8f8dSDr. David Alan Gilbert 6052e3c8f8dSDr. David Alan GilbertIn postcopy the destination CPUs are started before all the memory has been 6062e3c8f8dSDr. David Alan Gilberttransferred, and accesses to pages that are yet to be transferred cause 6072e3c8f8dSDr. David Alan Gilberta fault that's translated by QEMU into a request to the source QEMU. 6082e3c8f8dSDr. David Alan Gilbert 6092e3c8f8dSDr. David Alan GilbertPostcopy can be combined with precopy (i.e. normal migration) so that if precopy 6102e3c8f8dSDr. David Alan Gilbertdoesn't finish in a given time the switch is made to postcopy. 6112e3c8f8dSDr. David Alan Gilbert 6122e3c8f8dSDr. David Alan GilbertEnabling postcopy 6132e3c8f8dSDr. David Alan Gilbert----------------- 6142e3c8f8dSDr. David Alan Gilbert 615c2eb7f21SGreg KurzTo enable postcopy, issue this command on the monitor (both source and 616c2eb7f21SGreg Kurzdestination) prior to the start of migration: 6172e3c8f8dSDr. David Alan Gilbert 6182e3c8f8dSDr. David Alan Gilbert``migrate_set_capability postcopy-ram on`` 6192e3c8f8dSDr. David Alan Gilbert 6202e3c8f8dSDr. David Alan GilbertThe normal commands are then used to start a migration, which is still 6212e3c8f8dSDr. David Alan Gilbertstarted in precopy mode. Issuing: 6222e3c8f8dSDr. David Alan Gilbert 6232e3c8f8dSDr. David Alan Gilbert``migrate_start_postcopy`` 6242e3c8f8dSDr. David Alan Gilbert 6252e3c8f8dSDr. David Alan Gilbertwill now cause the transition from precopy to postcopy. 6262e3c8f8dSDr. David Alan GilbertIt can be issued immediately after migration is started or any 6272e3c8f8dSDr. David Alan Gilberttime later on. Issuing it after the end of a migration is harmless. 6282e3c8f8dSDr. David Alan Gilbert 6299ed01779SAlexey PerevalovBlocktime is a postcopy live migration metric, intended to show how 63076ca4b58Szhaolichanglong the vCPU was in state of interruptible sleep due to pagefault. 6319ed01779SAlexey PerevalovThat metric is calculated both for all vCPUs as overlapped value, and 6329ed01779SAlexey Perevalovseparately for each vCPU. These values are calculated on destination 6339ed01779SAlexey Perevalovside. To enable postcopy blocktime calculation, enter following 6349ed01779SAlexey Perevalovcommand on destination monitor: 6359ed01779SAlexey Perevalov 6369ed01779SAlexey Perevalov``migrate_set_capability postcopy-blocktime on`` 6379ed01779SAlexey Perevalov 6389ed01779SAlexey PerevalovPostcopy blocktime can be retrieved by query-migrate qmp command. 6399ed01779SAlexey Perevalovpostcopy-blocktime value of qmp command will show overlapped blocking 6409ed01779SAlexey Perevalovtime for all vCPU, postcopy-vcpu-blocktime will show list of blocking 6419ed01779SAlexey Perevalovtime per vCPU. 6429ed01779SAlexey Perevalov 6432e3c8f8dSDr. David Alan Gilbert.. note:: 6442e3c8f8dSDr. David Alan Gilbert During the postcopy phase, the bandwidth limits set using 645cbde7be9SDaniel P. Berrangé ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that 6462e3c8f8dSDr. David Alan Gilbert the destination is waiting for). 6472e3c8f8dSDr. David Alan Gilbert 6482e3c8f8dSDr. David Alan GilbertPostcopy device transfer 6492e3c8f8dSDr. David Alan Gilbert------------------------ 6502e3c8f8dSDr. David Alan Gilbert 6512e3c8f8dSDr. David Alan GilbertLoading of device data may cause the device emulation to access guest RAM 6522e3c8f8dSDr. David Alan Gilbertthat may trigger faults that have to be resolved by the source, as such 6532e3c8f8dSDr. David Alan Gilbertthe migration stream has to be able to respond with page data *during* the 6542e3c8f8dSDr. David Alan Gilbertdevice load, and hence the device data has to be read from the stream completely 6552e3c8f8dSDr. David Alan Gilbertbefore the device load begins to free the stream up. This is achieved by 6562e3c8f8dSDr. David Alan Gilbert'packaging' the device data into a blob that's read in one go. 6572e3c8f8dSDr. David Alan Gilbert 6582e3c8f8dSDr. David Alan GilbertSource behaviour 6592e3c8f8dSDr. David Alan Gilbert---------------- 6602e3c8f8dSDr. David Alan Gilbert 6612e3c8f8dSDr. David Alan GilbertUntil postcopy is entered the migration stream is identical to normal 6622e3c8f8dSDr. David Alan Gilbertprecopy, except for the addition of a 'postcopy advise' command at 6632e3c8f8dSDr. David Alan Gilbertthe beginning, to tell the destination that postcopy might happen. 6642e3c8f8dSDr. David Alan GilbertWhen postcopy starts the source sends the page discard data and then 6652e3c8f8dSDr. David Alan Gilbertforms the 'package' containing: 6662e3c8f8dSDr. David Alan Gilbert 6672e3c8f8dSDr. David Alan Gilbert - Command: 'postcopy listen' 6682e3c8f8dSDr. David Alan Gilbert - The device state 6692e3c8f8dSDr. David Alan Gilbert 6702e3c8f8dSDr. David Alan Gilbert A series of sections, identical to the precopy streams device state stream 6712e3c8f8dSDr. David Alan Gilbert containing everything except postcopiable devices (i.e. RAM) 6722e3c8f8dSDr. David Alan Gilbert - Command: 'postcopy run' 6732e3c8f8dSDr. David Alan Gilbert 6742e3c8f8dSDr. David Alan GilbertThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the 6752e3c8f8dSDr. David Alan Gilbertcontents are formatted in the same way as the main migration stream. 6762e3c8f8dSDr. David Alan Gilbert 6772e3c8f8dSDr. David Alan GilbertDuring postcopy the source scans the list of dirty pages and sends them 6782e3c8f8dSDr. David Alan Gilbertto the destination without being requested (in much the same way as precopy), 6792e3c8f8dSDr. David Alan Gilberthowever when a page request is received from the destination, the dirty page 6802e3c8f8dSDr. David Alan Gilbertscanning restarts from the requested location. This causes requested pages 6812e3c8f8dSDr. David Alan Gilbertto be sent quickly, and also causes pages directly after the requested page 6822e3c8f8dSDr. David Alan Gilbertto be sent quickly in the hope that those pages are likely to be used 6832e3c8f8dSDr. David Alan Gilbertby the destination soon. 6842e3c8f8dSDr. David Alan Gilbert 6852e3c8f8dSDr. David Alan GilbertDestination behaviour 6862e3c8f8dSDr. David Alan Gilbert--------------------- 6872e3c8f8dSDr. David Alan Gilbert 6882e3c8f8dSDr. David Alan GilbertInitially the destination looks the same as precopy, with a single thread 6892e3c8f8dSDr. David Alan Gilbertreading the migration stream; the 'postcopy advise' and 'discard' commands 6902e3c8f8dSDr. David Alan Gilbertare processed to change the way RAM is managed, but don't affect the stream 6912e3c8f8dSDr. David Alan Gilbertprocessing. 6922e3c8f8dSDr. David Alan Gilbert 6932e3c8f8dSDr. David Alan Gilbert:: 6942e3c8f8dSDr. David Alan Gilbert 6952e3c8f8dSDr. David Alan Gilbert ------------------------------------------------------------------------------ 6962e3c8f8dSDr. David Alan Gilbert 1 2 3 4 5 6 7 6972e3c8f8dSDr. David Alan Gilbert main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) 6982e3c8f8dSDr. David Alan Gilbert thread | | 6992e3c8f8dSDr. David Alan Gilbert | (page request) 7002e3c8f8dSDr. David Alan Gilbert | \___ 7012e3c8f8dSDr. David Alan Gilbert v \ 7022e3c8f8dSDr. David Alan Gilbert listen thread: --- page -- page -- page -- page -- page -- 7032e3c8f8dSDr. David Alan Gilbert 7042e3c8f8dSDr. David Alan Gilbert a b c 7052e3c8f8dSDr. David Alan Gilbert ------------------------------------------------------------------------------ 7062e3c8f8dSDr. David Alan Gilbert 7072e3c8f8dSDr. David Alan Gilbert- On receipt of ``CMD_PACKAGED`` (1) 7082e3c8f8dSDr. David Alan Gilbert 7092e3c8f8dSDr. David Alan Gilbert All the data associated with the package - the ( ... ) section in the diagram - 7102e3c8f8dSDr. David Alan Gilbert is read into memory, and the main thread recurses into qemu_loadvm_state_main 7112e3c8f8dSDr. David Alan Gilbert to process the contents of the package (2) which contains commands (3,6) and 7122e3c8f8dSDr. David Alan Gilbert devices (4...) 7132e3c8f8dSDr. David Alan Gilbert 7142e3c8f8dSDr. David Alan Gilbert- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) 7152e3c8f8dSDr. David Alan Gilbert 7162e3c8f8dSDr. David Alan Gilbert a new thread (a) is started that takes over servicing the migration stream, 7172e3c8f8dSDr. David Alan Gilbert while the main thread carries on loading the package. It loads normal 7182e3c8f8dSDr. David Alan Gilbert background page data (b) but if during a device load a fault happens (5) 7192e3c8f8dSDr. David Alan Gilbert the returned page (c) is loaded by the listen thread allowing the main 7202e3c8f8dSDr. David Alan Gilbert threads device load to carry on. 7212e3c8f8dSDr. David Alan Gilbert 7222e3c8f8dSDr. David Alan Gilbert- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) 7232e3c8f8dSDr. David Alan Gilbert 7242e3c8f8dSDr. David Alan Gilbert letting the destination CPUs start running. At the end of the 7252e3c8f8dSDr. David Alan Gilbert ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and 7262e3c8f8dSDr. David Alan Gilbert is no longer used by migration, while the listen thread carries on servicing 7272e3c8f8dSDr. David Alan Gilbert page data until the end of migration. 7282e3c8f8dSDr. David Alan Gilbert 729f014880aSPeter XuPostcopy Recovery 730f014880aSPeter Xu----------------- 731f014880aSPeter Xu 732f014880aSPeter XuComparing to precopy, postcopy is special on error handlings. When any 733f014880aSPeter Xuerror happens (in this case, mostly network errors), QEMU cannot easily 734f014880aSPeter Xufail a migration because VM data resides in both source and destination 735f014880aSPeter XuQEMU instances. On the other hand, when issue happens QEMU on both sides 736f014880aSPeter Xuwill go into a paused state. It'll need a recovery phase to continue a 737f014880aSPeter Xupaused postcopy migration. 738f014880aSPeter Xu 739f014880aSPeter XuThe recovery phase normally contains a few steps: 740f014880aSPeter Xu 741f014880aSPeter Xu - When network issue occurs, both QEMU will go into PAUSED state 742f014880aSPeter Xu 743f014880aSPeter Xu - When the network is recovered (or a new network is provided), the admin 744f014880aSPeter Xu can setup the new channel for migration using QMP command 745f014880aSPeter Xu 'migrate-recover' on destination node, preparing for a resume. 746f014880aSPeter Xu 747f014880aSPeter Xu - On source host, the admin can continue the interrupted postcopy 748f014880aSPeter Xu migration using QMP command 'migrate' with resume=true flag set. 749f014880aSPeter Xu 750f014880aSPeter Xu - After the connection is re-established, QEMU will continue the postcopy 751f014880aSPeter Xu migration on both sides. 752f014880aSPeter Xu 753f014880aSPeter XuDuring a paused postcopy migration, the VM can logically still continue 754f014880aSPeter Xurunning, and it will not be impacted from any page access to pages that 755f014880aSPeter Xuwere already migrated to destination VM before the interruption happens. 756f014880aSPeter XuHowever, if any of the missing pages got accessed on destination VM, the VM 757f014880aSPeter Xuthread will be halted waiting for the page to be migrated, it means it can 758f014880aSPeter Xube halted until the recovery is complete. 759f014880aSPeter Xu 760f014880aSPeter XuThe impact of accessing missing pages can be relevant to different 761f014880aSPeter Xuconfigurations of the guest. For example, when with async page fault 762f014880aSPeter Xuenabled, logically the guest can proactively schedule out the threads 763f014880aSPeter Xuaccessing missing pages. 764f014880aSPeter Xu 7652e3c8f8dSDr. David Alan GilbertPostcopy states 7662e3c8f8dSDr. David Alan Gilbert--------------- 7672e3c8f8dSDr. David Alan Gilbert 7682e3c8f8dSDr. David Alan GilbertPostcopy moves through a series of states (see postcopy_state) from 7692e3c8f8dSDr. David Alan GilbertADVISE->DISCARD->LISTEN->RUNNING->END 7702e3c8f8dSDr. David Alan Gilbert 7712e3c8f8dSDr. David Alan Gilbert - Advise 7722e3c8f8dSDr. David Alan Gilbert 7732e3c8f8dSDr. David Alan Gilbert Set at the start of migration if postcopy is enabled, even 7742e3c8f8dSDr. David Alan Gilbert if it hasn't had the start command; here the destination 7752e3c8f8dSDr. David Alan Gilbert checks that its OS has the support needed for postcopy, and performs 7762e3c8f8dSDr. David Alan Gilbert setup to ensure the RAM mappings are suitable for later postcopy. 7772e3c8f8dSDr. David Alan Gilbert The destination will fail early in migration at this point if the 7782e3c8f8dSDr. David Alan Gilbert required OS support is not present. 7792e3c8f8dSDr. David Alan Gilbert (Triggered by reception of POSTCOPY_ADVISE command) 7802e3c8f8dSDr. David Alan Gilbert 7812e3c8f8dSDr. David Alan Gilbert - Discard 7822e3c8f8dSDr. David Alan Gilbert 7832e3c8f8dSDr. David Alan Gilbert Entered on receipt of the first 'discard' command; prior to 7842e3c8f8dSDr. David Alan Gilbert the first Discard being performed, hugepages are switched off 7852e3c8f8dSDr. David Alan Gilbert (using madvise) to ensure that no new huge pages are created 7862e3c8f8dSDr. David Alan Gilbert during the postcopy phase, and to cause any huge pages that 7872e3c8f8dSDr. David Alan Gilbert have discards on them to be broken. 7882e3c8f8dSDr. David Alan Gilbert 7892e3c8f8dSDr. David Alan Gilbert - Listen 7902e3c8f8dSDr. David Alan Gilbert 7912e3c8f8dSDr. David Alan Gilbert The first command in the package, POSTCOPY_LISTEN, switches 7922e3c8f8dSDr. David Alan Gilbert the destination state to Listen, and starts a new thread 7932e3c8f8dSDr. David Alan Gilbert (the 'listen thread') which takes over the job of receiving 7942e3c8f8dSDr. David Alan Gilbert pages off the migration stream, while the main thread carries 7952e3c8f8dSDr. David Alan Gilbert on processing the blob. With this thread able to process page 7962e3c8f8dSDr. David Alan Gilbert reception, the destination now 'sensitises' the RAM to detect 7972e3c8f8dSDr. David Alan Gilbert any access to missing pages (on Linux using the 'userfault' 7982e3c8f8dSDr. David Alan Gilbert system). 7992e3c8f8dSDr. David Alan Gilbert 8002e3c8f8dSDr. David Alan Gilbert - Running 8012e3c8f8dSDr. David Alan Gilbert 8022e3c8f8dSDr. David Alan Gilbert POSTCOPY_RUN causes the destination to synchronise all 8032e3c8f8dSDr. David Alan Gilbert state and start the CPUs and IO devices running. The main 8042e3c8f8dSDr. David Alan Gilbert thread now finishes processing the migration package and 8052e3c8f8dSDr. David Alan Gilbert now carries on as it would for normal precopy migration 8062e3c8f8dSDr. David Alan Gilbert (although it can't do the cleanup it would do as it 8072e3c8f8dSDr. David Alan Gilbert finishes a normal migration). 8082e3c8f8dSDr. David Alan Gilbert 809f014880aSPeter Xu - Paused 810f014880aSPeter Xu 811f014880aSPeter Xu Postcopy can run into a paused state (normally on both sides when 812f014880aSPeter Xu happens), where all threads will be temporarily halted mostly due to 813f014880aSPeter Xu network errors. When reaching paused state, migration will make sure 814f014880aSPeter Xu the qemu binary on both sides maintain the data without corrupting 815f014880aSPeter Xu the VM. To continue the migration, the admin needs to fix the 816f014880aSPeter Xu migration channel using the QMP command 'migrate-recover' on the 817f014880aSPeter Xu destination node, then resume the migration using QMP command 'migrate' 818f014880aSPeter Xu again on source node, with resume=true flag set. 819f014880aSPeter Xu 8202e3c8f8dSDr. David Alan Gilbert - End 8212e3c8f8dSDr. David Alan Gilbert 8222e3c8f8dSDr. David Alan Gilbert The listen thread can now quit, and perform the cleanup of migration 8232e3c8f8dSDr. David Alan Gilbert state, the migration is now complete. 8242e3c8f8dSDr. David Alan Gilbert 825f014880aSPeter XuSource side page map 826f014880aSPeter Xu-------------------- 8272e3c8f8dSDr. David Alan Gilbert 828f014880aSPeter XuThe 'migration bitmap' in postcopy is basically the same as in the precopy, 829f014880aSPeter Xuwhere each of the bit to indicate that page is 'dirty' - i.e. needs 830f014880aSPeter Xusending. During the precopy phase this is updated as the CPU dirties 831f014880aSPeter Xupages, however during postcopy the CPUs are stopped and nothing should 832f014880aSPeter Xudirty anything any more. Instead, dirty bits are cleared when the relevant 833f014880aSPeter Xupages are sent during postcopy. 8342e3c8f8dSDr. David Alan Gilbert 8352e3c8f8dSDr. David Alan GilbertPostcopy with hugepages 8362e3c8f8dSDr. David Alan Gilbert----------------------- 8372e3c8f8dSDr. David Alan Gilbert 8382e3c8f8dSDr. David Alan GilbertPostcopy now works with hugetlbfs backed memory: 8392e3c8f8dSDr. David Alan Gilbert 8402e3c8f8dSDr. David Alan Gilbert a) The linux kernel on the destination must support userfault on hugepages. 8412e3c8f8dSDr. David Alan Gilbert b) The huge-page configuration on the source and destination VMs must be 8422e3c8f8dSDr. David Alan Gilbert identical; i.e. RAMBlocks on both sides must use the same page size. 8432e3c8f8dSDr. David Alan Gilbert c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal 8442e3c8f8dSDr. David Alan Gilbert RAM if it doesn't have enough hugepages, triggering (b) to fail. 8452e3c8f8dSDr. David Alan Gilbert Using ``-mem-prealloc`` enforces the allocation using hugepages. 8462e3c8f8dSDr. David Alan Gilbert d) Care should be taken with the size of hugepage used; postcopy with 2MB 8472e3c8f8dSDr. David Alan Gilbert hugepages works well, however 1GB hugepages are likely to be problematic 8482e3c8f8dSDr. David Alan Gilbert since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, 8492e3c8f8dSDr. David Alan Gilbert and until the full page is transferred the destination thread is blocked. 8501dc61e7bSDr. David Alan Gilbert 8511dc61e7bSDr. David Alan GilbertPostcopy with shared memory 8521dc61e7bSDr. David Alan Gilbert--------------------------- 8531dc61e7bSDr. David Alan Gilbert 8541dc61e7bSDr. David Alan GilbertPostcopy migration with shared memory needs explicit support from the other 8551dc61e7bSDr. David Alan Gilbertprocesses that share memory and from QEMU. There are restrictions on the type of 8561dc61e7bSDr. David Alan Gilbertmemory that userfault can support shared. 8571dc61e7bSDr. David Alan Gilbert 8584df3a7bfSPeter MaydellThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` 8594df3a7bfSPeter Maydell(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` 8601dc61e7bSDr. David Alan Gilbertfor hugetlbfs which may be a problem in some configurations). 8611dc61e7bSDr. David Alan Gilbert 8621dc61e7bSDr. David Alan GilbertThe vhost-user code in QEMU supports clients that have Postcopy support, 8634df3a7bfSPeter Maydelland the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes 8641dc61e7bSDr. David Alan Gilbertto support postcopy. 8651dc61e7bSDr. David Alan Gilbert 8661dc61e7bSDr. David Alan GilbertThe client needs to open a userfaultfd and register the areas 8671dc61e7bSDr. David Alan Gilbertof memory that it maps with userfault. The client must then pass the 8681dc61e7bSDr. David Alan Gilbertuserfaultfd back to QEMU together with a mapping table that allows 8691dc61e7bSDr. David Alan Gilbertfault addresses in the clients address space to be converted back to 8701dc61e7bSDr. David Alan GilbertRAMBlock/offsets. The client's userfaultfd is added to the postcopy 8711dc61e7bSDr. David Alan Gilbertfault-thread and page requests are made on behalf of the client by QEMU. 8721dc61e7bSDr. David Alan GilbertQEMU performs 'wake' operations on the client's userfaultfd to allow it 8731dc61e7bSDr. David Alan Gilbertto continue after a page has arrived. 8741dc61e7bSDr. David Alan Gilbert 8751dc61e7bSDr. David Alan Gilbert.. note:: 8761dc61e7bSDr. David Alan Gilbert There are two future improvements that would be nice: 8771dc61e7bSDr. David Alan Gilbert a) Some way to make QEMU ignorant of the addresses in the clients 8781dc61e7bSDr. David Alan Gilbert address space 8791dc61e7bSDr. David Alan Gilbert b) Avoiding the need for QEMU to perform ufd-wake calls after the 8801dc61e7bSDr. David Alan Gilbert pages have arrived 8811dc61e7bSDr. David Alan Gilbert 8821dc61e7bSDr. David Alan GilbertRetro-fitting postcopy to existing clients is possible: 8831dc61e7bSDr. David Alan Gilbert a) A mechanism is needed for the registration with userfault as above, 8841dc61e7bSDr. David Alan Gilbert and the registration needs to be coordinated with the phases of 8851dc61e7bSDr. David Alan Gilbert postcopy. In vhost-user extra messages are added to the existing 8861dc61e7bSDr. David Alan Gilbert control channel. 8871dc61e7bSDr. David Alan Gilbert b) Any thread that can block due to guest memory accesses must be 8881dc61e7bSDr. David Alan Gilbert identified and the implication understood; for example if the 8891dc61e7bSDr. David Alan Gilbert guest memory access is made while holding a lock then all other 8901dc61e7bSDr. David Alan Gilbert threads waiting for that lock will also be blocked. 891edd70806SDr. David Alan Gilbert 892f014880aSPeter XuPostcopy Preemption Mode 893f014880aSPeter Xu------------------------ 894f014880aSPeter Xu 895f014880aSPeter XuPostcopy preempt is a new capability introduced in 8.0 QEMU release, it 896f014880aSPeter Xuallows urgent pages (those got page fault requested from destination QEMU 897f014880aSPeter Xuexplicitly) to be sent in a separate preempt channel, rather than queued in 898f014880aSPeter Xuthe background migration channel. Anyone who cares about latencies of page 899f014880aSPeter Xufaults during a postcopy migration should enable this feature. By default, 900f014880aSPeter Xuit's not enabled. 901f014880aSPeter Xu 902edd70806SDr. David Alan GilbertFirmware 903edd70806SDr. David Alan Gilbert======== 904edd70806SDr. David Alan Gilbert 905edd70806SDr. David Alan GilbertMigration migrates the copies of RAM and ROM, and thus when running 906edd70806SDr. David Alan Gilberton the destination it includes the firmware from the source. Even after 907edd70806SDr. David Alan Gilbertresetting a VM, the old firmware is used. Only once QEMU has been restarted 908edd70806SDr. David Alan Gilbertis the new firmware in use. 909edd70806SDr. David Alan Gilbert 910edd70806SDr. David Alan Gilbert- Changes in firmware size can cause changes in the required RAMBlock size 911edd70806SDr. David Alan Gilbert to hold the firmware and thus migration can fail. In practice it's best 912edd70806SDr. David Alan Gilbert to pad firmware images to convenient powers of 2 with plenty of space 913edd70806SDr. David Alan Gilbert for growth. 914edd70806SDr. David Alan Gilbert 915edd70806SDr. David Alan Gilbert- Care should be taken with device emulation code so that newer 916edd70806SDr. David Alan Gilbert emulation code can work with older firmware to allow forward migration. 917edd70806SDr. David Alan Gilbert 918edd70806SDr. David Alan Gilbert- Care should be taken with newer firmware so that backward migration 919edd70806SDr. David Alan Gilbert to older systems with older device emulation code will work. 920edd70806SDr. David Alan Gilbert 921edd70806SDr. David Alan GilbertIn some cases it may be best to tie specific firmware versions to specific 922edd70806SDr. David Alan Gilbertversioned machine types to cut down on the combinations that will need 923edd70806SDr. David Alan Gilbertsupport. This is also useful when newer versions of firmware outgrow 924edd70806SDr. David Alan Gilbertthe padding. 925edd70806SDr. David Alan Gilbert 9261aefe2caSJuan Quintela 9271aefe2caSJuan QuintelaBackwards compatibility 9281aefe2caSJuan Quintela======================= 9291aefe2caSJuan Quintela 9301aefe2caSJuan QuintelaHow backwards compatibility works 9311aefe2caSJuan Quintela--------------------------------- 9321aefe2caSJuan Quintela 9331aefe2caSJuan QuintelaWhen we do migration, we have two QEMU processes: the source and the 9341aefe2caSJuan Quintelatarget. There are two cases, they are the same version or they are 9351aefe2caSJuan Quinteladifferent versions. The easy case is when they are the same version. 9361aefe2caSJuan QuintelaThe difficult one is when they are different versions. 9371aefe2caSJuan Quintela 9381aefe2caSJuan QuintelaThere are two things that are different, but they have very similar 9391aefe2caSJuan Quintelanames and sometimes get confused: 9401aefe2caSJuan Quintela 9411aefe2caSJuan Quintela- QEMU version 9421aefe2caSJuan Quintela- machine type version 9431aefe2caSJuan Quintela 9441aefe2caSJuan QuintelaLet's start with a practical example, we start with: 9451aefe2caSJuan Quintela 9461aefe2caSJuan Quintela- qemu-system-x86_64 (v5.2), from now on qemu-5.2. 9471aefe2caSJuan Quintela- qemu-system-x86_64 (v5.1), from now on qemu-5.1. 9481aefe2caSJuan Quintela 9491aefe2caSJuan QuintelaRelated to this are the "latest" machine types defined on each of 9501aefe2caSJuan Quintelathem: 9511aefe2caSJuan Quintela 9521aefe2caSJuan Quintela- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2 9531aefe2caSJuan Quintela- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1 9541aefe2caSJuan Quintela 9551aefe2caSJuan QuintelaFirst of all, migration is only supposed to work if you use the same 9561aefe2caSJuan Quintelamachine type in both source and destination. The QEMU hardware 9571aefe2caSJuan Quintelaconfiguration needs to be the same also on source and destination. 9581aefe2caSJuan QuintelaMost aspects of the backend configuration can be changed at will, 9591aefe2caSJuan Quintelaexcept for a few cases where the backend features influence frontend 9601aefe2caSJuan Quinteladevice feature exposure. But that is not relevant for this section. 9611aefe2caSJuan Quintela 9621aefe2caSJuan QuintelaI am going to list the number of combinations that we can have. Let's 9631aefe2caSJuan Quintelastart with the trivial ones, QEMU is the same on source and 9641aefe2caSJuan Quinteladestination: 9651aefe2caSJuan Quintela 9661aefe2caSJuan Quintela1 - qemu-5.2 -M pc-5.2 -> migrates to -> qemu-5.2 -M pc-5.2 9671aefe2caSJuan Quintela 9681aefe2caSJuan Quintela This is the latest QEMU with the latest machine type. 9691aefe2caSJuan Quintela This have to work, and if it doesn't work it is a bug. 9701aefe2caSJuan Quintela 9711aefe2caSJuan Quintela2 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1 9721aefe2caSJuan Quintela 9731aefe2caSJuan Quintela Exactly the same case than the previous one, but for 5.1. 9741aefe2caSJuan Quintela Nothing to see here either. 9751aefe2caSJuan Quintela 9761aefe2caSJuan QuintelaThis are the easiest ones, we will not talk more about them in this 9771aefe2caSJuan Quintelasection. 9781aefe2caSJuan Quintela 9791aefe2caSJuan QuintelaNow we start with the more interesting cases. Consider the case where 9801aefe2caSJuan Quintelawe have the same QEMU version in both sides (qemu-5.2) but we are using 9811aefe2caSJuan Quintelathe latest machine type for that version (pc-5.2) but one of an older 9821aefe2caSJuan QuintelaQEMU version, in this case pc-5.1. 9831aefe2caSJuan Quintela 9841aefe2caSJuan Quintela3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1 9851aefe2caSJuan Quintela 9861aefe2caSJuan Quintela It needs to use the definition of pc-5.1 and the devices as they 9871aefe2caSJuan Quintela were configured on 5.1, but this should be easy in the sense that 9881aefe2caSJuan Quintela both sides are the same QEMU and both sides have exactly the same 9891aefe2caSJuan Quintela idea of what the pc-5.1 machine is. 9901aefe2caSJuan Quintela 9911aefe2caSJuan Quintela4 - qemu-5.1 -M pc-5.2 -> migrates to -> qemu-5.1 -M pc-5.2 9921aefe2caSJuan Quintela 9931aefe2caSJuan Quintela This combination is not possible as the qemu-5.1 doen't understand 9941aefe2caSJuan Quintela pc-5.2 machine type. So nothing to worry here. 9951aefe2caSJuan Quintela 9961aefe2caSJuan QuintelaNow it comes the interesting ones, when both QEMU processes are 9971aefe2caSJuan Quinteladifferent. Notice also that the machine type needs to be pc-5.1, 9981aefe2caSJuan Quintelabecause we have the limitation than qemu-5.1 doesn't know pc-5.2. So 9991aefe2caSJuan Quintelathe possible cases are: 10001aefe2caSJuan Quintela 10011aefe2caSJuan Quintela5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1 10021aefe2caSJuan Quintela 10031aefe2caSJuan Quintela This migration is known as newer to older. We need to make sure 10041aefe2caSJuan Quintela when we are developing 5.2 we need to take care about not to break 10051aefe2caSJuan Quintela migration to qemu-5.1. Notice that we can't make updates to 10061aefe2caSJuan Quintela qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is 10071aefe2caSJuan Quintela in qemu-5.2 side to make the relevant changes. 10081aefe2caSJuan Quintela 10091aefe2caSJuan Quintela6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1 10101aefe2caSJuan Quintela 10111aefe2caSJuan Quintela This migration is known as older to newer. We need to make sure 10121aefe2caSJuan Quintela than we are able to receive migrations from qemu-5.1. The problem is 10131aefe2caSJuan Quintela similar to the previous one. 10141aefe2caSJuan Quintela 10151aefe2caSJuan QuintelaIf qemu-5.1 and qemu-5.2 were the same, there will not be any 10161aefe2caSJuan Quintelacompatibility problems. But the reason that we create qemu-5.2 is to 10171aefe2caSJuan Quintelaget new features, devices, defaults, etc. 10181aefe2caSJuan Quintela 10191aefe2caSJuan QuintelaIf we get a device that has a new feature, or change a default value, 10201aefe2caSJuan Quintelawe have a problem when we try to migrate between different QEMU 10211aefe2caSJuan Quintelaversions. 10221aefe2caSJuan Quintela 10231aefe2caSJuan QuintelaSo we need a way to tell qemu-5.2 that when we are using machine type 10241aefe2caSJuan Quintelapc-5.1, it needs to **not** use the feature, to be able to migrate to 10251aefe2caSJuan Quintelareal qemu-5.1. 10261aefe2caSJuan Quintela 10271aefe2caSJuan QuintelaAnd the equivalent part when migrating from qemu-5.1 to qemu-5.2. 10281aefe2caSJuan Quintelaqemu-5.2 has to expect that it is not going to get data for the new 10291aefe2caSJuan Quintelafeature, because qemu-5.1 doesn't know about it. 10301aefe2caSJuan Quintela 10311aefe2caSJuan QuintelaHow do we tell QEMU about these device feature changes? In 10321aefe2caSJuan Quintelahw/core/machine.c:hw_compat_X_Y arrays. 10331aefe2caSJuan Quintela 10341aefe2caSJuan QuintelaIf we change a default value, we need to put back the old value on 10351aefe2caSJuan Quintelathat array. And the device, during initialization needs to look at 10361aefe2caSJuan Quintelathat array to see what value it needs to get for that feature. And 10371aefe2caSJuan Quintelawhat are we going to put in that array, the value of a property. 10381aefe2caSJuan Quintela 10391aefe2caSJuan QuintelaTo create a property for a device, we need to use one of the 10401aefe2caSJuan QuintelaDEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the 10411aefe2caSJuan Quintelamacros that exist. With it, we set the default value for that 10421aefe2caSJuan Quintelaproperty, and that is what it is going to get in the latest released 10431aefe2caSJuan Quintelaversion. But if we want a different value for a previous version, we 10441aefe2caSJuan Quintelacan change that in the hw_compat_X_Y arrays. 10451aefe2caSJuan Quintela 10461aefe2caSJuan Quintelahw_compat_X_Y is an array of registers that have the format: 10471aefe2caSJuan Quintela 10481aefe2caSJuan Quintela- name_device 10491aefe2caSJuan Quintela- name_property 10501aefe2caSJuan Quintela- value 10511aefe2caSJuan Quintela 10521aefe2caSJuan QuintelaLet's see a practical example. 10531aefe2caSJuan Quintela 10541aefe2caSJuan QuintelaIn qemu-5.2 virtio-blk-device got multi queue support. This is a 10551aefe2caSJuan Quintelachange that is not backward compatible. In qemu-5.1 it has one 10561aefe2caSJuan Quintelaqueue. In qemu-5.2 it has the same number of queues as the number of 10571aefe2caSJuan Quintelacpus in the system. 10581aefe2caSJuan Quintela 10591aefe2caSJuan QuintelaWhen we are doing migration, if we migrate from a device that has 4 10601aefe2caSJuan Quintelaqueues to a device that have only one queue, we don't know where to 10611aefe2caSJuan Quintelaput the extra information for the other 3 queues, and we fail 10621aefe2caSJuan Quintelamigration. 10631aefe2caSJuan Quintela 10641aefe2caSJuan QuintelaSimilar problem when we migrate from qemu-5.1 that has only one queue 10651aefe2caSJuan Quintelato qemu-5.2, we only sent information for one queue, but destination 10661aefe2caSJuan Quintelahas 4, and we have 3 queues that are not properly initialized and 10671aefe2caSJuan Quintelaanything can happen. 10681aefe2caSJuan Quintela 10691aefe2caSJuan QuintelaSo, how can we address this problem. Easy, just convince qemu-5.2 10701aefe2caSJuan Quintelathat when it is running pc-5.1, it needs to set the number of queues 10711aefe2caSJuan Quintelafor virtio-blk-devices to 1. 10721aefe2caSJuan Quintela 10731aefe2caSJuan QuintelaThat way we fix the cases 5 and 6. 10741aefe2caSJuan Quintela 10751aefe2caSJuan Quintela5 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.1 -M pc-5.1 10761aefe2caSJuan Quintela 10771aefe2caSJuan Quintela qemu-5.2 -M pc-5.1 sets number of queues to be 1. 10781aefe2caSJuan Quintela qemu-5.1 -M pc-5.1 expects number of queues to be 1. 10791aefe2caSJuan Quintela 10801aefe2caSJuan Quintela correct. migration works. 10811aefe2caSJuan Quintela 10821aefe2caSJuan Quintela6 - qemu-5.1 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1 10831aefe2caSJuan Quintela 10841aefe2caSJuan Quintela qemu-5.1 -M pc-5.1 sets number of queues to be 1. 10851aefe2caSJuan Quintela qemu-5.2 -M pc-5.1 expects number of queues to be 1. 10861aefe2caSJuan Quintela 10871aefe2caSJuan Quintela correct. migration works. 10881aefe2caSJuan Quintela 10891aefe2caSJuan QuintelaAnd now the other interesting case, case 3. In this case we have: 10901aefe2caSJuan Quintela 10911aefe2caSJuan Quintela3 - qemu-5.2 -M pc-5.1 -> migrates to -> qemu-5.2 -M pc-5.1 10921aefe2caSJuan Quintela 10931aefe2caSJuan Quintela Here we have the same QEMU in both sides. So it doesn't matter a 10941aefe2caSJuan Quintela lot if we have set the number of queues to 1 or not, because 10951aefe2caSJuan Quintela they are the same. 10961aefe2caSJuan Quintela 10971aefe2caSJuan Quintela WRONG! 10981aefe2caSJuan Quintela 10991aefe2caSJuan Quintela Think what happens if we do one of this double migrations: 11001aefe2caSJuan Quintela 11011aefe2caSJuan Quintela A -> migrates -> B -> migrates -> C 11021aefe2caSJuan Quintela 11031aefe2caSJuan Quintela where: 11041aefe2caSJuan Quintela 11051aefe2caSJuan Quintela A: qemu-5.1 -M pc-5.1 11061aefe2caSJuan Quintela B: qemu-5.2 -M pc-5.1 11071aefe2caSJuan Quintela C: qemu-5.2 -M pc-5.1 11081aefe2caSJuan Quintela 11091aefe2caSJuan Quintela migration A -> B is case 6, so number of queues needs to be 1. 11101aefe2caSJuan Quintela 11111aefe2caSJuan Quintela migration B -> C is case 3, so we don't care. But actually we 11121aefe2caSJuan Quintela care because we haven't started the guest in qemu-5.2, it came 11131aefe2caSJuan Quintela migrated from qemu-5.1. So to be in the safe place, we need to 11141aefe2caSJuan Quintela always use number of queues 1 when we are using pc-5.1. 11151aefe2caSJuan Quintela 11161aefe2caSJuan QuintelaNow, how was this done in reality? The following commit shows how it 11171aefe2caSJuan Quintelawas done:: 11181aefe2caSJuan Quintela 11191aefe2caSJuan Quintela commit 9445e1e15e66c19e42bea942ba810db28052cd05 11201aefe2caSJuan Quintela Author: Stefan Hajnoczi <stefanha@redhat.com> 11211aefe2caSJuan Quintela Date: Tue Aug 18 15:33:47 2020 +0100 11221aefe2caSJuan Quintela 11231aefe2caSJuan Quintela virtio-blk-pci: default num_queues to -smp N 11241aefe2caSJuan Quintela 11251aefe2caSJuan QuintelaThe relevant parts for migration are:: 11261aefe2caSJuan Quintela 11271aefe2caSJuan Quintela @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = { 11281aefe2caSJuan Quintela #endif 11291aefe2caSJuan Quintela DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0, 11301aefe2caSJuan Quintela true), 11311aefe2caSJuan Quintela - DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1), 11321aefe2caSJuan Quintela + DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 11331aefe2caSJuan Quintela + VIRTIO_BLK_AUTO_NUM_QUEUES), 11341aefe2caSJuan Quintela DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256), 11351aefe2caSJuan Quintela 11361aefe2caSJuan QuintelaIt changes the default value of num_queues. But it fishes it for old 11371aefe2caSJuan Quintelamachine types to have the right value:: 11381aefe2caSJuan Quintela 11391aefe2caSJuan Quintela @@ -31,6 +31,7 @@ 11401aefe2caSJuan Quintela GlobalProperty hw_compat_5_1[] = { 11411aefe2caSJuan Quintela ... 11421aefe2caSJuan Quintela + { "virtio-blk-device", "num-queues", "1"}, 11431aefe2caSJuan Quintela ... 11441aefe2caSJuan Quintela }; 1145593c28c0SJuan Quintela 1146593c28c0SJuan QuintelaA device with diferent features on both sides 1147593c28c0SJuan Quintela--------------------------------------------- 1148593c28c0SJuan Quintela 1149593c28c0SJuan QuintelaLet's assume that we are using the same QEMU binary on both sides, 1150593c28c0SJuan Quintelajust to make the things easier. But we have a device that has 1151593c28c0SJuan Quinteladifferent features on both sides of the migration. That can be 1152593c28c0SJuan Quintelabecause the devices are different, because the kernel driver of both 1153593c28c0SJuan Quinteladevices have different features, whatever. 1154593c28c0SJuan Quintela 1155593c28c0SJuan QuintelaHow can we get this to work with migration. The way to do that is 1156593c28c0SJuan Quintela"theoretically" easy. You have to get the features that the device 1157593c28c0SJuan Quintelahas in the source of the migration. The features that the device has 1158593c28c0SJuan Quintelaon the target of the migration, you get the intersection of the 1159593c28c0SJuan Quintelafeatures of both sides, and that is the way that you should launch 1160593c28c0SJuan QuintelaQEMU. 1161593c28c0SJuan Quintela 1162593c28c0SJuan QuintelaNotice that this is not completely related to QEMU. The most 1163593c28c0SJuan Quintelaimportant thing here is that this should be handled by the managing 1164593c28c0SJuan Quintelaapplication that launches QEMU. If QEMU is configured correctly, the 1165593c28c0SJuan Quintelamigration will succeed. 1166593c28c0SJuan Quintela 1167593c28c0SJuan QuintelaThat said, actually doing it is complicated. Almost all devices are 1168593c28c0SJuan Quintelabad at being able to be launched with only some features enabled. 1169593c28c0SJuan QuintelaWith one big exception: cpus. 1170593c28c0SJuan Quintela 1171593c28c0SJuan QuintelaYou can read the documentation for QEMU x86 cpu models here: 1172593c28c0SJuan Quintela 1173593c28c0SJuan Quintelahttps://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html 1174593c28c0SJuan Quintela 1175593c28c0SJuan QuintelaSee when they talk about migration they recommend that one chooses the 1176593c28c0SJuan Quintelanewest cpu model that is supported for all cpus. 1177593c28c0SJuan Quintela 1178593c28c0SJuan QuintelaLet's say that we have: 1179593c28c0SJuan Quintela 1180593c28c0SJuan QuintelaHost A: 1181593c28c0SJuan Quintela 1182593c28c0SJuan QuintelaDevice X has the feature Y 1183593c28c0SJuan Quintela 1184593c28c0SJuan QuintelaHost B: 1185593c28c0SJuan Quintela 1186593c28c0SJuan QuintelaDevice X has not the feature Y 1187593c28c0SJuan Quintela 1188593c28c0SJuan QuintelaIf we try to migrate without any care from host A to host B, it will 1189593c28c0SJuan Quintelafail because when migration tries to load the feature Y on 1190593c28c0SJuan Quinteladestination, it will find that the hardware is not there. 1191593c28c0SJuan Quintela 1192593c28c0SJuan QuintelaDoing this would be the equivalent of doing with cpus: 1193593c28c0SJuan Quintela 1194593c28c0SJuan QuintelaHost A: 1195593c28c0SJuan Quintela 1196593c28c0SJuan Quintela$ qemu-system-x86_64 -cpu host 1197593c28c0SJuan Quintela 1198593c28c0SJuan QuintelaHost B: 1199593c28c0SJuan Quintela 1200593c28c0SJuan Quintela$ qemu-system-x86_64 -cpu host 1201593c28c0SJuan Quintela 1202593c28c0SJuan QuintelaWhen both hosts have different cpu features this is guaranteed to 1203593c28c0SJuan Quintelafail. Especially if Host B has less features than host A. If host A 1204593c28c0SJuan Quintelahas less features than host B, sometimes it works. Important word of 1205593c28c0SJuan Quintelalast sentence is "sometimes". 1206593c28c0SJuan Quintela 1207593c28c0SJuan QuintelaSo, forgetting about cpu models and continuing with the -cpu host 1208593c28c0SJuan Quintelaexample, let's see that the differences of the cpus is that Host A and 1209593c28c0SJuan QuintelaB have the following features: 1210593c28c0SJuan Quintela 1211593c28c0SJuan QuintelaFeatures: 'pcid' 'stibp' 'taa-no' 1212593c28c0SJuan QuintelaHost A: X X 1213593c28c0SJuan QuintelaHost B: X 1214593c28c0SJuan Quintela 1215593c28c0SJuan QuintelaAnd we want to migrate between them, the way configure both QEMU cpu 1216593c28c0SJuan Quintelawill be: 1217593c28c0SJuan Quintela 1218593c28c0SJuan QuintelaHost A: 1219593c28c0SJuan Quintela 1220593c28c0SJuan Quintela$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off 1221593c28c0SJuan Quintela 1222593c28c0SJuan QuintelaHost B: 1223593c28c0SJuan Quintela 1224593c28c0SJuan Quintela$ qemu-system-x86_64 -cpu host,taa-no=off 1225593c28c0SJuan Quintela 1226593c28c0SJuan QuintelaAnd you would be able to migrate between them. It is responsability 1227593c28c0SJuan Quintelaof the management application or of the user to make sure that the 1228593c28c0SJuan Quintelaconfiguration is correct. QEMU doesn't know how to look at this kind 1229593c28c0SJuan Quintelaof features in general. 1230593c28c0SJuan Quintela 1231593c28c0SJuan QuintelaNotice that we don't recomend to use -cpu host for migration. It is 1232593c28c0SJuan Quintelaused in this example because it makes the example simpler. 1233593c28c0SJuan Quintela 1234593c28c0SJuan QuintelaOther devices have worse control about individual features. If they 1235593c28c0SJuan Quintelawant to be able to migrate between hosts that show different features, 1236593c28c0SJuan Quintelathe device needs a way to configure which ones it is going to use. 1237593c28c0SJuan Quintela 1238593c28c0SJuan QuintelaIn this section we have considered that we are using the same QEMU 1239593c28c0SJuan Quintelabinary in both sides of the migration. If we use different QEMU 1240593c28c0SJuan Quintelaversions process, then we need to have into account all other 1241593c28c0SJuan Quinteladifferences and the examples become even more complicated. 1242e7732617SJuan Quintela 1243e7732617SJuan QuintelaHow to mitigate when we have a backward compatibility error 1244e7732617SJuan Quintela----------------------------------------------------------- 1245e7732617SJuan Quintela 1246e7732617SJuan QuintelaWe broke migration for old machine types continuously during 1247e7732617SJuan Quinteladevelopment. But as soon as we find that there is a problem, we fix 1248e7732617SJuan Quintelait. The problem is what happens when we detect after we have done a 1249e7732617SJuan Quintelarelease that something has gone wrong. 1250e7732617SJuan Quintela 1251e7732617SJuan QuintelaLet see how it worked with one example. 1252e7732617SJuan Quintela 1253e7732617SJuan QuintelaAfter the release of qemu-8.0 we found a problem when doing migration 1254e7732617SJuan Quintelaof the machine type pc-7.2. 1255e7732617SJuan Quintela 1256e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1257e7732617SJuan Quintela 1258e7732617SJuan Quintela This migration works 1259e7732617SJuan Quintela 1260e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1261e7732617SJuan Quintela 1262e7732617SJuan Quintela This migration works 1263e7732617SJuan Quintela 1264e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1265e7732617SJuan Quintela 1266e7732617SJuan Quintela This migration fails 1267e7732617SJuan Quintela 1268e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1269e7732617SJuan Quintela 1270e7732617SJuan Quintela This migration fails 1271e7732617SJuan Quintela 1272e7732617SJuan QuintelaSo clearly something fails when migration between qemu-7.2 and 1273e7732617SJuan Quintelaqemu-8.0 with machine type pc-7.2. The error messages, and git bisect 1274e7732617SJuan Quintelapointed to this commit. 1275e7732617SJuan Quintela 1276e7732617SJuan QuintelaIn qemu-8.0 we got this commit:: 1277e7732617SJuan Quintela 1278e7732617SJuan Quintela commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2 1279e7732617SJuan Quintela Author: Jonathan Cameron <Jonathan.Cameron@huawei.com> 1280e7732617SJuan Quintela Date: Thu Mar 2 13:37:02 2023 +0000 1281e7732617SJuan Quintela 1282e7732617SJuan Quintela hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register 1283e7732617SJuan Quintela 1284e7732617SJuan Quintela 1285e7732617SJuan QuintelaThe relevant bits of the commit for our example are this ones:: 1286e7732617SJuan Quintela 1287e7732617SJuan Quintela --- a/hw/pci/pcie_aer.c 1288e7732617SJuan Quintela +++ b/hw/pci/pcie_aer.c 1289e7732617SJuan Quintela @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev, 1290e7732617SJuan Quintela 1291e7732617SJuan Quintela pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, 1292e7732617SJuan Quintela PCI_ERR_UNC_SUPPORTED); 1293e7732617SJuan Quintela + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, 1294e7732617SJuan Quintela + PCI_ERR_UNC_MASK_DEFAULT); 1295e7732617SJuan Quintela + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, 1296e7732617SJuan Quintela + PCI_ERR_UNC_SUPPORTED); 1297e7732617SJuan Quintela 1298e7732617SJuan Quintela pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, 1299e7732617SJuan Quintela PCI_ERR_UNC_SEVERITY_DEFAULT); 1300e7732617SJuan Quintela 1301e7732617SJuan QuintelaThe patch changes how we configure PCI space for AER. But QEMU fails 1302e7732617SJuan Quintelawhen the PCI space configuration is different between source and 1303e7732617SJuan Quinteladestination. 1304e7732617SJuan Quintela 1305e7732617SJuan QuintelaThe following commit shows how this got fixed:: 1306e7732617SJuan Quintela 1307e7732617SJuan Quintela commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f 1308e7732617SJuan Quintela Author: Leonardo Bras <leobras@redhat.com> 1309e7732617SJuan Quintela Date: Tue May 2 21:27:02 2023 -0300 1310e7732617SJuan Quintela 1311e7732617SJuan Quintela hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0 1312e7732617SJuan Quintela 1313e7732617SJuan Quintela [...] 1314e7732617SJuan Quintela 1315e7732617SJuan QuintelaThe relevant parts of the fix in QEMU are as follow: 1316e7732617SJuan Quintela 1317e7732617SJuan QuintelaFirst, we create a new property for the device to be able to configure 1318e7732617SJuan Quintelathe old behaviour or the new behaviour:: 1319e7732617SJuan Quintela 1320e7732617SJuan Quintela diff --git a/hw/pci/pci.c b/hw/pci/pci.c 1321e7732617SJuan Quintela index 8a87ccc8b0..5153ad63d6 100644 1322e7732617SJuan Quintela --- a/hw/pci/pci.c 1323e7732617SJuan Quintela +++ b/hw/pci/pci.c 1324e7732617SJuan Quintela @@ -79,6 +79,8 @@ static Property pci_props[] = { 1325e7732617SJuan Quintela DEFINE_PROP_STRING("failover_pair_id", PCIDevice, 1326e7732617SJuan Quintela failover_pair_id), 1327e7732617SJuan Quintela DEFINE_PROP_UINT32("acpi-index", PCIDevice, acpi_index, 0), 1328e7732617SJuan Quintela + DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present, 1329e7732617SJuan Quintela + QEMU_PCIE_ERR_UNC_MASK_BITNR, true), 1330e7732617SJuan Quintela DEFINE_PROP_END_OF_LIST() 1331e7732617SJuan Quintela }; 1332e7732617SJuan Quintela 1333e7732617SJuan QuintelaNotice that we enable the feature for new machine types. 1334e7732617SJuan Quintela 1335e7732617SJuan QuintelaNow we see how the fix is done. This is going to depend on what kind 1336e7732617SJuan Quintelaof breakage happens, but in this case it is quite simple:: 1337e7732617SJuan Quintela 1338e7732617SJuan Quintela diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c 1339e7732617SJuan Quintela index 103667c368..374d593ead 100644 1340e7732617SJuan Quintela --- a/hw/pci/pcie_aer.c 1341e7732617SJuan Quintela +++ b/hw/pci/pcie_aer.c 1342e7732617SJuan Quintela @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver, 1343e7732617SJuan Quintela uint16_t offset, 1344e7732617SJuan Quintela 1345e7732617SJuan Quintela pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS, 1346e7732617SJuan Quintela PCI_ERR_UNC_SUPPORTED); 1347e7732617SJuan Quintela - pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, 1348e7732617SJuan Quintela - PCI_ERR_UNC_MASK_DEFAULT); 1349e7732617SJuan Quintela - pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, 1350e7732617SJuan Quintela - PCI_ERR_UNC_SUPPORTED); 1351e7732617SJuan Quintela + 1352e7732617SJuan Quintela + if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) { 1353e7732617SJuan Quintela + pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK, 1354e7732617SJuan Quintela + PCI_ERR_UNC_MASK_DEFAULT); 1355e7732617SJuan Quintela + pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK, 1356e7732617SJuan Quintela + PCI_ERR_UNC_SUPPORTED); 1357e7732617SJuan Quintela + } 1358e7732617SJuan Quintela 1359e7732617SJuan Quintela pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER, 1360e7732617SJuan Quintela PCI_ERR_UNC_SEVERITY_DEFAULT); 1361e7732617SJuan Quintela 1362e7732617SJuan QuintelaI.e. If the property bit is enabled, we configure it as we did for 1363e7732617SJuan Quintelaqemu-8.0. If the property bit is not set, we configure it as it was in 7.2. 1364e7732617SJuan Quintela 1365e7732617SJuan QuintelaAnd now, everything that is missing is disabling the feature for old 1366e7732617SJuan Quintelamachine types:: 1367e7732617SJuan Quintela 1368e7732617SJuan Quintela diff --git a/hw/core/machine.c b/hw/core/machine.c 1369e7732617SJuan Quintela index 47a34841a5..07f763eb2e 100644 1370e7732617SJuan Quintela --- a/hw/core/machine.c 1371e7732617SJuan Quintela +++ b/hw/core/machine.c 1372e7732617SJuan Quintela @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = { 1373e7732617SJuan Quintela { "e1000e", "migrate-timadj", "off" }, 1374e7732617SJuan Quintela { "virtio-mem", "x-early-migration", "false" }, 1375e7732617SJuan Quintela { "migration", "x-preempt-pre-7-2", "true" }, 1376e7732617SJuan Quintela + { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" }, 1377e7732617SJuan Quintela }; 1378e7732617SJuan Quintela const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2); 1379e7732617SJuan Quintela 1380e7732617SJuan QuintelaAnd now, when qemu-8.0.1 is released with this fix, all combinations 1381e7732617SJuan Quintelaare going to work as supposed. 1382e7732617SJuan Quintela 1383e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works) 1384e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works) 1385e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 (works) 1386e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 (works) 1387e7732617SJuan Quintela 1388e7732617SJuan QuintelaSo the normality has been restored and everything is ok, no? 1389e7732617SJuan Quintela 1390e7732617SJuan QuintelaNot really, now our matrix is much bigger. We started with the easy 1391e7732617SJuan Quintelacases, migration from the same version to the same version always 1392e7732617SJuan Quintelaworks: 1393e7732617SJuan Quintela 1394e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1395e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1396e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 1397e7732617SJuan Quintela 1398e7732617SJuan QuintelaNow the interesting ones. When the QEMU processes versions are 1399e7732617SJuan Quinteladifferent. For the 1st set, their fail and we can do nothing, both 1400e7732617SJuan Quintelaversions are released and we can't change anything. 1401e7732617SJuan Quintela 1402e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1403e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1404e7732617SJuan Quintela 1405e7732617SJuan QuintelaThis two are the ones that work. The whole point of making the 1406e7732617SJuan Quintelachange in qemu-8.0.1 release was to fix this issue: 1407e7732617SJuan Quintela 1408e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 1409e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2 -> qemu-7.2 -M pc-7.2 1410e7732617SJuan Quintela 1411e7732617SJuan QuintelaBut now we found that qemu-8.0 neither can migrate to qemu-7.2 not 1412e7732617SJuan Quintelaqemu-8.0.1. 1413e7732617SJuan Quintela 1414e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 1415e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2 -> qemu-8.0 -M pc-7.2 1416e7732617SJuan Quintela 1417e7732617SJuan QuintelaSo, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to 1418e7732617SJuan Quintelaanything except to qemu-8.0. 1419e7732617SJuan Quintela 1420e7732617SJuan QuintelaCan we do better? 1421e7732617SJuan Quintela 1422e7732617SJuan QuintelaYeap. If we know that we are going to do this migration: 1423e7732617SJuan Quintela 1424e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 1425e7732617SJuan Quintela 1426e7732617SJuan QuintelaWe can launch the appropriate devices with:: 1427e7732617SJuan Quintela 1428e7732617SJuan Quintela --device...,x-pci-e-err-unc-mask=on 1429e7732617SJuan Quintela 1430e7732617SJuan QuintelaAnd now we can receive a migration from 8.0. And from now on, we can 1431e7732617SJuan Quintelado that migration to new machine types if we remember to enable that 1432e7732617SJuan Quintelaproperty for pc-7.2. Notice that we need to remember, it is not 1433e7732617SJuan Quintelaenough to know that the source of the migration is qemu-8.0. Think of 1434e7732617SJuan Quintelathis example: 1435e7732617SJuan Quintela 1436e7732617SJuan Quintela$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2 1437e7732617SJuan Quintela 1438e7732617SJuan QuintelaIn the second migration, the source is not qemu-8.0, but we still have 1439e7732617SJuan Quintelathat "problem" and have that property enabled. Notice that we need to 1440e7732617SJuan Quintelacontinue having this mark/property until we have this machine 1441e7732617SJuan Quintelarebooted. But it is not a normal reboot (that don't reload QEMU) we 1442e7732617SJuan Quintelaneed the machine to poweroff/poweron on a fixed QEMU. And from now 1443e7732617SJuan Quintelaon we can use the proper real machine. 1444