12e3c8f8dSDr. David Alan Gilbert========= 22e3c8f8dSDr. David Alan GilbertMigration 32e3c8f8dSDr. David Alan Gilbert========= 42e3c8f8dSDr. David Alan Gilbert 52e3c8f8dSDr. David Alan GilbertQEMU has code to load/save the state of the guest that it is running. 62e3c8f8dSDr. David Alan GilbertThese are two complementary operations. Saving the state just does 72e3c8f8dSDr. David Alan Gilbertthat, saves the state for each device that the guest is running. 82e3c8f8dSDr. David Alan GilbertRestoring a guest is just the opposite operation: we need to load the 92e3c8f8dSDr. David Alan Gilbertstate of each device. 102e3c8f8dSDr. David Alan Gilbert 112e3c8f8dSDr. David Alan GilbertFor this to work, QEMU has to be launched with the same arguments the 122e3c8f8dSDr. David Alan Gilberttwo times. I.e. it can only restore the state in one guest that has 132e3c8f8dSDr. David Alan Gilbertthe same devices that the one it was saved (this last requirement can 142e3c8f8dSDr. David Alan Gilbertbe relaxed a bit, but for now we can consider that configuration has 152e3c8f8dSDr. David Alan Gilbertto be exactly the same). 162e3c8f8dSDr. David Alan Gilbert 172e3c8f8dSDr. David Alan GilbertOnce that we are able to save/restore a guest, a new functionality is 182e3c8f8dSDr. David Alan Gilbertrequested: migration. This means that QEMU is able to start in one 192e3c8f8dSDr. David Alan Gilbertmachine and being "migrated" to another machine. I.e. being moved to 202e3c8f8dSDr. David Alan Gilbertanother machine. 212e3c8f8dSDr. David Alan Gilbert 222e3c8f8dSDr. David Alan GilbertNext was the "live migration" functionality. This is important 232e3c8f8dSDr. David Alan Gilbertbecause some guests run with a lot of state (specially RAM), and it 242e3c8f8dSDr. David Alan Gilbertcan take a while to move all state from one machine to another. Live 252e3c8f8dSDr. David Alan Gilbertmigration allows the guest to continue running while the state is 262e3c8f8dSDr. David Alan Gilberttransferred. Only while the last part of the state is transferred has 272e3c8f8dSDr. David Alan Gilbertthe guest to be stopped. Typically the time that the guest is 282e3c8f8dSDr. David Alan Gilbertunresponsive during live migration is the low hundred of milliseconds 292e3c8f8dSDr. David Alan Gilbert(notice that this depends on a lot of things). 302e3c8f8dSDr. David Alan Gilbert 312e3c8f8dSDr. David Alan GilbertTypes of migration 322e3c8f8dSDr. David Alan Gilbert================== 332e3c8f8dSDr. David Alan Gilbert 342e3c8f8dSDr. David Alan GilbertNow that we have talked about live migration, there are several ways 352e3c8f8dSDr. David Alan Gilbertto do migration: 362e3c8f8dSDr. David Alan Gilbert 372e3c8f8dSDr. David Alan Gilbert- tcp migration: do the migration using tcp sockets 382e3c8f8dSDr. David Alan Gilbert- unix migration: do the migration using unix sockets 392e3c8f8dSDr. David Alan Gilbert- exec migration: do the migration using the stdin/stdout through a process. 402e3c8f8dSDr. David Alan Gilbert- fd migration: do the migration using an file descriptor that is 412e3c8f8dSDr. David Alan Gilbert passed to QEMU. QEMU doesn't care how this file descriptor is opened. 422e3c8f8dSDr. David Alan Gilbert 432e3c8f8dSDr. David Alan GilbertAll these four migration protocols use the same infrastructure to 442e3c8f8dSDr. David Alan Gilbertsave/restore state devices. This infrastructure is shared with the 452e3c8f8dSDr. David Alan Gilbertsavevm/loadvm functionality. 462e3c8f8dSDr. David Alan Gilbert 472e3c8f8dSDr. David Alan GilbertState Live Migration 482e3c8f8dSDr. David Alan Gilbert==================== 492e3c8f8dSDr. David Alan Gilbert 502e3c8f8dSDr. David Alan GilbertThis is used for RAM and block devices. It is not yet ported to vmstate. 512e3c8f8dSDr. David Alan Gilbert<Fill more information here> 522e3c8f8dSDr. David Alan Gilbert 532e3c8f8dSDr. David Alan GilbertCommon infrastructure 542e3c8f8dSDr. David Alan Gilbert===================== 552e3c8f8dSDr. David Alan Gilbert 562e3c8f8dSDr. David Alan GilbertThe files, sockets or fd's that carry the migration stream are abstracted by 572e3c8f8dSDr. David Alan Gilbertthe ``QEMUFile`` type (see `migration/qemu-file.h`). In most cases this 582e3c8f8dSDr. David Alan Gilbertis connected to a subtype of ``QIOChannel`` (see `io/`). 592e3c8f8dSDr. David Alan Gilbert 602e3c8f8dSDr. David Alan GilbertSaving the state of one device 612e3c8f8dSDr. David Alan Gilbert============================== 622e3c8f8dSDr. David Alan Gilbert 632e3c8f8dSDr. David Alan GilbertThe state of a device is saved using intermediate buffers. There are 642e3c8f8dSDr. David Alan Gilbertsome helper functions to assist this saving. 652e3c8f8dSDr. David Alan Gilbert 662e3c8f8dSDr. David Alan GilbertThere is a new concept that we have to explain here: device state 672e3c8f8dSDr. David Alan Gilbertversion. When we migrate a device, we save/load the state as a series 682e3c8f8dSDr. David Alan Gilbertof fields. Some times, due to bugs or new functionality, we need to 692e3c8f8dSDr. David Alan Gilbertchange the state to store more/different information. We use the 702e3c8f8dSDr. David Alan Gilbertversion to identify each time that we do a change. Each version is 712e3c8f8dSDr. David Alan Gilbertassociated with a series of fields saved. The `save_state` always saves 722e3c8f8dSDr. David Alan Gilbertthe state as the newer version. But `load_state` sometimes is able to 732e3c8f8dSDr. David Alan Gilbertload state from an older version. 742e3c8f8dSDr. David Alan Gilbert 752e3c8f8dSDr. David Alan GilbertLegacy way 762e3c8f8dSDr. David Alan Gilbert---------- 772e3c8f8dSDr. David Alan Gilbert 782e3c8f8dSDr. David Alan GilbertThis way is going to disappear as soon as all current users are ported to VMSTATE. 792e3c8f8dSDr. David Alan Gilbert 802e3c8f8dSDr. David Alan GilbertEach device has to register two functions, one to save the state and 812e3c8f8dSDr. David Alan Gilbertanother to load the state back. 822e3c8f8dSDr. David Alan Gilbert 832e3c8f8dSDr. David Alan Gilbert.. code:: c 842e3c8f8dSDr. David Alan Gilbert 852e3c8f8dSDr. David Alan Gilbert int register_savevm(DeviceState *dev, 862e3c8f8dSDr. David Alan Gilbert const char *idstr, 872e3c8f8dSDr. David Alan Gilbert int instance_id, 882e3c8f8dSDr. David Alan Gilbert int version_id, 892e3c8f8dSDr. David Alan Gilbert SaveStateHandler *save_state, 902e3c8f8dSDr. David Alan Gilbert LoadStateHandler *load_state, 912e3c8f8dSDr. David Alan Gilbert void *opaque); 922e3c8f8dSDr. David Alan Gilbert 932e3c8f8dSDr. David Alan Gilbert typedef void SaveStateHandler(QEMUFile *f, void *opaque); 942e3c8f8dSDr. David Alan Gilbert typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id); 952e3c8f8dSDr. David Alan Gilbert 962e3c8f8dSDr. David Alan GilbertThe important functions for the device state format are the `save_state` 972e3c8f8dSDr. David Alan Gilbertand `load_state`. Notice that `load_state` receives a version_id 982e3c8f8dSDr. David Alan Gilbertparameter to know what state format is receiving. `save_state` doesn't 992e3c8f8dSDr. David Alan Gilberthave a version_id parameter because it always uses the latest version. 1002e3c8f8dSDr. David Alan Gilbert 1012e3c8f8dSDr. David Alan GilbertVMState 1022e3c8f8dSDr. David Alan Gilbert------- 1032e3c8f8dSDr. David Alan Gilbert 1042e3c8f8dSDr. David Alan GilbertThe legacy way of saving/loading state of the device had the problem 1052e3c8f8dSDr. David Alan Gilbertthat we have to maintain two functions in sync. If we did one change 1062e3c8f8dSDr. David Alan Gilbertin one of them and not in the other, we would get a failed migration. 1072e3c8f8dSDr. David Alan Gilbert 1082e3c8f8dSDr. David Alan GilbertVMState changed the way that state is saved/loaded. Instead of using 1092e3c8f8dSDr. David Alan Gilberta function to save the state and another to load it, it was changed to 1102e3c8f8dSDr. David Alan Gilberta declarative way of what the state consisted of. Now VMState is able 1112e3c8f8dSDr. David Alan Gilbertto interpret that definition to be able to load/save the state. As 1122e3c8f8dSDr. David Alan Gilbertthe state is declared only once, it can't go out of sync in the 1132e3c8f8dSDr. David Alan Gilbertsave/load functions. 1142e3c8f8dSDr. David Alan Gilbert 1152e3c8f8dSDr. David Alan GilbertAn example (from hw/input/pckbd.c) 1162e3c8f8dSDr. David Alan Gilbert 1172e3c8f8dSDr. David Alan Gilbert.. code:: c 1182e3c8f8dSDr. David Alan Gilbert 1192e3c8f8dSDr. David Alan Gilbert static const VMStateDescription vmstate_kbd = { 1202e3c8f8dSDr. David Alan Gilbert .name = "pckbd", 1212e3c8f8dSDr. David Alan Gilbert .version_id = 3, 1222e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 3, 1232e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 1242e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(write_cmd, KBDState), 1252e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(status, KBDState), 1262e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(mode, KBDState), 1272e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(pending, KBDState), 1282e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 1292e3c8f8dSDr. David Alan Gilbert } 1302e3c8f8dSDr. David Alan Gilbert }; 1312e3c8f8dSDr. David Alan Gilbert 1322e3c8f8dSDr. David Alan GilbertWe are declaring the state with name "pckbd". 1332e3c8f8dSDr. David Alan GilbertThe `version_id` is 3, and the fields are 4 uint8_t in a KBDState structure. 1342e3c8f8dSDr. David Alan GilbertWe registered this with: 1352e3c8f8dSDr. David Alan Gilbert 1362e3c8f8dSDr. David Alan Gilbert.. code:: c 1372e3c8f8dSDr. David Alan Gilbert 1382e3c8f8dSDr. David Alan Gilbert vmstate_register(NULL, 0, &vmstate_kbd, s); 1392e3c8f8dSDr. David Alan Gilbert 1402e3c8f8dSDr. David Alan GilbertNote: talk about how vmstate <-> qdev interact, and what the instance ids mean. 1412e3c8f8dSDr. David Alan Gilbert 1422e3c8f8dSDr. David Alan GilbertYou can search for ``VMSTATE_*`` macros for lots of types used in QEMU in 1432e3c8f8dSDr. David Alan Gilbertinclude/hw/hw.h. 1442e3c8f8dSDr. David Alan Gilbert 1452e3c8f8dSDr. David Alan GilbertMore about versions 1462e3c8f8dSDr. David Alan Gilbert------------------- 1472e3c8f8dSDr. David Alan Gilbert 1482e3c8f8dSDr. David Alan GilbertVersion numbers are intended for major incompatible changes to the 1492e3c8f8dSDr. David Alan Gilbertmigration of a device, and using them breaks backwards-migration 1502e3c8f8dSDr. David Alan Gilbertcompatibility; in general most changes can be made by adding Subsections 1512e3c8f8dSDr. David Alan Gilbert(see below) or _TEST macros (see below) which won't break compatibility. 1522e3c8f8dSDr. David Alan Gilbert 1532e3c8f8dSDr. David Alan GilbertYou can see that there are several version fields: 1542e3c8f8dSDr. David Alan Gilbert 1552e3c8f8dSDr. David Alan Gilbert- `version_id`: the maximum version_id supported by VMState for that device. 1562e3c8f8dSDr. David Alan Gilbert- `minimum_version_id`: the minimum version_id that VMState is able to understand 1572e3c8f8dSDr. David Alan Gilbert for that device. 1582e3c8f8dSDr. David Alan Gilbert- `minimum_version_id_old`: For devices that were not able to port to vmstate, we can 1592e3c8f8dSDr. David Alan Gilbert assign a function that knows how to read this old state. This field is 1602e3c8f8dSDr. David Alan Gilbert ignored if there is no `load_state_old` handler. 1612e3c8f8dSDr. David Alan Gilbert 1622e3c8f8dSDr. David Alan GilbertSo, VMState is able to read versions from minimum_version_id to 1632e3c8f8dSDr. David Alan Gilbertversion_id. And the function ``load_state_old()`` (if present) is able to 1642e3c8f8dSDr. David Alan Gilbertload state from minimum_version_id_old to minimum_version_id. This 1652e3c8f8dSDr. David Alan Gilbertfunction is deprecated and will be removed when no more users are left. 1662e3c8f8dSDr. David Alan Gilbert 1672e3c8f8dSDr. David Alan GilbertSaving state will always create a section with the 'version_id' value 1682e3c8f8dSDr. David Alan Gilbertand thus can't be loaded by any older QEMU. 1692e3c8f8dSDr. David Alan Gilbert 1702e3c8f8dSDr. David Alan GilbertMassaging functions 1712e3c8f8dSDr. David Alan Gilbert------------------- 1722e3c8f8dSDr. David Alan Gilbert 1732e3c8f8dSDr. David Alan GilbertSometimes, it is not enough to be able to save the state directly 1742e3c8f8dSDr. David Alan Gilbertfrom one structure, we need to fill the correct values there. One 1752e3c8f8dSDr. David Alan Gilbertexample is when we are using kvm. Before saving the cpu state, we 1762e3c8f8dSDr. David Alan Gilbertneed to ask kvm to copy to QEMU the state that it is using. And the 1772e3c8f8dSDr. David Alan Gilbertopposite when we are loading the state, we need a way to tell kvm to 1782e3c8f8dSDr. David Alan Gilbertload the state for the cpu that we have just loaded from the QEMUFile. 1792e3c8f8dSDr. David Alan Gilbert 1802e3c8f8dSDr. David Alan GilbertThe functions to do that are inside a vmstate definition, and are called: 1812e3c8f8dSDr. David Alan Gilbert 1822e3c8f8dSDr. David Alan Gilbert- ``int (*pre_load)(void *opaque);`` 1832e3c8f8dSDr. David Alan Gilbert 1842e3c8f8dSDr. David Alan Gilbert This function is called before we load the state of one device. 1852e3c8f8dSDr. David Alan Gilbert 1862e3c8f8dSDr. David Alan Gilbert- ``int (*post_load)(void *opaque, int version_id);`` 1872e3c8f8dSDr. David Alan Gilbert 1882e3c8f8dSDr. David Alan Gilbert This function is called after we load the state of one device. 1892e3c8f8dSDr. David Alan Gilbert 1902e3c8f8dSDr. David Alan Gilbert- ``int (*pre_save)(void *opaque);`` 1912e3c8f8dSDr. David Alan Gilbert 1922e3c8f8dSDr. David Alan Gilbert This function is called before we save the state of one device. 1932e3c8f8dSDr. David Alan Gilbert 1942e3c8f8dSDr. David Alan GilbertExample: You can look at hpet.c, that uses the three function to 1952e3c8f8dSDr. David Alan Gilbertmassage the state that is transferred. 1962e3c8f8dSDr. David Alan Gilbert 1972e3c8f8dSDr. David Alan GilbertIf you use memory API functions that update memory layout outside 1982e3c8f8dSDr. David Alan Gilbertinitialization (i.e., in response to a guest action), this is a strong 1992e3c8f8dSDr. David Alan Gilbertindication that you need to call these functions in a `post_load` callback. 2002e3c8f8dSDr. David Alan GilbertExamples of such memory API functions are: 2012e3c8f8dSDr. David Alan Gilbert 2022e3c8f8dSDr. David Alan Gilbert - memory_region_add_subregion() 2032e3c8f8dSDr. David Alan Gilbert - memory_region_del_subregion() 2042e3c8f8dSDr. David Alan Gilbert - memory_region_set_readonly() 2052e3c8f8dSDr. David Alan Gilbert - memory_region_set_enabled() 2062e3c8f8dSDr. David Alan Gilbert - memory_region_set_address() 2072e3c8f8dSDr. David Alan Gilbert - memory_region_set_alias_offset() 2082e3c8f8dSDr. David Alan Gilbert 2092e3c8f8dSDr. David Alan GilbertSubsections 2102e3c8f8dSDr. David Alan Gilbert----------- 2112e3c8f8dSDr. David Alan Gilbert 2122e3c8f8dSDr. David Alan GilbertThe use of version_id allows to be able to migrate from older versions 2132e3c8f8dSDr. David Alan Gilbertto newer versions of a device. But not the other way around. This 2142e3c8f8dSDr. David Alan Gilbertmakes very complicated to fix bugs in stable branches. If we need to 2152e3c8f8dSDr. David Alan Gilbertadd anything to the state to fix a bug, we have to disable migration 2162e3c8f8dSDr. David Alan Gilbertto older versions that don't have that bug-fix (i.e. a new field). 2172e3c8f8dSDr. David Alan Gilbert 2182e3c8f8dSDr. David Alan GilbertBut sometimes, that bug-fix is only needed sometimes, not always. For 2192e3c8f8dSDr. David Alan Gilbertinstance, if the device is in the middle of a DMA operation, it is 2202e3c8f8dSDr. David Alan Gilbertusing a specific functionality, .... 2212e3c8f8dSDr. David Alan Gilbert 2222e3c8f8dSDr. David Alan GilbertIt is impossible to create a way to make migration from any version to 2232e3c8f8dSDr. David Alan Gilbertany other version to work. But we can do better than only allowing 2242e3c8f8dSDr. David Alan Gilbertmigration from older versions to newer ones. For that fields that are 2252e3c8f8dSDr. David Alan Gilbertonly needed sometimes, we add the idea of subsections. A subsection 2262e3c8f8dSDr. David Alan Gilbertis "like" a device vmstate, but with a particularity, it has a Boolean 2272e3c8f8dSDr. David Alan Gilbertfunction that tells if that values are needed to be sent or not. If 2282e3c8f8dSDr. David Alan Gilbertthis functions returns false, the subsection is not sent. 2292e3c8f8dSDr. David Alan Gilbert 2302e3c8f8dSDr. David Alan GilbertOn the receiving side, if we found a subsection for a device that we 2312e3c8f8dSDr. David Alan Gilbertdon't understand, we just fail the migration. If we understand all 2322e3c8f8dSDr. David Alan Gilbertthe subsections, then we load the state with success. 2332e3c8f8dSDr. David Alan Gilbert 2342e3c8f8dSDr. David Alan GilbertOne important note is that the post_load() function is called "after" 2352e3c8f8dSDr. David Alan Gilbertloading all subsections, because a newer subsection could change same 2362e3c8f8dSDr. David Alan Gilbertvalue that it uses. 2372e3c8f8dSDr. David Alan Gilbert 2382e3c8f8dSDr. David Alan GilbertExample: 2392e3c8f8dSDr. David Alan Gilbert 2402e3c8f8dSDr. David Alan Gilbert.. code:: c 2412e3c8f8dSDr. David Alan Gilbert 2422e3c8f8dSDr. David Alan Gilbert static bool ide_drive_pio_state_needed(void *opaque) 2432e3c8f8dSDr. David Alan Gilbert { 2442e3c8f8dSDr. David Alan Gilbert IDEState *s = opaque; 2452e3c8f8dSDr. David Alan Gilbert 2462e3c8f8dSDr. David Alan Gilbert return ((s->status & DRQ_STAT) != 0) 2472e3c8f8dSDr. David Alan Gilbert || (s->bus->error_status & BM_STATUS_PIO_RETRY); 2482e3c8f8dSDr. David Alan Gilbert } 2492e3c8f8dSDr. David Alan Gilbert 2502e3c8f8dSDr. David Alan Gilbert const VMStateDescription vmstate_ide_drive_pio_state = { 2512e3c8f8dSDr. David Alan Gilbert .name = "ide_drive/pio_state", 2522e3c8f8dSDr. David Alan Gilbert .version_id = 1, 2532e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 1, 2542e3c8f8dSDr. David Alan Gilbert .pre_save = ide_drive_pio_pre_save, 2552e3c8f8dSDr. David Alan Gilbert .post_load = ide_drive_pio_post_load, 2562e3c8f8dSDr. David Alan Gilbert .needed = ide_drive_pio_state_needed, 2572e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 2582e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(req_nb_sectors, IDEState), 2592e3c8f8dSDr. David Alan Gilbert VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1, 2602e3c8f8dSDr. David Alan Gilbert vmstate_info_uint8, uint8_t), 2612e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(cur_io_buffer_offset, IDEState), 2622e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(cur_io_buffer_len, IDEState), 2632e3c8f8dSDr. David Alan Gilbert VMSTATE_UINT8(end_transfer_fn_idx, IDEState), 2642e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(elementary_transfer_size, IDEState), 2652e3c8f8dSDr. David Alan Gilbert VMSTATE_INT32(packet_transfer_size, IDEState), 2662e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 2672e3c8f8dSDr. David Alan Gilbert } 2682e3c8f8dSDr. David Alan Gilbert }; 2692e3c8f8dSDr. David Alan Gilbert 2702e3c8f8dSDr. David Alan Gilbert const VMStateDescription vmstate_ide_drive = { 2712e3c8f8dSDr. David Alan Gilbert .name = "ide_drive", 2722e3c8f8dSDr. David Alan Gilbert .version_id = 3, 2732e3c8f8dSDr. David Alan Gilbert .minimum_version_id = 0, 2742e3c8f8dSDr. David Alan Gilbert .post_load = ide_drive_post_load, 2752e3c8f8dSDr. David Alan Gilbert .fields = (VMStateField[]) { 2762e3c8f8dSDr. David Alan Gilbert .... several fields .... 2772e3c8f8dSDr. David Alan Gilbert VMSTATE_END_OF_LIST() 2782e3c8f8dSDr. David Alan Gilbert }, 2792e3c8f8dSDr. David Alan Gilbert .subsections = (const VMStateDescription*[]) { 2802e3c8f8dSDr. David Alan Gilbert &vmstate_ide_drive_pio_state, 2812e3c8f8dSDr. David Alan Gilbert NULL 2822e3c8f8dSDr. David Alan Gilbert } 2832e3c8f8dSDr. David Alan Gilbert }; 2842e3c8f8dSDr. David Alan Gilbert 2852e3c8f8dSDr. David Alan GilbertHere we have a subsection for the pio state. We only need to 2862e3c8f8dSDr. David Alan Gilbertsave/send this state when we are in the middle of a pio operation 2872e3c8f8dSDr. David Alan Gilbert(that is what ``ide_drive_pio_state_needed()`` checks). If DRQ_STAT is 2882e3c8f8dSDr. David Alan Gilbertnot enabled, the values on that fields are garbage and don't need to 2892e3c8f8dSDr. David Alan Gilbertbe sent. 2902e3c8f8dSDr. David Alan Gilbert 2912e3c8f8dSDr. David Alan GilbertUsing a condition function that checks a 'property' to determine whether 2922e3c8f8dSDr. David Alan Gilbertto send a subsection allows backwards migration compatibility when 2932e3c8f8dSDr. David Alan Gilbertnew subsections are added. 2942e3c8f8dSDr. David Alan Gilbert 2952e3c8f8dSDr. David Alan GilbertFor example: 2962e3c8f8dSDr. David Alan Gilbert 2972e3c8f8dSDr. David Alan Gilbert a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and 2982e3c8f8dSDr. David Alan Gilbert default it to true. 2992e3c8f8dSDr. David Alan Gilbert b) Add an entry to the ``HW_COMPAT_`` for the previous version that sets 3002e3c8f8dSDr. David Alan Gilbert the property to false. 3012e3c8f8dSDr. David Alan Gilbert c) Add a static bool support_foo function that tests the property. 3022e3c8f8dSDr. David Alan Gilbert d) Add a subsection with a .needed set to the support_foo function 3032e3c8f8dSDr. David Alan Gilbert e) (potentially) Add a pre_load that sets up a default value for 'foo' 3042e3c8f8dSDr. David Alan Gilbert to be used if the subsection isn't loaded. 3052e3c8f8dSDr. David Alan Gilbert 3062e3c8f8dSDr. David Alan GilbertNow that subsection will not be generated when using an older 3072e3c8f8dSDr. David Alan Gilbertmachine type and the migration stream will be accepted by older 3082e3c8f8dSDr. David Alan GilbertQEMU versions. pre-load functions can be used to initialise state 3092e3c8f8dSDr. David Alan Gilberton the newer version so that they default to suitable values 3102e3c8f8dSDr. David Alan Gilbertwhen loading streams created by older QEMU versions that do not 3112e3c8f8dSDr. David Alan Gilbertgenerate the subsection. 3122e3c8f8dSDr. David Alan Gilbert 3132e3c8f8dSDr. David Alan GilbertIn some cases subsections are added for data that had been accidentally 3142e3c8f8dSDr. David Alan Gilbertomitted by earlier versions; if the missing data causes the migration 3152e3c8f8dSDr. David Alan Gilbertprocess to succeed but the guest to behave badly then it may be better 3162e3c8f8dSDr. David Alan Gilbertto send the subsection and cause the migration to explicitly fail 3172e3c8f8dSDr. David Alan Gilbertwith the unknown subsection error. If the bad behaviour only happens 3182e3c8f8dSDr. David Alan Gilbertwith certain data values, making the subsection conditional on 3192e3c8f8dSDr. David Alan Gilbertthe data value (rather than the machine type) allows migrations to succeed 3202e3c8f8dSDr. David Alan Gilbertin most cases. In general the preference is to tie the subsection to 3212e3c8f8dSDr. David Alan Gilbertthe machine type, and allow reliable migrations, unless the behaviour 3222e3c8f8dSDr. David Alan Gilbertfrom omission of the subsection is really bad. 3232e3c8f8dSDr. David Alan Gilbert 3242e3c8f8dSDr. David Alan GilbertNot sending existing elements 3252e3c8f8dSDr. David Alan Gilbert----------------------------- 3262e3c8f8dSDr. David Alan Gilbert 3272e3c8f8dSDr. David Alan GilbertSometimes members of the VMState are no longer needed: 3282e3c8f8dSDr. David Alan Gilbert 3292e3c8f8dSDr. David Alan Gilbert - removing them will break migration compatibility 3302e3c8f8dSDr. David Alan Gilbert 3312e3c8f8dSDr. David Alan Gilbert - making them version dependent and bumping the version will break backwards migration compatibility. 3322e3c8f8dSDr. David Alan Gilbert 3332e3c8f8dSDr. David Alan GilbertThe best way is to: 3342e3c8f8dSDr. David Alan Gilbert 3352e3c8f8dSDr. David Alan Gilbert a) Add a new property/compatibility/function in the same way for subsections above. 3362e3c8f8dSDr. David Alan Gilbert b) replace the VMSTATE macro with the _TEST version of the macro, e.g.: 3372e3c8f8dSDr. David Alan Gilbert 3382e3c8f8dSDr. David Alan Gilbert ``VMSTATE_UINT32(foo, barstruct)`` 3392e3c8f8dSDr. David Alan Gilbert 3402e3c8f8dSDr. David Alan Gilbert becomes 3412e3c8f8dSDr. David Alan Gilbert 3422e3c8f8dSDr. David Alan Gilbert ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)`` 3432e3c8f8dSDr. David Alan Gilbert 3442e3c8f8dSDr. David Alan Gilbert Sometime in the future when we no longer care about the ancient versions these can be killed off. 3452e3c8f8dSDr. David Alan Gilbert 3462e3c8f8dSDr. David Alan GilbertReturn path 3472e3c8f8dSDr. David Alan Gilbert----------- 3482e3c8f8dSDr. David Alan Gilbert 3492e3c8f8dSDr. David Alan GilbertIn most migration scenarios there is only a single data path that runs 3502e3c8f8dSDr. David Alan Gilbertfrom the source VM to the destination, typically along a single fd (although 3512e3c8f8dSDr. David Alan Gilbertpossibly with another fd or similar for some fast way of throwing pages across). 3522e3c8f8dSDr. David Alan Gilbert 3532e3c8f8dSDr. David Alan GilbertHowever, some uses need two way communication; in particular the Postcopy 3542e3c8f8dSDr. David Alan Gilbertdestination needs to be able to request pages on demand from the source. 3552e3c8f8dSDr. David Alan Gilbert 3562e3c8f8dSDr. David Alan GilbertFor these scenarios there is a 'return path' from the destination to the source; 3572e3c8f8dSDr. David Alan Gilbert``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return 3582e3c8f8dSDr. David Alan Gilbertpath. 3592e3c8f8dSDr. David Alan Gilbert 3602e3c8f8dSDr. David Alan Gilbert Source side 3612e3c8f8dSDr. David Alan Gilbert 3622e3c8f8dSDr. David Alan Gilbert Forward path - written by migration thread 3632e3c8f8dSDr. David Alan Gilbert Return path - opened by main thread, read by return-path thread 3642e3c8f8dSDr. David Alan Gilbert 3652e3c8f8dSDr. David Alan Gilbert Destination side 3662e3c8f8dSDr. David Alan Gilbert 3672e3c8f8dSDr. David Alan Gilbert Forward path - read by main thread 3682e3c8f8dSDr. David Alan Gilbert Return path - opened by main thread, written by main thread AND postcopy 3692e3c8f8dSDr. David Alan Gilbert thread (protected by rp_mutex) 3702e3c8f8dSDr. David Alan Gilbert 3712e3c8f8dSDr. David Alan GilbertPostcopy 3722e3c8f8dSDr. David Alan Gilbert======== 3732e3c8f8dSDr. David Alan Gilbert 3742e3c8f8dSDr. David Alan Gilbert'Postcopy' migration is a way to deal with migrations that refuse to converge 3752e3c8f8dSDr. David Alan Gilbert(or take too long to converge) its plus side is that there is an upper bound on 3762e3c8f8dSDr. David Alan Gilbertthe amount of migration traffic and time it takes, the down side is that during 3772e3c8f8dSDr. David Alan Gilbertthe postcopy phase, a failure of *either* side or the network connection causes 3782e3c8f8dSDr. David Alan Gilbertthe guest to be lost. 3792e3c8f8dSDr. David Alan Gilbert 3802e3c8f8dSDr. David Alan GilbertIn postcopy the destination CPUs are started before all the memory has been 3812e3c8f8dSDr. David Alan Gilberttransferred, and accesses to pages that are yet to be transferred cause 3822e3c8f8dSDr. David Alan Gilberta fault that's translated by QEMU into a request to the source QEMU. 3832e3c8f8dSDr. David Alan Gilbert 3842e3c8f8dSDr. David Alan GilbertPostcopy can be combined with precopy (i.e. normal migration) so that if precopy 3852e3c8f8dSDr. David Alan Gilbertdoesn't finish in a given time the switch is made to postcopy. 3862e3c8f8dSDr. David Alan Gilbert 3872e3c8f8dSDr. David Alan GilbertEnabling postcopy 3882e3c8f8dSDr. David Alan Gilbert----------------- 3892e3c8f8dSDr. David Alan Gilbert 390c2eb7f21SGreg KurzTo enable postcopy, issue this command on the monitor (both source and 391c2eb7f21SGreg Kurzdestination) prior to the start of migration: 3922e3c8f8dSDr. David Alan Gilbert 3932e3c8f8dSDr. David Alan Gilbert``migrate_set_capability postcopy-ram on`` 3942e3c8f8dSDr. David Alan Gilbert 3952e3c8f8dSDr. David Alan GilbertThe normal commands are then used to start a migration, which is still 3962e3c8f8dSDr. David Alan Gilbertstarted in precopy mode. Issuing: 3972e3c8f8dSDr. David Alan Gilbert 3982e3c8f8dSDr. David Alan Gilbert``migrate_start_postcopy`` 3992e3c8f8dSDr. David Alan Gilbert 4002e3c8f8dSDr. David Alan Gilbertwill now cause the transition from precopy to postcopy. 4012e3c8f8dSDr. David Alan GilbertIt can be issued immediately after migration is started or any 4022e3c8f8dSDr. David Alan Gilberttime later on. Issuing it after the end of a migration is harmless. 4032e3c8f8dSDr. David Alan Gilbert 4049ed01779SAlexey PerevalovBlocktime is a postcopy live migration metric, intended to show how 4059ed01779SAlexey Perevalovlong the vCPU was in state of interruptable sleep due to pagefault. 4069ed01779SAlexey PerevalovThat metric is calculated both for all vCPUs as overlapped value, and 4079ed01779SAlexey Perevalovseparately for each vCPU. These values are calculated on destination 4089ed01779SAlexey Perevalovside. To enable postcopy blocktime calculation, enter following 4099ed01779SAlexey Perevalovcommand on destination monitor: 4109ed01779SAlexey Perevalov 4119ed01779SAlexey Perevalov``migrate_set_capability postcopy-blocktime on`` 4129ed01779SAlexey Perevalov 4139ed01779SAlexey PerevalovPostcopy blocktime can be retrieved by query-migrate qmp command. 4149ed01779SAlexey Perevalovpostcopy-blocktime value of qmp command will show overlapped blocking 4159ed01779SAlexey Perevalovtime for all vCPU, postcopy-vcpu-blocktime will show list of blocking 4169ed01779SAlexey Perevalovtime per vCPU. 4179ed01779SAlexey Perevalov 4182e3c8f8dSDr. David Alan Gilbert.. note:: 4192e3c8f8dSDr. David Alan Gilbert During the postcopy phase, the bandwidth limits set using 4202e3c8f8dSDr. David Alan Gilbert ``migrate_set_speed`` is ignored (to avoid delaying requested pages that 4212e3c8f8dSDr. David Alan Gilbert the destination is waiting for). 4222e3c8f8dSDr. David Alan Gilbert 4232e3c8f8dSDr. David Alan GilbertPostcopy device transfer 4242e3c8f8dSDr. David Alan Gilbert------------------------ 4252e3c8f8dSDr. David Alan Gilbert 4262e3c8f8dSDr. David Alan GilbertLoading of device data may cause the device emulation to access guest RAM 4272e3c8f8dSDr. David Alan Gilbertthat may trigger faults that have to be resolved by the source, as such 4282e3c8f8dSDr. David Alan Gilbertthe migration stream has to be able to respond with page data *during* the 4292e3c8f8dSDr. David Alan Gilbertdevice load, and hence the device data has to be read from the stream completely 4302e3c8f8dSDr. David Alan Gilbertbefore the device load begins to free the stream up. This is achieved by 4312e3c8f8dSDr. David Alan Gilbert'packaging' the device data into a blob that's read in one go. 4322e3c8f8dSDr. David Alan Gilbert 4332e3c8f8dSDr. David Alan GilbertSource behaviour 4342e3c8f8dSDr. David Alan Gilbert---------------- 4352e3c8f8dSDr. David Alan Gilbert 4362e3c8f8dSDr. David Alan GilbertUntil postcopy is entered the migration stream is identical to normal 4372e3c8f8dSDr. David Alan Gilbertprecopy, except for the addition of a 'postcopy advise' command at 4382e3c8f8dSDr. David Alan Gilbertthe beginning, to tell the destination that postcopy might happen. 4392e3c8f8dSDr. David Alan GilbertWhen postcopy starts the source sends the page discard data and then 4402e3c8f8dSDr. David Alan Gilbertforms the 'package' containing: 4412e3c8f8dSDr. David Alan Gilbert 4422e3c8f8dSDr. David Alan Gilbert - Command: 'postcopy listen' 4432e3c8f8dSDr. David Alan Gilbert - The device state 4442e3c8f8dSDr. David Alan Gilbert 4452e3c8f8dSDr. David Alan Gilbert A series of sections, identical to the precopy streams device state stream 4462e3c8f8dSDr. David Alan Gilbert containing everything except postcopiable devices (i.e. RAM) 4472e3c8f8dSDr. David Alan Gilbert - Command: 'postcopy run' 4482e3c8f8dSDr. David Alan Gilbert 4492e3c8f8dSDr. David Alan GilbertThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the 4502e3c8f8dSDr. David Alan Gilbertcontents are formatted in the same way as the main migration stream. 4512e3c8f8dSDr. David Alan Gilbert 4522e3c8f8dSDr. David Alan GilbertDuring postcopy the source scans the list of dirty pages and sends them 4532e3c8f8dSDr. David Alan Gilbertto the destination without being requested (in much the same way as precopy), 4542e3c8f8dSDr. David Alan Gilberthowever when a page request is received from the destination, the dirty page 4552e3c8f8dSDr. David Alan Gilbertscanning restarts from the requested location. This causes requested pages 4562e3c8f8dSDr. David Alan Gilbertto be sent quickly, and also causes pages directly after the requested page 4572e3c8f8dSDr. David Alan Gilbertto be sent quickly in the hope that those pages are likely to be used 4582e3c8f8dSDr. David Alan Gilbertby the destination soon. 4592e3c8f8dSDr. David Alan Gilbert 4602e3c8f8dSDr. David Alan GilbertDestination behaviour 4612e3c8f8dSDr. David Alan Gilbert--------------------- 4622e3c8f8dSDr. David Alan Gilbert 4632e3c8f8dSDr. David Alan GilbertInitially the destination looks the same as precopy, with a single thread 4642e3c8f8dSDr. David Alan Gilbertreading the migration stream; the 'postcopy advise' and 'discard' commands 4652e3c8f8dSDr. David Alan Gilbertare processed to change the way RAM is managed, but don't affect the stream 4662e3c8f8dSDr. David Alan Gilbertprocessing. 4672e3c8f8dSDr. David Alan Gilbert 4682e3c8f8dSDr. David Alan Gilbert:: 4692e3c8f8dSDr. David Alan Gilbert 4702e3c8f8dSDr. David Alan Gilbert ------------------------------------------------------------------------------ 4712e3c8f8dSDr. David Alan Gilbert 1 2 3 4 5 6 7 4722e3c8f8dSDr. David Alan Gilbert main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) 4732e3c8f8dSDr. David Alan Gilbert thread | | 4742e3c8f8dSDr. David Alan Gilbert | (page request) 4752e3c8f8dSDr. David Alan Gilbert | \___ 4762e3c8f8dSDr. David Alan Gilbert v \ 4772e3c8f8dSDr. David Alan Gilbert listen thread: --- page -- page -- page -- page -- page -- 4782e3c8f8dSDr. David Alan Gilbert 4792e3c8f8dSDr. David Alan Gilbert a b c 4802e3c8f8dSDr. David Alan Gilbert ------------------------------------------------------------------------------ 4812e3c8f8dSDr. David Alan Gilbert 4822e3c8f8dSDr. David Alan Gilbert- On receipt of ``CMD_PACKAGED`` (1) 4832e3c8f8dSDr. David Alan Gilbert 4842e3c8f8dSDr. David Alan Gilbert All the data associated with the package - the ( ... ) section in the diagram - 4852e3c8f8dSDr. David Alan Gilbert is read into memory, and the main thread recurses into qemu_loadvm_state_main 4862e3c8f8dSDr. David Alan Gilbert to process the contents of the package (2) which contains commands (3,6) and 4872e3c8f8dSDr. David Alan Gilbert devices (4...) 4882e3c8f8dSDr. David Alan Gilbert 4892e3c8f8dSDr. David Alan Gilbert- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) 4902e3c8f8dSDr. David Alan Gilbert 4912e3c8f8dSDr. David Alan Gilbert a new thread (a) is started that takes over servicing the migration stream, 4922e3c8f8dSDr. David Alan Gilbert while the main thread carries on loading the package. It loads normal 4932e3c8f8dSDr. David Alan Gilbert background page data (b) but if during a device load a fault happens (5) 4942e3c8f8dSDr. David Alan Gilbert the returned page (c) is loaded by the listen thread allowing the main 4952e3c8f8dSDr. David Alan Gilbert threads device load to carry on. 4962e3c8f8dSDr. David Alan Gilbert 4972e3c8f8dSDr. David Alan Gilbert- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) 4982e3c8f8dSDr. David Alan Gilbert 4992e3c8f8dSDr. David Alan Gilbert letting the destination CPUs start running. At the end of the 5002e3c8f8dSDr. David Alan Gilbert ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and 5012e3c8f8dSDr. David Alan Gilbert is no longer used by migration, while the listen thread carries on servicing 5022e3c8f8dSDr. David Alan Gilbert page data until the end of migration. 5032e3c8f8dSDr. David Alan Gilbert 5042e3c8f8dSDr. David Alan GilbertPostcopy states 5052e3c8f8dSDr. David Alan Gilbert--------------- 5062e3c8f8dSDr. David Alan Gilbert 5072e3c8f8dSDr. David Alan GilbertPostcopy moves through a series of states (see postcopy_state) from 5082e3c8f8dSDr. David Alan GilbertADVISE->DISCARD->LISTEN->RUNNING->END 5092e3c8f8dSDr. David Alan Gilbert 5102e3c8f8dSDr. David Alan Gilbert - Advise 5112e3c8f8dSDr. David Alan Gilbert 5122e3c8f8dSDr. David Alan Gilbert Set at the start of migration if postcopy is enabled, even 5132e3c8f8dSDr. David Alan Gilbert if it hasn't had the start command; here the destination 5142e3c8f8dSDr. David Alan Gilbert checks that its OS has the support needed for postcopy, and performs 5152e3c8f8dSDr. David Alan Gilbert setup to ensure the RAM mappings are suitable for later postcopy. 5162e3c8f8dSDr. David Alan Gilbert The destination will fail early in migration at this point if the 5172e3c8f8dSDr. David Alan Gilbert required OS support is not present. 5182e3c8f8dSDr. David Alan Gilbert (Triggered by reception of POSTCOPY_ADVISE command) 5192e3c8f8dSDr. David Alan Gilbert 5202e3c8f8dSDr. David Alan Gilbert - Discard 5212e3c8f8dSDr. David Alan Gilbert 5222e3c8f8dSDr. David Alan Gilbert Entered on receipt of the first 'discard' command; prior to 5232e3c8f8dSDr. David Alan Gilbert the first Discard being performed, hugepages are switched off 5242e3c8f8dSDr. David Alan Gilbert (using madvise) to ensure that no new huge pages are created 5252e3c8f8dSDr. David Alan Gilbert during the postcopy phase, and to cause any huge pages that 5262e3c8f8dSDr. David Alan Gilbert have discards on them to be broken. 5272e3c8f8dSDr. David Alan Gilbert 5282e3c8f8dSDr. David Alan Gilbert - Listen 5292e3c8f8dSDr. David Alan Gilbert 5302e3c8f8dSDr. David Alan Gilbert The first command in the package, POSTCOPY_LISTEN, switches 5312e3c8f8dSDr. David Alan Gilbert the destination state to Listen, and starts a new thread 5322e3c8f8dSDr. David Alan Gilbert (the 'listen thread') which takes over the job of receiving 5332e3c8f8dSDr. David Alan Gilbert pages off the migration stream, while the main thread carries 5342e3c8f8dSDr. David Alan Gilbert on processing the blob. With this thread able to process page 5352e3c8f8dSDr. David Alan Gilbert reception, the destination now 'sensitises' the RAM to detect 5362e3c8f8dSDr. David Alan Gilbert any access to missing pages (on Linux using the 'userfault' 5372e3c8f8dSDr. David Alan Gilbert system). 5382e3c8f8dSDr. David Alan Gilbert 5392e3c8f8dSDr. David Alan Gilbert - Running 5402e3c8f8dSDr. David Alan Gilbert 5412e3c8f8dSDr. David Alan Gilbert POSTCOPY_RUN causes the destination to synchronise all 5422e3c8f8dSDr. David Alan Gilbert state and start the CPUs and IO devices running. The main 5432e3c8f8dSDr. David Alan Gilbert thread now finishes processing the migration package and 5442e3c8f8dSDr. David Alan Gilbert now carries on as it would for normal precopy migration 5452e3c8f8dSDr. David Alan Gilbert (although it can't do the cleanup it would do as it 5462e3c8f8dSDr. David Alan Gilbert finishes a normal migration). 5472e3c8f8dSDr. David Alan Gilbert 5482e3c8f8dSDr. David Alan Gilbert - End 5492e3c8f8dSDr. David Alan Gilbert 5502e3c8f8dSDr. David Alan Gilbert The listen thread can now quit, and perform the cleanup of migration 5512e3c8f8dSDr. David Alan Gilbert state, the migration is now complete. 5522e3c8f8dSDr. David Alan Gilbert 5532e3c8f8dSDr. David Alan GilbertSource side page maps 5542e3c8f8dSDr. David Alan Gilbert--------------------- 5552e3c8f8dSDr. David Alan Gilbert 5562e3c8f8dSDr. David Alan GilbertThe source side keeps two bitmaps during postcopy; 'the migration bitmap' 5572e3c8f8dSDr. David Alan Gilbertand 'unsent map'. The 'migration bitmap' is basically the same as in 5582e3c8f8dSDr. David Alan Gilbertthe precopy case, and holds a bit to indicate that page is 'dirty' - 5592e3c8f8dSDr. David Alan Gilberti.e. needs sending. During the precopy phase this is updated as the CPU 5602e3c8f8dSDr. David Alan Gilbertdirties pages, however during postcopy the CPUs are stopped and nothing 5612e3c8f8dSDr. David Alan Gilbertshould dirty anything any more. 5622e3c8f8dSDr. David Alan Gilbert 5632e3c8f8dSDr. David Alan GilbertThe 'unsent map' is used for the transition to postcopy. It is a bitmap that 5642e3c8f8dSDr. David Alan Gilberthas a bit cleared whenever a page is sent to the destination, however during 5652e3c8f8dSDr. David Alan Gilbertthe transition to postcopy mode it is combined with the migration bitmap 5662e3c8f8dSDr. David Alan Gilbertto form a set of pages that: 5672e3c8f8dSDr. David Alan Gilbert 5682e3c8f8dSDr. David Alan Gilbert a) Have been sent but then redirtied (which must be discarded) 5692e3c8f8dSDr. David Alan Gilbert b) Have not yet been sent - which also must be discarded to cause any 5702e3c8f8dSDr. David Alan Gilbert transparent huge pages built during precopy to be broken. 5712e3c8f8dSDr. David Alan Gilbert 5722e3c8f8dSDr. David Alan GilbertNote that the contents of the unsentmap are sacrificed during the calculation 5732e3c8f8dSDr. David Alan Gilbertof the discard set and thus aren't valid once in postcopy. The dirtymap 5742e3c8f8dSDr. David Alan Gilbertis still valid and is used to ensure that no page is sent more than once. Any 5752e3c8f8dSDr. David Alan Gilbertrequest for a page that has already been sent is ignored. Duplicate requests 5762e3c8f8dSDr. David Alan Gilbertsuch as this can happen as a page is sent at about the same time the 5772e3c8f8dSDr. David Alan Gilbertdestination accesses it. 5782e3c8f8dSDr. David Alan Gilbert 5792e3c8f8dSDr. David Alan GilbertPostcopy with hugepages 5802e3c8f8dSDr. David Alan Gilbert----------------------- 5812e3c8f8dSDr. David Alan Gilbert 5822e3c8f8dSDr. David Alan GilbertPostcopy now works with hugetlbfs backed memory: 5832e3c8f8dSDr. David Alan Gilbert 5842e3c8f8dSDr. David Alan Gilbert a) The linux kernel on the destination must support userfault on hugepages. 5852e3c8f8dSDr. David Alan Gilbert b) The huge-page configuration on the source and destination VMs must be 5862e3c8f8dSDr. David Alan Gilbert identical; i.e. RAMBlocks on both sides must use the same page size. 5872e3c8f8dSDr. David Alan Gilbert c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal 5882e3c8f8dSDr. David Alan Gilbert RAM if it doesn't have enough hugepages, triggering (b) to fail. 5892e3c8f8dSDr. David Alan Gilbert Using ``-mem-prealloc`` enforces the allocation using hugepages. 5902e3c8f8dSDr. David Alan Gilbert d) Care should be taken with the size of hugepage used; postcopy with 2MB 5912e3c8f8dSDr. David Alan Gilbert hugepages works well, however 1GB hugepages are likely to be problematic 5922e3c8f8dSDr. David Alan Gilbert since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, 5932e3c8f8dSDr. David Alan Gilbert and until the full page is transferred the destination thread is blocked. 5941dc61e7bSDr. David Alan Gilbert 5951dc61e7bSDr. David Alan GilbertPostcopy with shared memory 5961dc61e7bSDr. David Alan Gilbert--------------------------- 5971dc61e7bSDr. David Alan Gilbert 5981dc61e7bSDr. David Alan GilbertPostcopy migration with shared memory needs explicit support from the other 5991dc61e7bSDr. David Alan Gilbertprocesses that share memory and from QEMU. There are restrictions on the type of 6001dc61e7bSDr. David Alan Gilbertmemory that userfault can support shared. 6011dc61e7bSDr. David Alan Gilbert 6021dc61e7bSDr. David Alan GilbertThe Linux kernel userfault support works on `/dev/shm` memory and on `hugetlbfs` 6031dc61e7bSDr. David Alan Gilbert(although the kernel doesn't provide an equivalent to `madvise(MADV_DONTNEED)` 6041dc61e7bSDr. David Alan Gilbertfor hugetlbfs which may be a problem in some configurations). 6051dc61e7bSDr. David Alan Gilbert 6061dc61e7bSDr. David Alan GilbertThe vhost-user code in QEMU supports clients that have Postcopy support, 6071dc61e7bSDr. David Alan Gilbertand the `vhost-user-bridge` (in `tests/`) and the DPDK package have changes 6081dc61e7bSDr. David Alan Gilbertto support postcopy. 6091dc61e7bSDr. David Alan Gilbert 6101dc61e7bSDr. David Alan GilbertThe client needs to open a userfaultfd and register the areas 6111dc61e7bSDr. David Alan Gilbertof memory that it maps with userfault. The client must then pass the 6121dc61e7bSDr. David Alan Gilbertuserfaultfd back to QEMU together with a mapping table that allows 6131dc61e7bSDr. David Alan Gilbertfault addresses in the clients address space to be converted back to 6141dc61e7bSDr. David Alan GilbertRAMBlock/offsets. The client's userfaultfd is added to the postcopy 6151dc61e7bSDr. David Alan Gilbertfault-thread and page requests are made on behalf of the client by QEMU. 6161dc61e7bSDr. David Alan GilbertQEMU performs 'wake' operations on the client's userfaultfd to allow it 6171dc61e7bSDr. David Alan Gilbertto continue after a page has arrived. 6181dc61e7bSDr. David Alan Gilbert 6191dc61e7bSDr. David Alan Gilbert.. note:: 6201dc61e7bSDr. David Alan Gilbert There are two future improvements that would be nice: 6211dc61e7bSDr. David Alan Gilbert a) Some way to make QEMU ignorant of the addresses in the clients 6221dc61e7bSDr. David Alan Gilbert address space 6231dc61e7bSDr. David Alan Gilbert b) Avoiding the need for QEMU to perform ufd-wake calls after the 6241dc61e7bSDr. David Alan Gilbert pages have arrived 6251dc61e7bSDr. David Alan Gilbert 6261dc61e7bSDr. David Alan GilbertRetro-fitting postcopy to existing clients is possible: 6271dc61e7bSDr. David Alan Gilbert a) A mechanism is needed for the registration with userfault as above, 6281dc61e7bSDr. David Alan Gilbert and the registration needs to be coordinated with the phases of 6291dc61e7bSDr. David Alan Gilbert postcopy. In vhost-user extra messages are added to the existing 6301dc61e7bSDr. David Alan Gilbert control channel. 6311dc61e7bSDr. David Alan Gilbert b) Any thread that can block due to guest memory accesses must be 6321dc61e7bSDr. David Alan Gilbert identified and the implication understood; for example if the 6331dc61e7bSDr. David Alan Gilbert guest memory access is made while holding a lock then all other 6341dc61e7bSDr. David Alan Gilbert threads waiting for that lock will also be blocked. 635