xref: /qemu/docs/devel/migration/main.rst (revision 9ed01779e8984b71cf62e4732de8d05ff091df36)
12e3c8f8dSDr. David Alan Gilbert=========
22e3c8f8dSDr. David Alan GilbertMigration
32e3c8f8dSDr. David Alan Gilbert=========
42e3c8f8dSDr. David Alan Gilbert
52e3c8f8dSDr. David Alan GilbertQEMU has code to load/save the state of the guest that it is running.
62e3c8f8dSDr. David Alan GilbertThese are two complementary operations.  Saving the state just does
72e3c8f8dSDr. David Alan Gilbertthat, saves the state for each device that the guest is running.
82e3c8f8dSDr. David Alan GilbertRestoring a guest is just the opposite operation: we need to load the
92e3c8f8dSDr. David Alan Gilbertstate of each device.
102e3c8f8dSDr. David Alan Gilbert
112e3c8f8dSDr. David Alan GilbertFor this to work, QEMU has to be launched with the same arguments the
122e3c8f8dSDr. David Alan Gilberttwo times.  I.e. it can only restore the state in one guest that has
132e3c8f8dSDr. David Alan Gilbertthe same devices that the one it was saved (this last requirement can
142e3c8f8dSDr. David Alan Gilbertbe relaxed a bit, but for now we can consider that configuration has
152e3c8f8dSDr. David Alan Gilbertto be exactly the same).
162e3c8f8dSDr. David Alan Gilbert
172e3c8f8dSDr. David Alan GilbertOnce that we are able to save/restore a guest, a new functionality is
182e3c8f8dSDr. David Alan Gilbertrequested: migration.  This means that QEMU is able to start in one
192e3c8f8dSDr. David Alan Gilbertmachine and being "migrated" to another machine.  I.e. being moved to
202e3c8f8dSDr. David Alan Gilbertanother machine.
212e3c8f8dSDr. David Alan Gilbert
222e3c8f8dSDr. David Alan GilbertNext was the "live migration" functionality.  This is important
232e3c8f8dSDr. David Alan Gilbertbecause some guests run with a lot of state (specially RAM), and it
242e3c8f8dSDr. David Alan Gilbertcan take a while to move all state from one machine to another.  Live
252e3c8f8dSDr. David Alan Gilbertmigration allows the guest to continue running while the state is
262e3c8f8dSDr. David Alan Gilberttransferred.  Only while the last part of the state is transferred has
272e3c8f8dSDr. David Alan Gilbertthe guest to be stopped.  Typically the time that the guest is
282e3c8f8dSDr. David Alan Gilbertunresponsive during live migration is the low hundred of milliseconds
292e3c8f8dSDr. David Alan Gilbert(notice that this depends on a lot of things).
302e3c8f8dSDr. David Alan Gilbert
312e3c8f8dSDr. David Alan GilbertTypes of migration
322e3c8f8dSDr. David Alan Gilbert==================
332e3c8f8dSDr. David Alan Gilbert
342e3c8f8dSDr. David Alan GilbertNow that we have talked about live migration, there are several ways
352e3c8f8dSDr. David Alan Gilbertto do migration:
362e3c8f8dSDr. David Alan Gilbert
372e3c8f8dSDr. David Alan Gilbert- tcp migration: do the migration using tcp sockets
382e3c8f8dSDr. David Alan Gilbert- unix migration: do the migration using unix sockets
392e3c8f8dSDr. David Alan Gilbert- exec migration: do the migration using the stdin/stdout through a process.
402e3c8f8dSDr. David Alan Gilbert- fd migration: do the migration using an file descriptor that is
412e3c8f8dSDr. David Alan Gilbert  passed to QEMU.  QEMU doesn't care how this file descriptor is opened.
422e3c8f8dSDr. David Alan Gilbert
432e3c8f8dSDr. David Alan GilbertAll these four migration protocols use the same infrastructure to
442e3c8f8dSDr. David Alan Gilbertsave/restore state devices.  This infrastructure is shared with the
452e3c8f8dSDr. David Alan Gilbertsavevm/loadvm functionality.
462e3c8f8dSDr. David Alan Gilbert
472e3c8f8dSDr. David Alan GilbertState Live Migration
482e3c8f8dSDr. David Alan Gilbert====================
492e3c8f8dSDr. David Alan Gilbert
502e3c8f8dSDr. David Alan GilbertThis is used for RAM and block devices.  It is not yet ported to vmstate.
512e3c8f8dSDr. David Alan Gilbert<Fill more information here>
522e3c8f8dSDr. David Alan Gilbert
532e3c8f8dSDr. David Alan GilbertCommon infrastructure
542e3c8f8dSDr. David Alan Gilbert=====================
552e3c8f8dSDr. David Alan Gilbert
562e3c8f8dSDr. David Alan GilbertThe files, sockets or fd's that carry the migration stream are abstracted by
572e3c8f8dSDr. David Alan Gilbertthe  ``QEMUFile`` type (see `migration/qemu-file.h`).  In most cases this
582e3c8f8dSDr. David Alan Gilbertis connected to a subtype of ``QIOChannel`` (see `io/`).
592e3c8f8dSDr. David Alan Gilbert
602e3c8f8dSDr. David Alan GilbertSaving the state of one device
612e3c8f8dSDr. David Alan Gilbert==============================
622e3c8f8dSDr. David Alan Gilbert
632e3c8f8dSDr. David Alan GilbertThe state of a device is saved using intermediate buffers.  There are
642e3c8f8dSDr. David Alan Gilbertsome helper functions to assist this saving.
652e3c8f8dSDr. David Alan Gilbert
662e3c8f8dSDr. David Alan GilbertThere is a new concept that we have to explain here: device state
672e3c8f8dSDr. David Alan Gilbertversion.  When we migrate a device, we save/load the state as a series
682e3c8f8dSDr. David Alan Gilbertof fields.  Some times, due to bugs or new functionality, we need to
692e3c8f8dSDr. David Alan Gilbertchange the state to store more/different information.  We use the
702e3c8f8dSDr. David Alan Gilbertversion to identify each time that we do a change.  Each version is
712e3c8f8dSDr. David Alan Gilbertassociated with a series of fields saved.  The `save_state` always saves
722e3c8f8dSDr. David Alan Gilbertthe state as the newer version.  But `load_state` sometimes is able to
732e3c8f8dSDr. David Alan Gilbertload state from an older version.
742e3c8f8dSDr. David Alan Gilbert
752e3c8f8dSDr. David Alan GilbertLegacy way
762e3c8f8dSDr. David Alan Gilbert----------
772e3c8f8dSDr. David Alan Gilbert
782e3c8f8dSDr. David Alan GilbertThis way is going to disappear as soon as all current users are ported to VMSTATE.
792e3c8f8dSDr. David Alan Gilbert
802e3c8f8dSDr. David Alan GilbertEach device has to register two functions, one to save the state and
812e3c8f8dSDr. David Alan Gilbertanother to load the state back.
822e3c8f8dSDr. David Alan Gilbert
832e3c8f8dSDr. David Alan Gilbert.. code:: c
842e3c8f8dSDr. David Alan Gilbert
852e3c8f8dSDr. David Alan Gilbert  int register_savevm(DeviceState *dev,
862e3c8f8dSDr. David Alan Gilbert                      const char *idstr,
872e3c8f8dSDr. David Alan Gilbert                      int instance_id,
882e3c8f8dSDr. David Alan Gilbert                      int version_id,
892e3c8f8dSDr. David Alan Gilbert                      SaveStateHandler *save_state,
902e3c8f8dSDr. David Alan Gilbert                      LoadStateHandler *load_state,
912e3c8f8dSDr. David Alan Gilbert                      void *opaque);
922e3c8f8dSDr. David Alan Gilbert
932e3c8f8dSDr. David Alan Gilbert  typedef void SaveStateHandler(QEMUFile *f, void *opaque);
942e3c8f8dSDr. David Alan Gilbert  typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id);
952e3c8f8dSDr. David Alan Gilbert
962e3c8f8dSDr. David Alan GilbertThe important functions for the device state format are the `save_state`
972e3c8f8dSDr. David Alan Gilbertand `load_state`.  Notice that `load_state` receives a version_id
982e3c8f8dSDr. David Alan Gilbertparameter to know what state format is receiving.  `save_state` doesn't
992e3c8f8dSDr. David Alan Gilberthave a version_id parameter because it always uses the latest version.
1002e3c8f8dSDr. David Alan Gilbert
1012e3c8f8dSDr. David Alan GilbertVMState
1022e3c8f8dSDr. David Alan Gilbert-------
1032e3c8f8dSDr. David Alan Gilbert
1042e3c8f8dSDr. David Alan GilbertThe legacy way of saving/loading state of the device had the problem
1052e3c8f8dSDr. David Alan Gilbertthat we have to maintain two functions in sync.  If we did one change
1062e3c8f8dSDr. David Alan Gilbertin one of them and not in the other, we would get a failed migration.
1072e3c8f8dSDr. David Alan Gilbert
1082e3c8f8dSDr. David Alan GilbertVMState changed the way that state is saved/loaded.  Instead of using
1092e3c8f8dSDr. David Alan Gilberta function to save the state and another to load it, it was changed to
1102e3c8f8dSDr. David Alan Gilberta declarative way of what the state consisted of.  Now VMState is able
1112e3c8f8dSDr. David Alan Gilbertto interpret that definition to be able to load/save the state.  As
1122e3c8f8dSDr. David Alan Gilbertthe state is declared only once, it can't go out of sync in the
1132e3c8f8dSDr. David Alan Gilbertsave/load functions.
1142e3c8f8dSDr. David Alan Gilbert
1152e3c8f8dSDr. David Alan GilbertAn example (from hw/input/pckbd.c)
1162e3c8f8dSDr. David Alan Gilbert
1172e3c8f8dSDr. David Alan Gilbert.. code:: c
1182e3c8f8dSDr. David Alan Gilbert
1192e3c8f8dSDr. David Alan Gilbert  static const VMStateDescription vmstate_kbd = {
1202e3c8f8dSDr. David Alan Gilbert      .name = "pckbd",
1212e3c8f8dSDr. David Alan Gilbert      .version_id = 3,
1222e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 3,
1232e3c8f8dSDr. David Alan Gilbert      .fields = (VMStateField[]) {
1242e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(write_cmd, KBDState),
1252e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(status, KBDState),
1262e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(mode, KBDState),
1272e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(pending, KBDState),
1282e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
1292e3c8f8dSDr. David Alan Gilbert      }
1302e3c8f8dSDr. David Alan Gilbert  };
1312e3c8f8dSDr. David Alan Gilbert
1322e3c8f8dSDr. David Alan GilbertWe are declaring the state with name "pckbd".
1332e3c8f8dSDr. David Alan GilbertThe `version_id` is 3, and the fields are 4 uint8_t in a KBDState structure.
1342e3c8f8dSDr. David Alan GilbertWe registered this with:
1352e3c8f8dSDr. David Alan Gilbert
1362e3c8f8dSDr. David Alan Gilbert.. code:: c
1372e3c8f8dSDr. David Alan Gilbert
1382e3c8f8dSDr. David Alan Gilbert    vmstate_register(NULL, 0, &vmstate_kbd, s);
1392e3c8f8dSDr. David Alan Gilbert
1402e3c8f8dSDr. David Alan GilbertNote: talk about how vmstate <-> qdev interact, and what the instance ids mean.
1412e3c8f8dSDr. David Alan Gilbert
1422e3c8f8dSDr. David Alan GilbertYou can search for ``VMSTATE_*`` macros for lots of types used in QEMU in
1432e3c8f8dSDr. David Alan Gilbertinclude/hw/hw.h.
1442e3c8f8dSDr. David Alan Gilbert
1452e3c8f8dSDr. David Alan GilbertMore about versions
1462e3c8f8dSDr. David Alan Gilbert-------------------
1472e3c8f8dSDr. David Alan Gilbert
1482e3c8f8dSDr. David Alan GilbertVersion numbers are intended for major incompatible changes to the
1492e3c8f8dSDr. David Alan Gilbertmigration of a device, and using them breaks backwards-migration
1502e3c8f8dSDr. David Alan Gilbertcompatibility; in general most changes can be made by adding Subsections
1512e3c8f8dSDr. David Alan Gilbert(see below) or _TEST macros (see below) which won't break compatibility.
1522e3c8f8dSDr. David Alan Gilbert
1532e3c8f8dSDr. David Alan GilbertYou can see that there are several version fields:
1542e3c8f8dSDr. David Alan Gilbert
1552e3c8f8dSDr. David Alan Gilbert- `version_id`: the maximum version_id supported by VMState for that device.
1562e3c8f8dSDr. David Alan Gilbert- `minimum_version_id`: the minimum version_id that VMState is able to understand
1572e3c8f8dSDr. David Alan Gilbert  for that device.
1582e3c8f8dSDr. David Alan Gilbert- `minimum_version_id_old`: For devices that were not able to port to vmstate, we can
1592e3c8f8dSDr. David Alan Gilbert  assign a function that knows how to read this old state. This field is
1602e3c8f8dSDr. David Alan Gilbert  ignored if there is no `load_state_old` handler.
1612e3c8f8dSDr. David Alan Gilbert
1622e3c8f8dSDr. David Alan GilbertSo, VMState is able to read versions from minimum_version_id to
1632e3c8f8dSDr. David Alan Gilbertversion_id.  And the function ``load_state_old()`` (if present) is able to
1642e3c8f8dSDr. David Alan Gilbertload state from minimum_version_id_old to minimum_version_id.  This
1652e3c8f8dSDr. David Alan Gilbertfunction is deprecated and will be removed when no more users are left.
1662e3c8f8dSDr. David Alan Gilbert
1672e3c8f8dSDr. David Alan GilbertSaving state will always create a section with the 'version_id' value
1682e3c8f8dSDr. David Alan Gilbertand thus can't be loaded by any older QEMU.
1692e3c8f8dSDr. David Alan Gilbert
1702e3c8f8dSDr. David Alan GilbertMassaging functions
1712e3c8f8dSDr. David Alan Gilbert-------------------
1722e3c8f8dSDr. David Alan Gilbert
1732e3c8f8dSDr. David Alan GilbertSometimes, it is not enough to be able to save the state directly
1742e3c8f8dSDr. David Alan Gilbertfrom one structure, we need to fill the correct values there.  One
1752e3c8f8dSDr. David Alan Gilbertexample is when we are using kvm.  Before saving the cpu state, we
1762e3c8f8dSDr. David Alan Gilbertneed to ask kvm to copy to QEMU the state that it is using.  And the
1772e3c8f8dSDr. David Alan Gilbertopposite when we are loading the state, we need a way to tell kvm to
1782e3c8f8dSDr. David Alan Gilbertload the state for the cpu that we have just loaded from the QEMUFile.
1792e3c8f8dSDr. David Alan Gilbert
1802e3c8f8dSDr. David Alan GilbertThe functions to do that are inside a vmstate definition, and are called:
1812e3c8f8dSDr. David Alan Gilbert
1822e3c8f8dSDr. David Alan Gilbert- ``int (*pre_load)(void *opaque);``
1832e3c8f8dSDr. David Alan Gilbert
1842e3c8f8dSDr. David Alan Gilbert  This function is called before we load the state of one device.
1852e3c8f8dSDr. David Alan Gilbert
1862e3c8f8dSDr. David Alan Gilbert- ``int (*post_load)(void *opaque, int version_id);``
1872e3c8f8dSDr. David Alan Gilbert
1882e3c8f8dSDr. David Alan Gilbert  This function is called after we load the state of one device.
1892e3c8f8dSDr. David Alan Gilbert
1902e3c8f8dSDr. David Alan Gilbert- ``int (*pre_save)(void *opaque);``
1912e3c8f8dSDr. David Alan Gilbert
1922e3c8f8dSDr. David Alan Gilbert  This function is called before we save the state of one device.
1932e3c8f8dSDr. David Alan Gilbert
1942e3c8f8dSDr. David Alan GilbertExample: You can look at hpet.c, that uses the three function to
1952e3c8f8dSDr. David Alan Gilbertmassage the state that is transferred.
1962e3c8f8dSDr. David Alan Gilbert
1972e3c8f8dSDr. David Alan GilbertIf you use memory API functions that update memory layout outside
1982e3c8f8dSDr. David Alan Gilbertinitialization (i.e., in response to a guest action), this is a strong
1992e3c8f8dSDr. David Alan Gilbertindication that you need to call these functions in a `post_load` callback.
2002e3c8f8dSDr. David Alan GilbertExamples of such memory API functions are:
2012e3c8f8dSDr. David Alan Gilbert
2022e3c8f8dSDr. David Alan Gilbert  - memory_region_add_subregion()
2032e3c8f8dSDr. David Alan Gilbert  - memory_region_del_subregion()
2042e3c8f8dSDr. David Alan Gilbert  - memory_region_set_readonly()
2052e3c8f8dSDr. David Alan Gilbert  - memory_region_set_enabled()
2062e3c8f8dSDr. David Alan Gilbert  - memory_region_set_address()
2072e3c8f8dSDr. David Alan Gilbert  - memory_region_set_alias_offset()
2082e3c8f8dSDr. David Alan Gilbert
2092e3c8f8dSDr. David Alan GilbertSubsections
2102e3c8f8dSDr. David Alan Gilbert-----------
2112e3c8f8dSDr. David Alan Gilbert
2122e3c8f8dSDr. David Alan GilbertThe use of version_id allows to be able to migrate from older versions
2132e3c8f8dSDr. David Alan Gilbertto newer versions of a device.  But not the other way around.  This
2142e3c8f8dSDr. David Alan Gilbertmakes very complicated to fix bugs in stable branches.  If we need to
2152e3c8f8dSDr. David Alan Gilbertadd anything to the state to fix a bug, we have to disable migration
2162e3c8f8dSDr. David Alan Gilbertto older versions that don't have that bug-fix (i.e. a new field).
2172e3c8f8dSDr. David Alan Gilbert
2182e3c8f8dSDr. David Alan GilbertBut sometimes, that bug-fix is only needed sometimes, not always.  For
2192e3c8f8dSDr. David Alan Gilbertinstance, if the device is in the middle of a DMA operation, it is
2202e3c8f8dSDr. David Alan Gilbertusing a specific functionality, ....
2212e3c8f8dSDr. David Alan Gilbert
2222e3c8f8dSDr. David Alan GilbertIt is impossible to create a way to make migration from any version to
2232e3c8f8dSDr. David Alan Gilbertany other version to work.  But we can do better than only allowing
2242e3c8f8dSDr. David Alan Gilbertmigration from older versions to newer ones.  For that fields that are
2252e3c8f8dSDr. David Alan Gilbertonly needed sometimes, we add the idea of subsections.  A subsection
2262e3c8f8dSDr. David Alan Gilbertis "like" a device vmstate, but with a particularity, it has a Boolean
2272e3c8f8dSDr. David Alan Gilbertfunction that tells if that values are needed to be sent or not.  If
2282e3c8f8dSDr. David Alan Gilbertthis functions returns false, the subsection is not sent.
2292e3c8f8dSDr. David Alan Gilbert
2302e3c8f8dSDr. David Alan GilbertOn the receiving side, if we found a subsection for a device that we
2312e3c8f8dSDr. David Alan Gilbertdon't understand, we just fail the migration.  If we understand all
2322e3c8f8dSDr. David Alan Gilbertthe subsections, then we load the state with success.
2332e3c8f8dSDr. David Alan Gilbert
2342e3c8f8dSDr. David Alan GilbertOne important note is that the post_load() function is called "after"
2352e3c8f8dSDr. David Alan Gilbertloading all subsections, because a newer subsection could change same
2362e3c8f8dSDr. David Alan Gilbertvalue that it uses.
2372e3c8f8dSDr. David Alan Gilbert
2382e3c8f8dSDr. David Alan GilbertExample:
2392e3c8f8dSDr. David Alan Gilbert
2402e3c8f8dSDr. David Alan Gilbert.. code:: c
2412e3c8f8dSDr. David Alan Gilbert
2422e3c8f8dSDr. David Alan Gilbert  static bool ide_drive_pio_state_needed(void *opaque)
2432e3c8f8dSDr. David Alan Gilbert  {
2442e3c8f8dSDr. David Alan Gilbert      IDEState *s = opaque;
2452e3c8f8dSDr. David Alan Gilbert
2462e3c8f8dSDr. David Alan Gilbert      return ((s->status & DRQ_STAT) != 0)
2472e3c8f8dSDr. David Alan Gilbert          || (s->bus->error_status & BM_STATUS_PIO_RETRY);
2482e3c8f8dSDr. David Alan Gilbert  }
2492e3c8f8dSDr. David Alan Gilbert
2502e3c8f8dSDr. David Alan Gilbert  const VMStateDescription vmstate_ide_drive_pio_state = {
2512e3c8f8dSDr. David Alan Gilbert      .name = "ide_drive/pio_state",
2522e3c8f8dSDr. David Alan Gilbert      .version_id = 1,
2532e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 1,
2542e3c8f8dSDr. David Alan Gilbert      .pre_save = ide_drive_pio_pre_save,
2552e3c8f8dSDr. David Alan Gilbert      .post_load = ide_drive_pio_post_load,
2562e3c8f8dSDr. David Alan Gilbert      .needed = ide_drive_pio_state_needed,
2572e3c8f8dSDr. David Alan Gilbert      .fields = (VMStateField[]) {
2582e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(req_nb_sectors, IDEState),
2592e3c8f8dSDr. David Alan Gilbert          VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
2602e3c8f8dSDr. David Alan Gilbert                               vmstate_info_uint8, uint8_t),
2612e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(cur_io_buffer_offset, IDEState),
2622e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(cur_io_buffer_len, IDEState),
2632e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
2642e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(elementary_transfer_size, IDEState),
2652e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(packet_transfer_size, IDEState),
2662e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
2672e3c8f8dSDr. David Alan Gilbert      }
2682e3c8f8dSDr. David Alan Gilbert  };
2692e3c8f8dSDr. David Alan Gilbert
2702e3c8f8dSDr. David Alan Gilbert  const VMStateDescription vmstate_ide_drive = {
2712e3c8f8dSDr. David Alan Gilbert      .name = "ide_drive",
2722e3c8f8dSDr. David Alan Gilbert      .version_id = 3,
2732e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 0,
2742e3c8f8dSDr. David Alan Gilbert      .post_load = ide_drive_post_load,
2752e3c8f8dSDr. David Alan Gilbert      .fields = (VMStateField[]) {
2762e3c8f8dSDr. David Alan Gilbert          .... several fields ....
2772e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
2782e3c8f8dSDr. David Alan Gilbert      },
2792e3c8f8dSDr. David Alan Gilbert      .subsections = (const VMStateDescription*[]) {
2802e3c8f8dSDr. David Alan Gilbert          &vmstate_ide_drive_pio_state,
2812e3c8f8dSDr. David Alan Gilbert          NULL
2822e3c8f8dSDr. David Alan Gilbert      }
2832e3c8f8dSDr. David Alan Gilbert  };
2842e3c8f8dSDr. David Alan Gilbert
2852e3c8f8dSDr. David Alan GilbertHere we have a subsection for the pio state.  We only need to
2862e3c8f8dSDr. David Alan Gilbertsave/send this state when we are in the middle of a pio operation
2872e3c8f8dSDr. David Alan Gilbert(that is what ``ide_drive_pio_state_needed()`` checks).  If DRQ_STAT is
2882e3c8f8dSDr. David Alan Gilbertnot enabled, the values on that fields are garbage and don't need to
2892e3c8f8dSDr. David Alan Gilbertbe sent.
2902e3c8f8dSDr. David Alan Gilbert
2912e3c8f8dSDr. David Alan GilbertUsing a condition function that checks a 'property' to determine whether
2922e3c8f8dSDr. David Alan Gilbertto send a subsection allows backwards migration compatibility when
2932e3c8f8dSDr. David Alan Gilbertnew subsections are added.
2942e3c8f8dSDr. David Alan Gilbert
2952e3c8f8dSDr. David Alan GilbertFor example:
2962e3c8f8dSDr. David Alan Gilbert
2972e3c8f8dSDr. David Alan Gilbert   a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and
2982e3c8f8dSDr. David Alan Gilbert      default it to true.
2992e3c8f8dSDr. David Alan Gilbert   b) Add an entry to the ``HW_COMPAT_`` for the previous version that sets
3002e3c8f8dSDr. David Alan Gilbert      the property to false.
3012e3c8f8dSDr. David Alan Gilbert   c) Add a static bool  support_foo function that tests the property.
3022e3c8f8dSDr. David Alan Gilbert   d) Add a subsection with a .needed set to the support_foo function
3032e3c8f8dSDr. David Alan Gilbert   e) (potentially) Add a pre_load that sets up a default value for 'foo'
3042e3c8f8dSDr. David Alan Gilbert      to be used if the subsection isn't loaded.
3052e3c8f8dSDr. David Alan Gilbert
3062e3c8f8dSDr. David Alan GilbertNow that subsection will not be generated when using an older
3072e3c8f8dSDr. David Alan Gilbertmachine type and the migration stream will be accepted by older
3082e3c8f8dSDr. David Alan GilbertQEMU versions. pre-load functions can be used to initialise state
3092e3c8f8dSDr. David Alan Gilberton the newer version so that they default to suitable values
3102e3c8f8dSDr. David Alan Gilbertwhen loading streams created by older QEMU versions that do not
3112e3c8f8dSDr. David Alan Gilbertgenerate the subsection.
3122e3c8f8dSDr. David Alan Gilbert
3132e3c8f8dSDr. David Alan GilbertIn some cases subsections are added for data that had been accidentally
3142e3c8f8dSDr. David Alan Gilbertomitted by earlier versions; if the missing data causes the migration
3152e3c8f8dSDr. David Alan Gilbertprocess to succeed but the guest to behave badly then it may be better
3162e3c8f8dSDr. David Alan Gilbertto send the subsection and cause the migration to explicitly fail
3172e3c8f8dSDr. David Alan Gilbertwith the unknown subsection error.   If the bad behaviour only happens
3182e3c8f8dSDr. David Alan Gilbertwith certain data values, making the subsection conditional on
3192e3c8f8dSDr. David Alan Gilbertthe data value (rather than the machine type) allows migrations to succeed
3202e3c8f8dSDr. David Alan Gilbertin most cases.  In general the preference is to tie the subsection to
3212e3c8f8dSDr. David Alan Gilbertthe machine type, and allow reliable migrations, unless the behaviour
3222e3c8f8dSDr. David Alan Gilbertfrom omission of the subsection is really bad.
3232e3c8f8dSDr. David Alan Gilbert
3242e3c8f8dSDr. David Alan GilbertNot sending existing elements
3252e3c8f8dSDr. David Alan Gilbert-----------------------------
3262e3c8f8dSDr. David Alan Gilbert
3272e3c8f8dSDr. David Alan GilbertSometimes members of the VMState are no longer needed:
3282e3c8f8dSDr. David Alan Gilbert
3292e3c8f8dSDr. David Alan Gilbert  - removing them will break migration compatibility
3302e3c8f8dSDr. David Alan Gilbert
3312e3c8f8dSDr. David Alan Gilbert  - making them version dependent and bumping the version will break backwards migration compatibility.
3322e3c8f8dSDr. David Alan Gilbert
3332e3c8f8dSDr. David Alan GilbertThe best way is to:
3342e3c8f8dSDr. David Alan Gilbert
3352e3c8f8dSDr. David Alan Gilbert  a) Add a new property/compatibility/function in the same way for subsections above.
3362e3c8f8dSDr. David Alan Gilbert  b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
3372e3c8f8dSDr. David Alan Gilbert
3382e3c8f8dSDr. David Alan Gilbert   ``VMSTATE_UINT32(foo, barstruct)``
3392e3c8f8dSDr. David Alan Gilbert
3402e3c8f8dSDr. David Alan Gilbert   becomes
3412e3c8f8dSDr. David Alan Gilbert
3422e3c8f8dSDr. David Alan Gilbert   ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)``
3432e3c8f8dSDr. David Alan Gilbert
3442e3c8f8dSDr. David Alan Gilbert   Sometime in the future when we no longer care about the ancient versions these can be killed off.
3452e3c8f8dSDr. David Alan Gilbert
3462e3c8f8dSDr. David Alan GilbertReturn path
3472e3c8f8dSDr. David Alan Gilbert-----------
3482e3c8f8dSDr. David Alan Gilbert
3492e3c8f8dSDr. David Alan GilbertIn most migration scenarios there is only a single data path that runs
3502e3c8f8dSDr. David Alan Gilbertfrom the source VM to the destination, typically along a single fd (although
3512e3c8f8dSDr. David Alan Gilbertpossibly with another fd or similar for some fast way of throwing pages across).
3522e3c8f8dSDr. David Alan Gilbert
3532e3c8f8dSDr. David Alan GilbertHowever, some uses need two way communication; in particular the Postcopy
3542e3c8f8dSDr. David Alan Gilbertdestination needs to be able to request pages on demand from the source.
3552e3c8f8dSDr. David Alan Gilbert
3562e3c8f8dSDr. David Alan GilbertFor these scenarios there is a 'return path' from the destination to the source;
3572e3c8f8dSDr. David Alan Gilbert``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return
3582e3c8f8dSDr. David Alan Gilbertpath.
3592e3c8f8dSDr. David Alan Gilbert
3602e3c8f8dSDr. David Alan Gilbert  Source side
3612e3c8f8dSDr. David Alan Gilbert
3622e3c8f8dSDr. David Alan Gilbert     Forward path - written by migration thread
3632e3c8f8dSDr. David Alan Gilbert     Return path  - opened by main thread, read by return-path thread
3642e3c8f8dSDr. David Alan Gilbert
3652e3c8f8dSDr. David Alan Gilbert  Destination side
3662e3c8f8dSDr. David Alan Gilbert
3672e3c8f8dSDr. David Alan Gilbert     Forward path - read by main thread
3682e3c8f8dSDr. David Alan Gilbert     Return path  - opened by main thread, written by main thread AND postcopy
3692e3c8f8dSDr. David Alan Gilbert     thread (protected by rp_mutex)
3702e3c8f8dSDr. David Alan Gilbert
3712e3c8f8dSDr. David Alan GilbertPostcopy
3722e3c8f8dSDr. David Alan Gilbert========
3732e3c8f8dSDr. David Alan Gilbert
3742e3c8f8dSDr. David Alan Gilbert'Postcopy' migration is a way to deal with migrations that refuse to converge
3752e3c8f8dSDr. David Alan Gilbert(or take too long to converge) its plus side is that there is an upper bound on
3762e3c8f8dSDr. David Alan Gilbertthe amount of migration traffic and time it takes, the down side is that during
3772e3c8f8dSDr. David Alan Gilbertthe postcopy phase, a failure of *either* side or the network connection causes
3782e3c8f8dSDr. David Alan Gilbertthe guest to be lost.
3792e3c8f8dSDr. David Alan Gilbert
3802e3c8f8dSDr. David Alan GilbertIn postcopy the destination CPUs are started before all the memory has been
3812e3c8f8dSDr. David Alan Gilberttransferred, and accesses to pages that are yet to be transferred cause
3822e3c8f8dSDr. David Alan Gilberta fault that's translated by QEMU into a request to the source QEMU.
3832e3c8f8dSDr. David Alan Gilbert
3842e3c8f8dSDr. David Alan GilbertPostcopy can be combined with precopy (i.e. normal migration) so that if precopy
3852e3c8f8dSDr. David Alan Gilbertdoesn't finish in a given time the switch is made to postcopy.
3862e3c8f8dSDr. David Alan Gilbert
3872e3c8f8dSDr. David Alan GilbertEnabling postcopy
3882e3c8f8dSDr. David Alan Gilbert-----------------
3892e3c8f8dSDr. David Alan Gilbert
390c2eb7f21SGreg KurzTo enable postcopy, issue this command on the monitor (both source and
391c2eb7f21SGreg Kurzdestination) prior to the start of migration:
3922e3c8f8dSDr. David Alan Gilbert
3932e3c8f8dSDr. David Alan Gilbert``migrate_set_capability postcopy-ram on``
3942e3c8f8dSDr. David Alan Gilbert
3952e3c8f8dSDr. David Alan GilbertThe normal commands are then used to start a migration, which is still
3962e3c8f8dSDr. David Alan Gilbertstarted in precopy mode.  Issuing:
3972e3c8f8dSDr. David Alan Gilbert
3982e3c8f8dSDr. David Alan Gilbert``migrate_start_postcopy``
3992e3c8f8dSDr. David Alan Gilbert
4002e3c8f8dSDr. David Alan Gilbertwill now cause the transition from precopy to postcopy.
4012e3c8f8dSDr. David Alan GilbertIt can be issued immediately after migration is started or any
4022e3c8f8dSDr. David Alan Gilberttime later on.  Issuing it after the end of a migration is harmless.
4032e3c8f8dSDr. David Alan Gilbert
4049ed01779SAlexey PerevalovBlocktime is a postcopy live migration metric, intended to show how
4059ed01779SAlexey Perevalovlong the vCPU was in state of interruptable sleep due to pagefault.
4069ed01779SAlexey PerevalovThat metric is calculated both for all vCPUs as overlapped value, and
4079ed01779SAlexey Perevalovseparately for each vCPU. These values are calculated on destination
4089ed01779SAlexey Perevalovside.  To enable postcopy blocktime calculation, enter following
4099ed01779SAlexey Perevalovcommand on destination monitor:
4109ed01779SAlexey Perevalov
4119ed01779SAlexey Perevalov``migrate_set_capability postcopy-blocktime on``
4129ed01779SAlexey Perevalov
4139ed01779SAlexey PerevalovPostcopy blocktime can be retrieved by query-migrate qmp command.
4149ed01779SAlexey Perevalovpostcopy-blocktime value of qmp command will show overlapped blocking
4159ed01779SAlexey Perevalovtime for all vCPU, postcopy-vcpu-blocktime will show list of blocking
4169ed01779SAlexey Perevalovtime per vCPU.
4179ed01779SAlexey Perevalov
4182e3c8f8dSDr. David Alan Gilbert.. note::
4192e3c8f8dSDr. David Alan Gilbert  During the postcopy phase, the bandwidth limits set using
4202e3c8f8dSDr. David Alan Gilbert  ``migrate_set_speed`` is ignored (to avoid delaying requested pages that
4212e3c8f8dSDr. David Alan Gilbert  the destination is waiting for).
4222e3c8f8dSDr. David Alan Gilbert
4232e3c8f8dSDr. David Alan GilbertPostcopy device transfer
4242e3c8f8dSDr. David Alan Gilbert------------------------
4252e3c8f8dSDr. David Alan Gilbert
4262e3c8f8dSDr. David Alan GilbertLoading of device data may cause the device emulation to access guest RAM
4272e3c8f8dSDr. David Alan Gilbertthat may trigger faults that have to be resolved by the source, as such
4282e3c8f8dSDr. David Alan Gilbertthe migration stream has to be able to respond with page data *during* the
4292e3c8f8dSDr. David Alan Gilbertdevice load, and hence the device data has to be read from the stream completely
4302e3c8f8dSDr. David Alan Gilbertbefore the device load begins to free the stream up.  This is achieved by
4312e3c8f8dSDr. David Alan Gilbert'packaging' the device data into a blob that's read in one go.
4322e3c8f8dSDr. David Alan Gilbert
4332e3c8f8dSDr. David Alan GilbertSource behaviour
4342e3c8f8dSDr. David Alan Gilbert----------------
4352e3c8f8dSDr. David Alan Gilbert
4362e3c8f8dSDr. David Alan GilbertUntil postcopy is entered the migration stream is identical to normal
4372e3c8f8dSDr. David Alan Gilbertprecopy, except for the addition of a 'postcopy advise' command at
4382e3c8f8dSDr. David Alan Gilbertthe beginning, to tell the destination that postcopy might happen.
4392e3c8f8dSDr. David Alan GilbertWhen postcopy starts the source sends the page discard data and then
4402e3c8f8dSDr. David Alan Gilbertforms the 'package' containing:
4412e3c8f8dSDr. David Alan Gilbert
4422e3c8f8dSDr. David Alan Gilbert   - Command: 'postcopy listen'
4432e3c8f8dSDr. David Alan Gilbert   - The device state
4442e3c8f8dSDr. David Alan Gilbert
4452e3c8f8dSDr. David Alan Gilbert     A series of sections, identical to the precopy streams device state stream
4462e3c8f8dSDr. David Alan Gilbert     containing everything except postcopiable devices (i.e. RAM)
4472e3c8f8dSDr. David Alan Gilbert   - Command: 'postcopy run'
4482e3c8f8dSDr. David Alan Gilbert
4492e3c8f8dSDr. David Alan GilbertThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
4502e3c8f8dSDr. David Alan Gilbertcontents are formatted in the same way as the main migration stream.
4512e3c8f8dSDr. David Alan Gilbert
4522e3c8f8dSDr. David Alan GilbertDuring postcopy the source scans the list of dirty pages and sends them
4532e3c8f8dSDr. David Alan Gilbertto the destination without being requested (in much the same way as precopy),
4542e3c8f8dSDr. David Alan Gilberthowever when a page request is received from the destination, the dirty page
4552e3c8f8dSDr. David Alan Gilbertscanning restarts from the requested location.  This causes requested pages
4562e3c8f8dSDr. David Alan Gilbertto be sent quickly, and also causes pages directly after the requested page
4572e3c8f8dSDr. David Alan Gilbertto be sent quickly in the hope that those pages are likely to be used
4582e3c8f8dSDr. David Alan Gilbertby the destination soon.
4592e3c8f8dSDr. David Alan Gilbert
4602e3c8f8dSDr. David Alan GilbertDestination behaviour
4612e3c8f8dSDr. David Alan Gilbert---------------------
4622e3c8f8dSDr. David Alan Gilbert
4632e3c8f8dSDr. David Alan GilbertInitially the destination looks the same as precopy, with a single thread
4642e3c8f8dSDr. David Alan Gilbertreading the migration stream; the 'postcopy advise' and 'discard' commands
4652e3c8f8dSDr. David Alan Gilbertare processed to change the way RAM is managed, but don't affect the stream
4662e3c8f8dSDr. David Alan Gilbertprocessing.
4672e3c8f8dSDr. David Alan Gilbert
4682e3c8f8dSDr. David Alan Gilbert::
4692e3c8f8dSDr. David Alan Gilbert
4702e3c8f8dSDr. David Alan Gilbert  ------------------------------------------------------------------------------
4712e3c8f8dSDr. David Alan Gilbert                          1      2   3     4 5                      6   7
4722e3c8f8dSDr. David Alan Gilbert  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
4732e3c8f8dSDr. David Alan Gilbert  thread                             |       |
4742e3c8f8dSDr. David Alan Gilbert                                     |     (page request)
4752e3c8f8dSDr. David Alan Gilbert                                     |        \___
4762e3c8f8dSDr. David Alan Gilbert                                     v            \
4772e3c8f8dSDr. David Alan Gilbert  listen thread:                     --- page -- page -- page -- page -- page --
4782e3c8f8dSDr. David Alan Gilbert
4792e3c8f8dSDr. David Alan Gilbert                                     a   b        c
4802e3c8f8dSDr. David Alan Gilbert  ------------------------------------------------------------------------------
4812e3c8f8dSDr. David Alan Gilbert
4822e3c8f8dSDr. David Alan Gilbert- On receipt of ``CMD_PACKAGED`` (1)
4832e3c8f8dSDr. David Alan Gilbert
4842e3c8f8dSDr. David Alan Gilbert   All the data associated with the package - the ( ... ) section in the diagram -
4852e3c8f8dSDr. David Alan Gilbert   is read into memory, and the main thread recurses into qemu_loadvm_state_main
4862e3c8f8dSDr. David Alan Gilbert   to process the contents of the package (2) which contains commands (3,6) and
4872e3c8f8dSDr. David Alan Gilbert   devices (4...)
4882e3c8f8dSDr. David Alan Gilbert
4892e3c8f8dSDr. David Alan Gilbert- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
4902e3c8f8dSDr. David Alan Gilbert
4912e3c8f8dSDr. David Alan Gilbert   a new thread (a) is started that takes over servicing the migration stream,
4922e3c8f8dSDr. David Alan Gilbert   while the main thread carries on loading the package.   It loads normal
4932e3c8f8dSDr. David Alan Gilbert   background page data (b) but if during a device load a fault happens (5)
4942e3c8f8dSDr. David Alan Gilbert   the returned page (c) is loaded by the listen thread allowing the main
4952e3c8f8dSDr. David Alan Gilbert   threads device load to carry on.
4962e3c8f8dSDr. David Alan Gilbert
4972e3c8f8dSDr. David Alan Gilbert- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
4982e3c8f8dSDr. David Alan Gilbert
4992e3c8f8dSDr. David Alan Gilbert   letting the destination CPUs start running.  At the end of the
5002e3c8f8dSDr. David Alan Gilbert   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
5012e3c8f8dSDr. David Alan Gilbert   is no longer used by migration, while the listen thread carries on servicing
5022e3c8f8dSDr. David Alan Gilbert   page data until the end of migration.
5032e3c8f8dSDr. David Alan Gilbert
5042e3c8f8dSDr. David Alan GilbertPostcopy states
5052e3c8f8dSDr. David Alan Gilbert---------------
5062e3c8f8dSDr. David Alan Gilbert
5072e3c8f8dSDr. David Alan GilbertPostcopy moves through a series of states (see postcopy_state) from
5082e3c8f8dSDr. David Alan GilbertADVISE->DISCARD->LISTEN->RUNNING->END
5092e3c8f8dSDr. David Alan Gilbert
5102e3c8f8dSDr. David Alan Gilbert - Advise
5112e3c8f8dSDr. David Alan Gilbert
5122e3c8f8dSDr. David Alan Gilbert    Set at the start of migration if postcopy is enabled, even
5132e3c8f8dSDr. David Alan Gilbert    if it hasn't had the start command; here the destination
5142e3c8f8dSDr. David Alan Gilbert    checks that its OS has the support needed for postcopy, and performs
5152e3c8f8dSDr. David Alan Gilbert    setup to ensure the RAM mappings are suitable for later postcopy.
5162e3c8f8dSDr. David Alan Gilbert    The destination will fail early in migration at this point if the
5172e3c8f8dSDr. David Alan Gilbert    required OS support is not present.
5182e3c8f8dSDr. David Alan Gilbert    (Triggered by reception of POSTCOPY_ADVISE command)
5192e3c8f8dSDr. David Alan Gilbert
5202e3c8f8dSDr. David Alan Gilbert - Discard
5212e3c8f8dSDr. David Alan Gilbert
5222e3c8f8dSDr. David Alan Gilbert    Entered on receipt of the first 'discard' command; prior to
5232e3c8f8dSDr. David Alan Gilbert    the first Discard being performed, hugepages are switched off
5242e3c8f8dSDr. David Alan Gilbert    (using madvise) to ensure that no new huge pages are created
5252e3c8f8dSDr. David Alan Gilbert    during the postcopy phase, and to cause any huge pages that
5262e3c8f8dSDr. David Alan Gilbert    have discards on them to be broken.
5272e3c8f8dSDr. David Alan Gilbert
5282e3c8f8dSDr. David Alan Gilbert - Listen
5292e3c8f8dSDr. David Alan Gilbert
5302e3c8f8dSDr. David Alan Gilbert    The first command in the package, POSTCOPY_LISTEN, switches
5312e3c8f8dSDr. David Alan Gilbert    the destination state to Listen, and starts a new thread
5322e3c8f8dSDr. David Alan Gilbert    (the 'listen thread') which takes over the job of receiving
5332e3c8f8dSDr. David Alan Gilbert    pages off the migration stream, while the main thread carries
5342e3c8f8dSDr. David Alan Gilbert    on processing the blob.  With this thread able to process page
5352e3c8f8dSDr. David Alan Gilbert    reception, the destination now 'sensitises' the RAM to detect
5362e3c8f8dSDr. David Alan Gilbert    any access to missing pages (on Linux using the 'userfault'
5372e3c8f8dSDr. David Alan Gilbert    system).
5382e3c8f8dSDr. David Alan Gilbert
5392e3c8f8dSDr. David Alan Gilbert - Running
5402e3c8f8dSDr. David Alan Gilbert
5412e3c8f8dSDr. David Alan Gilbert    POSTCOPY_RUN causes the destination to synchronise all
5422e3c8f8dSDr. David Alan Gilbert    state and start the CPUs and IO devices running.  The main
5432e3c8f8dSDr. David Alan Gilbert    thread now finishes processing the migration package and
5442e3c8f8dSDr. David Alan Gilbert    now carries on as it would for normal precopy migration
5452e3c8f8dSDr. David Alan Gilbert    (although it can't do the cleanup it would do as it
5462e3c8f8dSDr. David Alan Gilbert    finishes a normal migration).
5472e3c8f8dSDr. David Alan Gilbert
5482e3c8f8dSDr. David Alan Gilbert - End
5492e3c8f8dSDr. David Alan Gilbert
5502e3c8f8dSDr. David Alan Gilbert    The listen thread can now quit, and perform the cleanup of migration
5512e3c8f8dSDr. David Alan Gilbert    state, the migration is now complete.
5522e3c8f8dSDr. David Alan Gilbert
5532e3c8f8dSDr. David Alan GilbertSource side page maps
5542e3c8f8dSDr. David Alan Gilbert---------------------
5552e3c8f8dSDr. David Alan Gilbert
5562e3c8f8dSDr. David Alan GilbertThe source side keeps two bitmaps during postcopy; 'the migration bitmap'
5572e3c8f8dSDr. David Alan Gilbertand 'unsent map'.  The 'migration bitmap' is basically the same as in
5582e3c8f8dSDr. David Alan Gilbertthe precopy case, and holds a bit to indicate that page is 'dirty' -
5592e3c8f8dSDr. David Alan Gilberti.e. needs sending.  During the precopy phase this is updated as the CPU
5602e3c8f8dSDr. David Alan Gilbertdirties pages, however during postcopy the CPUs are stopped and nothing
5612e3c8f8dSDr. David Alan Gilbertshould dirty anything any more.
5622e3c8f8dSDr. David Alan Gilbert
5632e3c8f8dSDr. David Alan GilbertThe 'unsent map' is used for the transition to postcopy. It is a bitmap that
5642e3c8f8dSDr. David Alan Gilberthas a bit cleared whenever a page is sent to the destination, however during
5652e3c8f8dSDr. David Alan Gilbertthe transition to postcopy mode it is combined with the migration bitmap
5662e3c8f8dSDr. David Alan Gilbertto form a set of pages that:
5672e3c8f8dSDr. David Alan Gilbert
5682e3c8f8dSDr. David Alan Gilbert   a) Have been sent but then redirtied (which must be discarded)
5692e3c8f8dSDr. David Alan Gilbert   b) Have not yet been sent - which also must be discarded to cause any
5702e3c8f8dSDr. David Alan Gilbert      transparent huge pages built during precopy to be broken.
5712e3c8f8dSDr. David Alan Gilbert
5722e3c8f8dSDr. David Alan GilbertNote that the contents of the unsentmap are sacrificed during the calculation
5732e3c8f8dSDr. David Alan Gilbertof the discard set and thus aren't valid once in postcopy.  The dirtymap
5742e3c8f8dSDr. David Alan Gilbertis still valid and is used to ensure that no page is sent more than once.  Any
5752e3c8f8dSDr. David Alan Gilbertrequest for a page that has already been sent is ignored.  Duplicate requests
5762e3c8f8dSDr. David Alan Gilbertsuch as this can happen as a page is sent at about the same time the
5772e3c8f8dSDr. David Alan Gilbertdestination accesses it.
5782e3c8f8dSDr. David Alan Gilbert
5792e3c8f8dSDr. David Alan GilbertPostcopy with hugepages
5802e3c8f8dSDr. David Alan Gilbert-----------------------
5812e3c8f8dSDr. David Alan Gilbert
5822e3c8f8dSDr. David Alan GilbertPostcopy now works with hugetlbfs backed memory:
5832e3c8f8dSDr. David Alan Gilbert
5842e3c8f8dSDr. David Alan Gilbert  a) The linux kernel on the destination must support userfault on hugepages.
5852e3c8f8dSDr. David Alan Gilbert  b) The huge-page configuration on the source and destination VMs must be
5862e3c8f8dSDr. David Alan Gilbert     identical; i.e. RAMBlocks on both sides must use the same page size.
5872e3c8f8dSDr. David Alan Gilbert  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
5882e3c8f8dSDr. David Alan Gilbert     RAM if it doesn't have enough hugepages, triggering (b) to fail.
5892e3c8f8dSDr. David Alan Gilbert     Using ``-mem-prealloc`` enforces the allocation using hugepages.
5902e3c8f8dSDr. David Alan Gilbert  d) Care should be taken with the size of hugepage used; postcopy with 2MB
5912e3c8f8dSDr. David Alan Gilbert     hugepages works well, however 1GB hugepages are likely to be problematic
5922e3c8f8dSDr. David Alan Gilbert     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
5932e3c8f8dSDr. David Alan Gilbert     and until the full page is transferred the destination thread is blocked.
5941dc61e7bSDr. David Alan Gilbert
5951dc61e7bSDr. David Alan GilbertPostcopy with shared memory
5961dc61e7bSDr. David Alan Gilbert---------------------------
5971dc61e7bSDr. David Alan Gilbert
5981dc61e7bSDr. David Alan GilbertPostcopy migration with shared memory needs explicit support from the other
5991dc61e7bSDr. David Alan Gilbertprocesses that share memory and from QEMU. There are restrictions on the type of
6001dc61e7bSDr. David Alan Gilbertmemory that userfault can support shared.
6011dc61e7bSDr. David Alan Gilbert
6021dc61e7bSDr. David Alan GilbertThe Linux kernel userfault support works on `/dev/shm` memory and on `hugetlbfs`
6031dc61e7bSDr. David Alan Gilbert(although the kernel doesn't provide an equivalent to `madvise(MADV_DONTNEED)`
6041dc61e7bSDr. David Alan Gilbertfor hugetlbfs which may be a problem in some configurations).
6051dc61e7bSDr. David Alan Gilbert
6061dc61e7bSDr. David Alan GilbertThe vhost-user code in QEMU supports clients that have Postcopy support,
6071dc61e7bSDr. David Alan Gilbertand the `vhost-user-bridge` (in `tests/`) and the DPDK package have changes
6081dc61e7bSDr. David Alan Gilbertto support postcopy.
6091dc61e7bSDr. David Alan Gilbert
6101dc61e7bSDr. David Alan GilbertThe client needs to open a userfaultfd and register the areas
6111dc61e7bSDr. David Alan Gilbertof memory that it maps with userfault.  The client must then pass the
6121dc61e7bSDr. David Alan Gilbertuserfaultfd back to QEMU together with a mapping table that allows
6131dc61e7bSDr. David Alan Gilbertfault addresses in the clients address space to be converted back to
6141dc61e7bSDr. David Alan GilbertRAMBlock/offsets.  The client's userfaultfd is added to the postcopy
6151dc61e7bSDr. David Alan Gilbertfault-thread and page requests are made on behalf of the client by QEMU.
6161dc61e7bSDr. David Alan GilbertQEMU performs 'wake' operations on the client's userfaultfd to allow it
6171dc61e7bSDr. David Alan Gilbertto continue after a page has arrived.
6181dc61e7bSDr. David Alan Gilbert
6191dc61e7bSDr. David Alan Gilbert.. note::
6201dc61e7bSDr. David Alan Gilbert  There are two future improvements that would be nice:
6211dc61e7bSDr. David Alan Gilbert    a) Some way to make QEMU ignorant of the addresses in the clients
6221dc61e7bSDr. David Alan Gilbert       address space
6231dc61e7bSDr. David Alan Gilbert    b) Avoiding the need for QEMU to perform ufd-wake calls after the
6241dc61e7bSDr. David Alan Gilbert       pages have arrived
6251dc61e7bSDr. David Alan Gilbert
6261dc61e7bSDr. David Alan GilbertRetro-fitting postcopy to existing clients is possible:
6271dc61e7bSDr. David Alan Gilbert  a) A mechanism is needed for the registration with userfault as above,
6281dc61e7bSDr. David Alan Gilbert     and the registration needs to be coordinated with the phases of
6291dc61e7bSDr. David Alan Gilbert     postcopy.  In vhost-user extra messages are added to the existing
6301dc61e7bSDr. David Alan Gilbert     control channel.
6311dc61e7bSDr. David Alan Gilbert  b) Any thread that can block due to guest memory accesses must be
6321dc61e7bSDr. David Alan Gilbert     identified and the implication understood; for example if the
6331dc61e7bSDr. David Alan Gilbert     guest memory access is made while holding a lock then all other
6341dc61e7bSDr. David Alan Gilbert     threads waiting for that lock will also be blocked.
635