xref: /qemu/docs/devel/migration/main.rst (revision 2563c97f611f709b975880737a24dddc3318fa17)
12e3c8f8dSDr. David Alan Gilbert=========
22e3c8f8dSDr. David Alan GilbertMigration
32e3c8f8dSDr. David Alan Gilbert=========
42e3c8f8dSDr. David Alan Gilbert
52e3c8f8dSDr. David Alan GilbertQEMU has code to load/save the state of the guest that it is running.
62e3c8f8dSDr. David Alan GilbertThese are two complementary operations.  Saving the state just does
72e3c8f8dSDr. David Alan Gilbertthat, saves the state for each device that the guest is running.
82e3c8f8dSDr. David Alan GilbertRestoring a guest is just the opposite operation: we need to load the
92e3c8f8dSDr. David Alan Gilbertstate of each device.
102e3c8f8dSDr. David Alan Gilbert
112e3c8f8dSDr. David Alan GilbertFor this to work, QEMU has to be launched with the same arguments the
122e3c8f8dSDr. David Alan Gilberttwo times.  I.e. it can only restore the state in one guest that has
132e3c8f8dSDr. David Alan Gilbertthe same devices that the one it was saved (this last requirement can
142e3c8f8dSDr. David Alan Gilbertbe relaxed a bit, but for now we can consider that configuration has
152e3c8f8dSDr. David Alan Gilbertto be exactly the same).
162e3c8f8dSDr. David Alan Gilbert
172e3c8f8dSDr. David Alan GilbertOnce that we are able to save/restore a guest, a new functionality is
182e3c8f8dSDr. David Alan Gilbertrequested: migration.  This means that QEMU is able to start in one
192e3c8f8dSDr. David Alan Gilbertmachine and being "migrated" to another machine.  I.e. being moved to
202e3c8f8dSDr. David Alan Gilbertanother machine.
212e3c8f8dSDr. David Alan Gilbert
222e3c8f8dSDr. David Alan GilbertNext was the "live migration" functionality.  This is important
232e3c8f8dSDr. David Alan Gilbertbecause some guests run with a lot of state (specially RAM), and it
242e3c8f8dSDr. David Alan Gilbertcan take a while to move all state from one machine to another.  Live
252e3c8f8dSDr. David Alan Gilbertmigration allows the guest to continue running while the state is
262e3c8f8dSDr. David Alan Gilberttransferred.  Only while the last part of the state is transferred has
272e3c8f8dSDr. David Alan Gilbertthe guest to be stopped.  Typically the time that the guest is
282e3c8f8dSDr. David Alan Gilbertunresponsive during live migration is the low hundred of milliseconds
292e3c8f8dSDr. David Alan Gilbert(notice that this depends on a lot of things).
302e3c8f8dSDr. David Alan Gilbert
31d8a0f054SJuan Quintela.. contents::
32d8a0f054SJuan Quintela
33edd70806SDr. David Alan GilbertTransports
34edd70806SDr. David Alan Gilbert==========
352e3c8f8dSDr. David Alan Gilbert
36edd70806SDr. David Alan GilbertThe migration stream is normally just a byte stream that can be passed
37edd70806SDr. David Alan Gilbertover any transport.
382e3c8f8dSDr. David Alan Gilbert
392e3c8f8dSDr. David Alan Gilbert- tcp migration: do the migration using tcp sockets
402e3c8f8dSDr. David Alan Gilbert- unix migration: do the migration using unix sockets
412e3c8f8dSDr. David Alan Gilbert- exec migration: do the migration using the stdin/stdout through a process.
429277d81fSVille Skyttä- fd migration: do the migration using a file descriptor that is
432e3c8f8dSDr. David Alan Gilbert  passed to QEMU.  QEMU doesn't care how this file descriptor is opened.
442e3c8f8dSDr. David Alan Gilbert
45edd70806SDr. David Alan GilbertIn addition, support is included for migration using RDMA, which
46edd70806SDr. David Alan Gilberttransports the page data using ``RDMA``, where the hardware takes care of
47edd70806SDr. David Alan Gilberttransporting the pages, and the load on the CPU is much lower.  While the
48edd70806SDr. David Alan Gilbertinternals of RDMA migration are a bit different, this isn't really visible
49edd70806SDr. David Alan Gilbertoutside the RAM migration code.
50edd70806SDr. David Alan Gilbert
51edd70806SDr. David Alan GilbertAll these migration protocols use the same infrastructure to
522e3c8f8dSDr. David Alan Gilbertsave/restore state devices.  This infrastructure is shared with the
532e3c8f8dSDr. David Alan Gilbertsavevm/loadvm functionality.
542e3c8f8dSDr. David Alan Gilbert
55979da8b3SMarc-André LureauDebugging
56979da8b3SMarc-André Lureau=========
57979da8b3SMarc-André Lureau
584df3a7bfSPeter MaydellThe migration stream can be analyzed thanks to ``scripts/analyze-migration.py``.
59979da8b3SMarc-André Lureau
60979da8b3SMarc-André LureauExample usage:
61979da8b3SMarc-André Lureau
62979da8b3SMarc-André Lureau.. code-block:: shell
63979da8b3SMarc-André Lureau
64243e7480SMarkus Armbruster  $ qemu-system-x86_64 -display none -monitor stdio
65979da8b3SMarc-André Lureau  (qemu) migrate "exec:cat > mig"
66243e7480SMarkus Armbruster  (qemu) q
67243e7480SMarkus Armbruster  $ ./scripts/analyze-migration.py -f mig
68979da8b3SMarc-André Lureau  {
69979da8b3SMarc-André Lureau    "ram (3)": {
70979da8b3SMarc-André Lureau        "section sizes": {
71979da8b3SMarc-André Lureau            "pc.ram": "0x0000000008000000",
72979da8b3SMarc-André Lureau  ...
73979da8b3SMarc-André Lureau
74243e7480SMarkus ArmbrusterSee also ``analyze-migration.py -h`` help for more options.
75979da8b3SMarc-André Lureau
762e3c8f8dSDr. David Alan GilbertCommon infrastructure
772e3c8f8dSDr. David Alan Gilbert=====================
782e3c8f8dSDr. David Alan Gilbert
792e3c8f8dSDr. David Alan GilbertThe files, sockets or fd's that carry the migration stream are abstracted by
804df3a7bfSPeter Maydellthe  ``QEMUFile`` type (see ``migration/qemu-file.h``).  In most cases this
814df3a7bfSPeter Maydellis connected to a subtype of ``QIOChannel`` (see ``io/``).
822e3c8f8dSDr. David Alan Gilbert
83edd70806SDr. David Alan Gilbert
842e3c8f8dSDr. David Alan GilbertSaving the state of one device
852e3c8f8dSDr. David Alan Gilbert==============================
862e3c8f8dSDr. David Alan Gilbert
87edd70806SDr. David Alan GilbertFor most devices, the state is saved in a single call to the migration
88edd70806SDr. David Alan Gilbertinfrastructure; these are *non-iterative* devices.  The data for these
89edd70806SDr. David Alan Gilbertdevices is sent at the end of precopy migration, when the CPUs are paused.
90edd70806SDr. David Alan GilbertThere are also *iterative* devices, which contain a very large amount of
91edd70806SDr. David Alan Gilbertdata (e.g. RAM or large tables).  See the iterative device section below.
922e3c8f8dSDr. David Alan Gilbert
93edd70806SDr. David Alan GilbertGeneral advice for device developers
94edd70806SDr. David Alan Gilbert------------------------------------
952e3c8f8dSDr. David Alan Gilbert
96edd70806SDr. David Alan Gilbert- The migration state saved should reflect the device being modelled rather
97edd70806SDr. David Alan Gilbert  than the way your implementation works.  That way if you change the implementation
98edd70806SDr. David Alan Gilbert  later the migration stream will stay compatible.  That model may include
99edd70806SDr. David Alan Gilbert  internal state that's not directly visible in a register.
1002e3c8f8dSDr. David Alan Gilbert
101edd70806SDr. David Alan Gilbert- When saving a migration stream the device code may walk and check
102edd70806SDr. David Alan Gilbert  the state of the device.  These checks might fail in various ways (e.g.
103edd70806SDr. David Alan Gilbert  discovering internal state is corrupt or that the guest has done something bad).
104edd70806SDr. David Alan Gilbert  Consider carefully before asserting/aborting at this point, since the
105edd70806SDr. David Alan Gilbert  normal response from users is that *migration broke their VM* since it had
106edd70806SDr. David Alan Gilbert  apparently been running fine until then.  In these error cases, the device
107edd70806SDr. David Alan Gilbert  should log a message indicating the cause of error, and should consider
108edd70806SDr. David Alan Gilbert  putting the device into an error state, allowing the rest of the VM to
109edd70806SDr. David Alan Gilbert  continue execution.
1102e3c8f8dSDr. David Alan Gilbert
111edd70806SDr. David Alan Gilbert- The migration might happen at an inconvenient point,
112edd70806SDr. David Alan Gilbert  e.g. right in the middle of the guest reprogramming the device, during
113edd70806SDr. David Alan Gilbert  guest reboot or shutdown or while the device is waiting for external IO.
114edd70806SDr. David Alan Gilbert  It's strongly preferred that migrations do not fail in this situation,
115edd70806SDr. David Alan Gilbert  since in the cloud environment migrations might happen automatically to
116edd70806SDr. David Alan Gilbert  VMs that the administrator doesn't directly control.
1172e3c8f8dSDr. David Alan Gilbert
118edd70806SDr. David Alan Gilbert- If you do need to fail a migration, ensure that sufficient information
119edd70806SDr. David Alan Gilbert  is logged to identify what went wrong.
1202e3c8f8dSDr. David Alan Gilbert
121edd70806SDr. David Alan Gilbert- The destination should treat an incoming migration stream as hostile
122edd70806SDr. David Alan Gilbert  (which we do to varying degrees in the existing code).  Check that offsets
123edd70806SDr. David Alan Gilbert  into buffers and the like can't cause overruns.  Fail the incoming migration
124edd70806SDr. David Alan Gilbert  in the case of a corrupted stream like this.
1252e3c8f8dSDr. David Alan Gilbert
126edd70806SDr. David Alan Gilbert- Take care with internal device state or behaviour that might become
127edd70806SDr. David Alan Gilbert  migration version dependent.  For example, the order of PCI capabilities
128edd70806SDr. David Alan Gilbert  is required to stay constant across migration.  Another example would
129edd70806SDr. David Alan Gilbert  be that a special case handled by subsections (see below) might become
130edd70806SDr. David Alan Gilbert  much more common if a default behaviour is changed.
1312e3c8f8dSDr. David Alan Gilbert
132edd70806SDr. David Alan Gilbert- The state of the source should not be changed or destroyed by the
133edd70806SDr. David Alan Gilbert  outgoing migration.  Migrations timing out or being failed by
134edd70806SDr. David Alan Gilbert  higher levels of management, or failures of the destination host are
135edd70806SDr. David Alan Gilbert  not unusual, and in that case the VM is restarted on the source.
136edd70806SDr. David Alan Gilbert  Note that the management layer can validly revert the migration
137edd70806SDr. David Alan Gilbert  even though the QEMU level of migration has succeeded as long as it
138edd70806SDr. David Alan Gilbert  does it before starting execution on the destination.
139edd70806SDr. David Alan Gilbert
140edd70806SDr. David Alan Gilbert- Buses and devices should be able to explicitly specify addresses when
141edd70806SDr. David Alan Gilbert  instantiated, and management tools should use those.  For example,
142edd70806SDr. David Alan Gilbert  when hot adding USB devices it's important to specify the ports
143edd70806SDr. David Alan Gilbert  and addresses, since implicit ordering based on the command line order
144edd70806SDr. David Alan Gilbert  may be different on the destination.  This can result in the
145edd70806SDr. David Alan Gilbert  device state being loaded into the wrong device.
1462e3c8f8dSDr. David Alan Gilbert
1472e3c8f8dSDr. David Alan GilbertVMState
1482e3c8f8dSDr. David Alan Gilbert-------
1492e3c8f8dSDr. David Alan Gilbert
150edd70806SDr. David Alan GilbertMost device data can be described using the ``VMSTATE`` macros (mostly defined
151edd70806SDr. David Alan Gilbertin ``include/migration/vmstate.h``).
1522e3c8f8dSDr. David Alan Gilbert
1532e3c8f8dSDr. David Alan GilbertAn example (from hw/input/pckbd.c)
1542e3c8f8dSDr. David Alan Gilbert
1552e3c8f8dSDr. David Alan Gilbert.. code:: c
1562e3c8f8dSDr. David Alan Gilbert
1572e3c8f8dSDr. David Alan Gilbert  static const VMStateDescription vmstate_kbd = {
1582e3c8f8dSDr. David Alan Gilbert      .name = "pckbd",
1592e3c8f8dSDr. David Alan Gilbert      .version_id = 3,
1602e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 3,
1612563c97fSRichard Henderson      .fields = (const VMStateField[]) {
1622e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(write_cmd, KBDState),
1632e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(status, KBDState),
1642e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(mode, KBDState),
1652e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(pending, KBDState),
1662e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
1672e3c8f8dSDr. David Alan Gilbert      }
1682e3c8f8dSDr. David Alan Gilbert  };
1692e3c8f8dSDr. David Alan Gilbert
1705b146be3SJuan QuintelaWe are declaring the state with name "pckbd".  The ``version_id`` is
1715b146be3SJuan Quintela3, and there are 4 uint8_t fields in the KBDState structure.  We
1725b146be3SJuan Quintelaregistered this ``VMSTATEDescription`` with one of the following
1735b146be3SJuan Quintelafunctions.  The first one will generate a device ``instance_id``
1745b146be3SJuan Quinteladifferent for each registration.  Use the second one if you already
1755b146be3SJuan Quintelahave an id that is different for each instance of the device:
1762e3c8f8dSDr. David Alan Gilbert
1772e3c8f8dSDr. David Alan Gilbert.. code:: c
1782e3c8f8dSDr. David Alan Gilbert
1795b146be3SJuan Quintela    vmstate_register_any(NULL, &vmstate_kbd, s);
1805b146be3SJuan Quintela    vmstate_register(NULL, instance_id, &vmstate_kbd, s);
1812e3c8f8dSDr. David Alan Gilbert
1824df3a7bfSPeter MaydellFor devices that are ``qdev`` based, we can register the device in the class
183edd70806SDr. David Alan Gilbertinit function:
1842e3c8f8dSDr. David Alan Gilbert
185edd70806SDr. David Alan Gilbert.. code:: c
1862e3c8f8dSDr. David Alan Gilbert
187edd70806SDr. David Alan Gilbert    dc->vmsd = &vmstate_kbd_isa;
1882e3c8f8dSDr. David Alan Gilbert
189edd70806SDr. David Alan GilbertThe VMState macros take care of ensuring that the device data section
190edd70806SDr. David Alan Gilbertis formatted portably (normally big endian) and make some compile time checks
191edd70806SDr. David Alan Gilbertagainst the types of the fields in the structures.
1922e3c8f8dSDr. David Alan Gilbert
193edd70806SDr. David Alan GilbertVMState macros can include other VMStateDescriptions to store substructures
194edd70806SDr. David Alan Gilbert(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length
195edd70806SDr. David Alan Gilbertarrays (``VMSTATE_VARRAY_``).  Various other macros exist for special
196edd70806SDr. David Alan Gilbertcases.
1972e3c8f8dSDr. David Alan Gilbert
198edd70806SDr. David Alan GilbertNote that the format on the wire is still very raw; i.e. a VMSTATE_UINT32
199edd70806SDr. David Alan Gilbertends up with a 4 byte bigendian representation on the wire; in the future
200edd70806SDr. David Alan Gilbertit might be possible to use a more structured format.
2012e3c8f8dSDr. David Alan Gilbert
202edd70806SDr. David Alan GilbertLegacy way
203edd70806SDr. David Alan Gilbert----------
2042e3c8f8dSDr. David Alan Gilbert
205edd70806SDr. David Alan GilbertThis way is going to disappear as soon as all current users are ported to VMSTATE;
206edd70806SDr. David Alan Gilbertalthough converting existing code can be tricky, and thus 'soon' is relative.
2072e3c8f8dSDr. David Alan Gilbert
208edd70806SDr. David Alan GilbertEach device has to register two functions, one to save the state and
209edd70806SDr. David Alan Gilbertanother to load the state back.
2102e3c8f8dSDr. David Alan Gilbert
211edd70806SDr. David Alan Gilbert.. code:: c
2122e3c8f8dSDr. David Alan Gilbert
213ce62df53SDr. David Alan Gilbert  int register_savevm_live(const char *idstr,
214edd70806SDr. David Alan Gilbert                           int instance_id,
215edd70806SDr. David Alan Gilbert                           int version_id,
216edd70806SDr. David Alan Gilbert                           SaveVMHandlers *ops,
217edd70806SDr. David Alan Gilbert                           void *opaque);
2182e3c8f8dSDr. David Alan Gilbert
2194df3a7bfSPeter MaydellTwo functions in the ``ops`` structure are the ``save_state``
2204df3a7bfSPeter Maydelland ``load_state`` functions.  Notice that ``load_state`` receives a version_id
2214df3a7bfSPeter Maydellparameter to know what state format is receiving.  ``save_state`` doesn't
222edd70806SDr. David Alan Gilberthave a version_id parameter because it always uses the latest version.
2232e3c8f8dSDr. David Alan Gilbert
224edd70806SDr. David Alan GilbertNote that because the VMState macros still save the data in a raw
225edd70806SDr. David Alan Gilbertformat, in many cases it's possible to replace legacy code
226edd70806SDr. David Alan Gilbertwith a carefully constructed VMState description that matches the
227edd70806SDr. David Alan Gilbertbyte layout of the existing code.
2282e3c8f8dSDr. David Alan Gilbert
229edd70806SDr. David Alan GilbertChanging migration data structures
230edd70806SDr. David Alan Gilbert----------------------------------
2312e3c8f8dSDr. David Alan Gilbert
232edd70806SDr. David Alan GilbertWhen we migrate a device, we save/load the state as a series
233edd70806SDr. David Alan Gilbertof fields.  Sometimes, due to bugs or new functionality, we need to
234edd70806SDr. David Alan Gilbertchange the state to store more/different information.  Changing the migration
235edd70806SDr. David Alan Gilbertstate saved for a device can break migration compatibility unless
236edd70806SDr. David Alan Gilbertcare is taken to use the appropriate techniques.  In general QEMU tries
237edd70806SDr. David Alan Gilbertto maintain forward migration compatibility (i.e. migrating from
238edd70806SDr. David Alan GilbertQEMU n->n+1) and there are users who benefit from backward compatibility
239edd70806SDr. David Alan Gilbertas well.
2402e3c8f8dSDr. David Alan Gilbert
2412e3c8f8dSDr. David Alan GilbertSubsections
2422e3c8f8dSDr. David Alan Gilbert-----------
2432e3c8f8dSDr. David Alan Gilbert
244edd70806SDr. David Alan GilbertThe most common structure change is adding new data, e.g. when adding
245edd70806SDr. David Alan Gilberta newer form of device, or adding that state that you previously
246edd70806SDr. David Alan Gilbertforgot to migrate.  This is best solved using a subsection.
2472e3c8f8dSDr. David Alan Gilbert
248edd70806SDr. David Alan GilbertA subsection is "like" a device vmstate, but with a particularity, it
249edd70806SDr. David Alan Gilberthas a Boolean function that tells if that values are needed to be sent
250edd70806SDr. David Alan Gilbertor not.  If this functions returns false, the subsection is not sent.
251edd70806SDr. David Alan GilbertSubsections have a unique name, that is looked for on the receiving
252edd70806SDr. David Alan Gilbertside.
2532e3c8f8dSDr. David Alan Gilbert
2542e3c8f8dSDr. David Alan GilbertOn the receiving side, if we found a subsection for a device that we
2552e3c8f8dSDr. David Alan Gilbertdon't understand, we just fail the migration.  If we understand all
256edd70806SDr. David Alan Gilbertthe subsections, then we load the state with success.  There's no check
257edd70806SDr. David Alan Gilbertthat a subsection is loaded, so a newer QEMU that knows about a subsection
258edd70806SDr. David Alan Gilbertcan (with care) load a stream from an older QEMU that didn't send
259edd70806SDr. David Alan Gilbertthe subsection.
260edd70806SDr. David Alan Gilbert
261edd70806SDr. David Alan GilbertIf the new data is only needed in a rare case, then the subsection
262edd70806SDr. David Alan Gilbertcan be made conditional on that case and the migration will still
263edd70806SDr. David Alan Gilbertsucceed to older QEMUs in most cases.  This is OK for data that's
264edd70806SDr. David Alan Gilbertcritical, but in some use cases it's preferred that the migration
265edd70806SDr. David Alan Gilbertshould succeed even with the data missing.  To support this the
266edd70806SDr. David Alan Gilbertsubsection can be connected to a device property and from there
267edd70806SDr. David Alan Gilbertto a versioned machine type.
2682e3c8f8dSDr. David Alan Gilbert
2693eb21fe9SDr. David Alan GilbertThe 'pre_load' and 'post_load' functions on subsections are only
2703eb21fe9SDr. David Alan Gilbertcalled if the subsection is loaded.
2713eb21fe9SDr. David Alan Gilbert
2723eb21fe9SDr. David Alan GilbertOne important note is that the outer post_load() function is called "after"
2733eb21fe9SDr. David Alan Gilbertloading all subsections, because a newer subsection could change the same
2743eb21fe9SDr. David Alan Gilbertvalue that it uses.  A flag, and the combination of outer pre_load and
2753eb21fe9SDr. David Alan Gilbertpost_load can be used to detect whether a subsection was loaded, and to
276edd70806SDr. David Alan Gilbertfall back on default behaviour when the subsection isn't present.
2772e3c8f8dSDr. David Alan Gilbert
2782e3c8f8dSDr. David Alan GilbertExample:
2792e3c8f8dSDr. David Alan Gilbert
2802e3c8f8dSDr. David Alan Gilbert.. code:: c
2812e3c8f8dSDr. David Alan Gilbert
2822e3c8f8dSDr. David Alan Gilbert  static bool ide_drive_pio_state_needed(void *opaque)
2832e3c8f8dSDr. David Alan Gilbert  {
2842e3c8f8dSDr. David Alan Gilbert      IDEState *s = opaque;
2852e3c8f8dSDr. David Alan Gilbert
2862e3c8f8dSDr. David Alan Gilbert      return ((s->status & DRQ_STAT) != 0)
2872e3c8f8dSDr. David Alan Gilbert          || (s->bus->error_status & BM_STATUS_PIO_RETRY);
2882e3c8f8dSDr. David Alan Gilbert  }
2892e3c8f8dSDr. David Alan Gilbert
2902e3c8f8dSDr. David Alan Gilbert  const VMStateDescription vmstate_ide_drive_pio_state = {
2912e3c8f8dSDr. David Alan Gilbert      .name = "ide_drive/pio_state",
2922e3c8f8dSDr. David Alan Gilbert      .version_id = 1,
2932e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 1,
2942e3c8f8dSDr. David Alan Gilbert      .pre_save = ide_drive_pio_pre_save,
2952e3c8f8dSDr. David Alan Gilbert      .post_load = ide_drive_pio_post_load,
2962e3c8f8dSDr. David Alan Gilbert      .needed = ide_drive_pio_state_needed,
2972563c97fSRichard Henderson      .fields = (const VMStateField[]) {
2982e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(req_nb_sectors, IDEState),
2992e3c8f8dSDr. David Alan Gilbert          VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
3002e3c8f8dSDr. David Alan Gilbert                               vmstate_info_uint8, uint8_t),
3012e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(cur_io_buffer_offset, IDEState),
3022e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(cur_io_buffer_len, IDEState),
3032e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
3042e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(elementary_transfer_size, IDEState),
3052e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(packet_transfer_size, IDEState),
3062e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
3072e3c8f8dSDr. David Alan Gilbert      }
3082e3c8f8dSDr. David Alan Gilbert  };
3092e3c8f8dSDr. David Alan Gilbert
3102e3c8f8dSDr. David Alan Gilbert  const VMStateDescription vmstate_ide_drive = {
3112e3c8f8dSDr. David Alan Gilbert      .name = "ide_drive",
3122e3c8f8dSDr. David Alan Gilbert      .version_id = 3,
3132e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 0,
3142e3c8f8dSDr. David Alan Gilbert      .post_load = ide_drive_post_load,
3152563c97fSRichard Henderson      .fields = (const VMStateField[]) {
3162e3c8f8dSDr. David Alan Gilbert          .... several fields ....
3172e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
3182e3c8f8dSDr. David Alan Gilbert      },
3192563c97fSRichard Henderson      .subsections = (const VMStateDescription * const []) {
3202e3c8f8dSDr. David Alan Gilbert          &vmstate_ide_drive_pio_state,
3212e3c8f8dSDr. David Alan Gilbert          NULL
3222e3c8f8dSDr. David Alan Gilbert      }
3232e3c8f8dSDr. David Alan Gilbert  };
3242e3c8f8dSDr. David Alan Gilbert
3252e3c8f8dSDr. David Alan GilbertHere we have a subsection for the pio state.  We only need to
3262e3c8f8dSDr. David Alan Gilbertsave/send this state when we are in the middle of a pio operation
3272e3c8f8dSDr. David Alan Gilbert(that is what ``ide_drive_pio_state_needed()`` checks).  If DRQ_STAT is
3282e3c8f8dSDr. David Alan Gilbertnot enabled, the values on that fields are garbage and don't need to
3292e3c8f8dSDr. David Alan Gilbertbe sent.
3302e3c8f8dSDr. David Alan Gilbert
331edd70806SDr. David Alan GilbertConnecting subsections to properties
332edd70806SDr. David Alan Gilbert------------------------------------
333edd70806SDr. David Alan Gilbert
3342e3c8f8dSDr. David Alan GilbertUsing a condition function that checks a 'property' to determine whether
335edd70806SDr. David Alan Gilbertto send a subsection allows backward migration compatibility when
336edd70806SDr. David Alan Gilbertnew subsections are added, especially when combined with versioned
337edd70806SDr. David Alan Gilbertmachine types.
3382e3c8f8dSDr. David Alan Gilbert
3392e3c8f8dSDr. David Alan GilbertFor example:
3402e3c8f8dSDr. David Alan Gilbert
3412e3c8f8dSDr. David Alan Gilbert   a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and
3422e3c8f8dSDr. David Alan Gilbert      default it to true.
343ac78f737SMarc-André Lureau   b) Add an entry to the ``hw_compat_`` for the previous version that sets
3442e3c8f8dSDr. David Alan Gilbert      the property to false.
3452e3c8f8dSDr. David Alan Gilbert   c) Add a static bool  support_foo function that tests the property.
3462e3c8f8dSDr. David Alan Gilbert   d) Add a subsection with a .needed set to the support_foo function
3473eb21fe9SDr. David Alan Gilbert   e) (potentially) Add an outer pre_load that sets up a default value
3483eb21fe9SDr. David Alan Gilbert      for 'foo' to be used if the subsection isn't loaded.
3492e3c8f8dSDr. David Alan Gilbert
3502e3c8f8dSDr. David Alan GilbertNow that subsection will not be generated when using an older
3512e3c8f8dSDr. David Alan Gilbertmachine type and the migration stream will be accepted by older
352edd70806SDr. David Alan GilbertQEMU versions.
3532e3c8f8dSDr. David Alan Gilbert
3542e3c8f8dSDr. David Alan GilbertNot sending existing elements
3552e3c8f8dSDr. David Alan Gilbert-----------------------------
3562e3c8f8dSDr. David Alan Gilbert
3572e3c8f8dSDr. David Alan GilbertSometimes members of the VMState are no longer needed:
3582e3c8f8dSDr. David Alan Gilbert
3592e3c8f8dSDr. David Alan Gilbert  - removing them will break migration compatibility
3602e3c8f8dSDr. David Alan Gilbert
361edd70806SDr. David Alan Gilbert  - making them version dependent and bumping the version will break backward migration
362edd70806SDr. David Alan Gilbert    compatibility.
3632e3c8f8dSDr. David Alan Gilbert
364edd70806SDr. David Alan GilbertAdding a dummy field into the migration stream is normally the best way to preserve
365edd70806SDr. David Alan Gilbertcompatibility.
366edd70806SDr. David Alan Gilbert
367edd70806SDr. David Alan GilbertIf the field really does need to be removed then:
3682e3c8f8dSDr. David Alan Gilbert
3692e3c8f8dSDr. David Alan Gilbert  a) Add a new property/compatibility/function in the same way for subsections above.
3702e3c8f8dSDr. David Alan Gilbert  b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
3712e3c8f8dSDr. David Alan Gilbert
3722e3c8f8dSDr. David Alan Gilbert   ``VMSTATE_UINT32(foo, barstruct)``
3732e3c8f8dSDr. David Alan Gilbert
3742e3c8f8dSDr. David Alan Gilbert   becomes
3752e3c8f8dSDr. David Alan Gilbert
3762e3c8f8dSDr. David Alan Gilbert   ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)``
3772e3c8f8dSDr. David Alan Gilbert
3782e3c8f8dSDr. David Alan Gilbert   Sometime in the future when we no longer care about the ancient versions these can be killed off.
379edd70806SDr. David Alan Gilbert   Note that for backward compatibility it's important to fill in the structure with
380edd70806SDr. David Alan Gilbert   data that the destination will understand.
381edd70806SDr. David Alan Gilbert
382edd70806SDr. David Alan GilbertAny difference in the predicates on the source and destination will end up
383edd70806SDr. David Alan Gilbertwith different fields being enabled and data being loaded into the wrong
384edd70806SDr. David Alan Gilbertfields; for this reason conditional fields like this are very fragile.
385edd70806SDr. David Alan Gilbert
386edd70806SDr. David Alan GilbertVersions
387edd70806SDr. David Alan Gilbert--------
388edd70806SDr. David Alan Gilbert
389edd70806SDr. David Alan GilbertVersion numbers are intended for major incompatible changes to the
390edd70806SDr. David Alan Gilbertmigration of a device, and using them breaks backward-migration
391edd70806SDr. David Alan Gilbertcompatibility; in general most changes can be made by adding Subsections
392edd70806SDr. David Alan Gilbert(see above) or _TEST macros (see above) which won't break compatibility.
393edd70806SDr. David Alan Gilbert
3944df3a7bfSPeter MaydellEach version is associated with a series of fields saved.  The ``save_state`` always saves
3954df3a7bfSPeter Maydellthe state as the newer version.  But ``load_state`` sometimes is able to
396edd70806SDr. David Alan Gilbertload state from an older version.
397edd70806SDr. David Alan Gilbert
39818621987SPeter MaydellYou can see that there are two version fields:
399edd70806SDr. David Alan Gilbert
4004df3a7bfSPeter Maydell- ``version_id``: the maximum version_id supported by VMState for that device.
4014df3a7bfSPeter Maydell- ``minimum_version_id``: the minimum version_id that VMState is able to understand
402edd70806SDr. David Alan Gilbert  for that device.
403edd70806SDr. David Alan Gilbert
40418621987SPeter MaydellVMState is able to read versions from minimum_version_id to version_id.
405edd70806SDr. David Alan Gilbert
406edd70806SDr. David Alan GilbertThere are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields,
407edd70806SDr. David Alan Gilberte.g.
408edd70806SDr. David Alan Gilbert
409edd70806SDr. David Alan Gilbert.. code:: c
410edd70806SDr. David Alan Gilbert
411edd70806SDr. David Alan Gilbert   VMSTATE_UINT16_V(ip_id, Slirp, 2),
412edd70806SDr. David Alan Gilbert
413edd70806SDr. David Alan Gilbertonly loads that field for versions 2 and newer.
414edd70806SDr. David Alan Gilbert
415edd70806SDr. David Alan GilbertSaving state will always create a section with the 'version_id' value
416edd70806SDr. David Alan Gilbertand thus can't be loaded by any older QEMU.
417edd70806SDr. David Alan Gilbert
418edd70806SDr. David Alan GilbertMassaging functions
419edd70806SDr. David Alan Gilbert-------------------
420edd70806SDr. David Alan Gilbert
421edd70806SDr. David Alan GilbertSometimes, it is not enough to be able to save the state directly
422edd70806SDr. David Alan Gilbertfrom one structure, we need to fill the correct values there.  One
423edd70806SDr. David Alan Gilbertexample is when we are using kvm.  Before saving the cpu state, we
424edd70806SDr. David Alan Gilbertneed to ask kvm to copy to QEMU the state that it is using.  And the
425edd70806SDr. David Alan Gilbertopposite when we are loading the state, we need a way to tell kvm to
426edd70806SDr. David Alan Gilbertload the state for the cpu that we have just loaded from the QEMUFile.
427edd70806SDr. David Alan Gilbert
428edd70806SDr. David Alan GilbertThe functions to do that are inside a vmstate definition, and are called:
429edd70806SDr. David Alan Gilbert
430edd70806SDr. David Alan Gilbert- ``int (*pre_load)(void *opaque);``
431edd70806SDr. David Alan Gilbert
432edd70806SDr. David Alan Gilbert  This function is called before we load the state of one device.
433edd70806SDr. David Alan Gilbert
434edd70806SDr. David Alan Gilbert- ``int (*post_load)(void *opaque, int version_id);``
435edd70806SDr. David Alan Gilbert
436edd70806SDr. David Alan Gilbert  This function is called after we load the state of one device.
437edd70806SDr. David Alan Gilbert
438edd70806SDr. David Alan Gilbert- ``int (*pre_save)(void *opaque);``
439edd70806SDr. David Alan Gilbert
440edd70806SDr. David Alan Gilbert  This function is called before we save the state of one device.
441edd70806SDr. David Alan Gilbert
4428c07559fSAaron Lindsay- ``int (*post_save)(void *opaque);``
4438c07559fSAaron Lindsay
4448c07559fSAaron Lindsay  This function is called after we save the state of one device
4458c07559fSAaron Lindsay  (even upon failure, unless the call to pre_save returned an error).
4468c07559fSAaron Lindsay
4478c07559fSAaron LindsayExample: You can look at hpet.c, that uses the first three functions
4488c07559fSAaron Lindsayto massage the state that is transferred.
449edd70806SDr. David Alan Gilbert
450edd70806SDr. David Alan GilbertThe ``VMSTATE_WITH_TMP`` macro may be useful when the migration
451edd70806SDr. David Alan Gilbertdata doesn't match the stored device data well; it allows an
452edd70806SDr. David Alan Gilbertintermediate temporary structure to be populated with migration
453edd70806SDr. David Alan Gilbertdata and then transferred to the main structure.
454edd70806SDr. David Alan Gilbert
455edd70806SDr. David Alan GilbertIf you use memory API functions that update memory layout outside
456edd70806SDr. David Alan Gilbertinitialization (i.e., in response to a guest action), this is a strong
4574df3a7bfSPeter Maydellindication that you need to call these functions in a ``post_load`` callback.
458edd70806SDr. David Alan GilbertExamples of such memory API functions are:
459edd70806SDr. David Alan Gilbert
460edd70806SDr. David Alan Gilbert  - memory_region_add_subregion()
461edd70806SDr. David Alan Gilbert  - memory_region_del_subregion()
462edd70806SDr. David Alan Gilbert  - memory_region_set_readonly()
463c26763f8SMarc-André Lureau  - memory_region_set_nonvolatile()
464edd70806SDr. David Alan Gilbert  - memory_region_set_enabled()
465edd70806SDr. David Alan Gilbert  - memory_region_set_address()
466edd70806SDr. David Alan Gilbert  - memory_region_set_alias_offset()
467edd70806SDr. David Alan Gilbert
468edd70806SDr. David Alan GilbertIterative device migration
469edd70806SDr. David Alan Gilbert--------------------------
470edd70806SDr. David Alan Gilbert
471edd70806SDr. David Alan GilbertSome devices, such as RAM, Block storage or certain platform devices,
472edd70806SDr. David Alan Gilberthave large amounts of data that would mean that the CPUs would be
473edd70806SDr. David Alan Gilbertpaused for too long if they were sent in one section.  For these
474edd70806SDr. David Alan Gilbertdevices an *iterative* approach is taken.
475edd70806SDr. David Alan Gilbert
476edd70806SDr. David Alan GilbertThe iterative devices generally don't use VMState macros
477edd70806SDr. David Alan Gilbert(although it may be possible in some cases) and instead use
478edd70806SDr. David Alan Gilbertqemu_put_*/qemu_get_* macros to read/write data to the stream.  Specialist
479edd70806SDr. David Alan Gilbertversions exist for high bandwidth IO.
480edd70806SDr. David Alan Gilbert
481edd70806SDr. David Alan Gilbert
482edd70806SDr. David Alan GilbertAn iterative device must provide:
483edd70806SDr. David Alan Gilbert
484edd70806SDr. David Alan Gilbert  - A ``save_setup`` function that initialises the data structures and
485edd70806SDr. David Alan Gilbert    transmits a first section containing information on the device.  In the
486edd70806SDr. David Alan Gilbert    case of RAM this transmits a list of RAMBlocks and sizes.
487edd70806SDr. David Alan Gilbert
488edd70806SDr. David Alan Gilbert  - A ``load_setup`` function that initialises the data structures on the
489edd70806SDr. David Alan Gilbert    destination.
490edd70806SDr. David Alan Gilbert
491c8df4a7aSJuan Quintela  - A ``state_pending_exact`` function that indicates how much more
492c8df4a7aSJuan Quintela    data we must save.  The core migration code will use this to
493c8df4a7aSJuan Quintela    determine when to pause the CPUs and complete the migration.
494edd70806SDr. David Alan Gilbert
495c8df4a7aSJuan Quintela  - A ``state_pending_estimate`` function that indicates how much more
496c8df4a7aSJuan Quintela    data we must save.  When the estimated amount is smaller than the
497c8df4a7aSJuan Quintela    threshold, we call ``state_pending_exact``.
498c8df4a7aSJuan Quintela
499c8df4a7aSJuan Quintela  - A ``save_live_iterate`` function should send a chunk of data until
500c8df4a7aSJuan Quintela    the point that stream bandwidth limits tell it to stop.  Each call
501c8df4a7aSJuan Quintela    generates one section.
502edd70806SDr. David Alan Gilbert
503edd70806SDr. David Alan Gilbert  - A ``save_live_complete_precopy`` function that must transmit the
504edd70806SDr. David Alan Gilbert    last section for the device containing any remaining data.
505edd70806SDr. David Alan Gilbert
506edd70806SDr. David Alan Gilbert  - A ``load_state`` function used to load sections generated by
507edd70806SDr. David Alan Gilbert    any of the save functions that generate sections.
508edd70806SDr. David Alan Gilbert
509edd70806SDr. David Alan Gilbert  - ``cleanup`` functions for both save and load that are called
510edd70806SDr. David Alan Gilbert    at the end of migration.
511edd70806SDr. David Alan Gilbert
512edd70806SDr. David Alan GilbertNote that the contents of the sections for iterative migration tend
513edd70806SDr. David Alan Gilbertto be open-coded by the devices; care should be taken in parsing
514edd70806SDr. David Alan Gilbertthe results and structuring the stream to make them easy to validate.
515edd70806SDr. David Alan Gilbert
516edd70806SDr. David Alan GilbertDevice ordering
517edd70806SDr. David Alan Gilbert---------------
518edd70806SDr. David Alan Gilbert
519edd70806SDr. David Alan GilbertThere are cases in which the ordering of device loading matters; for
520edd70806SDr. David Alan Gilbertexample in some systems where a device may assert an interrupt during loading,
521edd70806SDr. David Alan Gilbertif the interrupt controller is loaded later then it might lose the state.
522edd70806SDr. David Alan Gilbert
523edd70806SDr. David Alan GilbertSome ordering is implicitly provided by the order in which the machine
524edd70806SDr. David Alan Gilbertdefinition creates devices, however this is somewhat fragile.
525edd70806SDr. David Alan Gilbert
526edd70806SDr. David Alan GilbertThe ``MigrationPriority`` enum provides a means of explicitly enforcing
527edd70806SDr. David Alan Gilbertordering.  Numerically higher priorities are loaded earlier.
528edd70806SDr. David Alan GilbertThe priority is set by setting the ``priority`` field of the top level
529edd70806SDr. David Alan Gilbert``VMStateDescription`` for the device.
530edd70806SDr. David Alan Gilbert
531edd70806SDr. David Alan GilbertStream structure
532edd70806SDr. David Alan Gilbert================
533edd70806SDr. David Alan Gilbert
534edd70806SDr. David Alan GilbertThe stream tries to be word and endian agnostic, allowing migration between hosts
535edd70806SDr. David Alan Gilbertof different characteristics running the same VM.
536edd70806SDr. David Alan Gilbert
537edd70806SDr. David Alan Gilbert  - Header
538edd70806SDr. David Alan Gilbert
539edd70806SDr. David Alan Gilbert    - Magic
540edd70806SDr. David Alan Gilbert    - Version
541edd70806SDr. David Alan Gilbert    - VM configuration section
542edd70806SDr. David Alan Gilbert
543edd70806SDr. David Alan Gilbert       - Machine type
544edd70806SDr. David Alan Gilbert       - Target page bits
545edd70806SDr. David Alan Gilbert  - List of sections
546edd70806SDr. David Alan Gilbert    Each section contains a device, or one iteration of a device save.
547edd70806SDr. David Alan Gilbert
548edd70806SDr. David Alan Gilbert    - section type
549edd70806SDr. David Alan Gilbert    - section id
550edd70806SDr. David Alan Gilbert    - ID string (First section of each device)
551edd70806SDr. David Alan Gilbert    - instance id (First section of each device)
552edd70806SDr. David Alan Gilbert    - version id (First section of each device)
553edd70806SDr. David Alan Gilbert    - <device data>
554edd70806SDr. David Alan Gilbert    - Footer mark
555edd70806SDr. David Alan Gilbert  - EOF mark
556edd70806SDr. David Alan Gilbert  - VM Description structure
557edd70806SDr. David Alan Gilbert    Consisting of a JSON description of the contents for analysis only
558edd70806SDr. David Alan Gilbert
559edd70806SDr. David Alan GilbertThe ``device data`` in each section consists of the data produced
560edd70806SDr. David Alan Gilbertby the code described above.  For non-iterative devices they have a single
561edd70806SDr. David Alan Gilbertsection; iterative devices have an initial and last section and a set
562edd70806SDr. David Alan Gilbertof parts in between.
563edd70806SDr. David Alan GilbertNote that there is very little checking by the common code of the integrity
564edd70806SDr. David Alan Gilbertof the ``device data`` contents, that's up to the devices themselves.
565edd70806SDr. David Alan GilbertThe ``footer mark`` provides a little bit of protection for the case where
566edd70806SDr. David Alan Gilbertthe receiving side reads more or less data than expected.
567edd70806SDr. David Alan Gilbert
568edd70806SDr. David Alan GilbertThe ``ID string`` is normally unique, having been formed from a bus name
569edd70806SDr. David Alan Gilbertand device address, PCI devices and storage devices hung off PCI controllers
570edd70806SDr. David Alan Gilbertfit this pattern well.  Some devices are fixed single instances (e.g. "pc-ram").
571edd70806SDr. David Alan GilbertOthers (especially either older devices or system devices which for
572edd70806SDr. David Alan Gilbertsome reason don't have a bus concept) make use of the ``instance id``
573edd70806SDr. David Alan Gilbertfor otherwise identically named devices.
5742e3c8f8dSDr. David Alan Gilbert
5752e3c8f8dSDr. David Alan GilbertReturn path
5762e3c8f8dSDr. David Alan Gilbert-----------
5772e3c8f8dSDr. David Alan Gilbert
578edd70806SDr. David Alan GilbertOnly a unidirectional stream is required for normal migration, however a
579edd70806SDr. David Alan Gilbert``return path`` can be created when bidirectional communication is desired.
580edd70806SDr. David Alan GilbertThis is primarily used by postcopy, but is also used to return a success
581edd70806SDr. David Alan Gilbertflag to the source at the end of migration.
5822e3c8f8dSDr. David Alan Gilbert
5832e3c8f8dSDr. David Alan Gilbert``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return
5842e3c8f8dSDr. David Alan Gilbertpath.
5852e3c8f8dSDr. David Alan Gilbert
5862e3c8f8dSDr. David Alan Gilbert  Source side
5872e3c8f8dSDr. David Alan Gilbert
5882e3c8f8dSDr. David Alan Gilbert     Forward path - written by migration thread
5892e3c8f8dSDr. David Alan Gilbert     Return path  - opened by main thread, read by return-path thread
5902e3c8f8dSDr. David Alan Gilbert
5912e3c8f8dSDr. David Alan Gilbert  Destination side
5922e3c8f8dSDr. David Alan Gilbert
5932e3c8f8dSDr. David Alan Gilbert     Forward path - read by main thread
5942e3c8f8dSDr. David Alan Gilbert     Return path  - opened by main thread, written by main thread AND postcopy
5952e3c8f8dSDr. David Alan Gilbert     thread (protected by rp_mutex)
5962e3c8f8dSDr. David Alan Gilbert
597ceddc482SHyman HuangDirty limit
598ceddc482SHyman Huang=====================
599ceddc482SHyman HuangThe dirty limit, short for dirty page rate upper limit, is a new capability
600ceddc482SHyman Huangintroduced in the 8.1 QEMU release that uses a new algorithm based on the KVM
601ceddc482SHyman Huangdirty ring to throttle down the guest during live migration.
602ceddc482SHyman Huang
603ceddc482SHyman HuangThe algorithm framework is as follows:
604ceddc482SHyman Huang
605ceddc482SHyman Huang::
606ceddc482SHyman Huang
607ceddc482SHyman Huang  ------------------------------------------------------------------------------
608ceddc482SHyman Huang  main   --------------> throttle thread ------------> PREPARE(1) <--------
609ceddc482SHyman Huang  thread  \                                                |              |
610ceddc482SHyman Huang           \                                               |              |
611ceddc482SHyman Huang            \                                              V              |
612ceddc482SHyman Huang             -\                                        CALCULATE(2)       |
613ceddc482SHyman Huang               \                                           |              |
614ceddc482SHyman Huang                \                                          |              |
615ceddc482SHyman Huang                 \                                         V              |
616ceddc482SHyman Huang                  \                                    SET PENALTY(3) -----
617ceddc482SHyman Huang                   -\                                      |
618ceddc482SHyman Huang                     \                                     |
619ceddc482SHyman Huang                      \                                    V
620ceddc482SHyman Huang                       -> virtual CPU thread -------> ACCEPT PENALTY(4)
621ceddc482SHyman Huang  ------------------------------------------------------------------------------
622ceddc482SHyman Huang
623ceddc482SHyman HuangWhen the qmp command qmp_set_vcpu_dirty_limit is called for the first time,
624ceddc482SHyman Huangthe QEMU main thread starts the throttle thread. The throttle thread, once
625ceddc482SHyman Huanglaunched, executes the loop, which consists of three steps:
626ceddc482SHyman Huang
627ceddc482SHyman Huang  - PREPARE (1)
628ceddc482SHyman Huang
629ceddc482SHyman Huang     The entire work of PREPARE (1) is preparation for the second stage,
630ceddc482SHyman Huang     CALCULATE(2), as the name implies. It involves preparing the dirty
631ceddc482SHyman Huang     page rate value and the corresponding upper limit of the VM:
632ceddc482SHyman Huang     The dirty page rate is calculated via the KVM dirty ring mechanism,
633ceddc482SHyman Huang     which tells QEMU how many dirty pages a virtual CPU has had since the
634ceddc482SHyman Huang     last KVM_EXIT_DIRTY_RING_FULL exception; The dirty page rate upper
635ceddc482SHyman Huang     limit is specified by caller, therefore fetch it directly.
636ceddc482SHyman Huang
637ceddc482SHyman Huang  - CALCULATE (2)
638ceddc482SHyman Huang
639ceddc482SHyman Huang     Calculate a suitable sleep period for each virtual CPU, which will be
640ceddc482SHyman Huang     used to determine the penalty for the target virtual CPU. The
641ceddc482SHyman Huang     computation must be done carefully in order to reduce the dirty page
642ceddc482SHyman Huang     rate progressively down to the upper limit without oscillation. To
643ceddc482SHyman Huang     achieve this, two strategies are provided: the first is to add or
644ceddc482SHyman Huang     subtract sleep time based on the ratio of the current dirty page rate
645ceddc482SHyman Huang     to the limit, which is used when the current dirty page rate is far
646ceddc482SHyman Huang     from the limit; the second is to add or subtract a fixed time when
647ceddc482SHyman Huang     the current dirty page rate is close to the limit.
648ceddc482SHyman Huang
649ceddc482SHyman Huang  - SET PENALTY (3)
650ceddc482SHyman Huang
651ceddc482SHyman Huang     Set the sleep time for each virtual CPU that should be penalized based
652ceddc482SHyman Huang     on the results of the calculation supplied by step CALCULATE (2).
653ceddc482SHyman Huang
654ceddc482SHyman HuangAfter completing the three above stages, the throttle thread loops back
655ceddc482SHyman Huangto step PREPARE (1) until the dirty limit is reached.
656ceddc482SHyman Huang
657ceddc482SHyman HuangOn the other hand, each virtual CPU thread reads the sleep duration and
658ceddc482SHyman Huangsleeps in the path of the KVM_EXIT_DIRTY_RING_FULL exception handler, that
659ceddc482SHyman Huangis ACCEPT PENALTY (4). Virtual CPUs tied with writing processes will
660ceddc482SHyman Huangobviously exit to the path and get penalized, whereas virtual CPUs involved
661ceddc482SHyman Huangwith read processes will not.
662ceddc482SHyman Huang
663ceddc482SHyman HuangIn summary, thanks to the KVM dirty ring technology, the dirty limit
664ceddc482SHyman Huangalgorithm will restrict virtual CPUs as needed to keep their dirty page
665ceddc482SHyman Huangrate inside the limit. This leads to more steady reading performance during
666ceddc482SHyman Huanglive migration and can aid in improving large guest responsiveness.
667ceddc482SHyman Huang
6682e3c8f8dSDr. David Alan GilbertPostcopy
6692e3c8f8dSDr. David Alan Gilbert========
6702e3c8f8dSDr. David Alan Gilbert
6712e3c8f8dSDr. David Alan Gilbert'Postcopy' migration is a way to deal with migrations that refuse to converge
6722e3c8f8dSDr. David Alan Gilbert(or take too long to converge) its plus side is that there is an upper bound on
6732e3c8f8dSDr. David Alan Gilbertthe amount of migration traffic and time it takes, the down side is that during
674f014880aSPeter Xuthe postcopy phase, a failure of *either* side causes the guest to be lost.
6752e3c8f8dSDr. David Alan Gilbert
6762e3c8f8dSDr. David Alan GilbertIn postcopy the destination CPUs are started before all the memory has been
6772e3c8f8dSDr. David Alan Gilberttransferred, and accesses to pages that are yet to be transferred cause
6782e3c8f8dSDr. David Alan Gilberta fault that's translated by QEMU into a request to the source QEMU.
6792e3c8f8dSDr. David Alan Gilbert
6802e3c8f8dSDr. David Alan GilbertPostcopy can be combined with precopy (i.e. normal migration) so that if precopy
6812e3c8f8dSDr. David Alan Gilbertdoesn't finish in a given time the switch is made to postcopy.
6822e3c8f8dSDr. David Alan Gilbert
6832e3c8f8dSDr. David Alan GilbertEnabling postcopy
6842e3c8f8dSDr. David Alan Gilbert-----------------
6852e3c8f8dSDr. David Alan Gilbert
686c2eb7f21SGreg KurzTo enable postcopy, issue this command on the monitor (both source and
687c2eb7f21SGreg Kurzdestination) prior to the start of migration:
6882e3c8f8dSDr. David Alan Gilbert
6892e3c8f8dSDr. David Alan Gilbert``migrate_set_capability postcopy-ram on``
6902e3c8f8dSDr. David Alan Gilbert
6912e3c8f8dSDr. David Alan GilbertThe normal commands are then used to start a migration, which is still
6922e3c8f8dSDr. David Alan Gilbertstarted in precopy mode.  Issuing:
6932e3c8f8dSDr. David Alan Gilbert
6942e3c8f8dSDr. David Alan Gilbert``migrate_start_postcopy``
6952e3c8f8dSDr. David Alan Gilbert
6962e3c8f8dSDr. David Alan Gilbertwill now cause the transition from precopy to postcopy.
6972e3c8f8dSDr. David Alan GilbertIt can be issued immediately after migration is started or any
6982e3c8f8dSDr. David Alan Gilberttime later on.  Issuing it after the end of a migration is harmless.
6992e3c8f8dSDr. David Alan Gilbert
7009ed01779SAlexey PerevalovBlocktime is a postcopy live migration metric, intended to show how
70176ca4b58Szhaolichanglong the vCPU was in state of interruptible sleep due to pagefault.
7029ed01779SAlexey PerevalovThat metric is calculated both for all vCPUs as overlapped value, and
7039ed01779SAlexey Perevalovseparately for each vCPU. These values are calculated on destination
7049ed01779SAlexey Perevalovside.  To enable postcopy blocktime calculation, enter following
7059ed01779SAlexey Perevalovcommand on destination monitor:
7069ed01779SAlexey Perevalov
7079ed01779SAlexey Perevalov``migrate_set_capability postcopy-blocktime on``
7089ed01779SAlexey Perevalov
7099ed01779SAlexey PerevalovPostcopy blocktime can be retrieved by query-migrate qmp command.
7109ed01779SAlexey Perevalovpostcopy-blocktime value of qmp command will show overlapped blocking
7119ed01779SAlexey Perevalovtime for all vCPU, postcopy-vcpu-blocktime will show list of blocking
7129ed01779SAlexey Perevalovtime per vCPU.
7139ed01779SAlexey Perevalov
7142e3c8f8dSDr. David Alan Gilbert.. note::
7152e3c8f8dSDr. David Alan Gilbert  During the postcopy phase, the bandwidth limits set using
716cbde7be9SDaniel P. Berrangé  ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
7172e3c8f8dSDr. David Alan Gilbert  the destination is waiting for).
7182e3c8f8dSDr. David Alan Gilbert
7192e3c8f8dSDr. David Alan GilbertPostcopy device transfer
7202e3c8f8dSDr. David Alan Gilbert------------------------
7212e3c8f8dSDr. David Alan Gilbert
7222e3c8f8dSDr. David Alan GilbertLoading of device data may cause the device emulation to access guest RAM
7232e3c8f8dSDr. David Alan Gilbertthat may trigger faults that have to be resolved by the source, as such
7242e3c8f8dSDr. David Alan Gilbertthe migration stream has to be able to respond with page data *during* the
7252e3c8f8dSDr. David Alan Gilbertdevice load, and hence the device data has to be read from the stream completely
7262e3c8f8dSDr. David Alan Gilbertbefore the device load begins to free the stream up.  This is achieved by
7272e3c8f8dSDr. David Alan Gilbert'packaging' the device data into a blob that's read in one go.
7282e3c8f8dSDr. David Alan Gilbert
7292e3c8f8dSDr. David Alan GilbertSource behaviour
7302e3c8f8dSDr. David Alan Gilbert----------------
7312e3c8f8dSDr. David Alan Gilbert
7322e3c8f8dSDr. David Alan GilbertUntil postcopy is entered the migration stream is identical to normal
7332e3c8f8dSDr. David Alan Gilbertprecopy, except for the addition of a 'postcopy advise' command at
7342e3c8f8dSDr. David Alan Gilbertthe beginning, to tell the destination that postcopy might happen.
7352e3c8f8dSDr. David Alan GilbertWhen postcopy starts the source sends the page discard data and then
7362e3c8f8dSDr. David Alan Gilbertforms the 'package' containing:
7372e3c8f8dSDr. David Alan Gilbert
7382e3c8f8dSDr. David Alan Gilbert   - Command: 'postcopy listen'
7392e3c8f8dSDr. David Alan Gilbert   - The device state
7402e3c8f8dSDr. David Alan Gilbert
7412e3c8f8dSDr. David Alan Gilbert     A series of sections, identical to the precopy streams device state stream
7422e3c8f8dSDr. David Alan Gilbert     containing everything except postcopiable devices (i.e. RAM)
7432e3c8f8dSDr. David Alan Gilbert   - Command: 'postcopy run'
7442e3c8f8dSDr. David Alan Gilbert
7452e3c8f8dSDr. David Alan GilbertThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
7462e3c8f8dSDr. David Alan Gilbertcontents are formatted in the same way as the main migration stream.
7472e3c8f8dSDr. David Alan Gilbert
7482e3c8f8dSDr. David Alan GilbertDuring postcopy the source scans the list of dirty pages and sends them
7492e3c8f8dSDr. David Alan Gilbertto the destination without being requested (in much the same way as precopy),
7502e3c8f8dSDr. David Alan Gilberthowever when a page request is received from the destination, the dirty page
7512e3c8f8dSDr. David Alan Gilbertscanning restarts from the requested location.  This causes requested pages
7522e3c8f8dSDr. David Alan Gilbertto be sent quickly, and also causes pages directly after the requested page
7532e3c8f8dSDr. David Alan Gilbertto be sent quickly in the hope that those pages are likely to be used
7542e3c8f8dSDr. David Alan Gilbertby the destination soon.
7552e3c8f8dSDr. David Alan Gilbert
7562e3c8f8dSDr. David Alan GilbertDestination behaviour
7572e3c8f8dSDr. David Alan Gilbert---------------------
7582e3c8f8dSDr. David Alan Gilbert
7592e3c8f8dSDr. David Alan GilbertInitially the destination looks the same as precopy, with a single thread
7602e3c8f8dSDr. David Alan Gilbertreading the migration stream; the 'postcopy advise' and 'discard' commands
7612e3c8f8dSDr. David Alan Gilbertare processed to change the way RAM is managed, but don't affect the stream
7622e3c8f8dSDr. David Alan Gilbertprocessing.
7632e3c8f8dSDr. David Alan Gilbert
7642e3c8f8dSDr. David Alan Gilbert::
7652e3c8f8dSDr. David Alan Gilbert
7662e3c8f8dSDr. David Alan Gilbert  ------------------------------------------------------------------------------
7672e3c8f8dSDr. David Alan Gilbert                          1      2   3     4 5                      6   7
7682e3c8f8dSDr. David Alan Gilbert  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
7692e3c8f8dSDr. David Alan Gilbert  thread                             |       |
7702e3c8f8dSDr. David Alan Gilbert                                     |     (page request)
7712e3c8f8dSDr. David Alan Gilbert                                     |        \___
7722e3c8f8dSDr. David Alan Gilbert                                     v            \
7732e3c8f8dSDr. David Alan Gilbert  listen thread:                     --- page -- page -- page -- page -- page --
7742e3c8f8dSDr. David Alan Gilbert
7752e3c8f8dSDr. David Alan Gilbert                                     a   b        c
7762e3c8f8dSDr. David Alan Gilbert  ------------------------------------------------------------------------------
7772e3c8f8dSDr. David Alan Gilbert
7782e3c8f8dSDr. David Alan Gilbert- On receipt of ``CMD_PACKAGED`` (1)
7792e3c8f8dSDr. David Alan Gilbert
7802e3c8f8dSDr. David Alan Gilbert   All the data associated with the package - the ( ... ) section in the diagram -
7812e3c8f8dSDr. David Alan Gilbert   is read into memory, and the main thread recurses into qemu_loadvm_state_main
7822e3c8f8dSDr. David Alan Gilbert   to process the contents of the package (2) which contains commands (3,6) and
7832e3c8f8dSDr. David Alan Gilbert   devices (4...)
7842e3c8f8dSDr. David Alan Gilbert
7852e3c8f8dSDr. David Alan Gilbert- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
7862e3c8f8dSDr. David Alan Gilbert
7872e3c8f8dSDr. David Alan Gilbert   a new thread (a) is started that takes over servicing the migration stream,
7882e3c8f8dSDr. David Alan Gilbert   while the main thread carries on loading the package.   It loads normal
7892e3c8f8dSDr. David Alan Gilbert   background page data (b) but if during a device load a fault happens (5)
7902e3c8f8dSDr. David Alan Gilbert   the returned page (c) is loaded by the listen thread allowing the main
7912e3c8f8dSDr. David Alan Gilbert   threads device load to carry on.
7922e3c8f8dSDr. David Alan Gilbert
7932e3c8f8dSDr. David Alan Gilbert- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
7942e3c8f8dSDr. David Alan Gilbert
7952e3c8f8dSDr. David Alan Gilbert   letting the destination CPUs start running.  At the end of the
7962e3c8f8dSDr. David Alan Gilbert   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
7972e3c8f8dSDr. David Alan Gilbert   is no longer used by migration, while the listen thread carries on servicing
7982e3c8f8dSDr. David Alan Gilbert   page data until the end of migration.
7992e3c8f8dSDr. David Alan Gilbert
800f014880aSPeter XuPostcopy Recovery
801f014880aSPeter Xu-----------------
802f014880aSPeter Xu
803f014880aSPeter XuComparing to precopy, postcopy is special on error handlings.  When any
804f014880aSPeter Xuerror happens (in this case, mostly network errors), QEMU cannot easily
805f014880aSPeter Xufail a migration because VM data resides in both source and destination
806f014880aSPeter XuQEMU instances.  On the other hand, when issue happens QEMU on both sides
807f014880aSPeter Xuwill go into a paused state.  It'll need a recovery phase to continue a
808f014880aSPeter Xupaused postcopy migration.
809f014880aSPeter Xu
810f014880aSPeter XuThe recovery phase normally contains a few steps:
811f014880aSPeter Xu
812f014880aSPeter Xu  - When network issue occurs, both QEMU will go into PAUSED state
813f014880aSPeter Xu
814f014880aSPeter Xu  - When the network is recovered (or a new network is provided), the admin
815f014880aSPeter Xu    can setup the new channel for migration using QMP command
816f014880aSPeter Xu    'migrate-recover' on destination node, preparing for a resume.
817f014880aSPeter Xu
818f014880aSPeter Xu  - On source host, the admin can continue the interrupted postcopy
819f014880aSPeter Xu    migration using QMP command 'migrate' with resume=true flag set.
820f014880aSPeter Xu
821f014880aSPeter Xu  - After the connection is re-established, QEMU will continue the postcopy
822f014880aSPeter Xu    migration on both sides.
823f014880aSPeter Xu
824f014880aSPeter XuDuring a paused postcopy migration, the VM can logically still continue
825f014880aSPeter Xurunning, and it will not be impacted from any page access to pages that
826f014880aSPeter Xuwere already migrated to destination VM before the interruption happens.
827f014880aSPeter XuHowever, if any of the missing pages got accessed on destination VM, the VM
828f014880aSPeter Xuthread will be halted waiting for the page to be migrated, it means it can
829f014880aSPeter Xube halted until the recovery is complete.
830f014880aSPeter Xu
831f014880aSPeter XuThe impact of accessing missing pages can be relevant to different
832f014880aSPeter Xuconfigurations of the guest.  For example, when with async page fault
833f014880aSPeter Xuenabled, logically the guest can proactively schedule out the threads
834f014880aSPeter Xuaccessing missing pages.
835f014880aSPeter Xu
8362e3c8f8dSDr. David Alan GilbertPostcopy states
8372e3c8f8dSDr. David Alan Gilbert---------------
8382e3c8f8dSDr. David Alan Gilbert
8392e3c8f8dSDr. David Alan GilbertPostcopy moves through a series of states (see postcopy_state) from
8402e3c8f8dSDr. David Alan GilbertADVISE->DISCARD->LISTEN->RUNNING->END
8412e3c8f8dSDr. David Alan Gilbert
8422e3c8f8dSDr. David Alan Gilbert - Advise
8432e3c8f8dSDr. David Alan Gilbert
8442e3c8f8dSDr. David Alan Gilbert    Set at the start of migration if postcopy is enabled, even
8452e3c8f8dSDr. David Alan Gilbert    if it hasn't had the start command; here the destination
8462e3c8f8dSDr. David Alan Gilbert    checks that its OS has the support needed for postcopy, and performs
8472e3c8f8dSDr. David Alan Gilbert    setup to ensure the RAM mappings are suitable for later postcopy.
8482e3c8f8dSDr. David Alan Gilbert    The destination will fail early in migration at this point if the
8492e3c8f8dSDr. David Alan Gilbert    required OS support is not present.
8502e3c8f8dSDr. David Alan Gilbert    (Triggered by reception of POSTCOPY_ADVISE command)
8512e3c8f8dSDr. David Alan Gilbert
8522e3c8f8dSDr. David Alan Gilbert - Discard
8532e3c8f8dSDr. David Alan Gilbert
8542e3c8f8dSDr. David Alan Gilbert    Entered on receipt of the first 'discard' command; prior to
8552e3c8f8dSDr. David Alan Gilbert    the first Discard being performed, hugepages are switched off
8562e3c8f8dSDr. David Alan Gilbert    (using madvise) to ensure that no new huge pages are created
8572e3c8f8dSDr. David Alan Gilbert    during the postcopy phase, and to cause any huge pages that
8582e3c8f8dSDr. David Alan Gilbert    have discards on them to be broken.
8592e3c8f8dSDr. David Alan Gilbert
8602e3c8f8dSDr. David Alan Gilbert - Listen
8612e3c8f8dSDr. David Alan Gilbert
8622e3c8f8dSDr. David Alan Gilbert    The first command in the package, POSTCOPY_LISTEN, switches
8632e3c8f8dSDr. David Alan Gilbert    the destination state to Listen, and starts a new thread
8642e3c8f8dSDr. David Alan Gilbert    (the 'listen thread') which takes over the job of receiving
8652e3c8f8dSDr. David Alan Gilbert    pages off the migration stream, while the main thread carries
8662e3c8f8dSDr. David Alan Gilbert    on processing the blob.  With this thread able to process page
8672e3c8f8dSDr. David Alan Gilbert    reception, the destination now 'sensitises' the RAM to detect
8682e3c8f8dSDr. David Alan Gilbert    any access to missing pages (on Linux using the 'userfault'
8692e3c8f8dSDr. David Alan Gilbert    system).
8702e3c8f8dSDr. David Alan Gilbert
8712e3c8f8dSDr. David Alan Gilbert - Running
8722e3c8f8dSDr. David Alan Gilbert
8732e3c8f8dSDr. David Alan Gilbert    POSTCOPY_RUN causes the destination to synchronise all
8742e3c8f8dSDr. David Alan Gilbert    state and start the CPUs and IO devices running.  The main
8752e3c8f8dSDr. David Alan Gilbert    thread now finishes processing the migration package and
8762e3c8f8dSDr. David Alan Gilbert    now carries on as it would for normal precopy migration
8772e3c8f8dSDr. David Alan Gilbert    (although it can't do the cleanup it would do as it
8782e3c8f8dSDr. David Alan Gilbert    finishes a normal migration).
8792e3c8f8dSDr. David Alan Gilbert
880f014880aSPeter Xu - Paused
881f014880aSPeter Xu
882f014880aSPeter Xu    Postcopy can run into a paused state (normally on both sides when
883f014880aSPeter Xu    happens), where all threads will be temporarily halted mostly due to
884f014880aSPeter Xu    network errors.  When reaching paused state, migration will make sure
885f014880aSPeter Xu    the qemu binary on both sides maintain the data without corrupting
886f014880aSPeter Xu    the VM.  To continue the migration, the admin needs to fix the
887f014880aSPeter Xu    migration channel using the QMP command 'migrate-recover' on the
888f014880aSPeter Xu    destination node, then resume the migration using QMP command 'migrate'
889f014880aSPeter Xu    again on source node, with resume=true flag set.
890f014880aSPeter Xu
8912e3c8f8dSDr. David Alan Gilbert - End
8922e3c8f8dSDr. David Alan Gilbert
8932e3c8f8dSDr. David Alan Gilbert    The listen thread can now quit, and perform the cleanup of migration
8942e3c8f8dSDr. David Alan Gilbert    state, the migration is now complete.
8952e3c8f8dSDr. David Alan Gilbert
896f014880aSPeter XuSource side page map
897f014880aSPeter Xu--------------------
8982e3c8f8dSDr. David Alan Gilbert
899f014880aSPeter XuThe 'migration bitmap' in postcopy is basically the same as in the precopy,
900f014880aSPeter Xuwhere each of the bit to indicate that page is 'dirty' - i.e. needs
901f014880aSPeter Xusending.  During the precopy phase this is updated as the CPU dirties
902f014880aSPeter Xupages, however during postcopy the CPUs are stopped and nothing should
903f014880aSPeter Xudirty anything any more. Instead, dirty bits are cleared when the relevant
904f014880aSPeter Xupages are sent during postcopy.
9052e3c8f8dSDr. David Alan Gilbert
9062e3c8f8dSDr. David Alan GilbertPostcopy with hugepages
9072e3c8f8dSDr. David Alan Gilbert-----------------------
9082e3c8f8dSDr. David Alan Gilbert
9092e3c8f8dSDr. David Alan GilbertPostcopy now works with hugetlbfs backed memory:
9102e3c8f8dSDr. David Alan Gilbert
9112e3c8f8dSDr. David Alan Gilbert  a) The linux kernel on the destination must support userfault on hugepages.
9122e3c8f8dSDr. David Alan Gilbert  b) The huge-page configuration on the source and destination VMs must be
9132e3c8f8dSDr. David Alan Gilbert     identical; i.e. RAMBlocks on both sides must use the same page size.
9142e3c8f8dSDr. David Alan Gilbert  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
9152e3c8f8dSDr. David Alan Gilbert     RAM if it doesn't have enough hugepages, triggering (b) to fail.
9162e3c8f8dSDr. David Alan Gilbert     Using ``-mem-prealloc`` enforces the allocation using hugepages.
9172e3c8f8dSDr. David Alan Gilbert  d) Care should be taken with the size of hugepage used; postcopy with 2MB
9182e3c8f8dSDr. David Alan Gilbert     hugepages works well, however 1GB hugepages are likely to be problematic
9192e3c8f8dSDr. David Alan Gilbert     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
9202e3c8f8dSDr. David Alan Gilbert     and until the full page is transferred the destination thread is blocked.
9211dc61e7bSDr. David Alan Gilbert
9221dc61e7bSDr. David Alan GilbertPostcopy with shared memory
9231dc61e7bSDr. David Alan Gilbert---------------------------
9241dc61e7bSDr. David Alan Gilbert
9251dc61e7bSDr. David Alan GilbertPostcopy migration with shared memory needs explicit support from the other
9261dc61e7bSDr. David Alan Gilbertprocesses that share memory and from QEMU. There are restrictions on the type of
9271dc61e7bSDr. David Alan Gilbertmemory that userfault can support shared.
9281dc61e7bSDr. David Alan Gilbert
9294df3a7bfSPeter MaydellThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
9304df3a7bfSPeter Maydell(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
9311dc61e7bSDr. David Alan Gilbertfor hugetlbfs which may be a problem in some configurations).
9321dc61e7bSDr. David Alan Gilbert
9331dc61e7bSDr. David Alan GilbertThe vhost-user code in QEMU supports clients that have Postcopy support,
9344df3a7bfSPeter Maydelland the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
9351dc61e7bSDr. David Alan Gilbertto support postcopy.
9361dc61e7bSDr. David Alan Gilbert
9371dc61e7bSDr. David Alan GilbertThe client needs to open a userfaultfd and register the areas
9381dc61e7bSDr. David Alan Gilbertof memory that it maps with userfault.  The client must then pass the
9391dc61e7bSDr. David Alan Gilbertuserfaultfd back to QEMU together with a mapping table that allows
9401dc61e7bSDr. David Alan Gilbertfault addresses in the clients address space to be converted back to
9411dc61e7bSDr. David Alan GilbertRAMBlock/offsets.  The client's userfaultfd is added to the postcopy
9421dc61e7bSDr. David Alan Gilbertfault-thread and page requests are made on behalf of the client by QEMU.
9431dc61e7bSDr. David Alan GilbertQEMU performs 'wake' operations on the client's userfaultfd to allow it
9441dc61e7bSDr. David Alan Gilbertto continue after a page has arrived.
9451dc61e7bSDr. David Alan Gilbert
9461dc61e7bSDr. David Alan Gilbert.. note::
9471dc61e7bSDr. David Alan Gilbert  There are two future improvements that would be nice:
9481dc61e7bSDr. David Alan Gilbert    a) Some way to make QEMU ignorant of the addresses in the clients
9491dc61e7bSDr. David Alan Gilbert       address space
9501dc61e7bSDr. David Alan Gilbert    b) Avoiding the need for QEMU to perform ufd-wake calls after the
9511dc61e7bSDr. David Alan Gilbert       pages have arrived
9521dc61e7bSDr. David Alan Gilbert
9531dc61e7bSDr. David Alan GilbertRetro-fitting postcopy to existing clients is possible:
9541dc61e7bSDr. David Alan Gilbert  a) A mechanism is needed for the registration with userfault as above,
9551dc61e7bSDr. David Alan Gilbert     and the registration needs to be coordinated with the phases of
9561dc61e7bSDr. David Alan Gilbert     postcopy.  In vhost-user extra messages are added to the existing
9571dc61e7bSDr. David Alan Gilbert     control channel.
9581dc61e7bSDr. David Alan Gilbert  b) Any thread that can block due to guest memory accesses must be
9591dc61e7bSDr. David Alan Gilbert     identified and the implication understood; for example if the
9601dc61e7bSDr. David Alan Gilbert     guest memory access is made while holding a lock then all other
9611dc61e7bSDr. David Alan Gilbert     threads waiting for that lock will also be blocked.
962edd70806SDr. David Alan Gilbert
963f014880aSPeter XuPostcopy Preemption Mode
964f014880aSPeter Xu------------------------
965f014880aSPeter Xu
966f014880aSPeter XuPostcopy preempt is a new capability introduced in 8.0 QEMU release, it
967f014880aSPeter Xuallows urgent pages (those got page fault requested from destination QEMU
968f014880aSPeter Xuexplicitly) to be sent in a separate preempt channel, rather than queued in
969f014880aSPeter Xuthe background migration channel.  Anyone who cares about latencies of page
970f014880aSPeter Xufaults during a postcopy migration should enable this feature.  By default,
971f014880aSPeter Xuit's not enabled.
972f014880aSPeter Xu
973edd70806SDr. David Alan GilbertFirmware
974edd70806SDr. David Alan Gilbert========
975edd70806SDr. David Alan Gilbert
976edd70806SDr. David Alan GilbertMigration migrates the copies of RAM and ROM, and thus when running
977edd70806SDr. David Alan Gilberton the destination it includes the firmware from the source. Even after
978edd70806SDr. David Alan Gilbertresetting a VM, the old firmware is used.  Only once QEMU has been restarted
979edd70806SDr. David Alan Gilbertis the new firmware in use.
980edd70806SDr. David Alan Gilbert
981edd70806SDr. David Alan Gilbert- Changes in firmware size can cause changes in the required RAMBlock size
982edd70806SDr. David Alan Gilbert  to hold the firmware and thus migration can fail.  In practice it's best
983edd70806SDr. David Alan Gilbert  to pad firmware images to convenient powers of 2 with plenty of space
984edd70806SDr. David Alan Gilbert  for growth.
985edd70806SDr. David Alan Gilbert
986edd70806SDr. David Alan Gilbert- Care should be taken with device emulation code so that newer
987edd70806SDr. David Alan Gilbert  emulation code can work with older firmware to allow forward migration.
988edd70806SDr. David Alan Gilbert
989edd70806SDr. David Alan Gilbert- Care should be taken with newer firmware so that backward migration
990edd70806SDr. David Alan Gilbert  to older systems with older device emulation code will work.
991edd70806SDr. David Alan Gilbert
992edd70806SDr. David Alan GilbertIn some cases it may be best to tie specific firmware versions to specific
993edd70806SDr. David Alan Gilbertversioned machine types to cut down on the combinations that will need
994edd70806SDr. David Alan Gilbertsupport.  This is also useful when newer versions of firmware outgrow
995edd70806SDr. David Alan Gilbertthe padding.
996edd70806SDr. David Alan Gilbert
9971aefe2caSJuan Quintela
9981aefe2caSJuan QuintelaBackwards compatibility
9991aefe2caSJuan Quintela=======================
10001aefe2caSJuan Quintela
10011aefe2caSJuan QuintelaHow backwards compatibility works
10021aefe2caSJuan Quintela---------------------------------
10031aefe2caSJuan Quintela
10041aefe2caSJuan QuintelaWhen we do migration, we have two QEMU processes: the source and the
10051aefe2caSJuan Quintelatarget.  There are two cases, they are the same version or they are
10061aefe2caSJuan Quinteladifferent versions.  The easy case is when they are the same version.
10071aefe2caSJuan QuintelaThe difficult one is when they are different versions.
10081aefe2caSJuan Quintela
10091aefe2caSJuan QuintelaThere are two things that are different, but they have very similar
10101aefe2caSJuan Quintelanames and sometimes get confused:
10111aefe2caSJuan Quintela
10121aefe2caSJuan Quintela- QEMU version
10131aefe2caSJuan Quintela- machine type version
10141aefe2caSJuan Quintela
10151aefe2caSJuan QuintelaLet's start with a practical example, we start with:
10161aefe2caSJuan Quintela
10171aefe2caSJuan Quintela- qemu-system-x86_64 (v5.2), from now on qemu-5.2.
10181aefe2caSJuan Quintela- qemu-system-x86_64 (v5.1), from now on qemu-5.1.
10191aefe2caSJuan Quintela
10201aefe2caSJuan QuintelaRelated to this are the "latest" machine types defined on each of
10211aefe2caSJuan Quintelathem:
10221aefe2caSJuan Quintela
10231aefe2caSJuan Quintela- pc-q35-5.2 (newer one in qemu-5.2) from now on pc-5.2
10241aefe2caSJuan Quintela- pc-q35-5.1 (newer one in qemu-5.1) from now on pc-5.1
10251aefe2caSJuan Quintela
10261aefe2caSJuan QuintelaFirst of all, migration is only supposed to work if you use the same
10271aefe2caSJuan Quintelamachine type in both source and destination. The QEMU hardware
10281aefe2caSJuan Quintelaconfiguration needs to be the same also on source and destination.
10291aefe2caSJuan QuintelaMost aspects of the backend configuration can be changed at will,
10301aefe2caSJuan Quintelaexcept for a few cases where the backend features influence frontend
10311aefe2caSJuan Quinteladevice feature exposure.  But that is not relevant for this section.
10321aefe2caSJuan Quintela
10331aefe2caSJuan QuintelaI am going to list the number of combinations that we can have.  Let's
10341aefe2caSJuan Quintelastart with the trivial ones, QEMU is the same on source and
10351aefe2caSJuan Quinteladestination:
10361aefe2caSJuan Quintela
10371aefe2caSJuan Quintela1 - qemu-5.2 -M pc-5.2  -> migrates to -> qemu-5.2 -M pc-5.2
10381aefe2caSJuan Quintela
10391aefe2caSJuan Quintela  This is the latest QEMU with the latest machine type.
10401aefe2caSJuan Quintela  This have to work, and if it doesn't work it is a bug.
10411aefe2caSJuan Quintela
10421aefe2caSJuan Quintela2 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
10431aefe2caSJuan Quintela
10441aefe2caSJuan Quintela  Exactly the same case than the previous one, but for 5.1.
10451aefe2caSJuan Quintela  Nothing to see here either.
10461aefe2caSJuan Quintela
10471aefe2caSJuan QuintelaThis are the easiest ones, we will not talk more about them in this
10481aefe2caSJuan Quintelasection.
10491aefe2caSJuan Quintela
10501aefe2caSJuan QuintelaNow we start with the more interesting cases.  Consider the case where
10511aefe2caSJuan Quintelawe have the same QEMU version in both sides (qemu-5.2) but we are using
10521aefe2caSJuan Quintelathe latest machine type for that version (pc-5.2) but one of an older
10531aefe2caSJuan QuintelaQEMU version, in this case pc-5.1.
10541aefe2caSJuan Quintela
10551aefe2caSJuan Quintela3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
10561aefe2caSJuan Quintela
10571aefe2caSJuan Quintela  It needs to use the definition of pc-5.1 and the devices as they
10581aefe2caSJuan Quintela  were configured on 5.1, but this should be easy in the sense that
10591aefe2caSJuan Quintela  both sides are the same QEMU and both sides have exactly the same
10601aefe2caSJuan Quintela  idea of what the pc-5.1 machine is.
10611aefe2caSJuan Quintela
10621aefe2caSJuan Quintela4 - qemu-5.1 -M pc-5.2  -> migrates to -> qemu-5.1 -M pc-5.2
10631aefe2caSJuan Quintela
10642a620ed5SMichael Tokarev  This combination is not possible as the qemu-5.1 doesn't understand
10651aefe2caSJuan Quintela  pc-5.2 machine type.  So nothing to worry here.
10661aefe2caSJuan Quintela
10671aefe2caSJuan QuintelaNow it comes the interesting ones, when both QEMU processes are
10681aefe2caSJuan Quinteladifferent.  Notice also that the machine type needs to be pc-5.1,
10691aefe2caSJuan Quintelabecause we have the limitation than qemu-5.1 doesn't know pc-5.2.  So
10701aefe2caSJuan Quintelathe possible cases are:
10711aefe2caSJuan Quintela
10721aefe2caSJuan Quintela5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
10731aefe2caSJuan Quintela
10741aefe2caSJuan Quintela  This migration is known as newer to older.  We need to make sure
10751aefe2caSJuan Quintela  when we are developing 5.2 we need to take care about not to break
10761aefe2caSJuan Quintela  migration to qemu-5.1.  Notice that we can't make updates to
10771aefe2caSJuan Quintela  qemu-5.1 to understand whatever qemu-5.2 decides to change, so it is
10781aefe2caSJuan Quintela  in qemu-5.2 side to make the relevant changes.
10791aefe2caSJuan Quintela
10801aefe2caSJuan Quintela6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
10811aefe2caSJuan Quintela
10821aefe2caSJuan Quintela  This migration is known as older to newer.  We need to make sure
10831aefe2caSJuan Quintela  than we are able to receive migrations from qemu-5.1. The problem is
10841aefe2caSJuan Quintela  similar to the previous one.
10851aefe2caSJuan Quintela
10861aefe2caSJuan QuintelaIf qemu-5.1 and qemu-5.2 were the same, there will not be any
10871aefe2caSJuan Quintelacompatibility problems.  But the reason that we create qemu-5.2 is to
10881aefe2caSJuan Quintelaget new features, devices, defaults, etc.
10891aefe2caSJuan Quintela
10901aefe2caSJuan QuintelaIf we get a device that has a new feature, or change a default value,
10911aefe2caSJuan Quintelawe have a problem when we try to migrate between different QEMU
10921aefe2caSJuan Quintelaversions.
10931aefe2caSJuan Quintela
10941aefe2caSJuan QuintelaSo we need a way to tell qemu-5.2 that when we are using machine type
10951aefe2caSJuan Quintelapc-5.1, it needs to **not** use the feature, to be able to migrate to
10961aefe2caSJuan Quintelareal qemu-5.1.
10971aefe2caSJuan Quintela
10981aefe2caSJuan QuintelaAnd the equivalent part when migrating from qemu-5.1 to qemu-5.2.
10991aefe2caSJuan Quintelaqemu-5.2 has to expect that it is not going to get data for the new
11001aefe2caSJuan Quintelafeature, because qemu-5.1 doesn't know about it.
11011aefe2caSJuan Quintela
11021aefe2caSJuan QuintelaHow do we tell QEMU about these device feature changes?  In
11031aefe2caSJuan Quintelahw/core/machine.c:hw_compat_X_Y arrays.
11041aefe2caSJuan Quintela
11051aefe2caSJuan QuintelaIf we change a default value, we need to put back the old value on
11061aefe2caSJuan Quintelathat array.  And the device, during initialization needs to look at
11071aefe2caSJuan Quintelathat array to see what value it needs to get for that feature.  And
11081aefe2caSJuan Quintelawhat are we going to put in that array, the value of a property.
11091aefe2caSJuan Quintela
11101aefe2caSJuan QuintelaTo create a property for a device, we need to use one of the
11111aefe2caSJuan QuintelaDEFINE_PROP_*() macros. See include/hw/qdev-properties.h to find the
11121aefe2caSJuan Quintelamacros that exist.  With it, we set the default value for that
11131aefe2caSJuan Quintelaproperty, and that is what it is going to get in the latest released
11141aefe2caSJuan Quintelaversion.  But if we want a different value for a previous version, we
11151aefe2caSJuan Quintelacan change that in the hw_compat_X_Y arrays.
11161aefe2caSJuan Quintela
11171aefe2caSJuan Quintelahw_compat_X_Y is an array of registers that have the format:
11181aefe2caSJuan Quintela
11191aefe2caSJuan Quintela- name_device
11201aefe2caSJuan Quintela- name_property
11211aefe2caSJuan Quintela- value
11221aefe2caSJuan Quintela
11231aefe2caSJuan QuintelaLet's see a practical example.
11241aefe2caSJuan Quintela
11251aefe2caSJuan QuintelaIn qemu-5.2 virtio-blk-device got multi queue support.  This is a
11261aefe2caSJuan Quintelachange that is not backward compatible.  In qemu-5.1 it has one
11271aefe2caSJuan Quintelaqueue. In qemu-5.2 it has the same number of queues as the number of
11281aefe2caSJuan Quintelacpus in the system.
11291aefe2caSJuan Quintela
11301aefe2caSJuan QuintelaWhen we are doing migration, if we migrate from a device that has 4
11311aefe2caSJuan Quintelaqueues to a device that have only one queue, we don't know where to
11321aefe2caSJuan Quintelaput the extra information for the other 3 queues, and we fail
11331aefe2caSJuan Quintelamigration.
11341aefe2caSJuan Quintela
11351aefe2caSJuan QuintelaSimilar problem when we migrate from qemu-5.1 that has only one queue
11361aefe2caSJuan Quintelato qemu-5.2, we only sent information for one queue, but destination
11371aefe2caSJuan Quintelahas 4, and we have 3 queues that are not properly initialized and
11381aefe2caSJuan Quintelaanything can happen.
11391aefe2caSJuan Quintela
11401aefe2caSJuan QuintelaSo, how can we address this problem.  Easy, just convince qemu-5.2
11411aefe2caSJuan Quintelathat when it is running pc-5.1, it needs to set the number of queues
11421aefe2caSJuan Quintelafor virtio-blk-devices to 1.
11431aefe2caSJuan Quintela
11441aefe2caSJuan QuintelaThat way we fix the cases 5 and 6.
11451aefe2caSJuan Quintela
11461aefe2caSJuan Quintela5 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.1 -M pc-5.1
11471aefe2caSJuan Quintela
11481aefe2caSJuan Quintela    qemu-5.2 -M pc-5.1 sets number of queues to be 1.
11491aefe2caSJuan Quintela    qemu-5.1 -M pc-5.1 expects number of queues to be 1.
11501aefe2caSJuan Quintela
11511aefe2caSJuan Quintela    correct.  migration works.
11521aefe2caSJuan Quintela
11531aefe2caSJuan Quintela6 - qemu-5.1 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
11541aefe2caSJuan Quintela
11551aefe2caSJuan Quintela    qemu-5.1 -M pc-5.1 sets number of queues to be 1.
11561aefe2caSJuan Quintela    qemu-5.2 -M pc-5.1 expects number of queues to be 1.
11571aefe2caSJuan Quintela
11581aefe2caSJuan Quintela    correct.  migration works.
11591aefe2caSJuan Quintela
11601aefe2caSJuan QuintelaAnd now the other interesting case, case 3.  In this case we have:
11611aefe2caSJuan Quintela
11621aefe2caSJuan Quintela3 - qemu-5.2 -M pc-5.1  -> migrates to -> qemu-5.2 -M pc-5.1
11631aefe2caSJuan Quintela
11641aefe2caSJuan Quintela    Here we have the same QEMU in both sides.  So it doesn't matter a
11651aefe2caSJuan Quintela    lot if we have set the number of queues to 1 or not, because
11661aefe2caSJuan Quintela    they are the same.
11671aefe2caSJuan Quintela
11681aefe2caSJuan Quintela    WRONG!
11691aefe2caSJuan Quintela
11701aefe2caSJuan Quintela    Think what happens if we do one of this double migrations:
11711aefe2caSJuan Quintela
11721aefe2caSJuan Quintela    A -> migrates -> B -> migrates -> C
11731aefe2caSJuan Quintela
11741aefe2caSJuan Quintela    where:
11751aefe2caSJuan Quintela
11761aefe2caSJuan Quintela    A: qemu-5.1 -M pc-5.1
11771aefe2caSJuan Quintela    B: qemu-5.2 -M pc-5.1
11781aefe2caSJuan Quintela    C: qemu-5.2 -M pc-5.1
11791aefe2caSJuan Quintela
11801aefe2caSJuan Quintela    migration A -> B is case 6, so number of queues needs to be 1.
11811aefe2caSJuan Quintela
11821aefe2caSJuan Quintela    migration B -> C is case 3, so we don't care.  But actually we
11831aefe2caSJuan Quintela    care because we haven't started the guest in qemu-5.2, it came
11841aefe2caSJuan Quintela    migrated from qemu-5.1.  So to be in the safe place, we need to
11851aefe2caSJuan Quintela    always use number of queues 1 when we are using pc-5.1.
11861aefe2caSJuan Quintela
11871aefe2caSJuan QuintelaNow, how was this done in reality?  The following commit shows how it
11881aefe2caSJuan Quintelawas done::
11891aefe2caSJuan Quintela
11901aefe2caSJuan Quintela  commit 9445e1e15e66c19e42bea942ba810db28052cd05
11911aefe2caSJuan Quintela  Author: Stefan Hajnoczi <stefanha@redhat.com>
11921aefe2caSJuan Quintela  Date:   Tue Aug 18 15:33:47 2020 +0100
11931aefe2caSJuan Quintela
11941aefe2caSJuan Quintela  virtio-blk-pci: default num_queues to -smp N
11951aefe2caSJuan Quintela
11961aefe2caSJuan QuintelaThe relevant parts for migration are::
11971aefe2caSJuan Quintela
11981aefe2caSJuan Quintela    @@ -1281,7 +1284,8 @@ static Property virtio_blk_properties[] = {
11991aefe2caSJuan Quintela     #endif
12001aefe2caSJuan Quintela         DEFINE_PROP_BIT("request-merging", VirtIOBlock, conf.request_merging, 0,
12011aefe2caSJuan Quintela                         true),
12021aefe2caSJuan Quintela    -    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues, 1),
12031aefe2caSJuan Quintela    +    DEFINE_PROP_UINT16("num-queues", VirtIOBlock, conf.num_queues,
12041aefe2caSJuan Quintela    +                       VIRTIO_BLK_AUTO_NUM_QUEUES),
12051aefe2caSJuan Quintela         DEFINE_PROP_UINT16("queue-size", VirtIOBlock, conf.queue_size, 256),
12061aefe2caSJuan Quintela
12071aefe2caSJuan QuintelaIt changes the default value of num_queues.  But it fishes it for old
12081aefe2caSJuan Quintelamachine types to have the right value::
12091aefe2caSJuan Quintela
12101aefe2caSJuan Quintela    @@ -31,6 +31,7 @@
12111aefe2caSJuan Quintela     GlobalProperty hw_compat_5_1[] = {
12121aefe2caSJuan Quintela         ...
12131aefe2caSJuan Quintela    +    { "virtio-blk-device", "num-queues", "1"},
12141aefe2caSJuan Quintela         ...
12151aefe2caSJuan Quintela     };
1216593c28c0SJuan Quintela
12172a620ed5SMichael TokarevA device with different features on both sides
12182a620ed5SMichael Tokarev----------------------------------------------
1219593c28c0SJuan Quintela
1220593c28c0SJuan QuintelaLet's assume that we are using the same QEMU binary on both sides,
1221593c28c0SJuan Quintelajust to make the things easier.  But we have a device that has
1222593c28c0SJuan Quinteladifferent features on both sides of the migration.  That can be
1223593c28c0SJuan Quintelabecause the devices are different, because the kernel driver of both
1224593c28c0SJuan Quinteladevices have different features, whatever.
1225593c28c0SJuan Quintela
1226593c28c0SJuan QuintelaHow can we get this to work with migration.  The way to do that is
1227593c28c0SJuan Quintela"theoretically" easy.  You have to get the features that the device
1228593c28c0SJuan Quintelahas in the source of the migration.  The features that the device has
1229593c28c0SJuan Quintelaon the target of the migration, you get the intersection of the
1230593c28c0SJuan Quintelafeatures of both sides, and that is the way that you should launch
1231593c28c0SJuan QuintelaQEMU.
1232593c28c0SJuan Quintela
1233593c28c0SJuan QuintelaNotice that this is not completely related to QEMU.  The most
1234593c28c0SJuan Quintelaimportant thing here is that this should be handled by the managing
1235593c28c0SJuan Quintelaapplication that launches QEMU.  If QEMU is configured correctly, the
1236593c28c0SJuan Quintelamigration will succeed.
1237593c28c0SJuan Quintela
1238593c28c0SJuan QuintelaThat said, actually doing it is complicated.  Almost all devices are
1239593c28c0SJuan Quintelabad at being able to be launched with only some features enabled.
1240593c28c0SJuan QuintelaWith one big exception: cpus.
1241593c28c0SJuan Quintela
1242593c28c0SJuan QuintelaYou can read the documentation for QEMU x86 cpu models here:
1243593c28c0SJuan Quintela
1244593c28c0SJuan Quintelahttps://qemu-project.gitlab.io/qemu/system/qemu-cpu-models.html
1245593c28c0SJuan Quintela
1246593c28c0SJuan QuintelaSee when they talk about migration they recommend that one chooses the
1247593c28c0SJuan Quintelanewest cpu model that is supported for all cpus.
1248593c28c0SJuan Quintela
1249593c28c0SJuan QuintelaLet's say that we have:
1250593c28c0SJuan Quintela
1251593c28c0SJuan QuintelaHost A:
1252593c28c0SJuan Quintela
1253593c28c0SJuan QuintelaDevice X has the feature Y
1254593c28c0SJuan Quintela
1255593c28c0SJuan QuintelaHost B:
1256593c28c0SJuan Quintela
1257593c28c0SJuan QuintelaDevice X has not the feature Y
1258593c28c0SJuan Quintela
1259593c28c0SJuan QuintelaIf we try to migrate without any care from host A to host B, it will
1260593c28c0SJuan Quintelafail because when migration tries to load the feature Y on
1261593c28c0SJuan Quinteladestination, it will find that the hardware is not there.
1262593c28c0SJuan Quintela
1263593c28c0SJuan QuintelaDoing this would be the equivalent of doing with cpus:
1264593c28c0SJuan Quintela
1265593c28c0SJuan QuintelaHost A:
1266593c28c0SJuan Quintela
1267593c28c0SJuan Quintela$ qemu-system-x86_64 -cpu host
1268593c28c0SJuan Quintela
1269593c28c0SJuan QuintelaHost B:
1270593c28c0SJuan Quintela
1271593c28c0SJuan Quintela$ qemu-system-x86_64 -cpu host
1272593c28c0SJuan Quintela
1273593c28c0SJuan QuintelaWhen both hosts have different cpu features this is guaranteed to
1274593c28c0SJuan Quintelafail.  Especially if Host B has less features than host A.  If host A
1275593c28c0SJuan Quintelahas less features than host B, sometimes it works.  Important word of
1276593c28c0SJuan Quintelalast sentence is "sometimes".
1277593c28c0SJuan Quintela
1278593c28c0SJuan QuintelaSo, forgetting about cpu models and continuing with the -cpu host
1279593c28c0SJuan Quintelaexample, let's see that the differences of the cpus is that Host A and
1280593c28c0SJuan QuintelaB have the following features:
1281593c28c0SJuan Quintela
1282593c28c0SJuan QuintelaFeatures:   'pcid'  'stibp' 'taa-no'
1283593c28c0SJuan QuintelaHost A:        X       X
1284593c28c0SJuan QuintelaHost B:                        X
1285593c28c0SJuan Quintela
1286593c28c0SJuan QuintelaAnd we want to migrate between them, the way configure both QEMU cpu
1287593c28c0SJuan Quintelawill be:
1288593c28c0SJuan Quintela
1289593c28c0SJuan QuintelaHost A:
1290593c28c0SJuan Quintela
1291593c28c0SJuan Quintela$ qemu-system-x86_64 -cpu host,pcid=off,stibp=off
1292593c28c0SJuan Quintela
1293593c28c0SJuan QuintelaHost B:
1294593c28c0SJuan Quintela
1295593c28c0SJuan Quintela$ qemu-system-x86_64 -cpu host,taa-no=off
1296593c28c0SJuan Quintela
12972a620ed5SMichael TokarevAnd you would be able to migrate between them.  It is responsibility
1298593c28c0SJuan Quintelaof the management application or of the user to make sure that the
1299593c28c0SJuan Quintelaconfiguration is correct.  QEMU doesn't know how to look at this kind
1300593c28c0SJuan Quintelaof features in general.
1301593c28c0SJuan Quintela
13022a620ed5SMichael TokarevNotice that we don't recommend to use -cpu host for migration.  It is
1303593c28c0SJuan Quintelaused in this example because it makes the example simpler.
1304593c28c0SJuan Quintela
1305593c28c0SJuan QuintelaOther devices have worse control about individual features.  If they
1306593c28c0SJuan Quintelawant to be able to migrate between hosts that show different features,
1307593c28c0SJuan Quintelathe device needs a way to configure which ones it is going to use.
1308593c28c0SJuan Quintela
1309593c28c0SJuan QuintelaIn this section we have considered that we are using the same QEMU
1310593c28c0SJuan Quintelabinary in both sides of the migration.  If we use different QEMU
1311593c28c0SJuan Quintelaversions process, then we need to have into account all other
1312593c28c0SJuan Quinteladifferences and the examples become even more complicated.
1313e7732617SJuan Quintela
1314e7732617SJuan QuintelaHow to mitigate when we have a backward compatibility error
1315e7732617SJuan Quintela-----------------------------------------------------------
1316e7732617SJuan Quintela
1317e7732617SJuan QuintelaWe broke migration for old machine types continuously during
1318e7732617SJuan Quinteladevelopment.  But as soon as we find that there is a problem, we fix
1319e7732617SJuan Quintelait.  The problem is what happens when we detect after we have done a
1320e7732617SJuan Quintelarelease that something has gone wrong.
1321e7732617SJuan Quintela
1322e7732617SJuan QuintelaLet see how it worked with one example.
1323e7732617SJuan Quintela
1324e7732617SJuan QuintelaAfter the release of qemu-8.0 we found a problem when doing migration
1325e7732617SJuan Quintelaof the machine type pc-7.2.
1326e7732617SJuan Quintela
1327e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1328e7732617SJuan Quintela
1329e7732617SJuan Quintela  This migration works
1330e7732617SJuan Quintela
1331e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1332e7732617SJuan Quintela
1333e7732617SJuan Quintela  This migration works
1334e7732617SJuan Quintela
1335e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1336e7732617SJuan Quintela
1337e7732617SJuan Quintela  This migration fails
1338e7732617SJuan Quintela
1339e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1340e7732617SJuan Quintela
1341e7732617SJuan Quintela  This migration fails
1342e7732617SJuan Quintela
1343e7732617SJuan QuintelaSo clearly something fails when migration between qemu-7.2 and
1344e7732617SJuan Quintelaqemu-8.0 with machine type pc-7.2.  The error messages, and git bisect
1345e7732617SJuan Quintelapointed to this commit.
1346e7732617SJuan Quintela
1347e7732617SJuan QuintelaIn qemu-8.0 we got this commit::
1348e7732617SJuan Quintela
1349e7732617SJuan Quintela    commit 010746ae1db7f52700cb2e2c46eb94f299cfa0d2
1350e7732617SJuan Quintela    Author: Jonathan Cameron <Jonathan.Cameron@huawei.com>
1351e7732617SJuan Quintela    Date:   Thu Mar 2 13:37:02 2023 +0000
1352e7732617SJuan Quintela
1353e7732617SJuan Quintela    hw/pci/aer: Implement PCI_ERR_UNCOR_MASK register
1354e7732617SJuan Quintela
1355e7732617SJuan Quintela
1356e7732617SJuan QuintelaThe relevant bits of the commit for our example are this ones::
1357e7732617SJuan Quintela
1358e7732617SJuan Quintela    --- a/hw/pci/pcie_aer.c
1359e7732617SJuan Quintela    +++ b/hw/pci/pcie_aer.c
1360e7732617SJuan Quintela    @@ -112,6 +112,10 @@ int pcie_aer_init(PCIDevice *dev,
1361e7732617SJuan Quintela
1362e7732617SJuan Quintela         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
1363e7732617SJuan Quintela                      PCI_ERR_UNC_SUPPORTED);
1364e7732617SJuan Quintela    +    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
1365e7732617SJuan Quintela    +                 PCI_ERR_UNC_MASK_DEFAULT);
1366e7732617SJuan Quintela    +    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
1367e7732617SJuan Quintela    +                 PCI_ERR_UNC_SUPPORTED);
1368e7732617SJuan Quintela
1369e7732617SJuan Quintela         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
1370e7732617SJuan Quintela                     PCI_ERR_UNC_SEVERITY_DEFAULT);
1371e7732617SJuan Quintela
1372e7732617SJuan QuintelaThe patch changes how we configure PCI space for AER.  But QEMU fails
1373e7732617SJuan Quintelawhen the PCI space configuration is different between source and
1374e7732617SJuan Quinteladestination.
1375e7732617SJuan Quintela
1376e7732617SJuan QuintelaThe following commit shows how this got fixed::
1377e7732617SJuan Quintela
1378e7732617SJuan Quintela    commit 5ed3dabe57dd9f4c007404345e5f5bf0e347317f
1379e7732617SJuan Quintela    Author: Leonardo Bras <leobras@redhat.com>
1380e7732617SJuan Quintela    Date:   Tue May 2 21:27:02 2023 -0300
1381e7732617SJuan Quintela
1382e7732617SJuan Quintela    hw/pci: Disable PCI_ERR_UNCOR_MASK register for machine type < 8.0
1383e7732617SJuan Quintela
1384e7732617SJuan Quintela    [...]
1385e7732617SJuan Quintela
1386e7732617SJuan QuintelaThe relevant parts of the fix in QEMU are as follow:
1387e7732617SJuan Quintela
1388e7732617SJuan QuintelaFirst, we create a new property for the device to be able to configure
1389e7732617SJuan Quintelathe old behaviour or the new behaviour::
1390e7732617SJuan Quintela
1391e7732617SJuan Quintela    diff --git a/hw/pci/pci.c b/hw/pci/pci.c
1392e7732617SJuan Quintela    index 8a87ccc8b0..5153ad63d6 100644
1393e7732617SJuan Quintela    --- a/hw/pci/pci.c
1394e7732617SJuan Quintela    +++ b/hw/pci/pci.c
1395e7732617SJuan Quintela    @@ -79,6 +79,8 @@ static Property pci_props[] = {
1396e7732617SJuan Quintela         DEFINE_PROP_STRING("failover_pair_id", PCIDevice,
1397e7732617SJuan Quintela                            failover_pair_id),
1398e7732617SJuan Quintela         DEFINE_PROP_UINT32("acpi-index",  PCIDevice, acpi_index, 0),
1399e7732617SJuan Quintela    +    DEFINE_PROP_BIT("x-pcie-err-unc-mask", PCIDevice, cap_present,
1400e7732617SJuan Quintela    +                    QEMU_PCIE_ERR_UNC_MASK_BITNR, true),
1401e7732617SJuan Quintela         DEFINE_PROP_END_OF_LIST()
1402e7732617SJuan Quintela     };
1403e7732617SJuan Quintela
1404e7732617SJuan QuintelaNotice that we enable the feature for new machine types.
1405e7732617SJuan Quintela
1406e7732617SJuan QuintelaNow we see how the fix is done.  This is going to depend on what kind
1407e7732617SJuan Quintelaof breakage happens, but in this case it is quite simple::
1408e7732617SJuan Quintela
1409e7732617SJuan Quintela    diff --git a/hw/pci/pcie_aer.c b/hw/pci/pcie_aer.c
1410e7732617SJuan Quintela    index 103667c368..374d593ead 100644
1411e7732617SJuan Quintela    --- a/hw/pci/pcie_aer.c
1412e7732617SJuan Quintela    +++ b/hw/pci/pcie_aer.c
1413e7732617SJuan Quintela    @@ -112,10 +112,13 @@ int pcie_aer_init(PCIDevice *dev, uint8_t cap_ver,
1414e7732617SJuan Quintela    uint16_t offset,
1415e7732617SJuan Quintela
1416e7732617SJuan Quintela         pci_set_long(dev->w1cmask + offset + PCI_ERR_UNCOR_STATUS,
1417e7732617SJuan Quintela                      PCI_ERR_UNC_SUPPORTED);
1418e7732617SJuan Quintela    -    pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
1419e7732617SJuan Quintela    -                 PCI_ERR_UNC_MASK_DEFAULT);
1420e7732617SJuan Quintela    -    pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
1421e7732617SJuan Quintela    -                 PCI_ERR_UNC_SUPPORTED);
1422e7732617SJuan Quintela    +
1423e7732617SJuan Quintela    +    if (dev->cap_present & QEMU_PCIE_ERR_UNC_MASK) {
1424e7732617SJuan Quintela    +        pci_set_long(dev->config + offset + PCI_ERR_UNCOR_MASK,
1425e7732617SJuan Quintela    +                     PCI_ERR_UNC_MASK_DEFAULT);
1426e7732617SJuan Quintela    +        pci_set_long(dev->wmask + offset + PCI_ERR_UNCOR_MASK,
1427e7732617SJuan Quintela    +                     PCI_ERR_UNC_SUPPORTED);
1428e7732617SJuan Quintela    +    }
1429e7732617SJuan Quintela
1430e7732617SJuan Quintela         pci_set_long(dev->config + offset + PCI_ERR_UNCOR_SEVER,
1431e7732617SJuan Quintela                      PCI_ERR_UNC_SEVERITY_DEFAULT);
1432e7732617SJuan Quintela
1433e7732617SJuan QuintelaI.e. If the property bit is enabled, we configure it as we did for
1434e7732617SJuan Quintelaqemu-8.0.  If the property bit is not set, we configure it as it was in 7.2.
1435e7732617SJuan Quintela
1436e7732617SJuan QuintelaAnd now, everything that is missing is disabling the feature for old
1437e7732617SJuan Quintelamachine types::
1438e7732617SJuan Quintela
1439e7732617SJuan Quintela    diff --git a/hw/core/machine.c b/hw/core/machine.c
1440e7732617SJuan Quintela    index 47a34841a5..07f763eb2e 100644
1441e7732617SJuan Quintela    --- a/hw/core/machine.c
1442e7732617SJuan Quintela    +++ b/hw/core/machine.c
1443e7732617SJuan Quintela    @@ -48,6 +48,7 @@ GlobalProperty hw_compat_7_2[] = {
1444e7732617SJuan Quintela         { "e1000e", "migrate-timadj", "off" },
1445e7732617SJuan Quintela         { "virtio-mem", "x-early-migration", "false" },
1446e7732617SJuan Quintela         { "migration", "x-preempt-pre-7-2", "true" },
1447e7732617SJuan Quintela    +    { TYPE_PCI_DEVICE, "x-pcie-err-unc-mask", "off" },
1448e7732617SJuan Quintela     };
1449e7732617SJuan Quintela     const size_t hw_compat_7_2_len = G_N_ELEMENTS(hw_compat_7_2);
1450e7732617SJuan Quintela
1451e7732617SJuan QuintelaAnd now, when qemu-8.0.1 is released with this fix, all combinations
1452e7732617SJuan Quintelaare going to work as supposed.
1453e7732617SJuan Quintela
1454e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
1455e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
1456e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2 (works)
1457e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2 (works)
1458e7732617SJuan Quintela
1459e7732617SJuan QuintelaSo the normality has been restored and everything is ok, no?
1460e7732617SJuan Quintela
1461e7732617SJuan QuintelaNot really, now our matrix is much bigger.  We started with the easy
1462e7732617SJuan Quintelacases, migration from the same version to the same version always
1463e7732617SJuan Quintelaworks:
1464e7732617SJuan Quintela
1465e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1466e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1467e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
1468e7732617SJuan Quintela
1469e7732617SJuan QuintelaNow the interesting ones.  When the QEMU processes versions are
1470e7732617SJuan Quinteladifferent.  For the 1st set, their fail and we can do nothing, both
1471e7732617SJuan Quintelaversions are released and we can't change anything.
1472e7732617SJuan Quintela
1473e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1474e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1475e7732617SJuan Quintela
1476e7732617SJuan QuintelaThis two are the ones that work. The whole point of making the
1477e7732617SJuan Quintelachange in qemu-8.0.1 release was to fix this issue:
1478e7732617SJuan Quintela
1479e7732617SJuan Quintela- $ qemu-7.2 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
1480e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2  ->  qemu-7.2 -M pc-7.2
1481e7732617SJuan Quintela
1482e7732617SJuan QuintelaBut now we found that qemu-8.0 neither can migrate to qemu-7.2 not
1483e7732617SJuan Quintelaqemu-8.0.1.
1484e7732617SJuan Quintela
1485e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
1486e7732617SJuan Quintela- $ qemu-8.0.1 -M pc-7.2  ->  qemu-8.0 -M pc-7.2
1487e7732617SJuan Quintela
1488e7732617SJuan QuintelaSo, if we start a pc-7.2 machine in qemu-8.0 we can't migrate it to
1489e7732617SJuan Quintelaanything except to qemu-8.0.
1490e7732617SJuan Quintela
1491e7732617SJuan QuintelaCan we do better?
1492e7732617SJuan Quintela
1493e7732617SJuan QuintelaYeap.  If we know that we are going to do this migration:
1494e7732617SJuan Quintela
1495e7732617SJuan Quintela- $ qemu-8.0 -M pc-7.2  ->  qemu-8.0.1 -M pc-7.2
1496e7732617SJuan Quintela
1497e7732617SJuan QuintelaWe can launch the appropriate devices with::
1498e7732617SJuan Quintela
1499e7732617SJuan Quintela  --device...,x-pci-e-err-unc-mask=on
1500e7732617SJuan Quintela
1501e7732617SJuan QuintelaAnd now we can receive a migration from 8.0.  And from now on, we can
1502e7732617SJuan Quintelado that migration to new machine types if we remember to enable that
1503e7732617SJuan Quintelaproperty for pc-7.2.  Notice that we need to remember, it is not
1504e7732617SJuan Quintelaenough to know that the source of the migration is qemu-8.0.  Think of
1505e7732617SJuan Quintelathis example:
1506e7732617SJuan Quintela
1507e7732617SJuan Quintela$ qemu-8.0 -M pc-7.2 -> qemu-8.0.1 -M pc-7.2 -> qemu-8.2 -M pc-7.2
1508e7732617SJuan Quintela
1509e7732617SJuan QuintelaIn the second migration, the source is not qemu-8.0, but we still have
1510e7732617SJuan Quintelathat "problem" and have that property enabled.  Notice that we need to
1511e7732617SJuan Quintelacontinue having this mark/property until we have this machine
1512e7732617SJuan Quintelarebooted.  But it is not a normal reboot (that don't reload QEMU) we
1513e7732617SJuan Quintelaneed the machine to poweroff/poweron on a fixed QEMU.  And from now
1514e7732617SJuan Quintelaon we can use the proper real machine.
1515