xref: /qemu/docs/devel/migration/main.rst (revision ce62df5378bd66963b3e096b86b31f342f001cfe)
12e3c8f8dSDr. David Alan Gilbert=========
22e3c8f8dSDr. David Alan GilbertMigration
32e3c8f8dSDr. David Alan Gilbert=========
42e3c8f8dSDr. David Alan Gilbert
52e3c8f8dSDr. David Alan GilbertQEMU has code to load/save the state of the guest that it is running.
62e3c8f8dSDr. David Alan GilbertThese are two complementary operations.  Saving the state just does
72e3c8f8dSDr. David Alan Gilbertthat, saves the state for each device that the guest is running.
82e3c8f8dSDr. David Alan GilbertRestoring a guest is just the opposite operation: we need to load the
92e3c8f8dSDr. David Alan Gilbertstate of each device.
102e3c8f8dSDr. David Alan Gilbert
112e3c8f8dSDr. David Alan GilbertFor this to work, QEMU has to be launched with the same arguments the
122e3c8f8dSDr. David Alan Gilberttwo times.  I.e. it can only restore the state in one guest that has
132e3c8f8dSDr. David Alan Gilbertthe same devices that the one it was saved (this last requirement can
142e3c8f8dSDr. David Alan Gilbertbe relaxed a bit, but for now we can consider that configuration has
152e3c8f8dSDr. David Alan Gilbertto be exactly the same).
162e3c8f8dSDr. David Alan Gilbert
172e3c8f8dSDr. David Alan GilbertOnce that we are able to save/restore a guest, a new functionality is
182e3c8f8dSDr. David Alan Gilbertrequested: migration.  This means that QEMU is able to start in one
192e3c8f8dSDr. David Alan Gilbertmachine and being "migrated" to another machine.  I.e. being moved to
202e3c8f8dSDr. David Alan Gilbertanother machine.
212e3c8f8dSDr. David Alan Gilbert
222e3c8f8dSDr. David Alan GilbertNext was the "live migration" functionality.  This is important
232e3c8f8dSDr. David Alan Gilbertbecause some guests run with a lot of state (specially RAM), and it
242e3c8f8dSDr. David Alan Gilbertcan take a while to move all state from one machine to another.  Live
252e3c8f8dSDr. David Alan Gilbertmigration allows the guest to continue running while the state is
262e3c8f8dSDr. David Alan Gilberttransferred.  Only while the last part of the state is transferred has
272e3c8f8dSDr. David Alan Gilbertthe guest to be stopped.  Typically the time that the guest is
282e3c8f8dSDr. David Alan Gilbertunresponsive during live migration is the low hundred of milliseconds
292e3c8f8dSDr. David Alan Gilbert(notice that this depends on a lot of things).
302e3c8f8dSDr. David Alan Gilbert
31edd70806SDr. David Alan GilbertTransports
32edd70806SDr. David Alan Gilbert==========
332e3c8f8dSDr. David Alan Gilbert
34edd70806SDr. David Alan GilbertThe migration stream is normally just a byte stream that can be passed
35edd70806SDr. David Alan Gilbertover any transport.
362e3c8f8dSDr. David Alan Gilbert
372e3c8f8dSDr. David Alan Gilbert- tcp migration: do the migration using tcp sockets
382e3c8f8dSDr. David Alan Gilbert- unix migration: do the migration using unix sockets
392e3c8f8dSDr. David Alan Gilbert- exec migration: do the migration using the stdin/stdout through a process.
409277d81fSVille Skyttä- fd migration: do the migration using a file descriptor that is
412e3c8f8dSDr. David Alan Gilbert  passed to QEMU.  QEMU doesn't care how this file descriptor is opened.
422e3c8f8dSDr. David Alan Gilbert
43edd70806SDr. David Alan GilbertIn addition, support is included for migration using RDMA, which
44edd70806SDr. David Alan Gilberttransports the page data using ``RDMA``, where the hardware takes care of
45edd70806SDr. David Alan Gilberttransporting the pages, and the load on the CPU is much lower.  While the
46edd70806SDr. David Alan Gilbertinternals of RDMA migration are a bit different, this isn't really visible
47edd70806SDr. David Alan Gilbertoutside the RAM migration code.
48edd70806SDr. David Alan Gilbert
49edd70806SDr. David Alan GilbertAll these migration protocols use the same infrastructure to
502e3c8f8dSDr. David Alan Gilbertsave/restore state devices.  This infrastructure is shared with the
512e3c8f8dSDr. David Alan Gilbertsavevm/loadvm functionality.
522e3c8f8dSDr. David Alan Gilbert
532e3c8f8dSDr. David Alan GilbertCommon infrastructure
542e3c8f8dSDr. David Alan Gilbert=====================
552e3c8f8dSDr. David Alan Gilbert
562e3c8f8dSDr. David Alan GilbertThe files, sockets or fd's that carry the migration stream are abstracted by
572e3c8f8dSDr. David Alan Gilbertthe  ``QEMUFile`` type (see `migration/qemu-file.h`).  In most cases this
582e3c8f8dSDr. David Alan Gilbertis connected to a subtype of ``QIOChannel`` (see `io/`).
592e3c8f8dSDr. David Alan Gilbert
60edd70806SDr. David Alan Gilbert
612e3c8f8dSDr. David Alan GilbertSaving the state of one device
622e3c8f8dSDr. David Alan Gilbert==============================
632e3c8f8dSDr. David Alan Gilbert
64edd70806SDr. David Alan GilbertFor most devices, the state is saved in a single call to the migration
65edd70806SDr. David Alan Gilbertinfrastructure; these are *non-iterative* devices.  The data for these
66edd70806SDr. David Alan Gilbertdevices is sent at the end of precopy migration, when the CPUs are paused.
67edd70806SDr. David Alan GilbertThere are also *iterative* devices, which contain a very large amount of
68edd70806SDr. David Alan Gilbertdata (e.g. RAM or large tables).  See the iterative device section below.
692e3c8f8dSDr. David Alan Gilbert
70edd70806SDr. David Alan GilbertGeneral advice for device developers
71edd70806SDr. David Alan Gilbert------------------------------------
722e3c8f8dSDr. David Alan Gilbert
73edd70806SDr. David Alan Gilbert- The migration state saved should reflect the device being modelled rather
74edd70806SDr. David Alan Gilbert  than the way your implementation works.  That way if you change the implementation
75edd70806SDr. David Alan Gilbert  later the migration stream will stay compatible.  That model may include
76edd70806SDr. David Alan Gilbert  internal state that's not directly visible in a register.
772e3c8f8dSDr. David Alan Gilbert
78edd70806SDr. David Alan Gilbert- When saving a migration stream the device code may walk and check
79edd70806SDr. David Alan Gilbert  the state of the device.  These checks might fail in various ways (e.g.
80edd70806SDr. David Alan Gilbert  discovering internal state is corrupt or that the guest has done something bad).
81edd70806SDr. David Alan Gilbert  Consider carefully before asserting/aborting at this point, since the
82edd70806SDr. David Alan Gilbert  normal response from users is that *migration broke their VM* since it had
83edd70806SDr. David Alan Gilbert  apparently been running fine until then.  In these error cases, the device
84edd70806SDr. David Alan Gilbert  should log a message indicating the cause of error, and should consider
85edd70806SDr. David Alan Gilbert  putting the device into an error state, allowing the rest of the VM to
86edd70806SDr. David Alan Gilbert  continue execution.
872e3c8f8dSDr. David Alan Gilbert
88edd70806SDr. David Alan Gilbert- The migration might happen at an inconvenient point,
89edd70806SDr. David Alan Gilbert  e.g. right in the middle of the guest reprogramming the device, during
90edd70806SDr. David Alan Gilbert  guest reboot or shutdown or while the device is waiting for external IO.
91edd70806SDr. David Alan Gilbert  It's strongly preferred that migrations do not fail in this situation,
92edd70806SDr. David Alan Gilbert  since in the cloud environment migrations might happen automatically to
93edd70806SDr. David Alan Gilbert  VMs that the administrator doesn't directly control.
942e3c8f8dSDr. David Alan Gilbert
95edd70806SDr. David Alan Gilbert- If you do need to fail a migration, ensure that sufficient information
96edd70806SDr. David Alan Gilbert  is logged to identify what went wrong.
972e3c8f8dSDr. David Alan Gilbert
98edd70806SDr. David Alan Gilbert- The destination should treat an incoming migration stream as hostile
99edd70806SDr. David Alan Gilbert  (which we do to varying degrees in the existing code).  Check that offsets
100edd70806SDr. David Alan Gilbert  into buffers and the like can't cause overruns.  Fail the incoming migration
101edd70806SDr. David Alan Gilbert  in the case of a corrupted stream like this.
1022e3c8f8dSDr. David Alan Gilbert
103edd70806SDr. David Alan Gilbert- Take care with internal device state or behaviour that might become
104edd70806SDr. David Alan Gilbert  migration version dependent.  For example, the order of PCI capabilities
105edd70806SDr. David Alan Gilbert  is required to stay constant across migration.  Another example would
106edd70806SDr. David Alan Gilbert  be that a special case handled by subsections (see below) might become
107edd70806SDr. David Alan Gilbert  much more common if a default behaviour is changed.
1082e3c8f8dSDr. David Alan Gilbert
109edd70806SDr. David Alan Gilbert- The state of the source should not be changed or destroyed by the
110edd70806SDr. David Alan Gilbert  outgoing migration.  Migrations timing out or being failed by
111edd70806SDr. David Alan Gilbert  higher levels of management, or failures of the destination host are
112edd70806SDr. David Alan Gilbert  not unusual, and in that case the VM is restarted on the source.
113edd70806SDr. David Alan Gilbert  Note that the management layer can validly revert the migration
114edd70806SDr. David Alan Gilbert  even though the QEMU level of migration has succeeded as long as it
115edd70806SDr. David Alan Gilbert  does it before starting execution on the destination.
116edd70806SDr. David Alan Gilbert
117edd70806SDr. David Alan Gilbert- Buses and devices should be able to explicitly specify addresses when
118edd70806SDr. David Alan Gilbert  instantiated, and management tools should use those.  For example,
119edd70806SDr. David Alan Gilbert  when hot adding USB devices it's important to specify the ports
120edd70806SDr. David Alan Gilbert  and addresses, since implicit ordering based on the command line order
121edd70806SDr. David Alan Gilbert  may be different on the destination.  This can result in the
122edd70806SDr. David Alan Gilbert  device state being loaded into the wrong device.
1232e3c8f8dSDr. David Alan Gilbert
1242e3c8f8dSDr. David Alan GilbertVMState
1252e3c8f8dSDr. David Alan Gilbert-------
1262e3c8f8dSDr. David Alan Gilbert
127edd70806SDr. David Alan GilbertMost device data can be described using the ``VMSTATE`` macros (mostly defined
128edd70806SDr. David Alan Gilbertin ``include/migration/vmstate.h``).
1292e3c8f8dSDr. David Alan Gilbert
1302e3c8f8dSDr. David Alan GilbertAn example (from hw/input/pckbd.c)
1312e3c8f8dSDr. David Alan Gilbert
1322e3c8f8dSDr. David Alan Gilbert.. code:: c
1332e3c8f8dSDr. David Alan Gilbert
1342e3c8f8dSDr. David Alan Gilbert  static const VMStateDescription vmstate_kbd = {
1352e3c8f8dSDr. David Alan Gilbert      .name = "pckbd",
1362e3c8f8dSDr. David Alan Gilbert      .version_id = 3,
1372e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 3,
1382e3c8f8dSDr. David Alan Gilbert      .fields = (VMStateField[]) {
1392e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(write_cmd, KBDState),
1402e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(status, KBDState),
1412e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(mode, KBDState),
1422e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(pending, KBDState),
1432e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
1442e3c8f8dSDr. David Alan Gilbert      }
1452e3c8f8dSDr. David Alan Gilbert  };
1462e3c8f8dSDr. David Alan Gilbert
1472e3c8f8dSDr. David Alan GilbertWe are declaring the state with name "pckbd".
1482e3c8f8dSDr. David Alan GilbertThe `version_id` is 3, and the fields are 4 uint8_t in a KBDState structure.
1492e3c8f8dSDr. David Alan GilbertWe registered this with:
1502e3c8f8dSDr. David Alan Gilbert
1512e3c8f8dSDr. David Alan Gilbert.. code:: c
1522e3c8f8dSDr. David Alan Gilbert
1532e3c8f8dSDr. David Alan Gilbert    vmstate_register(NULL, 0, &vmstate_kbd, s);
1542e3c8f8dSDr. David Alan Gilbert
155edd70806SDr. David Alan GilbertFor devices that are `qdev` based, we can register the device in the class
156edd70806SDr. David Alan Gilbertinit function:
1572e3c8f8dSDr. David Alan Gilbert
158edd70806SDr. David Alan Gilbert.. code:: c
1592e3c8f8dSDr. David Alan Gilbert
160edd70806SDr. David Alan Gilbert    dc->vmsd = &vmstate_kbd_isa;
1612e3c8f8dSDr. David Alan Gilbert
162edd70806SDr. David Alan GilbertThe VMState macros take care of ensuring that the device data section
163edd70806SDr. David Alan Gilbertis formatted portably (normally big endian) and make some compile time checks
164edd70806SDr. David Alan Gilbertagainst the types of the fields in the structures.
1652e3c8f8dSDr. David Alan Gilbert
166edd70806SDr. David Alan GilbertVMState macros can include other VMStateDescriptions to store substructures
167edd70806SDr. David Alan Gilbert(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length
168edd70806SDr. David Alan Gilbertarrays (``VMSTATE_VARRAY_``).  Various other macros exist for special
169edd70806SDr. David Alan Gilbertcases.
1702e3c8f8dSDr. David Alan Gilbert
171edd70806SDr. David Alan GilbertNote that the format on the wire is still very raw; i.e. a VMSTATE_UINT32
172edd70806SDr. David Alan Gilbertends up with a 4 byte bigendian representation on the wire; in the future
173edd70806SDr. David Alan Gilbertit might be possible to use a more structured format.
1742e3c8f8dSDr. David Alan Gilbert
175edd70806SDr. David Alan GilbertLegacy way
176edd70806SDr. David Alan Gilbert----------
1772e3c8f8dSDr. David Alan Gilbert
178edd70806SDr. David Alan GilbertThis way is going to disappear as soon as all current users are ported to VMSTATE;
179edd70806SDr. David Alan Gilbertalthough converting existing code can be tricky, and thus 'soon' is relative.
1802e3c8f8dSDr. David Alan Gilbert
181edd70806SDr. David Alan GilbertEach device has to register two functions, one to save the state and
182edd70806SDr. David Alan Gilbertanother to load the state back.
1832e3c8f8dSDr. David Alan Gilbert
184edd70806SDr. David Alan Gilbert.. code:: c
1852e3c8f8dSDr. David Alan Gilbert
186ce62df53SDr. David Alan Gilbert  int register_savevm_live(const char *idstr,
187edd70806SDr. David Alan Gilbert                           int instance_id,
188edd70806SDr. David Alan Gilbert                           int version_id,
189edd70806SDr. David Alan Gilbert                           SaveVMHandlers *ops,
190edd70806SDr. David Alan Gilbert                           void *opaque);
1912e3c8f8dSDr. David Alan Gilbert
192edd70806SDr. David Alan GilbertTwo functions in the ``ops`` structure are the `save_state`
193edd70806SDr. David Alan Gilbertand `load_state` functions.  Notice that `load_state` receives a version_id
194edd70806SDr. David Alan Gilbertparameter to know what state format is receiving.  `save_state` doesn't
195edd70806SDr. David Alan Gilberthave a version_id parameter because it always uses the latest version.
1962e3c8f8dSDr. David Alan Gilbert
197edd70806SDr. David Alan GilbertNote that because the VMState macros still save the data in a raw
198edd70806SDr. David Alan Gilbertformat, in many cases it's possible to replace legacy code
199edd70806SDr. David Alan Gilbertwith a carefully constructed VMState description that matches the
200edd70806SDr. David Alan Gilbertbyte layout of the existing code.
2012e3c8f8dSDr. David Alan Gilbert
202edd70806SDr. David Alan GilbertChanging migration data structures
203edd70806SDr. David Alan Gilbert----------------------------------
2042e3c8f8dSDr. David Alan Gilbert
205edd70806SDr. David Alan GilbertWhen we migrate a device, we save/load the state as a series
206edd70806SDr. David Alan Gilbertof fields.  Sometimes, due to bugs or new functionality, we need to
207edd70806SDr. David Alan Gilbertchange the state to store more/different information.  Changing the migration
208edd70806SDr. David Alan Gilbertstate saved for a device can break migration compatibility unless
209edd70806SDr. David Alan Gilbertcare is taken to use the appropriate techniques.  In general QEMU tries
210edd70806SDr. David Alan Gilbertto maintain forward migration compatibility (i.e. migrating from
211edd70806SDr. David Alan GilbertQEMU n->n+1) and there are users who benefit from backward compatibility
212edd70806SDr. David Alan Gilbertas well.
2132e3c8f8dSDr. David Alan Gilbert
2142e3c8f8dSDr. David Alan GilbertSubsections
2152e3c8f8dSDr. David Alan Gilbert-----------
2162e3c8f8dSDr. David Alan Gilbert
217edd70806SDr. David Alan GilbertThe most common structure change is adding new data, e.g. when adding
218edd70806SDr. David Alan Gilberta newer form of device, or adding that state that you previously
219edd70806SDr. David Alan Gilbertforgot to migrate.  This is best solved using a subsection.
2202e3c8f8dSDr. David Alan Gilbert
221edd70806SDr. David Alan GilbertA subsection is "like" a device vmstate, but with a particularity, it
222edd70806SDr. David Alan Gilberthas a Boolean function that tells if that values are needed to be sent
223edd70806SDr. David Alan Gilbertor not.  If this functions returns false, the subsection is not sent.
224edd70806SDr. David Alan GilbertSubsections have a unique name, that is looked for on the receiving
225edd70806SDr. David Alan Gilbertside.
2262e3c8f8dSDr. David Alan Gilbert
2272e3c8f8dSDr. David Alan GilbertOn the receiving side, if we found a subsection for a device that we
2282e3c8f8dSDr. David Alan Gilbertdon't understand, we just fail the migration.  If we understand all
229edd70806SDr. David Alan Gilbertthe subsections, then we load the state with success.  There's no check
230edd70806SDr. David Alan Gilbertthat a subsection is loaded, so a newer QEMU that knows about a subsection
231edd70806SDr. David Alan Gilbertcan (with care) load a stream from an older QEMU that didn't send
232edd70806SDr. David Alan Gilbertthe subsection.
233edd70806SDr. David Alan Gilbert
234edd70806SDr. David Alan GilbertIf the new data is only needed in a rare case, then the subsection
235edd70806SDr. David Alan Gilbertcan be made conditional on that case and the migration will still
236edd70806SDr. David Alan Gilbertsucceed to older QEMUs in most cases.  This is OK for data that's
237edd70806SDr. David Alan Gilbertcritical, but in some use cases it's preferred that the migration
238edd70806SDr. David Alan Gilbertshould succeed even with the data missing.  To support this the
239edd70806SDr. David Alan Gilbertsubsection can be connected to a device property and from there
240edd70806SDr. David Alan Gilbertto a versioned machine type.
2412e3c8f8dSDr. David Alan Gilbert
2423eb21fe9SDr. David Alan GilbertThe 'pre_load' and 'post_load' functions on subsections are only
2433eb21fe9SDr. David Alan Gilbertcalled if the subsection is loaded.
2443eb21fe9SDr. David Alan Gilbert
2453eb21fe9SDr. David Alan GilbertOne important note is that the outer post_load() function is called "after"
2463eb21fe9SDr. David Alan Gilbertloading all subsections, because a newer subsection could change the same
2473eb21fe9SDr. David Alan Gilbertvalue that it uses.  A flag, and the combination of outer pre_load and
2483eb21fe9SDr. David Alan Gilbertpost_load can be used to detect whether a subsection was loaded, and to
249edd70806SDr. David Alan Gilbertfall back on default behaviour when the subsection isn't present.
2502e3c8f8dSDr. David Alan Gilbert
2512e3c8f8dSDr. David Alan GilbertExample:
2522e3c8f8dSDr. David Alan Gilbert
2532e3c8f8dSDr. David Alan Gilbert.. code:: c
2542e3c8f8dSDr. David Alan Gilbert
2552e3c8f8dSDr. David Alan Gilbert  static bool ide_drive_pio_state_needed(void *opaque)
2562e3c8f8dSDr. David Alan Gilbert  {
2572e3c8f8dSDr. David Alan Gilbert      IDEState *s = opaque;
2582e3c8f8dSDr. David Alan Gilbert
2592e3c8f8dSDr. David Alan Gilbert      return ((s->status & DRQ_STAT) != 0)
2602e3c8f8dSDr. David Alan Gilbert          || (s->bus->error_status & BM_STATUS_PIO_RETRY);
2612e3c8f8dSDr. David Alan Gilbert  }
2622e3c8f8dSDr. David Alan Gilbert
2632e3c8f8dSDr. David Alan Gilbert  const VMStateDescription vmstate_ide_drive_pio_state = {
2642e3c8f8dSDr. David Alan Gilbert      .name = "ide_drive/pio_state",
2652e3c8f8dSDr. David Alan Gilbert      .version_id = 1,
2662e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 1,
2672e3c8f8dSDr. David Alan Gilbert      .pre_save = ide_drive_pio_pre_save,
2682e3c8f8dSDr. David Alan Gilbert      .post_load = ide_drive_pio_post_load,
2692e3c8f8dSDr. David Alan Gilbert      .needed = ide_drive_pio_state_needed,
2702e3c8f8dSDr. David Alan Gilbert      .fields = (VMStateField[]) {
2712e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(req_nb_sectors, IDEState),
2722e3c8f8dSDr. David Alan Gilbert          VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
2732e3c8f8dSDr. David Alan Gilbert                               vmstate_info_uint8, uint8_t),
2742e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(cur_io_buffer_offset, IDEState),
2752e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(cur_io_buffer_len, IDEState),
2762e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
2772e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(elementary_transfer_size, IDEState),
2782e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(packet_transfer_size, IDEState),
2792e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
2802e3c8f8dSDr. David Alan Gilbert      }
2812e3c8f8dSDr. David Alan Gilbert  };
2822e3c8f8dSDr. David Alan Gilbert
2832e3c8f8dSDr. David Alan Gilbert  const VMStateDescription vmstate_ide_drive = {
2842e3c8f8dSDr. David Alan Gilbert      .name = "ide_drive",
2852e3c8f8dSDr. David Alan Gilbert      .version_id = 3,
2862e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 0,
2872e3c8f8dSDr. David Alan Gilbert      .post_load = ide_drive_post_load,
2882e3c8f8dSDr. David Alan Gilbert      .fields = (VMStateField[]) {
2892e3c8f8dSDr. David Alan Gilbert          .... several fields ....
2902e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
2912e3c8f8dSDr. David Alan Gilbert      },
2922e3c8f8dSDr. David Alan Gilbert      .subsections = (const VMStateDescription*[]) {
2932e3c8f8dSDr. David Alan Gilbert          &vmstate_ide_drive_pio_state,
2942e3c8f8dSDr. David Alan Gilbert          NULL
2952e3c8f8dSDr. David Alan Gilbert      }
2962e3c8f8dSDr. David Alan Gilbert  };
2972e3c8f8dSDr. David Alan Gilbert
2982e3c8f8dSDr. David Alan GilbertHere we have a subsection for the pio state.  We only need to
2992e3c8f8dSDr. David Alan Gilbertsave/send this state when we are in the middle of a pio operation
3002e3c8f8dSDr. David Alan Gilbert(that is what ``ide_drive_pio_state_needed()`` checks).  If DRQ_STAT is
3012e3c8f8dSDr. David Alan Gilbertnot enabled, the values on that fields are garbage and don't need to
3022e3c8f8dSDr. David Alan Gilbertbe sent.
3032e3c8f8dSDr. David Alan Gilbert
304edd70806SDr. David Alan GilbertConnecting subsections to properties
305edd70806SDr. David Alan Gilbert------------------------------------
306edd70806SDr. David Alan Gilbert
3072e3c8f8dSDr. David Alan GilbertUsing a condition function that checks a 'property' to determine whether
308edd70806SDr. David Alan Gilbertto send a subsection allows backward migration compatibility when
309edd70806SDr. David Alan Gilbertnew subsections are added, especially when combined with versioned
310edd70806SDr. David Alan Gilbertmachine types.
3112e3c8f8dSDr. David Alan Gilbert
3122e3c8f8dSDr. David Alan GilbertFor example:
3132e3c8f8dSDr. David Alan Gilbert
3142e3c8f8dSDr. David Alan Gilbert   a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and
3152e3c8f8dSDr. David Alan Gilbert      default it to true.
316ac78f737SMarc-André Lureau   b) Add an entry to the ``hw_compat_`` for the previous version that sets
3172e3c8f8dSDr. David Alan Gilbert      the property to false.
3182e3c8f8dSDr. David Alan Gilbert   c) Add a static bool  support_foo function that tests the property.
3192e3c8f8dSDr. David Alan Gilbert   d) Add a subsection with a .needed set to the support_foo function
3203eb21fe9SDr. David Alan Gilbert   e) (potentially) Add an outer pre_load that sets up a default value
3213eb21fe9SDr. David Alan Gilbert      for 'foo' to be used if the subsection isn't loaded.
3222e3c8f8dSDr. David Alan Gilbert
3232e3c8f8dSDr. David Alan GilbertNow that subsection will not be generated when using an older
3242e3c8f8dSDr. David Alan Gilbertmachine type and the migration stream will be accepted by older
325edd70806SDr. David Alan GilbertQEMU versions.
3262e3c8f8dSDr. David Alan Gilbert
3272e3c8f8dSDr. David Alan GilbertNot sending existing elements
3282e3c8f8dSDr. David Alan Gilbert-----------------------------
3292e3c8f8dSDr. David Alan Gilbert
3302e3c8f8dSDr. David Alan GilbertSometimes members of the VMState are no longer needed:
3312e3c8f8dSDr. David Alan Gilbert
3322e3c8f8dSDr. David Alan Gilbert  - removing them will break migration compatibility
3332e3c8f8dSDr. David Alan Gilbert
334edd70806SDr. David Alan Gilbert  - making them version dependent and bumping the version will break backward migration
335edd70806SDr. David Alan Gilbert    compatibility.
3362e3c8f8dSDr. David Alan Gilbert
337edd70806SDr. David Alan GilbertAdding a dummy field into the migration stream is normally the best way to preserve
338edd70806SDr. David Alan Gilbertcompatibility.
339edd70806SDr. David Alan Gilbert
340edd70806SDr. David Alan GilbertIf the field really does need to be removed then:
3412e3c8f8dSDr. David Alan Gilbert
3422e3c8f8dSDr. David Alan Gilbert  a) Add a new property/compatibility/function in the same way for subsections above.
3432e3c8f8dSDr. David Alan Gilbert  b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
3442e3c8f8dSDr. David Alan Gilbert
3452e3c8f8dSDr. David Alan Gilbert   ``VMSTATE_UINT32(foo, barstruct)``
3462e3c8f8dSDr. David Alan Gilbert
3472e3c8f8dSDr. David Alan Gilbert   becomes
3482e3c8f8dSDr. David Alan Gilbert
3492e3c8f8dSDr. David Alan Gilbert   ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)``
3502e3c8f8dSDr. David Alan Gilbert
3512e3c8f8dSDr. David Alan Gilbert   Sometime in the future when we no longer care about the ancient versions these can be killed off.
352edd70806SDr. David Alan Gilbert   Note that for backward compatibility it's important to fill in the structure with
353edd70806SDr. David Alan Gilbert   data that the destination will understand.
354edd70806SDr. David Alan Gilbert
355edd70806SDr. David Alan GilbertAny difference in the predicates on the source and destination will end up
356edd70806SDr. David Alan Gilbertwith different fields being enabled and data being loaded into the wrong
357edd70806SDr. David Alan Gilbertfields; for this reason conditional fields like this are very fragile.
358edd70806SDr. David Alan Gilbert
359edd70806SDr. David Alan GilbertVersions
360edd70806SDr. David Alan Gilbert--------
361edd70806SDr. David Alan Gilbert
362edd70806SDr. David Alan GilbertVersion numbers are intended for major incompatible changes to the
363edd70806SDr. David Alan Gilbertmigration of a device, and using them breaks backward-migration
364edd70806SDr. David Alan Gilbertcompatibility; in general most changes can be made by adding Subsections
365edd70806SDr. David Alan Gilbert(see above) or _TEST macros (see above) which won't break compatibility.
366edd70806SDr. David Alan Gilbert
367edd70806SDr. David Alan GilbertEach version is associated with a series of fields saved.  The `save_state` always saves
368edd70806SDr. David Alan Gilbertthe state as the newer version.  But `load_state` sometimes is able to
369edd70806SDr. David Alan Gilbertload state from an older version.
370edd70806SDr. David Alan Gilbert
371edd70806SDr. David Alan GilbertYou can see that there are several version fields:
372edd70806SDr. David Alan Gilbert
373edd70806SDr. David Alan Gilbert- `version_id`: the maximum version_id supported by VMState for that device.
374edd70806SDr. David Alan Gilbert- `minimum_version_id`: the minimum version_id that VMState is able to understand
375edd70806SDr. David Alan Gilbert  for that device.
376edd70806SDr. David Alan Gilbert- `minimum_version_id_old`: For devices that were not able to port to vmstate, we can
377edd70806SDr. David Alan Gilbert  assign a function that knows how to read this old state. This field is
378edd70806SDr. David Alan Gilbert  ignored if there is no `load_state_old` handler.
379edd70806SDr. David Alan Gilbert
380edd70806SDr. David Alan GilbertVMState is able to read versions from minimum_version_id to
381edd70806SDr. David Alan Gilbertversion_id.  And the function ``load_state_old()`` (if present) is able to
382edd70806SDr. David Alan Gilbertload state from minimum_version_id_old to minimum_version_id.  This
383edd70806SDr. David Alan Gilbertfunction is deprecated and will be removed when no more users are left.
384edd70806SDr. David Alan Gilbert
385edd70806SDr. David Alan GilbertThere are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields,
386edd70806SDr. David Alan Gilberte.g.
387edd70806SDr. David Alan Gilbert
388edd70806SDr. David Alan Gilbert.. code:: c
389edd70806SDr. David Alan Gilbert
390edd70806SDr. David Alan Gilbert   VMSTATE_UINT16_V(ip_id, Slirp, 2),
391edd70806SDr. David Alan Gilbert
392edd70806SDr. David Alan Gilbertonly loads that field for versions 2 and newer.
393edd70806SDr. David Alan Gilbert
394edd70806SDr. David Alan GilbertSaving state will always create a section with the 'version_id' value
395edd70806SDr. David Alan Gilbertand thus can't be loaded by any older QEMU.
396edd70806SDr. David Alan Gilbert
397edd70806SDr. David Alan GilbertMassaging functions
398edd70806SDr. David Alan Gilbert-------------------
399edd70806SDr. David Alan Gilbert
400edd70806SDr. David Alan GilbertSometimes, it is not enough to be able to save the state directly
401edd70806SDr. David Alan Gilbertfrom one structure, we need to fill the correct values there.  One
402edd70806SDr. David Alan Gilbertexample is when we are using kvm.  Before saving the cpu state, we
403edd70806SDr. David Alan Gilbertneed to ask kvm to copy to QEMU the state that it is using.  And the
404edd70806SDr. David Alan Gilbertopposite when we are loading the state, we need a way to tell kvm to
405edd70806SDr. David Alan Gilbertload the state for the cpu that we have just loaded from the QEMUFile.
406edd70806SDr. David Alan Gilbert
407edd70806SDr. David Alan GilbertThe functions to do that are inside a vmstate definition, and are called:
408edd70806SDr. David Alan Gilbert
409edd70806SDr. David Alan Gilbert- ``int (*pre_load)(void *opaque);``
410edd70806SDr. David Alan Gilbert
411edd70806SDr. David Alan Gilbert  This function is called before we load the state of one device.
412edd70806SDr. David Alan Gilbert
413edd70806SDr. David Alan Gilbert- ``int (*post_load)(void *opaque, int version_id);``
414edd70806SDr. David Alan Gilbert
415edd70806SDr. David Alan Gilbert  This function is called after we load the state of one device.
416edd70806SDr. David Alan Gilbert
417edd70806SDr. David Alan Gilbert- ``int (*pre_save)(void *opaque);``
418edd70806SDr. David Alan Gilbert
419edd70806SDr. David Alan Gilbert  This function is called before we save the state of one device.
420edd70806SDr. David Alan Gilbert
4218c07559fSAaron Lindsay- ``int (*post_save)(void *opaque);``
4228c07559fSAaron Lindsay
4238c07559fSAaron Lindsay  This function is called after we save the state of one device
4248c07559fSAaron Lindsay  (even upon failure, unless the call to pre_save returned an error).
4258c07559fSAaron Lindsay
4268c07559fSAaron LindsayExample: You can look at hpet.c, that uses the first three functions
4278c07559fSAaron Lindsayto massage the state that is transferred.
428edd70806SDr. David Alan Gilbert
429edd70806SDr. David Alan GilbertThe ``VMSTATE_WITH_TMP`` macro may be useful when the migration
430edd70806SDr. David Alan Gilbertdata doesn't match the stored device data well; it allows an
431edd70806SDr. David Alan Gilbertintermediate temporary structure to be populated with migration
432edd70806SDr. David Alan Gilbertdata and then transferred to the main structure.
433edd70806SDr. David Alan Gilbert
434edd70806SDr. David Alan GilbertIf you use memory API functions that update memory layout outside
435edd70806SDr. David Alan Gilbertinitialization (i.e., in response to a guest action), this is a strong
436edd70806SDr. David Alan Gilbertindication that you need to call these functions in a `post_load` callback.
437edd70806SDr. David Alan GilbertExamples of such memory API functions are:
438edd70806SDr. David Alan Gilbert
439edd70806SDr. David Alan Gilbert  - memory_region_add_subregion()
440edd70806SDr. David Alan Gilbert  - memory_region_del_subregion()
441edd70806SDr. David Alan Gilbert  - memory_region_set_readonly()
442c26763f8SMarc-André Lureau  - memory_region_set_nonvolatile()
443edd70806SDr. David Alan Gilbert  - memory_region_set_enabled()
444edd70806SDr. David Alan Gilbert  - memory_region_set_address()
445edd70806SDr. David Alan Gilbert  - memory_region_set_alias_offset()
446edd70806SDr. David Alan Gilbert
447edd70806SDr. David Alan GilbertIterative device migration
448edd70806SDr. David Alan Gilbert--------------------------
449edd70806SDr. David Alan Gilbert
450edd70806SDr. David Alan GilbertSome devices, such as RAM, Block storage or certain platform devices,
451edd70806SDr. David Alan Gilberthave large amounts of data that would mean that the CPUs would be
452edd70806SDr. David Alan Gilbertpaused for too long if they were sent in one section.  For these
453edd70806SDr. David Alan Gilbertdevices an *iterative* approach is taken.
454edd70806SDr. David Alan Gilbert
455edd70806SDr. David Alan GilbertThe iterative devices generally don't use VMState macros
456edd70806SDr. David Alan Gilbert(although it may be possible in some cases) and instead use
457edd70806SDr. David Alan Gilbertqemu_put_*/qemu_get_* macros to read/write data to the stream.  Specialist
458edd70806SDr. David Alan Gilbertversions exist for high bandwidth IO.
459edd70806SDr. David Alan Gilbert
460edd70806SDr. David Alan Gilbert
461edd70806SDr. David Alan GilbertAn iterative device must provide:
462edd70806SDr. David Alan Gilbert
463edd70806SDr. David Alan Gilbert  - A ``save_setup`` function that initialises the data structures and
464edd70806SDr. David Alan Gilbert    transmits a first section containing information on the device.  In the
465edd70806SDr. David Alan Gilbert    case of RAM this transmits a list of RAMBlocks and sizes.
466edd70806SDr. David Alan Gilbert
467edd70806SDr. David Alan Gilbert  - A ``load_setup`` function that initialises the data structures on the
468edd70806SDr. David Alan Gilbert    destination.
469edd70806SDr. David Alan Gilbert
470edd70806SDr. David Alan Gilbert  - A ``save_live_pending`` function that is called repeatedly and must
471edd70806SDr. David Alan Gilbert    indicate how much more data the iterative data must save.  The core
472edd70806SDr. David Alan Gilbert    migration code will use this to determine when to pause the CPUs
473edd70806SDr. David Alan Gilbert    and complete the migration.
474edd70806SDr. David Alan Gilbert
475edd70806SDr. David Alan Gilbert  - A ``save_live_iterate`` function (called after ``save_live_pending``
476edd70806SDr. David Alan Gilbert    when there is significant data still to be sent).  It should send
477edd70806SDr. David Alan Gilbert    a chunk of data until the point that stream bandwidth limits tell it
478edd70806SDr. David Alan Gilbert    to stop.  Each call generates one section.
479edd70806SDr. David Alan Gilbert
480edd70806SDr. David Alan Gilbert  - A ``save_live_complete_precopy`` function that must transmit the
481edd70806SDr. David Alan Gilbert    last section for the device containing any remaining data.
482edd70806SDr. David Alan Gilbert
483edd70806SDr. David Alan Gilbert  - A ``load_state`` function used to load sections generated by
484edd70806SDr. David Alan Gilbert    any of the save functions that generate sections.
485edd70806SDr. David Alan Gilbert
486edd70806SDr. David Alan Gilbert  - ``cleanup`` functions for both save and load that are called
487edd70806SDr. David Alan Gilbert    at the end of migration.
488edd70806SDr. David Alan Gilbert
489edd70806SDr. David Alan GilbertNote that the contents of the sections for iterative migration tend
490edd70806SDr. David Alan Gilbertto be open-coded by the devices; care should be taken in parsing
491edd70806SDr. David Alan Gilbertthe results and structuring the stream to make them easy to validate.
492edd70806SDr. David Alan Gilbert
493edd70806SDr. David Alan GilbertDevice ordering
494edd70806SDr. David Alan Gilbert---------------
495edd70806SDr. David Alan Gilbert
496edd70806SDr. David Alan GilbertThere are cases in which the ordering of device loading matters; for
497edd70806SDr. David Alan Gilbertexample in some systems where a device may assert an interrupt during loading,
498edd70806SDr. David Alan Gilbertif the interrupt controller is loaded later then it might lose the state.
499edd70806SDr. David Alan Gilbert
500edd70806SDr. David Alan GilbertSome ordering is implicitly provided by the order in which the machine
501edd70806SDr. David Alan Gilbertdefinition creates devices, however this is somewhat fragile.
502edd70806SDr. David Alan Gilbert
503edd70806SDr. David Alan GilbertThe ``MigrationPriority`` enum provides a means of explicitly enforcing
504edd70806SDr. David Alan Gilbertordering.  Numerically higher priorities are loaded earlier.
505edd70806SDr. David Alan GilbertThe priority is set by setting the ``priority`` field of the top level
506edd70806SDr. David Alan Gilbert``VMStateDescription`` for the device.
507edd70806SDr. David Alan Gilbert
508edd70806SDr. David Alan GilbertStream structure
509edd70806SDr. David Alan Gilbert================
510edd70806SDr. David Alan Gilbert
511edd70806SDr. David Alan GilbertThe stream tries to be word and endian agnostic, allowing migration between hosts
512edd70806SDr. David Alan Gilbertof different characteristics running the same VM.
513edd70806SDr. David Alan Gilbert
514edd70806SDr. David Alan Gilbert  - Header
515edd70806SDr. David Alan Gilbert
516edd70806SDr. David Alan Gilbert    - Magic
517edd70806SDr. David Alan Gilbert    - Version
518edd70806SDr. David Alan Gilbert    - VM configuration section
519edd70806SDr. David Alan Gilbert
520edd70806SDr. David Alan Gilbert       - Machine type
521edd70806SDr. David Alan Gilbert       - Target page bits
522edd70806SDr. David Alan Gilbert  - List of sections
523edd70806SDr. David Alan Gilbert    Each section contains a device, or one iteration of a device save.
524edd70806SDr. David Alan Gilbert
525edd70806SDr. David Alan Gilbert    - section type
526edd70806SDr. David Alan Gilbert    - section id
527edd70806SDr. David Alan Gilbert    - ID string (First section of each device)
528edd70806SDr. David Alan Gilbert    - instance id (First section of each device)
529edd70806SDr. David Alan Gilbert    - version id (First section of each device)
530edd70806SDr. David Alan Gilbert    - <device data>
531edd70806SDr. David Alan Gilbert    - Footer mark
532edd70806SDr. David Alan Gilbert  - EOF mark
533edd70806SDr. David Alan Gilbert  - VM Description structure
534edd70806SDr. David Alan Gilbert    Consisting of a JSON description of the contents for analysis only
535edd70806SDr. David Alan Gilbert
536edd70806SDr. David Alan GilbertThe ``device data`` in each section consists of the data produced
537edd70806SDr. David Alan Gilbertby the code described above.  For non-iterative devices they have a single
538edd70806SDr. David Alan Gilbertsection; iterative devices have an initial and last section and a set
539edd70806SDr. David Alan Gilbertof parts in between.
540edd70806SDr. David Alan GilbertNote that there is very little checking by the common code of the integrity
541edd70806SDr. David Alan Gilbertof the ``device data`` contents, that's up to the devices themselves.
542edd70806SDr. David Alan GilbertThe ``footer mark`` provides a little bit of protection for the case where
543edd70806SDr. David Alan Gilbertthe receiving side reads more or less data than expected.
544edd70806SDr. David Alan Gilbert
545edd70806SDr. David Alan GilbertThe ``ID string`` is normally unique, having been formed from a bus name
546edd70806SDr. David Alan Gilbertand device address, PCI devices and storage devices hung off PCI controllers
547edd70806SDr. David Alan Gilbertfit this pattern well.  Some devices are fixed single instances (e.g. "pc-ram").
548edd70806SDr. David Alan GilbertOthers (especially either older devices or system devices which for
549edd70806SDr. David Alan Gilbertsome reason don't have a bus concept) make use of the ``instance id``
550edd70806SDr. David Alan Gilbertfor otherwise identically named devices.
5512e3c8f8dSDr. David Alan Gilbert
5522e3c8f8dSDr. David Alan GilbertReturn path
5532e3c8f8dSDr. David Alan Gilbert-----------
5542e3c8f8dSDr. David Alan Gilbert
555edd70806SDr. David Alan GilbertOnly a unidirectional stream is required for normal migration, however a
556edd70806SDr. David Alan Gilbert``return path`` can be created when bidirectional communication is desired.
557edd70806SDr. David Alan GilbertThis is primarily used by postcopy, but is also used to return a success
558edd70806SDr. David Alan Gilbertflag to the source at the end of migration.
5592e3c8f8dSDr. David Alan Gilbert
5602e3c8f8dSDr. David Alan Gilbert``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return
5612e3c8f8dSDr. David Alan Gilbertpath.
5622e3c8f8dSDr. David Alan Gilbert
5632e3c8f8dSDr. David Alan Gilbert  Source side
5642e3c8f8dSDr. David Alan Gilbert
5652e3c8f8dSDr. David Alan Gilbert     Forward path - written by migration thread
5662e3c8f8dSDr. David Alan Gilbert     Return path  - opened by main thread, read by return-path thread
5672e3c8f8dSDr. David Alan Gilbert
5682e3c8f8dSDr. David Alan Gilbert  Destination side
5692e3c8f8dSDr. David Alan Gilbert
5702e3c8f8dSDr. David Alan Gilbert     Forward path - read by main thread
5712e3c8f8dSDr. David Alan Gilbert     Return path  - opened by main thread, written by main thread AND postcopy
5722e3c8f8dSDr. David Alan Gilbert     thread (protected by rp_mutex)
5732e3c8f8dSDr. David Alan Gilbert
5742e3c8f8dSDr. David Alan GilbertPostcopy
5752e3c8f8dSDr. David Alan Gilbert========
5762e3c8f8dSDr. David Alan Gilbert
5772e3c8f8dSDr. David Alan Gilbert'Postcopy' migration is a way to deal with migrations that refuse to converge
5782e3c8f8dSDr. David Alan Gilbert(or take too long to converge) its plus side is that there is an upper bound on
5792e3c8f8dSDr. David Alan Gilbertthe amount of migration traffic and time it takes, the down side is that during
5802e3c8f8dSDr. David Alan Gilbertthe postcopy phase, a failure of *either* side or the network connection causes
5812e3c8f8dSDr. David Alan Gilbertthe guest to be lost.
5822e3c8f8dSDr. David Alan Gilbert
5832e3c8f8dSDr. David Alan GilbertIn postcopy the destination CPUs are started before all the memory has been
5842e3c8f8dSDr. David Alan Gilberttransferred, and accesses to pages that are yet to be transferred cause
5852e3c8f8dSDr. David Alan Gilberta fault that's translated by QEMU into a request to the source QEMU.
5862e3c8f8dSDr. David Alan Gilbert
5872e3c8f8dSDr. David Alan GilbertPostcopy can be combined with precopy (i.e. normal migration) so that if precopy
5882e3c8f8dSDr. David Alan Gilbertdoesn't finish in a given time the switch is made to postcopy.
5892e3c8f8dSDr. David Alan Gilbert
5902e3c8f8dSDr. David Alan GilbertEnabling postcopy
5912e3c8f8dSDr. David Alan Gilbert-----------------
5922e3c8f8dSDr. David Alan Gilbert
593c2eb7f21SGreg KurzTo enable postcopy, issue this command on the monitor (both source and
594c2eb7f21SGreg Kurzdestination) prior to the start of migration:
5952e3c8f8dSDr. David Alan Gilbert
5962e3c8f8dSDr. David Alan Gilbert``migrate_set_capability postcopy-ram on``
5972e3c8f8dSDr. David Alan Gilbert
5982e3c8f8dSDr. David Alan GilbertThe normal commands are then used to start a migration, which is still
5992e3c8f8dSDr. David Alan Gilbertstarted in precopy mode.  Issuing:
6002e3c8f8dSDr. David Alan Gilbert
6012e3c8f8dSDr. David Alan Gilbert``migrate_start_postcopy``
6022e3c8f8dSDr. David Alan Gilbert
6032e3c8f8dSDr. David Alan Gilbertwill now cause the transition from precopy to postcopy.
6042e3c8f8dSDr. David Alan GilbertIt can be issued immediately after migration is started or any
6052e3c8f8dSDr. David Alan Gilberttime later on.  Issuing it after the end of a migration is harmless.
6062e3c8f8dSDr. David Alan Gilbert
6079ed01779SAlexey PerevalovBlocktime is a postcopy live migration metric, intended to show how
6089ed01779SAlexey Perevalovlong the vCPU was in state of interruptable sleep due to pagefault.
6099ed01779SAlexey PerevalovThat metric is calculated both for all vCPUs as overlapped value, and
6109ed01779SAlexey Perevalovseparately for each vCPU. These values are calculated on destination
6119ed01779SAlexey Perevalovside.  To enable postcopy blocktime calculation, enter following
6129ed01779SAlexey Perevalovcommand on destination monitor:
6139ed01779SAlexey Perevalov
6149ed01779SAlexey Perevalov``migrate_set_capability postcopy-blocktime on``
6159ed01779SAlexey Perevalov
6169ed01779SAlexey PerevalovPostcopy blocktime can be retrieved by query-migrate qmp command.
6179ed01779SAlexey Perevalovpostcopy-blocktime value of qmp command will show overlapped blocking
6189ed01779SAlexey Perevalovtime for all vCPU, postcopy-vcpu-blocktime will show list of blocking
6199ed01779SAlexey Perevalovtime per vCPU.
6209ed01779SAlexey Perevalov
6212e3c8f8dSDr. David Alan Gilbert.. note::
6222e3c8f8dSDr. David Alan Gilbert  During the postcopy phase, the bandwidth limits set using
6232e3c8f8dSDr. David Alan Gilbert  ``migrate_set_speed`` is ignored (to avoid delaying requested pages that
6242e3c8f8dSDr. David Alan Gilbert  the destination is waiting for).
6252e3c8f8dSDr. David Alan Gilbert
6262e3c8f8dSDr. David Alan GilbertPostcopy device transfer
6272e3c8f8dSDr. David Alan Gilbert------------------------
6282e3c8f8dSDr. David Alan Gilbert
6292e3c8f8dSDr. David Alan GilbertLoading of device data may cause the device emulation to access guest RAM
6302e3c8f8dSDr. David Alan Gilbertthat may trigger faults that have to be resolved by the source, as such
6312e3c8f8dSDr. David Alan Gilbertthe migration stream has to be able to respond with page data *during* the
6322e3c8f8dSDr. David Alan Gilbertdevice load, and hence the device data has to be read from the stream completely
6332e3c8f8dSDr. David Alan Gilbertbefore the device load begins to free the stream up.  This is achieved by
6342e3c8f8dSDr. David Alan Gilbert'packaging' the device data into a blob that's read in one go.
6352e3c8f8dSDr. David Alan Gilbert
6362e3c8f8dSDr. David Alan GilbertSource behaviour
6372e3c8f8dSDr. David Alan Gilbert----------------
6382e3c8f8dSDr. David Alan Gilbert
6392e3c8f8dSDr. David Alan GilbertUntil postcopy is entered the migration stream is identical to normal
6402e3c8f8dSDr. David Alan Gilbertprecopy, except for the addition of a 'postcopy advise' command at
6412e3c8f8dSDr. David Alan Gilbertthe beginning, to tell the destination that postcopy might happen.
6422e3c8f8dSDr. David Alan GilbertWhen postcopy starts the source sends the page discard data and then
6432e3c8f8dSDr. David Alan Gilbertforms the 'package' containing:
6442e3c8f8dSDr. David Alan Gilbert
6452e3c8f8dSDr. David Alan Gilbert   - Command: 'postcopy listen'
6462e3c8f8dSDr. David Alan Gilbert   - The device state
6472e3c8f8dSDr. David Alan Gilbert
6482e3c8f8dSDr. David Alan Gilbert     A series of sections, identical to the precopy streams device state stream
6492e3c8f8dSDr. David Alan Gilbert     containing everything except postcopiable devices (i.e. RAM)
6502e3c8f8dSDr. David Alan Gilbert   - Command: 'postcopy run'
6512e3c8f8dSDr. David Alan Gilbert
6522e3c8f8dSDr. David Alan GilbertThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
6532e3c8f8dSDr. David Alan Gilbertcontents are formatted in the same way as the main migration stream.
6542e3c8f8dSDr. David Alan Gilbert
6552e3c8f8dSDr. David Alan GilbertDuring postcopy the source scans the list of dirty pages and sends them
6562e3c8f8dSDr. David Alan Gilbertto the destination without being requested (in much the same way as precopy),
6572e3c8f8dSDr. David Alan Gilberthowever when a page request is received from the destination, the dirty page
6582e3c8f8dSDr. David Alan Gilbertscanning restarts from the requested location.  This causes requested pages
6592e3c8f8dSDr. David Alan Gilbertto be sent quickly, and also causes pages directly after the requested page
6602e3c8f8dSDr. David Alan Gilbertto be sent quickly in the hope that those pages are likely to be used
6612e3c8f8dSDr. David Alan Gilbertby the destination soon.
6622e3c8f8dSDr. David Alan Gilbert
6632e3c8f8dSDr. David Alan GilbertDestination behaviour
6642e3c8f8dSDr. David Alan Gilbert---------------------
6652e3c8f8dSDr. David Alan Gilbert
6662e3c8f8dSDr. David Alan GilbertInitially the destination looks the same as precopy, with a single thread
6672e3c8f8dSDr. David Alan Gilbertreading the migration stream; the 'postcopy advise' and 'discard' commands
6682e3c8f8dSDr. David Alan Gilbertare processed to change the way RAM is managed, but don't affect the stream
6692e3c8f8dSDr. David Alan Gilbertprocessing.
6702e3c8f8dSDr. David Alan Gilbert
6712e3c8f8dSDr. David Alan Gilbert::
6722e3c8f8dSDr. David Alan Gilbert
6732e3c8f8dSDr. David Alan Gilbert  ------------------------------------------------------------------------------
6742e3c8f8dSDr. David Alan Gilbert                          1      2   3     4 5                      6   7
6752e3c8f8dSDr. David Alan Gilbert  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
6762e3c8f8dSDr. David Alan Gilbert  thread                             |       |
6772e3c8f8dSDr. David Alan Gilbert                                     |     (page request)
6782e3c8f8dSDr. David Alan Gilbert                                     |        \___
6792e3c8f8dSDr. David Alan Gilbert                                     v            \
6802e3c8f8dSDr. David Alan Gilbert  listen thread:                     --- page -- page -- page -- page -- page --
6812e3c8f8dSDr. David Alan Gilbert
6822e3c8f8dSDr. David Alan Gilbert                                     a   b        c
6832e3c8f8dSDr. David Alan Gilbert  ------------------------------------------------------------------------------
6842e3c8f8dSDr. David Alan Gilbert
6852e3c8f8dSDr. David Alan Gilbert- On receipt of ``CMD_PACKAGED`` (1)
6862e3c8f8dSDr. David Alan Gilbert
6872e3c8f8dSDr. David Alan Gilbert   All the data associated with the package - the ( ... ) section in the diagram -
6882e3c8f8dSDr. David Alan Gilbert   is read into memory, and the main thread recurses into qemu_loadvm_state_main
6892e3c8f8dSDr. David Alan Gilbert   to process the contents of the package (2) which contains commands (3,6) and
6902e3c8f8dSDr. David Alan Gilbert   devices (4...)
6912e3c8f8dSDr. David Alan Gilbert
6922e3c8f8dSDr. David Alan Gilbert- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
6932e3c8f8dSDr. David Alan Gilbert
6942e3c8f8dSDr. David Alan Gilbert   a new thread (a) is started that takes over servicing the migration stream,
6952e3c8f8dSDr. David Alan Gilbert   while the main thread carries on loading the package.   It loads normal
6962e3c8f8dSDr. David Alan Gilbert   background page data (b) but if during a device load a fault happens (5)
6972e3c8f8dSDr. David Alan Gilbert   the returned page (c) is loaded by the listen thread allowing the main
6982e3c8f8dSDr. David Alan Gilbert   threads device load to carry on.
6992e3c8f8dSDr. David Alan Gilbert
7002e3c8f8dSDr. David Alan Gilbert- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
7012e3c8f8dSDr. David Alan Gilbert
7022e3c8f8dSDr. David Alan Gilbert   letting the destination CPUs start running.  At the end of the
7032e3c8f8dSDr. David Alan Gilbert   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
7042e3c8f8dSDr. David Alan Gilbert   is no longer used by migration, while the listen thread carries on servicing
7052e3c8f8dSDr. David Alan Gilbert   page data until the end of migration.
7062e3c8f8dSDr. David Alan Gilbert
7072e3c8f8dSDr. David Alan GilbertPostcopy states
7082e3c8f8dSDr. David Alan Gilbert---------------
7092e3c8f8dSDr. David Alan Gilbert
7102e3c8f8dSDr. David Alan GilbertPostcopy moves through a series of states (see postcopy_state) from
7112e3c8f8dSDr. David Alan GilbertADVISE->DISCARD->LISTEN->RUNNING->END
7122e3c8f8dSDr. David Alan Gilbert
7132e3c8f8dSDr. David Alan Gilbert - Advise
7142e3c8f8dSDr. David Alan Gilbert
7152e3c8f8dSDr. David Alan Gilbert    Set at the start of migration if postcopy is enabled, even
7162e3c8f8dSDr. David Alan Gilbert    if it hasn't had the start command; here the destination
7172e3c8f8dSDr. David Alan Gilbert    checks that its OS has the support needed for postcopy, and performs
7182e3c8f8dSDr. David Alan Gilbert    setup to ensure the RAM mappings are suitable for later postcopy.
7192e3c8f8dSDr. David Alan Gilbert    The destination will fail early in migration at this point if the
7202e3c8f8dSDr. David Alan Gilbert    required OS support is not present.
7212e3c8f8dSDr. David Alan Gilbert    (Triggered by reception of POSTCOPY_ADVISE command)
7222e3c8f8dSDr. David Alan Gilbert
7232e3c8f8dSDr. David Alan Gilbert - Discard
7242e3c8f8dSDr. David Alan Gilbert
7252e3c8f8dSDr. David Alan Gilbert    Entered on receipt of the first 'discard' command; prior to
7262e3c8f8dSDr. David Alan Gilbert    the first Discard being performed, hugepages are switched off
7272e3c8f8dSDr. David Alan Gilbert    (using madvise) to ensure that no new huge pages are created
7282e3c8f8dSDr. David Alan Gilbert    during the postcopy phase, and to cause any huge pages that
7292e3c8f8dSDr. David Alan Gilbert    have discards on them to be broken.
7302e3c8f8dSDr. David Alan Gilbert
7312e3c8f8dSDr. David Alan Gilbert - Listen
7322e3c8f8dSDr. David Alan Gilbert
7332e3c8f8dSDr. David Alan Gilbert    The first command in the package, POSTCOPY_LISTEN, switches
7342e3c8f8dSDr. David Alan Gilbert    the destination state to Listen, and starts a new thread
7352e3c8f8dSDr. David Alan Gilbert    (the 'listen thread') which takes over the job of receiving
7362e3c8f8dSDr. David Alan Gilbert    pages off the migration stream, while the main thread carries
7372e3c8f8dSDr. David Alan Gilbert    on processing the blob.  With this thread able to process page
7382e3c8f8dSDr. David Alan Gilbert    reception, the destination now 'sensitises' the RAM to detect
7392e3c8f8dSDr. David Alan Gilbert    any access to missing pages (on Linux using the 'userfault'
7402e3c8f8dSDr. David Alan Gilbert    system).
7412e3c8f8dSDr. David Alan Gilbert
7422e3c8f8dSDr. David Alan Gilbert - Running
7432e3c8f8dSDr. David Alan Gilbert
7442e3c8f8dSDr. David Alan Gilbert    POSTCOPY_RUN causes the destination to synchronise all
7452e3c8f8dSDr. David Alan Gilbert    state and start the CPUs and IO devices running.  The main
7462e3c8f8dSDr. David Alan Gilbert    thread now finishes processing the migration package and
7472e3c8f8dSDr. David Alan Gilbert    now carries on as it would for normal precopy migration
7482e3c8f8dSDr. David Alan Gilbert    (although it can't do the cleanup it would do as it
7492e3c8f8dSDr. David Alan Gilbert    finishes a normal migration).
7502e3c8f8dSDr. David Alan Gilbert
7512e3c8f8dSDr. David Alan Gilbert - End
7522e3c8f8dSDr. David Alan Gilbert
7532e3c8f8dSDr. David Alan Gilbert    The listen thread can now quit, and perform the cleanup of migration
7542e3c8f8dSDr. David Alan Gilbert    state, the migration is now complete.
7552e3c8f8dSDr. David Alan Gilbert
7562e3c8f8dSDr. David Alan GilbertSource side page maps
7572e3c8f8dSDr. David Alan Gilbert---------------------
7582e3c8f8dSDr. David Alan Gilbert
7592e3c8f8dSDr. David Alan GilbertThe source side keeps two bitmaps during postcopy; 'the migration bitmap'
7602e3c8f8dSDr. David Alan Gilbertand 'unsent map'.  The 'migration bitmap' is basically the same as in
7612e3c8f8dSDr. David Alan Gilbertthe precopy case, and holds a bit to indicate that page is 'dirty' -
7622e3c8f8dSDr. David Alan Gilberti.e. needs sending.  During the precopy phase this is updated as the CPU
7632e3c8f8dSDr. David Alan Gilbertdirties pages, however during postcopy the CPUs are stopped and nothing
7642e3c8f8dSDr. David Alan Gilbertshould dirty anything any more.
7652e3c8f8dSDr. David Alan Gilbert
7662e3c8f8dSDr. David Alan GilbertThe 'unsent map' is used for the transition to postcopy. It is a bitmap that
7672e3c8f8dSDr. David Alan Gilberthas a bit cleared whenever a page is sent to the destination, however during
7682e3c8f8dSDr. David Alan Gilbertthe transition to postcopy mode it is combined with the migration bitmap
7692e3c8f8dSDr. David Alan Gilbertto form a set of pages that:
7702e3c8f8dSDr. David Alan Gilbert
7712e3c8f8dSDr. David Alan Gilbert   a) Have been sent but then redirtied (which must be discarded)
7722e3c8f8dSDr. David Alan Gilbert   b) Have not yet been sent - which also must be discarded to cause any
7732e3c8f8dSDr. David Alan Gilbert      transparent huge pages built during precopy to be broken.
7742e3c8f8dSDr. David Alan Gilbert
7752e3c8f8dSDr. David Alan GilbertNote that the contents of the unsentmap are sacrificed during the calculation
7762e3c8f8dSDr. David Alan Gilbertof the discard set and thus aren't valid once in postcopy.  The dirtymap
7772e3c8f8dSDr. David Alan Gilbertis still valid and is used to ensure that no page is sent more than once.  Any
7782e3c8f8dSDr. David Alan Gilbertrequest for a page that has already been sent is ignored.  Duplicate requests
7792e3c8f8dSDr. David Alan Gilbertsuch as this can happen as a page is sent at about the same time the
7802e3c8f8dSDr. David Alan Gilbertdestination accesses it.
7812e3c8f8dSDr. David Alan Gilbert
7822e3c8f8dSDr. David Alan GilbertPostcopy with hugepages
7832e3c8f8dSDr. David Alan Gilbert-----------------------
7842e3c8f8dSDr. David Alan Gilbert
7852e3c8f8dSDr. David Alan GilbertPostcopy now works with hugetlbfs backed memory:
7862e3c8f8dSDr. David Alan Gilbert
7872e3c8f8dSDr. David Alan Gilbert  a) The linux kernel on the destination must support userfault on hugepages.
7882e3c8f8dSDr. David Alan Gilbert  b) The huge-page configuration on the source and destination VMs must be
7892e3c8f8dSDr. David Alan Gilbert     identical; i.e. RAMBlocks on both sides must use the same page size.
7902e3c8f8dSDr. David Alan Gilbert  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
7912e3c8f8dSDr. David Alan Gilbert     RAM if it doesn't have enough hugepages, triggering (b) to fail.
7922e3c8f8dSDr. David Alan Gilbert     Using ``-mem-prealloc`` enforces the allocation using hugepages.
7932e3c8f8dSDr. David Alan Gilbert  d) Care should be taken with the size of hugepage used; postcopy with 2MB
7942e3c8f8dSDr. David Alan Gilbert     hugepages works well, however 1GB hugepages are likely to be problematic
7952e3c8f8dSDr. David Alan Gilbert     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
7962e3c8f8dSDr. David Alan Gilbert     and until the full page is transferred the destination thread is blocked.
7971dc61e7bSDr. David Alan Gilbert
7981dc61e7bSDr. David Alan GilbertPostcopy with shared memory
7991dc61e7bSDr. David Alan Gilbert---------------------------
8001dc61e7bSDr. David Alan Gilbert
8011dc61e7bSDr. David Alan GilbertPostcopy migration with shared memory needs explicit support from the other
8021dc61e7bSDr. David Alan Gilbertprocesses that share memory and from QEMU. There are restrictions on the type of
8031dc61e7bSDr. David Alan Gilbertmemory that userfault can support shared.
8041dc61e7bSDr. David Alan Gilbert
8051dc61e7bSDr. David Alan GilbertThe Linux kernel userfault support works on `/dev/shm` memory and on `hugetlbfs`
8061dc61e7bSDr. David Alan Gilbert(although the kernel doesn't provide an equivalent to `madvise(MADV_DONTNEED)`
8071dc61e7bSDr. David Alan Gilbertfor hugetlbfs which may be a problem in some configurations).
8081dc61e7bSDr. David Alan Gilbert
8091dc61e7bSDr. David Alan GilbertThe vhost-user code in QEMU supports clients that have Postcopy support,
8101dc61e7bSDr. David Alan Gilbertand the `vhost-user-bridge` (in `tests/`) and the DPDK package have changes
8111dc61e7bSDr. David Alan Gilbertto support postcopy.
8121dc61e7bSDr. David Alan Gilbert
8131dc61e7bSDr. David Alan GilbertThe client needs to open a userfaultfd and register the areas
8141dc61e7bSDr. David Alan Gilbertof memory that it maps with userfault.  The client must then pass the
8151dc61e7bSDr. David Alan Gilbertuserfaultfd back to QEMU together with a mapping table that allows
8161dc61e7bSDr. David Alan Gilbertfault addresses in the clients address space to be converted back to
8171dc61e7bSDr. David Alan GilbertRAMBlock/offsets.  The client's userfaultfd is added to the postcopy
8181dc61e7bSDr. David Alan Gilbertfault-thread and page requests are made on behalf of the client by QEMU.
8191dc61e7bSDr. David Alan GilbertQEMU performs 'wake' operations on the client's userfaultfd to allow it
8201dc61e7bSDr. David Alan Gilbertto continue after a page has arrived.
8211dc61e7bSDr. David Alan Gilbert
8221dc61e7bSDr. David Alan Gilbert.. note::
8231dc61e7bSDr. David Alan Gilbert  There are two future improvements that would be nice:
8241dc61e7bSDr. David Alan Gilbert    a) Some way to make QEMU ignorant of the addresses in the clients
8251dc61e7bSDr. David Alan Gilbert       address space
8261dc61e7bSDr. David Alan Gilbert    b) Avoiding the need for QEMU to perform ufd-wake calls after the
8271dc61e7bSDr. David Alan Gilbert       pages have arrived
8281dc61e7bSDr. David Alan Gilbert
8291dc61e7bSDr. David Alan GilbertRetro-fitting postcopy to existing clients is possible:
8301dc61e7bSDr. David Alan Gilbert  a) A mechanism is needed for the registration with userfault as above,
8311dc61e7bSDr. David Alan Gilbert     and the registration needs to be coordinated with the phases of
8321dc61e7bSDr. David Alan Gilbert     postcopy.  In vhost-user extra messages are added to the existing
8331dc61e7bSDr. David Alan Gilbert     control channel.
8341dc61e7bSDr. David Alan Gilbert  b) Any thread that can block due to guest memory accesses must be
8351dc61e7bSDr. David Alan Gilbert     identified and the implication understood; for example if the
8361dc61e7bSDr. David Alan Gilbert     guest memory access is made while holding a lock then all other
8371dc61e7bSDr. David Alan Gilbert     threads waiting for that lock will also be blocked.
838edd70806SDr. David Alan Gilbert
839edd70806SDr. David Alan GilbertFirmware
840edd70806SDr. David Alan Gilbert========
841edd70806SDr. David Alan Gilbert
842edd70806SDr. David Alan GilbertMigration migrates the copies of RAM and ROM, and thus when running
843edd70806SDr. David Alan Gilberton the destination it includes the firmware from the source. Even after
844edd70806SDr. David Alan Gilbertresetting a VM, the old firmware is used.  Only once QEMU has been restarted
845edd70806SDr. David Alan Gilbertis the new firmware in use.
846edd70806SDr. David Alan Gilbert
847edd70806SDr. David Alan Gilbert- Changes in firmware size can cause changes in the required RAMBlock size
848edd70806SDr. David Alan Gilbert  to hold the firmware and thus migration can fail.  In practice it's best
849edd70806SDr. David Alan Gilbert  to pad firmware images to convenient powers of 2 with plenty of space
850edd70806SDr. David Alan Gilbert  for growth.
851edd70806SDr. David Alan Gilbert
852edd70806SDr. David Alan Gilbert- Care should be taken with device emulation code so that newer
853edd70806SDr. David Alan Gilbert  emulation code can work with older firmware to allow forward migration.
854edd70806SDr. David Alan Gilbert
855edd70806SDr. David Alan Gilbert- Care should be taken with newer firmware so that backward migration
856edd70806SDr. David Alan Gilbert  to older systems with older device emulation code will work.
857edd70806SDr. David Alan Gilbert
858edd70806SDr. David Alan GilbertIn some cases it may be best to tie specific firmware versions to specific
859edd70806SDr. David Alan Gilbertversioned machine types to cut down on the combinations that will need
860edd70806SDr. David Alan Gilbertsupport.  This is also useful when newer versions of firmware outgrow
861edd70806SDr. David Alan Gilbertthe padding.
862edd70806SDr. David Alan Gilbert
863