xref: /qemu/docs/devel/migration/main.rst (revision eef0bae3a75fa33921ac859f70fd154310915ad4)
1f6bbac98SPeter Xu===================
2f6bbac98SPeter XuMigration framework
3f6bbac98SPeter Xu===================
42e3c8f8dSDr. David Alan Gilbert
52e3c8f8dSDr. David Alan GilbertQEMU has code to load/save the state of the guest that it is running.
62e3c8f8dSDr. David Alan GilbertThese are two complementary operations.  Saving the state just does
72e3c8f8dSDr. David Alan Gilbertthat, saves the state for each device that the guest is running.
82e3c8f8dSDr. David Alan GilbertRestoring a guest is just the opposite operation: we need to load the
92e3c8f8dSDr. David Alan Gilbertstate of each device.
102e3c8f8dSDr. David Alan Gilbert
112e3c8f8dSDr. David Alan GilbertFor this to work, QEMU has to be launched with the same arguments the
122e3c8f8dSDr. David Alan Gilberttwo times.  I.e. it can only restore the state in one guest that has
132e3c8f8dSDr. David Alan Gilbertthe same devices that the one it was saved (this last requirement can
142e3c8f8dSDr. David Alan Gilbertbe relaxed a bit, but for now we can consider that configuration has
152e3c8f8dSDr. David Alan Gilbertto be exactly the same).
162e3c8f8dSDr. David Alan Gilbert
172e3c8f8dSDr. David Alan GilbertOnce that we are able to save/restore a guest, a new functionality is
182e3c8f8dSDr. David Alan Gilbertrequested: migration.  This means that QEMU is able to start in one
192e3c8f8dSDr. David Alan Gilbertmachine and being "migrated" to another machine.  I.e. being moved to
202e3c8f8dSDr. David Alan Gilbertanother machine.
212e3c8f8dSDr. David Alan Gilbert
222e3c8f8dSDr. David Alan GilbertNext was the "live migration" functionality.  This is important
232e3c8f8dSDr. David Alan Gilbertbecause some guests run with a lot of state (specially RAM), and it
242e3c8f8dSDr. David Alan Gilbertcan take a while to move all state from one machine to another.  Live
252e3c8f8dSDr. David Alan Gilbertmigration allows the guest to continue running while the state is
262e3c8f8dSDr. David Alan Gilberttransferred.  Only while the last part of the state is transferred has
272e3c8f8dSDr. David Alan Gilbertthe guest to be stopped.  Typically the time that the guest is
282e3c8f8dSDr. David Alan Gilbertunresponsive during live migration is the low hundred of milliseconds
292e3c8f8dSDr. David Alan Gilbert(notice that this depends on a lot of things).
302e3c8f8dSDr. David Alan Gilbert
31d8a0f054SJuan Quintela.. contents::
32d8a0f054SJuan Quintela
33edd70806SDr. David Alan GilbertTransports
34edd70806SDr. David Alan Gilbert==========
352e3c8f8dSDr. David Alan Gilbert
36edd70806SDr. David Alan GilbertThe migration stream is normally just a byte stream that can be passed
37edd70806SDr. David Alan Gilbertover any transport.
382e3c8f8dSDr. David Alan Gilbert
392e3c8f8dSDr. David Alan Gilbert- tcp migration: do the migration using tcp sockets
402e3c8f8dSDr. David Alan Gilbert- unix migration: do the migration using unix sockets
412e3c8f8dSDr. David Alan Gilbert- exec migration: do the migration using the stdin/stdout through a process.
429277d81fSVille Skyttä- fd migration: do the migration using a file descriptor that is
432e3c8f8dSDr. David Alan Gilbert  passed to QEMU.  QEMU doesn't care how this file descriptor is opened.
44c35462f1SFabiano Rosas- file migration: do the migration using a file that is passed to QEMU
45c35462f1SFabiano Rosas  by path. A file offset option is supported to allow a management
46c35462f1SFabiano Rosas  application to add its own metadata to the start of the file without
4761dec060SFabiano Rosas  QEMU interference. Note that QEMU does not flush cached file
4861dec060SFabiano Rosas  data/metadata at the end of migration.
492e3c8f8dSDr. David Alan Gilbert
50edd70806SDr. David Alan GilbertIn addition, support is included for migration using RDMA, which
51edd70806SDr. David Alan Gilberttransports the page data using ``RDMA``, where the hardware takes care of
52edd70806SDr. David Alan Gilberttransporting the pages, and the load on the CPU is much lower.  While the
53edd70806SDr. David Alan Gilbertinternals of RDMA migration are a bit different, this isn't really visible
54edd70806SDr. David Alan Gilbertoutside the RAM migration code.
55edd70806SDr. David Alan Gilbert
56edd70806SDr. David Alan GilbertAll these migration protocols use the same infrastructure to
572e3c8f8dSDr. David Alan Gilbertsave/restore state devices.  This infrastructure is shared with the
582e3c8f8dSDr. David Alan Gilbertsavevm/loadvm functionality.
592e3c8f8dSDr. David Alan Gilbert
602e3c8f8dSDr. David Alan GilbertCommon infrastructure
612e3c8f8dSDr. David Alan Gilbert=====================
622e3c8f8dSDr. David Alan Gilbert
632e3c8f8dSDr. David Alan GilbertThe files, sockets or fd's that carry the migration stream are abstracted by
644df3a7bfSPeter Maydellthe  ``QEMUFile`` type (see ``migration/qemu-file.h``).  In most cases this
654df3a7bfSPeter Maydellis connected to a subtype of ``QIOChannel`` (see ``io/``).
662e3c8f8dSDr. David Alan Gilbert
67edd70806SDr. David Alan Gilbert
682e3c8f8dSDr. David Alan GilbertSaving the state of one device
692e3c8f8dSDr. David Alan Gilbert==============================
702e3c8f8dSDr. David Alan Gilbert
71edd70806SDr. David Alan GilbertFor most devices, the state is saved in a single call to the migration
72edd70806SDr. David Alan Gilbertinfrastructure; these are *non-iterative* devices.  The data for these
73edd70806SDr. David Alan Gilbertdevices is sent at the end of precopy migration, when the CPUs are paused.
74edd70806SDr. David Alan GilbertThere are also *iterative* devices, which contain a very large amount of
75edd70806SDr. David Alan Gilbertdata (e.g. RAM or large tables).  See the iterative device section below.
762e3c8f8dSDr. David Alan Gilbert
77edd70806SDr. David Alan GilbertGeneral advice for device developers
78edd70806SDr. David Alan Gilbert------------------------------------
792e3c8f8dSDr. David Alan Gilbert
80edd70806SDr. David Alan Gilbert- The migration state saved should reflect the device being modelled rather
81edd70806SDr. David Alan Gilbert  than the way your implementation works.  That way if you change the implementation
82edd70806SDr. David Alan Gilbert  later the migration stream will stay compatible.  That model may include
83edd70806SDr. David Alan Gilbert  internal state that's not directly visible in a register.
842e3c8f8dSDr. David Alan Gilbert
85edd70806SDr. David Alan Gilbert- When saving a migration stream the device code may walk and check
86edd70806SDr. David Alan Gilbert  the state of the device.  These checks might fail in various ways (e.g.
87edd70806SDr. David Alan Gilbert  discovering internal state is corrupt or that the guest has done something bad).
88edd70806SDr. David Alan Gilbert  Consider carefully before asserting/aborting at this point, since the
89edd70806SDr. David Alan Gilbert  normal response from users is that *migration broke their VM* since it had
90edd70806SDr. David Alan Gilbert  apparently been running fine until then.  In these error cases, the device
91edd70806SDr. David Alan Gilbert  should log a message indicating the cause of error, and should consider
92edd70806SDr. David Alan Gilbert  putting the device into an error state, allowing the rest of the VM to
93edd70806SDr. David Alan Gilbert  continue execution.
942e3c8f8dSDr. David Alan Gilbert
95edd70806SDr. David Alan Gilbert- The migration might happen at an inconvenient point,
96edd70806SDr. David Alan Gilbert  e.g. right in the middle of the guest reprogramming the device, during
97edd70806SDr. David Alan Gilbert  guest reboot or shutdown or while the device is waiting for external IO.
98edd70806SDr. David Alan Gilbert  It's strongly preferred that migrations do not fail in this situation,
99edd70806SDr. David Alan Gilbert  since in the cloud environment migrations might happen automatically to
100edd70806SDr. David Alan Gilbert  VMs that the administrator doesn't directly control.
1012e3c8f8dSDr. David Alan Gilbert
102edd70806SDr. David Alan Gilbert- If you do need to fail a migration, ensure that sufficient information
103edd70806SDr. David Alan Gilbert  is logged to identify what went wrong.
1042e3c8f8dSDr. David Alan Gilbert
105edd70806SDr. David Alan Gilbert- The destination should treat an incoming migration stream as hostile
106edd70806SDr. David Alan Gilbert  (which we do to varying degrees in the existing code).  Check that offsets
107edd70806SDr. David Alan Gilbert  into buffers and the like can't cause overruns.  Fail the incoming migration
108edd70806SDr. David Alan Gilbert  in the case of a corrupted stream like this.
1092e3c8f8dSDr. David Alan Gilbert
110edd70806SDr. David Alan Gilbert- Take care with internal device state or behaviour that might become
111edd70806SDr. David Alan Gilbert  migration version dependent.  For example, the order of PCI capabilities
112edd70806SDr. David Alan Gilbert  is required to stay constant across migration.  Another example would
113edd70806SDr. David Alan Gilbert  be that a special case handled by subsections (see below) might become
114edd70806SDr. David Alan Gilbert  much more common if a default behaviour is changed.
1152e3c8f8dSDr. David Alan Gilbert
116edd70806SDr. David Alan Gilbert- The state of the source should not be changed or destroyed by the
117edd70806SDr. David Alan Gilbert  outgoing migration.  Migrations timing out or being failed by
118edd70806SDr. David Alan Gilbert  higher levels of management, or failures of the destination host are
119edd70806SDr. David Alan Gilbert  not unusual, and in that case the VM is restarted on the source.
120edd70806SDr. David Alan Gilbert  Note that the management layer can validly revert the migration
121edd70806SDr. David Alan Gilbert  even though the QEMU level of migration has succeeded as long as it
122edd70806SDr. David Alan Gilbert  does it before starting execution on the destination.
123edd70806SDr. David Alan Gilbert
124edd70806SDr. David Alan Gilbert- Buses and devices should be able to explicitly specify addresses when
125edd70806SDr. David Alan Gilbert  instantiated, and management tools should use those.  For example,
126edd70806SDr. David Alan Gilbert  when hot adding USB devices it's important to specify the ports
127edd70806SDr. David Alan Gilbert  and addresses, since implicit ordering based on the command line order
128edd70806SDr. David Alan Gilbert  may be different on the destination.  This can result in the
129edd70806SDr. David Alan Gilbert  device state being loaded into the wrong device.
1302e3c8f8dSDr. David Alan Gilbert
1312e3c8f8dSDr. David Alan GilbertVMState
1322e3c8f8dSDr. David Alan Gilbert-------
1332e3c8f8dSDr. David Alan Gilbert
134edd70806SDr. David Alan GilbertMost device data can be described using the ``VMSTATE`` macros (mostly defined
135edd70806SDr. David Alan Gilbertin ``include/migration/vmstate.h``).
1362e3c8f8dSDr. David Alan Gilbert
1372e3c8f8dSDr. David Alan GilbertAn example (from hw/input/pckbd.c)
1382e3c8f8dSDr. David Alan Gilbert
1392e3c8f8dSDr. David Alan Gilbert.. code:: c
1402e3c8f8dSDr. David Alan Gilbert
1412e3c8f8dSDr. David Alan Gilbert  static const VMStateDescription vmstate_kbd = {
1422e3c8f8dSDr. David Alan Gilbert      .name = "pckbd",
1432e3c8f8dSDr. David Alan Gilbert      .version_id = 3,
1442e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 3,
1452563c97fSRichard Henderson      .fields = (const VMStateField[]) {
1462e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(write_cmd, KBDState),
1472e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(status, KBDState),
1482e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(mode, KBDState),
1492e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(pending, KBDState),
1502e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
1512e3c8f8dSDr. David Alan Gilbert      }
1522e3c8f8dSDr. David Alan Gilbert  };
1532e3c8f8dSDr. David Alan Gilbert
1545b146be3SJuan QuintelaWe are declaring the state with name "pckbd".  The ``version_id`` is
1555b146be3SJuan Quintela3, and there are 4 uint8_t fields in the KBDState structure.  We
1565b146be3SJuan Quintelaregistered this ``VMSTATEDescription`` with one of the following
1575b146be3SJuan Quintelafunctions.  The first one will generate a device ``instance_id``
1585b146be3SJuan Quinteladifferent for each registration.  Use the second one if you already
1595b146be3SJuan Quintelahave an id that is different for each instance of the device:
1602e3c8f8dSDr. David Alan Gilbert
1612e3c8f8dSDr. David Alan Gilbert.. code:: c
1622e3c8f8dSDr. David Alan Gilbert
1635b146be3SJuan Quintela    vmstate_register_any(NULL, &vmstate_kbd, s);
1645b146be3SJuan Quintela    vmstate_register(NULL, instance_id, &vmstate_kbd, s);
1652e3c8f8dSDr. David Alan Gilbert
1664df3a7bfSPeter MaydellFor devices that are ``qdev`` based, we can register the device in the class
167edd70806SDr. David Alan Gilbertinit function:
1682e3c8f8dSDr. David Alan Gilbert
169edd70806SDr. David Alan Gilbert.. code:: c
1702e3c8f8dSDr. David Alan Gilbert
171edd70806SDr. David Alan Gilbert    dc->vmsd = &vmstate_kbd_isa;
1722e3c8f8dSDr. David Alan Gilbert
173edd70806SDr. David Alan GilbertThe VMState macros take care of ensuring that the device data section
174edd70806SDr. David Alan Gilbertis formatted portably (normally big endian) and make some compile time checks
175edd70806SDr. David Alan Gilbertagainst the types of the fields in the structures.
1762e3c8f8dSDr. David Alan Gilbert
177edd70806SDr. David Alan GilbertVMState macros can include other VMStateDescriptions to store substructures
178edd70806SDr. David Alan Gilbert(see ``VMSTATE_STRUCT_``), arrays (``VMSTATE_ARRAY_``) and variable length
179edd70806SDr. David Alan Gilbertarrays (``VMSTATE_VARRAY_``).  Various other macros exist for special
180edd70806SDr. David Alan Gilbertcases.
1812e3c8f8dSDr. David Alan Gilbert
182edd70806SDr. David Alan GilbertNote that the format on the wire is still very raw; i.e. a VMSTATE_UINT32
183edd70806SDr. David Alan Gilbertends up with a 4 byte bigendian representation on the wire; in the future
184edd70806SDr. David Alan Gilbertit might be possible to use a more structured format.
1852e3c8f8dSDr. David Alan Gilbert
186edd70806SDr. David Alan GilbertLegacy way
187edd70806SDr. David Alan Gilbert----------
1882e3c8f8dSDr. David Alan Gilbert
189edd70806SDr. David Alan GilbertThis way is going to disappear as soon as all current users are ported to VMSTATE;
190edd70806SDr. David Alan Gilbertalthough converting existing code can be tricky, and thus 'soon' is relative.
1912e3c8f8dSDr. David Alan Gilbert
192edd70806SDr. David Alan GilbertEach device has to register two functions, one to save the state and
193edd70806SDr. David Alan Gilbertanother to load the state back.
1942e3c8f8dSDr. David Alan Gilbert
195edd70806SDr. David Alan Gilbert.. code:: c
1962e3c8f8dSDr. David Alan Gilbert
197ce62df53SDr. David Alan Gilbert  int register_savevm_live(const char *idstr,
198edd70806SDr. David Alan Gilbert                           int instance_id,
199edd70806SDr. David Alan Gilbert                           int version_id,
200edd70806SDr. David Alan Gilbert                           SaveVMHandlers *ops,
201edd70806SDr. David Alan Gilbert                           void *opaque);
2022e3c8f8dSDr. David Alan Gilbert
2034df3a7bfSPeter MaydellTwo functions in the ``ops`` structure are the ``save_state``
2044df3a7bfSPeter Maydelland ``load_state`` functions.  Notice that ``load_state`` receives a version_id
2054df3a7bfSPeter Maydellparameter to know what state format is receiving.  ``save_state`` doesn't
206edd70806SDr. David Alan Gilberthave a version_id parameter because it always uses the latest version.
2072e3c8f8dSDr. David Alan Gilbert
208edd70806SDr. David Alan GilbertNote that because the VMState macros still save the data in a raw
209edd70806SDr. David Alan Gilbertformat, in many cases it's possible to replace legacy code
210edd70806SDr. David Alan Gilbertwith a carefully constructed VMState description that matches the
211edd70806SDr. David Alan Gilbertbyte layout of the existing code.
2122e3c8f8dSDr. David Alan Gilbert
213edd70806SDr. David Alan GilbertChanging migration data structures
214edd70806SDr. David Alan Gilbert----------------------------------
2152e3c8f8dSDr. David Alan Gilbert
216edd70806SDr. David Alan GilbertWhen we migrate a device, we save/load the state as a series
217edd70806SDr. David Alan Gilbertof fields.  Sometimes, due to bugs or new functionality, we need to
218edd70806SDr. David Alan Gilbertchange the state to store more/different information.  Changing the migration
219edd70806SDr. David Alan Gilbertstate saved for a device can break migration compatibility unless
220edd70806SDr. David Alan Gilbertcare is taken to use the appropriate techniques.  In general QEMU tries
221edd70806SDr. David Alan Gilbertto maintain forward migration compatibility (i.e. migrating from
222edd70806SDr. David Alan GilbertQEMU n->n+1) and there are users who benefit from backward compatibility
223edd70806SDr. David Alan Gilbertas well.
2242e3c8f8dSDr. David Alan Gilbert
2252e3c8f8dSDr. David Alan GilbertSubsections
2262e3c8f8dSDr. David Alan Gilbert-----------
2272e3c8f8dSDr. David Alan Gilbert
228edd70806SDr. David Alan GilbertThe most common structure change is adding new data, e.g. when adding
229edd70806SDr. David Alan Gilberta newer form of device, or adding that state that you previously
230edd70806SDr. David Alan Gilbertforgot to migrate.  This is best solved using a subsection.
2312e3c8f8dSDr. David Alan Gilbert
232edd70806SDr. David Alan GilbertA subsection is "like" a device vmstate, but with a particularity, it
233edd70806SDr. David Alan Gilberthas a Boolean function that tells if that values are needed to be sent
234edd70806SDr. David Alan Gilbertor not.  If this functions returns false, the subsection is not sent.
235edd70806SDr. David Alan GilbertSubsections have a unique name, that is looked for on the receiving
236edd70806SDr. David Alan Gilbertside.
2372e3c8f8dSDr. David Alan Gilbert
2382e3c8f8dSDr. David Alan GilbertOn the receiving side, if we found a subsection for a device that we
2392e3c8f8dSDr. David Alan Gilbertdon't understand, we just fail the migration.  If we understand all
240edd70806SDr. David Alan Gilbertthe subsections, then we load the state with success.  There's no check
241edd70806SDr. David Alan Gilbertthat a subsection is loaded, so a newer QEMU that knows about a subsection
242edd70806SDr. David Alan Gilbertcan (with care) load a stream from an older QEMU that didn't send
243edd70806SDr. David Alan Gilbertthe subsection.
244edd70806SDr. David Alan Gilbert
245edd70806SDr. David Alan GilbertIf the new data is only needed in a rare case, then the subsection
246edd70806SDr. David Alan Gilbertcan be made conditional on that case and the migration will still
247edd70806SDr. David Alan Gilbertsucceed to older QEMUs in most cases.  This is OK for data that's
248edd70806SDr. David Alan Gilbertcritical, but in some use cases it's preferred that the migration
249edd70806SDr. David Alan Gilbertshould succeed even with the data missing.  To support this the
250edd70806SDr. David Alan Gilbertsubsection can be connected to a device property and from there
251edd70806SDr. David Alan Gilbertto a versioned machine type.
2522e3c8f8dSDr. David Alan Gilbert
2533eb21fe9SDr. David Alan GilbertThe 'pre_load' and 'post_load' functions on subsections are only
2543eb21fe9SDr. David Alan Gilbertcalled if the subsection is loaded.
2553eb21fe9SDr. David Alan Gilbert
2563eb21fe9SDr. David Alan GilbertOne important note is that the outer post_load() function is called "after"
2573eb21fe9SDr. David Alan Gilbertloading all subsections, because a newer subsection could change the same
2583eb21fe9SDr. David Alan Gilbertvalue that it uses.  A flag, and the combination of outer pre_load and
2593eb21fe9SDr. David Alan Gilbertpost_load can be used to detect whether a subsection was loaded, and to
260edd70806SDr. David Alan Gilbertfall back on default behaviour when the subsection isn't present.
2612e3c8f8dSDr. David Alan Gilbert
2622e3c8f8dSDr. David Alan GilbertExample:
2632e3c8f8dSDr. David Alan Gilbert
2642e3c8f8dSDr. David Alan Gilbert.. code:: c
2652e3c8f8dSDr. David Alan Gilbert
2662e3c8f8dSDr. David Alan Gilbert  static bool ide_drive_pio_state_needed(void *opaque)
2672e3c8f8dSDr. David Alan Gilbert  {
2682e3c8f8dSDr. David Alan Gilbert      IDEState *s = opaque;
2692e3c8f8dSDr. David Alan Gilbert
2702e3c8f8dSDr. David Alan Gilbert      return ((s->status & DRQ_STAT) != 0)
2712e3c8f8dSDr. David Alan Gilbert          || (s->bus->error_status & BM_STATUS_PIO_RETRY);
2722e3c8f8dSDr. David Alan Gilbert  }
2732e3c8f8dSDr. David Alan Gilbert
2742e3c8f8dSDr. David Alan Gilbert  const VMStateDescription vmstate_ide_drive_pio_state = {
2752e3c8f8dSDr. David Alan Gilbert      .name = "ide_drive/pio_state",
2762e3c8f8dSDr. David Alan Gilbert      .version_id = 1,
2772e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 1,
2782e3c8f8dSDr. David Alan Gilbert      .pre_save = ide_drive_pio_pre_save,
2792e3c8f8dSDr. David Alan Gilbert      .post_load = ide_drive_pio_post_load,
2802e3c8f8dSDr. David Alan Gilbert      .needed = ide_drive_pio_state_needed,
2812563c97fSRichard Henderson      .fields = (const VMStateField[]) {
2822e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(req_nb_sectors, IDEState),
2832e3c8f8dSDr. David Alan Gilbert          VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
2842e3c8f8dSDr. David Alan Gilbert                               vmstate_info_uint8, uint8_t),
2852e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(cur_io_buffer_offset, IDEState),
2862e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(cur_io_buffer_len, IDEState),
2872e3c8f8dSDr. David Alan Gilbert          VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
2882e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(elementary_transfer_size, IDEState),
2892e3c8f8dSDr. David Alan Gilbert          VMSTATE_INT32(packet_transfer_size, IDEState),
2902e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
2912e3c8f8dSDr. David Alan Gilbert      }
2922e3c8f8dSDr. David Alan Gilbert  };
2932e3c8f8dSDr. David Alan Gilbert
2942e3c8f8dSDr. David Alan Gilbert  const VMStateDescription vmstate_ide_drive = {
2952e3c8f8dSDr. David Alan Gilbert      .name = "ide_drive",
2962e3c8f8dSDr. David Alan Gilbert      .version_id = 3,
2972e3c8f8dSDr. David Alan Gilbert      .minimum_version_id = 0,
2982e3c8f8dSDr. David Alan Gilbert      .post_load = ide_drive_post_load,
2992563c97fSRichard Henderson      .fields = (const VMStateField[]) {
3002e3c8f8dSDr. David Alan Gilbert          .... several fields ....
3012e3c8f8dSDr. David Alan Gilbert          VMSTATE_END_OF_LIST()
3022e3c8f8dSDr. David Alan Gilbert      },
3032563c97fSRichard Henderson      .subsections = (const VMStateDescription * const []) {
3042e3c8f8dSDr. David Alan Gilbert          &vmstate_ide_drive_pio_state,
3052e3c8f8dSDr. David Alan Gilbert          NULL
3062e3c8f8dSDr. David Alan Gilbert      }
3072e3c8f8dSDr. David Alan Gilbert  };
3082e3c8f8dSDr. David Alan Gilbert
3092e3c8f8dSDr. David Alan GilbertHere we have a subsection for the pio state.  We only need to
3102e3c8f8dSDr. David Alan Gilbertsave/send this state when we are in the middle of a pio operation
3112e3c8f8dSDr. David Alan Gilbert(that is what ``ide_drive_pio_state_needed()`` checks).  If DRQ_STAT is
3122e3c8f8dSDr. David Alan Gilbertnot enabled, the values on that fields are garbage and don't need to
3132e3c8f8dSDr. David Alan Gilbertbe sent.
3142e3c8f8dSDr. David Alan Gilbert
315edd70806SDr. David Alan GilbertConnecting subsections to properties
316edd70806SDr. David Alan Gilbert------------------------------------
317edd70806SDr. David Alan Gilbert
3182e3c8f8dSDr. David Alan GilbertUsing a condition function that checks a 'property' to determine whether
319edd70806SDr. David Alan Gilbertto send a subsection allows backward migration compatibility when
320edd70806SDr. David Alan Gilbertnew subsections are added, especially when combined with versioned
321edd70806SDr. David Alan Gilbertmachine types.
3222e3c8f8dSDr. David Alan Gilbert
3232e3c8f8dSDr. David Alan GilbertFor example:
3242e3c8f8dSDr. David Alan Gilbert
3252e3c8f8dSDr. David Alan Gilbert   a) Add a new property using ``DEFINE_PROP_BOOL`` - e.g. support-foo and
3262e3c8f8dSDr. David Alan Gilbert      default it to true.
327ac78f737SMarc-André Lureau   b) Add an entry to the ``hw_compat_`` for the previous version that sets
3282e3c8f8dSDr. David Alan Gilbert      the property to false.
3292e3c8f8dSDr. David Alan Gilbert   c) Add a static bool  support_foo function that tests the property.
3302e3c8f8dSDr. David Alan Gilbert   d) Add a subsection with a .needed set to the support_foo function
3313eb21fe9SDr. David Alan Gilbert   e) (potentially) Add an outer pre_load that sets up a default value
3323eb21fe9SDr. David Alan Gilbert      for 'foo' to be used if the subsection isn't loaded.
3332e3c8f8dSDr. David Alan Gilbert
3342e3c8f8dSDr. David Alan GilbertNow that subsection will not be generated when using an older
3352e3c8f8dSDr. David Alan Gilbertmachine type and the migration stream will be accepted by older
336edd70806SDr. David Alan GilbertQEMU versions.
3372e3c8f8dSDr. David Alan Gilbert
3382e3c8f8dSDr. David Alan GilbertNot sending existing elements
3392e3c8f8dSDr. David Alan Gilbert-----------------------------
3402e3c8f8dSDr. David Alan Gilbert
3412e3c8f8dSDr. David Alan GilbertSometimes members of the VMState are no longer needed:
3422e3c8f8dSDr. David Alan Gilbert
3432e3c8f8dSDr. David Alan Gilbert  - removing them will break migration compatibility
3442e3c8f8dSDr. David Alan Gilbert
345edd70806SDr. David Alan Gilbert  - making them version dependent and bumping the version will break backward migration
346edd70806SDr. David Alan Gilbert    compatibility.
3472e3c8f8dSDr. David Alan Gilbert
348edd70806SDr. David Alan GilbertAdding a dummy field into the migration stream is normally the best way to preserve
349edd70806SDr. David Alan Gilbertcompatibility.
350edd70806SDr. David Alan Gilbert
351edd70806SDr. David Alan GilbertIf the field really does need to be removed then:
3522e3c8f8dSDr. David Alan Gilbert
3532e3c8f8dSDr. David Alan Gilbert  a) Add a new property/compatibility/function in the same way for subsections above.
3542e3c8f8dSDr. David Alan Gilbert  b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
3552e3c8f8dSDr. David Alan Gilbert
3562e3c8f8dSDr. David Alan Gilbert   ``VMSTATE_UINT32(foo, barstruct)``
3572e3c8f8dSDr. David Alan Gilbert
3582e3c8f8dSDr. David Alan Gilbert   becomes
3592e3c8f8dSDr. David Alan Gilbert
3602e3c8f8dSDr. David Alan Gilbert   ``VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)``
3612e3c8f8dSDr. David Alan Gilbert
3622e3c8f8dSDr. David Alan Gilbert   Sometime in the future when we no longer care about the ancient versions these can be killed off.
363edd70806SDr. David Alan Gilbert   Note that for backward compatibility it's important to fill in the structure with
364edd70806SDr. David Alan Gilbert   data that the destination will understand.
365edd70806SDr. David Alan Gilbert
366edd70806SDr. David Alan GilbertAny difference in the predicates on the source and destination will end up
367edd70806SDr. David Alan Gilbertwith different fields being enabled and data being loaded into the wrong
368edd70806SDr. David Alan Gilbertfields; for this reason conditional fields like this are very fragile.
369edd70806SDr. David Alan Gilbert
370edd70806SDr. David Alan GilbertVersions
371edd70806SDr. David Alan Gilbert--------
372edd70806SDr. David Alan Gilbert
373edd70806SDr. David Alan GilbertVersion numbers are intended for major incompatible changes to the
374edd70806SDr. David Alan Gilbertmigration of a device, and using them breaks backward-migration
375edd70806SDr. David Alan Gilbertcompatibility; in general most changes can be made by adding Subsections
376edd70806SDr. David Alan Gilbert(see above) or _TEST macros (see above) which won't break compatibility.
377edd70806SDr. David Alan Gilbert
3784df3a7bfSPeter MaydellEach version is associated with a series of fields saved.  The ``save_state`` always saves
3794df3a7bfSPeter Maydellthe state as the newer version.  But ``load_state`` sometimes is able to
380edd70806SDr. David Alan Gilbertload state from an older version.
381edd70806SDr. David Alan Gilbert
38218621987SPeter MaydellYou can see that there are two version fields:
383edd70806SDr. David Alan Gilbert
3844df3a7bfSPeter Maydell- ``version_id``: the maximum version_id supported by VMState for that device.
3854df3a7bfSPeter Maydell- ``minimum_version_id``: the minimum version_id that VMState is able to understand
386edd70806SDr. David Alan Gilbert  for that device.
387edd70806SDr. David Alan Gilbert
38818621987SPeter MaydellVMState is able to read versions from minimum_version_id to version_id.
389edd70806SDr. David Alan Gilbert
390edd70806SDr. David Alan GilbertThere are *_V* forms of many ``VMSTATE_`` macros to load fields for version dependent fields,
391edd70806SDr. David Alan Gilberte.g.
392edd70806SDr. David Alan Gilbert
393edd70806SDr. David Alan Gilbert.. code:: c
394edd70806SDr. David Alan Gilbert
395edd70806SDr. David Alan Gilbert   VMSTATE_UINT16_V(ip_id, Slirp, 2),
396edd70806SDr. David Alan Gilbert
397edd70806SDr. David Alan Gilbertonly loads that field for versions 2 and newer.
398edd70806SDr. David Alan Gilbert
399edd70806SDr. David Alan GilbertSaving state will always create a section with the 'version_id' value
400edd70806SDr. David Alan Gilbertand thus can't be loaded by any older QEMU.
401edd70806SDr. David Alan Gilbert
402edd70806SDr. David Alan GilbertMassaging functions
403edd70806SDr. David Alan Gilbert-------------------
404edd70806SDr. David Alan Gilbert
405edd70806SDr. David Alan GilbertSometimes, it is not enough to be able to save the state directly
406edd70806SDr. David Alan Gilbertfrom one structure, we need to fill the correct values there.  One
407edd70806SDr. David Alan Gilbertexample is when we are using kvm.  Before saving the cpu state, we
408edd70806SDr. David Alan Gilbertneed to ask kvm to copy to QEMU the state that it is using.  And the
409edd70806SDr. David Alan Gilbertopposite when we are loading the state, we need a way to tell kvm to
410edd70806SDr. David Alan Gilbertload the state for the cpu that we have just loaded from the QEMUFile.
411edd70806SDr. David Alan Gilbert
412edd70806SDr. David Alan GilbertThe functions to do that are inside a vmstate definition, and are called:
413edd70806SDr. David Alan Gilbert
414edd70806SDr. David Alan Gilbert- ``int (*pre_load)(void *opaque);``
415edd70806SDr. David Alan Gilbert
416edd70806SDr. David Alan Gilbert  This function is called before we load the state of one device.
417edd70806SDr. David Alan Gilbert
418edd70806SDr. David Alan Gilbert- ``int (*post_load)(void *opaque, int version_id);``
419edd70806SDr. David Alan Gilbert
420edd70806SDr. David Alan Gilbert  This function is called after we load the state of one device.
421edd70806SDr. David Alan Gilbert
422edd70806SDr. David Alan Gilbert- ``int (*pre_save)(void *opaque);``
423edd70806SDr. David Alan Gilbert
424edd70806SDr. David Alan Gilbert  This function is called before we save the state of one device.
425edd70806SDr. David Alan Gilbert
4268c07559fSAaron Lindsay- ``int (*post_save)(void *opaque);``
4278c07559fSAaron Lindsay
4288c07559fSAaron Lindsay  This function is called after we save the state of one device
4298c07559fSAaron Lindsay  (even upon failure, unless the call to pre_save returned an error).
4308c07559fSAaron Lindsay
4318c07559fSAaron LindsayExample: You can look at hpet.c, that uses the first three functions
4328c07559fSAaron Lindsayto massage the state that is transferred.
433edd70806SDr. David Alan Gilbert
434edd70806SDr. David Alan GilbertThe ``VMSTATE_WITH_TMP`` macro may be useful when the migration
435edd70806SDr. David Alan Gilbertdata doesn't match the stored device data well; it allows an
436edd70806SDr. David Alan Gilbertintermediate temporary structure to be populated with migration
437edd70806SDr. David Alan Gilbertdata and then transferred to the main structure.
438edd70806SDr. David Alan Gilbert
439ad2b6523SBernhard BeschowIf you use memory or portio_list API functions that update memory layout outside
440edd70806SDr. David Alan Gilbertinitialization (i.e., in response to a guest action), this is a strong
4414df3a7bfSPeter Maydellindication that you need to call these functions in a ``post_load`` callback.
442ad2b6523SBernhard BeschowExamples of such API functions are:
443edd70806SDr. David Alan Gilbert
444edd70806SDr. David Alan Gilbert  - memory_region_add_subregion()
445edd70806SDr. David Alan Gilbert  - memory_region_del_subregion()
446edd70806SDr. David Alan Gilbert  - memory_region_set_readonly()
447c26763f8SMarc-André Lureau  - memory_region_set_nonvolatile()
448edd70806SDr. David Alan Gilbert  - memory_region_set_enabled()
449edd70806SDr. David Alan Gilbert  - memory_region_set_address()
450edd70806SDr. David Alan Gilbert  - memory_region_set_alias_offset()
451ad2b6523SBernhard Beschow  - portio_list_set_address()
452f165cdf1SBernhard Beschow  - portio_list_set_enabled()
453edd70806SDr. David Alan Gilbert
454edd70806SDr. David Alan GilbertIterative device migration
455edd70806SDr. David Alan Gilbert--------------------------
456edd70806SDr. David Alan Gilbert
457*eef0bae3SFabiano RosasSome devices, such as RAM or certain platform devices,
458edd70806SDr. David Alan Gilberthave large amounts of data that would mean that the CPUs would be
459edd70806SDr. David Alan Gilbertpaused for too long if they were sent in one section.  For these
460edd70806SDr. David Alan Gilbertdevices an *iterative* approach is taken.
461edd70806SDr. David Alan Gilbert
462edd70806SDr. David Alan GilbertThe iterative devices generally don't use VMState macros
463edd70806SDr. David Alan Gilbert(although it may be possible in some cases) and instead use
464edd70806SDr. David Alan Gilbertqemu_put_*/qemu_get_* macros to read/write data to the stream.  Specialist
465edd70806SDr. David Alan Gilbertversions exist for high bandwidth IO.
466edd70806SDr. David Alan Gilbert
467edd70806SDr. David Alan Gilbert
468edd70806SDr. David Alan GilbertAn iterative device must provide:
469edd70806SDr. David Alan Gilbert
470edd70806SDr. David Alan Gilbert  - A ``save_setup`` function that initialises the data structures and
471edd70806SDr. David Alan Gilbert    transmits a first section containing information on the device.  In the
472edd70806SDr. David Alan Gilbert    case of RAM this transmits a list of RAMBlocks and sizes.
473edd70806SDr. David Alan Gilbert
474edd70806SDr. David Alan Gilbert  - A ``load_setup`` function that initialises the data structures on the
475edd70806SDr. David Alan Gilbert    destination.
476edd70806SDr. David Alan Gilbert
477c8df4a7aSJuan Quintela  - A ``state_pending_exact`` function that indicates how much more
478c8df4a7aSJuan Quintela    data we must save.  The core migration code will use this to
479c8df4a7aSJuan Quintela    determine when to pause the CPUs and complete the migration.
480edd70806SDr. David Alan Gilbert
481c8df4a7aSJuan Quintela  - A ``state_pending_estimate`` function that indicates how much more
482c8df4a7aSJuan Quintela    data we must save.  When the estimated amount is smaller than the
483c8df4a7aSJuan Quintela    threshold, we call ``state_pending_exact``.
484c8df4a7aSJuan Quintela
485c8df4a7aSJuan Quintela  - A ``save_live_iterate`` function should send a chunk of data until
486c8df4a7aSJuan Quintela    the point that stream bandwidth limits tell it to stop.  Each call
487c8df4a7aSJuan Quintela    generates one section.
488edd70806SDr. David Alan Gilbert
489edd70806SDr. David Alan Gilbert  - A ``save_live_complete_precopy`` function that must transmit the
490edd70806SDr. David Alan Gilbert    last section for the device containing any remaining data.
491edd70806SDr. David Alan Gilbert
492edd70806SDr. David Alan Gilbert  - A ``load_state`` function used to load sections generated by
493edd70806SDr. David Alan Gilbert    any of the save functions that generate sections.
494edd70806SDr. David Alan Gilbert
495edd70806SDr. David Alan Gilbert  - ``cleanup`` functions for both save and load that are called
496edd70806SDr. David Alan Gilbert    at the end of migration.
497edd70806SDr. David Alan Gilbert
498edd70806SDr. David Alan GilbertNote that the contents of the sections for iterative migration tend
499edd70806SDr. David Alan Gilbertto be open-coded by the devices; care should be taken in parsing
500edd70806SDr. David Alan Gilbertthe results and structuring the stream to make them easy to validate.
501edd70806SDr. David Alan Gilbert
502edd70806SDr. David Alan GilbertDevice ordering
503edd70806SDr. David Alan Gilbert---------------
504edd70806SDr. David Alan Gilbert
505edd70806SDr. David Alan GilbertThere are cases in which the ordering of device loading matters; for
506edd70806SDr. David Alan Gilbertexample in some systems where a device may assert an interrupt during loading,
507edd70806SDr. David Alan Gilbertif the interrupt controller is loaded later then it might lose the state.
508edd70806SDr. David Alan Gilbert
509edd70806SDr. David Alan GilbertSome ordering is implicitly provided by the order in which the machine
510edd70806SDr. David Alan Gilbertdefinition creates devices, however this is somewhat fragile.
511edd70806SDr. David Alan Gilbert
512edd70806SDr. David Alan GilbertThe ``MigrationPriority`` enum provides a means of explicitly enforcing
513edd70806SDr. David Alan Gilbertordering.  Numerically higher priorities are loaded earlier.
514edd70806SDr. David Alan GilbertThe priority is set by setting the ``priority`` field of the top level
515edd70806SDr. David Alan Gilbert``VMStateDescription`` for the device.
516edd70806SDr. David Alan Gilbert
517edd70806SDr. David Alan GilbertStream structure
518edd70806SDr. David Alan Gilbert================
519edd70806SDr. David Alan Gilbert
520edd70806SDr. David Alan GilbertThe stream tries to be word and endian agnostic, allowing migration between hosts
521edd70806SDr. David Alan Gilbertof different characteristics running the same VM.
522edd70806SDr. David Alan Gilbert
523edd70806SDr. David Alan Gilbert  - Header
524edd70806SDr. David Alan Gilbert
525edd70806SDr. David Alan Gilbert    - Magic
526edd70806SDr. David Alan Gilbert    - Version
527edd70806SDr. David Alan Gilbert    - VM configuration section
528edd70806SDr. David Alan Gilbert
529edd70806SDr. David Alan Gilbert       - Machine type
530edd70806SDr. David Alan Gilbert       - Target page bits
531edd70806SDr. David Alan Gilbert  - List of sections
532edd70806SDr. David Alan Gilbert    Each section contains a device, or one iteration of a device save.
533edd70806SDr. David Alan Gilbert
534edd70806SDr. David Alan Gilbert    - section type
535edd70806SDr. David Alan Gilbert    - section id
536edd70806SDr. David Alan Gilbert    - ID string (First section of each device)
537edd70806SDr. David Alan Gilbert    - instance id (First section of each device)
538edd70806SDr. David Alan Gilbert    - version id (First section of each device)
539edd70806SDr. David Alan Gilbert    - <device data>
540edd70806SDr. David Alan Gilbert    - Footer mark
541edd70806SDr. David Alan Gilbert  - EOF mark
542edd70806SDr. David Alan Gilbert  - VM Description structure
543edd70806SDr. David Alan Gilbert    Consisting of a JSON description of the contents for analysis only
544edd70806SDr. David Alan Gilbert
545edd70806SDr. David Alan GilbertThe ``device data`` in each section consists of the data produced
546edd70806SDr. David Alan Gilbertby the code described above.  For non-iterative devices they have a single
547edd70806SDr. David Alan Gilbertsection; iterative devices have an initial and last section and a set
548edd70806SDr. David Alan Gilbertof parts in between.
549edd70806SDr. David Alan GilbertNote that there is very little checking by the common code of the integrity
550edd70806SDr. David Alan Gilbertof the ``device data`` contents, that's up to the devices themselves.
551edd70806SDr. David Alan GilbertThe ``footer mark`` provides a little bit of protection for the case where
552edd70806SDr. David Alan Gilbertthe receiving side reads more or less data than expected.
553edd70806SDr. David Alan Gilbert
554edd70806SDr. David Alan GilbertThe ``ID string`` is normally unique, having been formed from a bus name
555edd70806SDr. David Alan Gilbertand device address, PCI devices and storage devices hung off PCI controllers
556edd70806SDr. David Alan Gilbertfit this pattern well.  Some devices are fixed single instances (e.g. "pc-ram").
557edd70806SDr. David Alan GilbertOthers (especially either older devices or system devices which for
558edd70806SDr. David Alan Gilbertsome reason don't have a bus concept) make use of the ``instance id``
559edd70806SDr. David Alan Gilbertfor otherwise identically named devices.
5602e3c8f8dSDr. David Alan Gilbert
5612e3c8f8dSDr. David Alan GilbertReturn path
5622e3c8f8dSDr. David Alan Gilbert-----------
5632e3c8f8dSDr. David Alan Gilbert
564edd70806SDr. David Alan GilbertOnly a unidirectional stream is required for normal migration, however a
565edd70806SDr. David Alan Gilbert``return path`` can be created when bidirectional communication is desired.
566edd70806SDr. David Alan GilbertThis is primarily used by postcopy, but is also used to return a success
567edd70806SDr. David Alan Gilbertflag to the source at the end of migration.
5682e3c8f8dSDr. David Alan Gilbert
5692e3c8f8dSDr. David Alan Gilbert``qemu_file_get_return_path(QEMUFile* fwdpath)`` gives the QEMUFile* for the return
5702e3c8f8dSDr. David Alan Gilbertpath.
5712e3c8f8dSDr. David Alan Gilbert
5722e3c8f8dSDr. David Alan Gilbert  Source side
5732e3c8f8dSDr. David Alan Gilbert
5742e3c8f8dSDr. David Alan Gilbert     Forward path - written by migration thread
5752e3c8f8dSDr. David Alan Gilbert     Return path  - opened by main thread, read by return-path thread
5762e3c8f8dSDr. David Alan Gilbert
5772e3c8f8dSDr. David Alan Gilbert  Destination side
5782e3c8f8dSDr. David Alan Gilbert
5792e3c8f8dSDr. David Alan Gilbert     Forward path - read by main thread
5802e3c8f8dSDr. David Alan Gilbert     Return path  - opened by main thread, written by main thread AND postcopy
5812e3c8f8dSDr. David Alan Gilbert     thread (protected by rp_mutex)
5822e3c8f8dSDr. David Alan Gilbert
583