xref: /qemu/docs/devel/migration/vfio.rst (revision 2a5781331a08628fa5d5a0e9a5ea415ce462e707)
12a578133STarun Gupta=====================
22a578133STarun GuptaVFIO device Migration
32a578133STarun Gupta=====================
42a578133STarun Gupta
52a578133STarun GuptaMigration of virtual machine involves saving the state for each device that
62a578133STarun Guptathe guest is running on source host and restoring this saved state on the
72a578133STarun Guptadestination host. This document details how saving and restoring of VFIO
82a578133STarun Guptadevices is done in QEMU.
92a578133STarun Gupta
102a578133STarun GuptaMigration of VFIO devices consists of two phases: the optional pre-copy phase,
112a578133STarun Guptaand the stop-and-copy phase. The pre-copy phase is iterative and allows to
122a578133STarun Guptaaccommodate VFIO devices that have a large amount of data that needs to be
132a578133STarun Guptatransferred. The iterative pre-copy phase of migration allows for the guest to
142a578133STarun Guptacontinue whilst the VFIO device state is transferred to the destination, this
152a578133STarun Guptahelps to reduce the total downtime of the VM. VFIO devices can choose to skip
162a578133STarun Guptathe pre-copy phase of migration by returning pending_bytes as zero during the
172a578133STarun Guptapre-copy phase.
182a578133STarun Gupta
192a578133STarun GuptaA detailed description of the UAPI for VFIO device migration can be found in
202a578133STarun Guptathe comment for the ``vfio_device_migration_info`` structure in the header
212a578133STarun Guptafile linux-headers/linux/vfio.h.
222a578133STarun Gupta
232a578133STarun GuptaVFIO implements the device hooks for the iterative approach as follows:
242a578133STarun Gupta
252a578133STarun Gupta* A ``save_setup`` function that sets up the migration region and sets _SAVING
262a578133STarun Gupta  flag in the VFIO device state.
272a578133STarun Gupta
282a578133STarun Gupta* A ``load_setup`` function that sets up the migration region on the
292a578133STarun Gupta  destination and sets _RESUMING flag in the VFIO device state.
302a578133STarun Gupta
312a578133STarun Gupta* A ``save_live_pending`` function that reads pending_bytes from the vendor
322a578133STarun Gupta  driver, which indicates the amount of data that the vendor driver has yet to
332a578133STarun Gupta  save for the VFIO device.
342a578133STarun Gupta
352a578133STarun Gupta* A ``save_live_iterate`` function that reads the VFIO device's data from the
362a578133STarun Gupta  vendor driver through the migration region during iterative phase.
372a578133STarun Gupta
382a578133STarun Gupta* A ``save_state`` function to save the device config space if it is present.
392a578133STarun Gupta
402a578133STarun Gupta* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the
412a578133STarun Gupta  VFIO device state and iteratively copies the remaining data for the VFIO
422a578133STarun Gupta  device until the vendor driver indicates that no data remains (pending bytes
432a578133STarun Gupta  is zero).
442a578133STarun Gupta
452a578133STarun Gupta* A ``load_state`` function that loads the config section and the data
462a578133STarun Gupta  sections that are generated by the save functions above
472a578133STarun Gupta
482a578133STarun Gupta* ``cleanup`` functions for both save and load that perform any migration
492a578133STarun Gupta  related cleanup, including unmapping the migration region
502a578133STarun Gupta
512a578133STarun Gupta
522a578133STarun GuptaThe VFIO migration code uses a VM state change handler to change the VFIO
532a578133STarun Guptadevice state when the VM state changes from running to not-running, and
542a578133STarun Guptavice versa.
552a578133STarun Gupta
562a578133STarun GuptaSimilarly, a migration state change handler is used to trigger a transition of
572a578133STarun Guptathe VFIO device state when certain changes of the migration state occur. For
582a578133STarun Guptaexample, the VFIO device state is transitioned back to _RUNNING in case a
592a578133STarun Guptamigration failed or was canceled.
602a578133STarun Gupta
612a578133STarun GuptaSystem memory dirty pages tracking
622a578133STarun Gupta----------------------------------
632a578133STarun Gupta
642a578133STarun GuptaA ``log_global_start`` and ``log_global_stop`` memory listener callback informs
652a578133STarun Guptathe VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync``
662a578133STarun Guptamemory listener callback marks those system memory pages as dirty which are
672a578133STarun Guptaused for DMA by the VFIO device. The dirty pages bitmap is queried per
682a578133STarun Guptacontainer. All pages pinned by the vendor driver through external APIs have to
692a578133STarun Guptabe marked as dirty during migration. When there are CPU writes, CPU dirty page
702a578133STarun Guptatracking can identify dirtied pages, but any page pinned by the vendor driver
712a578133STarun Guptacan also be written by the device. There is currently no device or IOMMU
722a578133STarun Guptasupport for dirty page tracking in hardware.
732a578133STarun Gupta
742a578133STarun GuptaBy default, dirty pages are tracked when the device is in pre-copy as well as
752a578133STarun Guptastop-and-copy phase. So, a page pinned by the vendor driver will be copied to
762a578133STarun Guptathe destination in both phases. Copying dirty pages in pre-copy phase helps
772a578133STarun GuptaQEMU to predict if it can achieve its downtime tolerances. If QEMU during
782a578133STarun Guptapre-copy phase keeps finding dirty pages continuously, then it understands
792a578133STarun Guptathat even in stop-and-copy phase, it is likely to find dirty pages and can
802a578133STarun Guptapredict the downtime accordingly.
812a578133STarun Gupta
822a578133STarun GuptaQEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking``
832a578133STarun Guptawhich disables querying the dirty bitmap during pre-copy phase. If it is set to
842a578133STarun Guptaoff, all dirty pages will be copied to the destination in stop-and-copy phase
852a578133STarun Guptaonly.
862a578133STarun Gupta
872a578133STarun GuptaSystem memory dirty pages tracking when vIOMMU is enabled
882a578133STarun Gupta---------------------------------------------------------
892a578133STarun Gupta
902a578133STarun GuptaWith vIOMMU, an IO virtual address range can get unmapped while in pre-copy
912a578133STarun Guptaphase of migration. In that case, the unmap ioctl returns any dirty pages in
922a578133STarun Guptathat range and QEMU reports corresponding guest physical pages dirty. During
932a578133STarun Guptastop-and-copy phase, an IOMMU notifier is used to get a callback for mapped
942a578133STarun Guptapages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those
952a578133STarun Guptamapped ranges.
962a578133STarun Gupta
972a578133STarun GuptaFlow of state changes during Live migration
982a578133STarun Gupta===========================================
992a578133STarun Gupta
1002a578133STarun GuptaBelow is the flow of state change during live migration.
1012a578133STarun GuptaThe values in the brackets represent the VM state, the migration state, and
1022a578133STarun Guptathe VFIO device state, respectively.
1032a578133STarun Gupta
1042a578133STarun GuptaLive migration save path
1052a578133STarun Gupta------------------------
1062a578133STarun Gupta
1072a578133STarun Gupta::
1082a578133STarun Gupta
1092a578133STarun Gupta                        QEMU normal running state
1102a578133STarun Gupta                        (RUNNING, _NONE, _RUNNING)
1112a578133STarun Gupta                                  |
1122a578133STarun Gupta                     migrate_init spawns migration_thread
1132a578133STarun Gupta                Migration thread then calls each device's .save_setup()
1142a578133STarun Gupta                    (RUNNING, _SETUP, _RUNNING|_SAVING)
1152a578133STarun Gupta                                  |
1162a578133STarun Gupta                    (RUNNING, _ACTIVE, _RUNNING|_SAVING)
1172a578133STarun Gupta             If device is active, get pending_bytes by .save_live_pending()
1182a578133STarun Gupta          If total pending_bytes >= threshold_size, call .save_live_iterate()
1192a578133STarun Gupta                  Data of VFIO device for pre-copy phase is copied
1202a578133STarun Gupta        Iterate till total pending bytes converge and are less than threshold
1212a578133STarun Gupta                                  |
1222a578133STarun Gupta  On migration completion, vCPU stops and calls .save_live_complete_precopy for
1232a578133STarun Gupta   each active device. The VFIO device is then transitioned into _SAVING state
1242a578133STarun Gupta                   (FINISH_MIGRATE, _DEVICE, _SAVING)
1252a578133STarun Gupta                                  |
1262a578133STarun Gupta     For the VFIO device, iterate in .save_live_complete_precopy until
1272a578133STarun Gupta                         pending data is 0
1282a578133STarun Gupta                   (FINISH_MIGRATE, _DEVICE, _STOPPED)
1292a578133STarun Gupta                                  |
1302a578133STarun Gupta                 (FINISH_MIGRATE, _COMPLETED, _STOPPED)
1312a578133STarun Gupta             Migraton thread schedules cleanup bottom half and exits
1322a578133STarun Gupta
1332a578133STarun GuptaLive migration resume path
1342a578133STarun Gupta--------------------------
1352a578133STarun Gupta
1362a578133STarun Gupta::
1372a578133STarun Gupta
1382a578133STarun Gupta              Incoming migration calls .load_setup for each device
1392a578133STarun Gupta                       (RESTORE_VM, _ACTIVE, _STOPPED)
1402a578133STarun Gupta                                 |
1412a578133STarun Gupta       For each device, .load_state is called for that device section data
1422a578133STarun Gupta                       (RESTORE_VM, _ACTIVE, _RESUMING)
1432a578133STarun Gupta                                 |
1442a578133STarun Gupta    At the end, .load_cleanup is called for each device and vCPUs are started
1452a578133STarun Gupta                       (RUNNING, _NONE, _RUNNING)
1462a578133STarun Gupta
1472a578133STarun GuptaPostcopy
1482a578133STarun Gupta========
1492a578133STarun Gupta
1502a578133STarun GuptaPostcopy migration is currently not supported for VFIO devices.
151