12a578133STarun Gupta===================== 22a578133STarun GuptaVFIO device Migration 32a578133STarun Gupta===================== 42a578133STarun Gupta 52a578133STarun GuptaMigration of virtual machine involves saving the state for each device that 62a578133STarun Guptathe guest is running on source host and restoring this saved state on the 72a578133STarun Guptadestination host. This document details how saving and restoring of VFIO 82a578133STarun Guptadevices is done in QEMU. 92a578133STarun Gupta 102a578133STarun GuptaMigration of VFIO devices consists of two phases: the optional pre-copy phase, 112a578133STarun Guptaand the stop-and-copy phase. The pre-copy phase is iterative and allows to 122a578133STarun Guptaaccommodate VFIO devices that have a large amount of data that needs to be 132a578133STarun Guptatransferred. The iterative pre-copy phase of migration allows for the guest to 142a578133STarun Guptacontinue whilst the VFIO device state is transferred to the destination, this 152a578133STarun Guptahelps to reduce the total downtime of the VM. VFIO devices can choose to skip 162a578133STarun Guptathe pre-copy phase of migration by returning pending_bytes as zero during the 172a578133STarun Guptapre-copy phase. 182a578133STarun Gupta 192a578133STarun GuptaA detailed description of the UAPI for VFIO device migration can be found in 202a578133STarun Guptathe comment for the ``vfio_device_migration_info`` structure in the header 212a578133STarun Guptafile linux-headers/linux/vfio.h. 222a578133STarun Gupta 232a578133STarun GuptaVFIO implements the device hooks for the iterative approach as follows: 242a578133STarun Gupta 252a578133STarun Gupta* A ``save_setup`` function that sets up the migration region and sets _SAVING 262a578133STarun Gupta flag in the VFIO device state. 272a578133STarun Gupta 282a578133STarun Gupta* A ``load_setup`` function that sets up the migration region on the 292a578133STarun Gupta destination and sets _RESUMING flag in the VFIO device state. 302a578133STarun Gupta 312a578133STarun Gupta* A ``save_live_pending`` function that reads pending_bytes from the vendor 322a578133STarun Gupta driver, which indicates the amount of data that the vendor driver has yet to 332a578133STarun Gupta save for the VFIO device. 342a578133STarun Gupta 352a578133STarun Gupta* A ``save_live_iterate`` function that reads the VFIO device's data from the 362a578133STarun Gupta vendor driver through the migration region during iterative phase. 372a578133STarun Gupta 382a578133STarun Gupta* A ``save_state`` function to save the device config space if it is present. 392a578133STarun Gupta 402a578133STarun Gupta* A ``save_live_complete_precopy`` function that resets _RUNNING flag from the 412a578133STarun Gupta VFIO device state and iteratively copies the remaining data for the VFIO 422a578133STarun Gupta device until the vendor driver indicates that no data remains (pending bytes 432a578133STarun Gupta is zero). 442a578133STarun Gupta 452a578133STarun Gupta* A ``load_state`` function that loads the config section and the data 462a578133STarun Gupta sections that are generated by the save functions above 472a578133STarun Gupta 482a578133STarun Gupta* ``cleanup`` functions for both save and load that perform any migration 492a578133STarun Gupta related cleanup, including unmapping the migration region 502a578133STarun Gupta 512a578133STarun Gupta 522a578133STarun GuptaThe VFIO migration code uses a VM state change handler to change the VFIO 532a578133STarun Guptadevice state when the VM state changes from running to not-running, and 542a578133STarun Guptavice versa. 552a578133STarun Gupta 562a578133STarun GuptaSimilarly, a migration state change handler is used to trigger a transition of 572a578133STarun Guptathe VFIO device state when certain changes of the migration state occur. For 582a578133STarun Guptaexample, the VFIO device state is transitioned back to _RUNNING in case a 592a578133STarun Guptamigration failed or was canceled. 602a578133STarun Gupta 612a578133STarun GuptaSystem memory dirty pages tracking 622a578133STarun Gupta---------------------------------- 632a578133STarun Gupta 642a578133STarun GuptaA ``log_global_start`` and ``log_global_stop`` memory listener callback informs 652a578133STarun Guptathe VFIO IOMMU module to start and stop dirty page tracking. A ``log_sync`` 662a578133STarun Guptamemory listener callback marks those system memory pages as dirty which are 672a578133STarun Guptaused for DMA by the VFIO device. The dirty pages bitmap is queried per 682a578133STarun Guptacontainer. All pages pinned by the vendor driver through external APIs have to 692a578133STarun Guptabe marked as dirty during migration. When there are CPU writes, CPU dirty page 702a578133STarun Guptatracking can identify dirtied pages, but any page pinned by the vendor driver 712a578133STarun Guptacan also be written by the device. There is currently no device or IOMMU 722a578133STarun Guptasupport for dirty page tracking in hardware. 732a578133STarun Gupta 742a578133STarun GuptaBy default, dirty pages are tracked when the device is in pre-copy as well as 752a578133STarun Guptastop-and-copy phase. So, a page pinned by the vendor driver will be copied to 762a578133STarun Guptathe destination in both phases. Copying dirty pages in pre-copy phase helps 772a578133STarun GuptaQEMU to predict if it can achieve its downtime tolerances. If QEMU during 782a578133STarun Guptapre-copy phase keeps finding dirty pages continuously, then it understands 792a578133STarun Guptathat even in stop-and-copy phase, it is likely to find dirty pages and can 802a578133STarun Guptapredict the downtime accordingly. 812a578133STarun Gupta 822a578133STarun GuptaQEMU also provides a per device opt-out option ``pre-copy-dirty-page-tracking`` 832a578133STarun Guptawhich disables querying the dirty bitmap during pre-copy phase. If it is set to 842a578133STarun Guptaoff, all dirty pages will be copied to the destination in stop-and-copy phase 852a578133STarun Guptaonly. 862a578133STarun Gupta 872a578133STarun GuptaSystem memory dirty pages tracking when vIOMMU is enabled 882a578133STarun Gupta--------------------------------------------------------- 892a578133STarun Gupta 902a578133STarun GuptaWith vIOMMU, an IO virtual address range can get unmapped while in pre-copy 912a578133STarun Guptaphase of migration. In that case, the unmap ioctl returns any dirty pages in 922a578133STarun Guptathat range and QEMU reports corresponding guest physical pages dirty. During 932a578133STarun Guptastop-and-copy phase, an IOMMU notifier is used to get a callback for mapped 942a578133STarun Guptapages and then dirty pages bitmap is fetched from VFIO IOMMU modules for those 952a578133STarun Guptamapped ranges. 962a578133STarun Gupta 972a578133STarun GuptaFlow of state changes during Live migration 982a578133STarun Gupta=========================================== 992a578133STarun Gupta 1002a578133STarun GuptaBelow is the flow of state change during live migration. 1012a578133STarun GuptaThe values in the brackets represent the VM state, the migration state, and 1022a578133STarun Guptathe VFIO device state, respectively. 1032a578133STarun Gupta 1042a578133STarun GuptaLive migration save path 1052a578133STarun Gupta------------------------ 1062a578133STarun Gupta 1072a578133STarun Gupta:: 1082a578133STarun Gupta 1092a578133STarun Gupta QEMU normal running state 1102a578133STarun Gupta (RUNNING, _NONE, _RUNNING) 1112a578133STarun Gupta | 1122a578133STarun Gupta migrate_init spawns migration_thread 1132a578133STarun Gupta Migration thread then calls each device's .save_setup() 1142a578133STarun Gupta (RUNNING, _SETUP, _RUNNING|_SAVING) 1152a578133STarun Gupta | 1162a578133STarun Gupta (RUNNING, _ACTIVE, _RUNNING|_SAVING) 1172a578133STarun Gupta If device is active, get pending_bytes by .save_live_pending() 1182a578133STarun Gupta If total pending_bytes >= threshold_size, call .save_live_iterate() 1192a578133STarun Gupta Data of VFIO device for pre-copy phase is copied 1202a578133STarun Gupta Iterate till total pending bytes converge and are less than threshold 1212a578133STarun Gupta | 1222a578133STarun Gupta On migration completion, vCPU stops and calls .save_live_complete_precopy for 1232a578133STarun Gupta each active device. The VFIO device is then transitioned into _SAVING state 1242a578133STarun Gupta (FINISH_MIGRATE, _DEVICE, _SAVING) 1252a578133STarun Gupta | 1262a578133STarun Gupta For the VFIO device, iterate in .save_live_complete_precopy until 1272a578133STarun Gupta pending data is 0 1282a578133STarun Gupta (FINISH_MIGRATE, _DEVICE, _STOPPED) 1292a578133STarun Gupta | 1302a578133STarun Gupta (FINISH_MIGRATE, _COMPLETED, _STOPPED) 1312a578133STarun Gupta Migraton thread schedules cleanup bottom half and exits 1322a578133STarun Gupta 1332a578133STarun GuptaLive migration resume path 1342a578133STarun Gupta-------------------------- 1352a578133STarun Gupta 1362a578133STarun Gupta:: 1372a578133STarun Gupta 1382a578133STarun Gupta Incoming migration calls .load_setup for each device 1392a578133STarun Gupta (RESTORE_VM, _ACTIVE, _STOPPED) 1402a578133STarun Gupta | 1412a578133STarun Gupta For each device, .load_state is called for that device section data 1422a578133STarun Gupta (RESTORE_VM, _ACTIVE, _RESUMING) 1432a578133STarun Gupta | 1442a578133STarun Gupta At the end, .load_cleanup is called for each device and vCPUs are started 1452a578133STarun Gupta (RUNNING, _NONE, _RUNNING) 1462a578133STarun Gupta 1472a578133STarun GuptaPostcopy 1482a578133STarun Gupta======== 1492a578133STarun Gupta 1502a578133STarun GuptaPostcopy migration is currently not supported for VFIO devices. 151