1*bfb4c7cdSPeter XuPostcopy 2*bfb4c7cdSPeter Xu======== 3*bfb4c7cdSPeter Xu 4*bfb4c7cdSPeter Xu'Postcopy' migration is a way to deal with migrations that refuse to converge 5*bfb4c7cdSPeter Xu(or take too long to converge) its plus side is that there is an upper bound on 6*bfb4c7cdSPeter Xuthe amount of migration traffic and time it takes, the down side is that during 7*bfb4c7cdSPeter Xuthe postcopy phase, a failure of *either* side causes the guest to be lost. 8*bfb4c7cdSPeter Xu 9*bfb4c7cdSPeter XuIn postcopy the destination CPUs are started before all the memory has been 10*bfb4c7cdSPeter Xutransferred, and accesses to pages that are yet to be transferred cause 11*bfb4c7cdSPeter Xua fault that's translated by QEMU into a request to the source QEMU. 12*bfb4c7cdSPeter Xu 13*bfb4c7cdSPeter XuPostcopy can be combined with precopy (i.e. normal migration) so that if precopy 14*bfb4c7cdSPeter Xudoesn't finish in a given time the switch is made to postcopy. 15*bfb4c7cdSPeter Xu 16*bfb4c7cdSPeter XuEnabling postcopy 17*bfb4c7cdSPeter Xu----------------- 18*bfb4c7cdSPeter Xu 19*bfb4c7cdSPeter XuTo enable postcopy, issue this command on the monitor (both source and 20*bfb4c7cdSPeter Xudestination) prior to the start of migration: 21*bfb4c7cdSPeter Xu 22*bfb4c7cdSPeter Xu``migrate_set_capability postcopy-ram on`` 23*bfb4c7cdSPeter Xu 24*bfb4c7cdSPeter XuThe normal commands are then used to start a migration, which is still 25*bfb4c7cdSPeter Xustarted in precopy mode. Issuing: 26*bfb4c7cdSPeter Xu 27*bfb4c7cdSPeter Xu``migrate_start_postcopy`` 28*bfb4c7cdSPeter Xu 29*bfb4c7cdSPeter Xuwill now cause the transition from precopy to postcopy. 30*bfb4c7cdSPeter XuIt can be issued immediately after migration is started or any 31*bfb4c7cdSPeter Xutime later on. Issuing it after the end of a migration is harmless. 32*bfb4c7cdSPeter Xu 33*bfb4c7cdSPeter XuBlocktime is a postcopy live migration metric, intended to show how 34*bfb4c7cdSPeter Xulong the vCPU was in state of interruptible sleep due to pagefault. 35*bfb4c7cdSPeter XuThat metric is calculated both for all vCPUs as overlapped value, and 36*bfb4c7cdSPeter Xuseparately for each vCPU. These values are calculated on destination 37*bfb4c7cdSPeter Xuside. To enable postcopy blocktime calculation, enter following 38*bfb4c7cdSPeter Xucommand on destination monitor: 39*bfb4c7cdSPeter Xu 40*bfb4c7cdSPeter Xu``migrate_set_capability postcopy-blocktime on`` 41*bfb4c7cdSPeter Xu 42*bfb4c7cdSPeter XuPostcopy blocktime can be retrieved by query-migrate qmp command. 43*bfb4c7cdSPeter Xupostcopy-blocktime value of qmp command will show overlapped blocking 44*bfb4c7cdSPeter Xutime for all vCPU, postcopy-vcpu-blocktime will show list of blocking 45*bfb4c7cdSPeter Xutime per vCPU. 46*bfb4c7cdSPeter Xu 47*bfb4c7cdSPeter Xu.. note:: 48*bfb4c7cdSPeter Xu During the postcopy phase, the bandwidth limits set using 49*bfb4c7cdSPeter Xu ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that 50*bfb4c7cdSPeter Xu the destination is waiting for). 51*bfb4c7cdSPeter Xu 52*bfb4c7cdSPeter XuPostcopy device transfer 53*bfb4c7cdSPeter Xu------------------------ 54*bfb4c7cdSPeter Xu 55*bfb4c7cdSPeter XuLoading of device data may cause the device emulation to access guest RAM 56*bfb4c7cdSPeter Xuthat may trigger faults that have to be resolved by the source, as such 57*bfb4c7cdSPeter Xuthe migration stream has to be able to respond with page data *during* the 58*bfb4c7cdSPeter Xudevice load, and hence the device data has to be read from the stream completely 59*bfb4c7cdSPeter Xubefore the device load begins to free the stream up. This is achieved by 60*bfb4c7cdSPeter Xu'packaging' the device data into a blob that's read in one go. 61*bfb4c7cdSPeter Xu 62*bfb4c7cdSPeter XuSource behaviour 63*bfb4c7cdSPeter Xu---------------- 64*bfb4c7cdSPeter Xu 65*bfb4c7cdSPeter XuUntil postcopy is entered the migration stream is identical to normal 66*bfb4c7cdSPeter Xuprecopy, except for the addition of a 'postcopy advise' command at 67*bfb4c7cdSPeter Xuthe beginning, to tell the destination that postcopy might happen. 68*bfb4c7cdSPeter XuWhen postcopy starts the source sends the page discard data and then 69*bfb4c7cdSPeter Xuforms the 'package' containing: 70*bfb4c7cdSPeter Xu 71*bfb4c7cdSPeter Xu - Command: 'postcopy listen' 72*bfb4c7cdSPeter Xu - The device state 73*bfb4c7cdSPeter Xu 74*bfb4c7cdSPeter Xu A series of sections, identical to the precopy streams device state stream 75*bfb4c7cdSPeter Xu containing everything except postcopiable devices (i.e. RAM) 76*bfb4c7cdSPeter Xu - Command: 'postcopy run' 77*bfb4c7cdSPeter Xu 78*bfb4c7cdSPeter XuThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the 79*bfb4c7cdSPeter Xucontents are formatted in the same way as the main migration stream. 80*bfb4c7cdSPeter Xu 81*bfb4c7cdSPeter XuDuring postcopy the source scans the list of dirty pages and sends them 82*bfb4c7cdSPeter Xuto the destination without being requested (in much the same way as precopy), 83*bfb4c7cdSPeter Xuhowever when a page request is received from the destination, the dirty page 84*bfb4c7cdSPeter Xuscanning restarts from the requested location. This causes requested pages 85*bfb4c7cdSPeter Xuto be sent quickly, and also causes pages directly after the requested page 86*bfb4c7cdSPeter Xuto be sent quickly in the hope that those pages are likely to be used 87*bfb4c7cdSPeter Xuby the destination soon. 88*bfb4c7cdSPeter Xu 89*bfb4c7cdSPeter XuDestination behaviour 90*bfb4c7cdSPeter Xu--------------------- 91*bfb4c7cdSPeter Xu 92*bfb4c7cdSPeter XuInitially the destination looks the same as precopy, with a single thread 93*bfb4c7cdSPeter Xureading the migration stream; the 'postcopy advise' and 'discard' commands 94*bfb4c7cdSPeter Xuare processed to change the way RAM is managed, but don't affect the stream 95*bfb4c7cdSPeter Xuprocessing. 96*bfb4c7cdSPeter Xu 97*bfb4c7cdSPeter Xu:: 98*bfb4c7cdSPeter Xu 99*bfb4c7cdSPeter Xu ------------------------------------------------------------------------------ 100*bfb4c7cdSPeter Xu 1 2 3 4 5 6 7 101*bfb4c7cdSPeter Xu main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN ) 102*bfb4c7cdSPeter Xu thread | | 103*bfb4c7cdSPeter Xu | (page request) 104*bfb4c7cdSPeter Xu | \___ 105*bfb4c7cdSPeter Xu v \ 106*bfb4c7cdSPeter Xu listen thread: --- page -- page -- page -- page -- page -- 107*bfb4c7cdSPeter Xu 108*bfb4c7cdSPeter Xu a b c 109*bfb4c7cdSPeter Xu ------------------------------------------------------------------------------ 110*bfb4c7cdSPeter Xu 111*bfb4c7cdSPeter Xu- On receipt of ``CMD_PACKAGED`` (1) 112*bfb4c7cdSPeter Xu 113*bfb4c7cdSPeter Xu All the data associated with the package - the ( ... ) section in the diagram - 114*bfb4c7cdSPeter Xu is read into memory, and the main thread recurses into qemu_loadvm_state_main 115*bfb4c7cdSPeter Xu to process the contents of the package (2) which contains commands (3,6) and 116*bfb4c7cdSPeter Xu devices (4...) 117*bfb4c7cdSPeter Xu 118*bfb4c7cdSPeter Xu- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package) 119*bfb4c7cdSPeter Xu 120*bfb4c7cdSPeter Xu a new thread (a) is started that takes over servicing the migration stream, 121*bfb4c7cdSPeter Xu while the main thread carries on loading the package. It loads normal 122*bfb4c7cdSPeter Xu background page data (b) but if during a device load a fault happens (5) 123*bfb4c7cdSPeter Xu the returned page (c) is loaded by the listen thread allowing the main 124*bfb4c7cdSPeter Xu threads device load to carry on. 125*bfb4c7cdSPeter Xu 126*bfb4c7cdSPeter Xu- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6) 127*bfb4c7cdSPeter Xu 128*bfb4c7cdSPeter Xu letting the destination CPUs start running. At the end of the 129*bfb4c7cdSPeter Xu ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and 130*bfb4c7cdSPeter Xu is no longer used by migration, while the listen thread carries on servicing 131*bfb4c7cdSPeter Xu page data until the end of migration. 132*bfb4c7cdSPeter Xu 133*bfb4c7cdSPeter XuPostcopy Recovery 134*bfb4c7cdSPeter Xu----------------- 135*bfb4c7cdSPeter Xu 136*bfb4c7cdSPeter XuComparing to precopy, postcopy is special on error handlings. When any 137*bfb4c7cdSPeter Xuerror happens (in this case, mostly network errors), QEMU cannot easily 138*bfb4c7cdSPeter Xufail a migration because VM data resides in both source and destination 139*bfb4c7cdSPeter XuQEMU instances. On the other hand, when issue happens QEMU on both sides 140*bfb4c7cdSPeter Xuwill go into a paused state. It'll need a recovery phase to continue a 141*bfb4c7cdSPeter Xupaused postcopy migration. 142*bfb4c7cdSPeter Xu 143*bfb4c7cdSPeter XuThe recovery phase normally contains a few steps: 144*bfb4c7cdSPeter Xu 145*bfb4c7cdSPeter Xu - When network issue occurs, both QEMU will go into PAUSED state 146*bfb4c7cdSPeter Xu 147*bfb4c7cdSPeter Xu - When the network is recovered (or a new network is provided), the admin 148*bfb4c7cdSPeter Xu can setup the new channel for migration using QMP command 149*bfb4c7cdSPeter Xu 'migrate-recover' on destination node, preparing for a resume. 150*bfb4c7cdSPeter Xu 151*bfb4c7cdSPeter Xu - On source host, the admin can continue the interrupted postcopy 152*bfb4c7cdSPeter Xu migration using QMP command 'migrate' with resume=true flag set. 153*bfb4c7cdSPeter Xu 154*bfb4c7cdSPeter Xu - After the connection is re-established, QEMU will continue the postcopy 155*bfb4c7cdSPeter Xu migration on both sides. 156*bfb4c7cdSPeter Xu 157*bfb4c7cdSPeter XuDuring a paused postcopy migration, the VM can logically still continue 158*bfb4c7cdSPeter Xurunning, and it will not be impacted from any page access to pages that 159*bfb4c7cdSPeter Xuwere already migrated to destination VM before the interruption happens. 160*bfb4c7cdSPeter XuHowever, if any of the missing pages got accessed on destination VM, the VM 161*bfb4c7cdSPeter Xuthread will be halted waiting for the page to be migrated, it means it can 162*bfb4c7cdSPeter Xube halted until the recovery is complete. 163*bfb4c7cdSPeter Xu 164*bfb4c7cdSPeter XuThe impact of accessing missing pages can be relevant to different 165*bfb4c7cdSPeter Xuconfigurations of the guest. For example, when with async page fault 166*bfb4c7cdSPeter Xuenabled, logically the guest can proactively schedule out the threads 167*bfb4c7cdSPeter Xuaccessing missing pages. 168*bfb4c7cdSPeter Xu 169*bfb4c7cdSPeter XuPostcopy states 170*bfb4c7cdSPeter Xu--------------- 171*bfb4c7cdSPeter Xu 172*bfb4c7cdSPeter XuPostcopy moves through a series of states (see postcopy_state) from 173*bfb4c7cdSPeter XuADVISE->DISCARD->LISTEN->RUNNING->END 174*bfb4c7cdSPeter Xu 175*bfb4c7cdSPeter Xu - Advise 176*bfb4c7cdSPeter Xu 177*bfb4c7cdSPeter Xu Set at the start of migration if postcopy is enabled, even 178*bfb4c7cdSPeter Xu if it hasn't had the start command; here the destination 179*bfb4c7cdSPeter Xu checks that its OS has the support needed for postcopy, and performs 180*bfb4c7cdSPeter Xu setup to ensure the RAM mappings are suitable for later postcopy. 181*bfb4c7cdSPeter Xu The destination will fail early in migration at this point if the 182*bfb4c7cdSPeter Xu required OS support is not present. 183*bfb4c7cdSPeter Xu (Triggered by reception of POSTCOPY_ADVISE command) 184*bfb4c7cdSPeter Xu 185*bfb4c7cdSPeter Xu - Discard 186*bfb4c7cdSPeter Xu 187*bfb4c7cdSPeter Xu Entered on receipt of the first 'discard' command; prior to 188*bfb4c7cdSPeter Xu the first Discard being performed, hugepages are switched off 189*bfb4c7cdSPeter Xu (using madvise) to ensure that no new huge pages are created 190*bfb4c7cdSPeter Xu during the postcopy phase, and to cause any huge pages that 191*bfb4c7cdSPeter Xu have discards on them to be broken. 192*bfb4c7cdSPeter Xu 193*bfb4c7cdSPeter Xu - Listen 194*bfb4c7cdSPeter Xu 195*bfb4c7cdSPeter Xu The first command in the package, POSTCOPY_LISTEN, switches 196*bfb4c7cdSPeter Xu the destination state to Listen, and starts a new thread 197*bfb4c7cdSPeter Xu (the 'listen thread') which takes over the job of receiving 198*bfb4c7cdSPeter Xu pages off the migration stream, while the main thread carries 199*bfb4c7cdSPeter Xu on processing the blob. With this thread able to process page 200*bfb4c7cdSPeter Xu reception, the destination now 'sensitises' the RAM to detect 201*bfb4c7cdSPeter Xu any access to missing pages (on Linux using the 'userfault' 202*bfb4c7cdSPeter Xu system). 203*bfb4c7cdSPeter Xu 204*bfb4c7cdSPeter Xu - Running 205*bfb4c7cdSPeter Xu 206*bfb4c7cdSPeter Xu POSTCOPY_RUN causes the destination to synchronise all 207*bfb4c7cdSPeter Xu state and start the CPUs and IO devices running. The main 208*bfb4c7cdSPeter Xu thread now finishes processing the migration package and 209*bfb4c7cdSPeter Xu now carries on as it would for normal precopy migration 210*bfb4c7cdSPeter Xu (although it can't do the cleanup it would do as it 211*bfb4c7cdSPeter Xu finishes a normal migration). 212*bfb4c7cdSPeter Xu 213*bfb4c7cdSPeter Xu - Paused 214*bfb4c7cdSPeter Xu 215*bfb4c7cdSPeter Xu Postcopy can run into a paused state (normally on both sides when 216*bfb4c7cdSPeter Xu happens), where all threads will be temporarily halted mostly due to 217*bfb4c7cdSPeter Xu network errors. When reaching paused state, migration will make sure 218*bfb4c7cdSPeter Xu the qemu binary on both sides maintain the data without corrupting 219*bfb4c7cdSPeter Xu the VM. To continue the migration, the admin needs to fix the 220*bfb4c7cdSPeter Xu migration channel using the QMP command 'migrate-recover' on the 221*bfb4c7cdSPeter Xu destination node, then resume the migration using QMP command 'migrate' 222*bfb4c7cdSPeter Xu again on source node, with resume=true flag set. 223*bfb4c7cdSPeter Xu 224*bfb4c7cdSPeter Xu - End 225*bfb4c7cdSPeter Xu 226*bfb4c7cdSPeter Xu The listen thread can now quit, and perform the cleanup of migration 227*bfb4c7cdSPeter Xu state, the migration is now complete. 228*bfb4c7cdSPeter Xu 229*bfb4c7cdSPeter XuSource side page map 230*bfb4c7cdSPeter Xu-------------------- 231*bfb4c7cdSPeter Xu 232*bfb4c7cdSPeter XuThe 'migration bitmap' in postcopy is basically the same as in the precopy, 233*bfb4c7cdSPeter Xuwhere each of the bit to indicate that page is 'dirty' - i.e. needs 234*bfb4c7cdSPeter Xusending. During the precopy phase this is updated as the CPU dirties 235*bfb4c7cdSPeter Xupages, however during postcopy the CPUs are stopped and nothing should 236*bfb4c7cdSPeter Xudirty anything any more. Instead, dirty bits are cleared when the relevant 237*bfb4c7cdSPeter Xupages are sent during postcopy. 238*bfb4c7cdSPeter Xu 239*bfb4c7cdSPeter XuPostcopy with hugepages 240*bfb4c7cdSPeter Xu----------------------- 241*bfb4c7cdSPeter Xu 242*bfb4c7cdSPeter XuPostcopy now works with hugetlbfs backed memory: 243*bfb4c7cdSPeter Xu 244*bfb4c7cdSPeter Xu a) The linux kernel on the destination must support userfault on hugepages. 245*bfb4c7cdSPeter Xu b) The huge-page configuration on the source and destination VMs must be 246*bfb4c7cdSPeter Xu identical; i.e. RAMBlocks on both sides must use the same page size. 247*bfb4c7cdSPeter Xu c) Note that ``-mem-path /dev/hugepages`` will fall back to allocating normal 248*bfb4c7cdSPeter Xu RAM if it doesn't have enough hugepages, triggering (b) to fail. 249*bfb4c7cdSPeter Xu Using ``-mem-prealloc`` enforces the allocation using hugepages. 250*bfb4c7cdSPeter Xu d) Care should be taken with the size of hugepage used; postcopy with 2MB 251*bfb4c7cdSPeter Xu hugepages works well, however 1GB hugepages are likely to be problematic 252*bfb4c7cdSPeter Xu since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link, 253*bfb4c7cdSPeter Xu and until the full page is transferred the destination thread is blocked. 254*bfb4c7cdSPeter Xu 255*bfb4c7cdSPeter XuPostcopy with shared memory 256*bfb4c7cdSPeter Xu--------------------------- 257*bfb4c7cdSPeter Xu 258*bfb4c7cdSPeter XuPostcopy migration with shared memory needs explicit support from the other 259*bfb4c7cdSPeter Xuprocesses that share memory and from QEMU. There are restrictions on the type of 260*bfb4c7cdSPeter Xumemory that userfault can support shared. 261*bfb4c7cdSPeter Xu 262*bfb4c7cdSPeter XuThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs`` 263*bfb4c7cdSPeter Xu(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)`` 264*bfb4c7cdSPeter Xufor hugetlbfs which may be a problem in some configurations). 265*bfb4c7cdSPeter Xu 266*bfb4c7cdSPeter XuThe vhost-user code in QEMU supports clients that have Postcopy support, 267*bfb4c7cdSPeter Xuand the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes 268*bfb4c7cdSPeter Xuto support postcopy. 269*bfb4c7cdSPeter Xu 270*bfb4c7cdSPeter XuThe client needs to open a userfaultfd and register the areas 271*bfb4c7cdSPeter Xuof memory that it maps with userfault. The client must then pass the 272*bfb4c7cdSPeter Xuuserfaultfd back to QEMU together with a mapping table that allows 273*bfb4c7cdSPeter Xufault addresses in the clients address space to be converted back to 274*bfb4c7cdSPeter XuRAMBlock/offsets. The client's userfaultfd is added to the postcopy 275*bfb4c7cdSPeter Xufault-thread and page requests are made on behalf of the client by QEMU. 276*bfb4c7cdSPeter XuQEMU performs 'wake' operations on the client's userfaultfd to allow it 277*bfb4c7cdSPeter Xuto continue after a page has arrived. 278*bfb4c7cdSPeter Xu 279*bfb4c7cdSPeter Xu.. note:: 280*bfb4c7cdSPeter Xu There are two future improvements that would be nice: 281*bfb4c7cdSPeter Xu a) Some way to make QEMU ignorant of the addresses in the clients 282*bfb4c7cdSPeter Xu address space 283*bfb4c7cdSPeter Xu b) Avoiding the need for QEMU to perform ufd-wake calls after the 284*bfb4c7cdSPeter Xu pages have arrived 285*bfb4c7cdSPeter Xu 286*bfb4c7cdSPeter XuRetro-fitting postcopy to existing clients is possible: 287*bfb4c7cdSPeter Xu a) A mechanism is needed for the registration with userfault as above, 288*bfb4c7cdSPeter Xu and the registration needs to be coordinated with the phases of 289*bfb4c7cdSPeter Xu postcopy. In vhost-user extra messages are added to the existing 290*bfb4c7cdSPeter Xu control channel. 291*bfb4c7cdSPeter Xu b) Any thread that can block due to guest memory accesses must be 292*bfb4c7cdSPeter Xu identified and the implication understood; for example if the 293*bfb4c7cdSPeter Xu guest memory access is made while holding a lock then all other 294*bfb4c7cdSPeter Xu threads waiting for that lock will also be blocked. 295*bfb4c7cdSPeter Xu 296*bfb4c7cdSPeter XuPostcopy Preemption Mode 297*bfb4c7cdSPeter Xu------------------------ 298*bfb4c7cdSPeter Xu 299*bfb4c7cdSPeter XuPostcopy preempt is a new capability introduced in 8.0 QEMU release, it 300*bfb4c7cdSPeter Xuallows urgent pages (those got page fault requested from destination QEMU 301*bfb4c7cdSPeter Xuexplicitly) to be sent in a separate preempt channel, rather than queued in 302*bfb4c7cdSPeter Xuthe background migration channel. Anyone who cares about latencies of page 303*bfb4c7cdSPeter Xufaults during a postcopy migration should enable this feature. By default, 304*bfb4c7cdSPeter Xuit's not enabled. 305