devel/migration/postcopy.rst

9 the amount of migration traffic and time it takes, the down side is that during
10 the postcopy phase, a failure of *either* side causes the guest to be lost.
12 In postcopy the destination CPUs are started before all the memory has been
14 a fault that's translated by QEMU into a request to the source QEMU.
17 doesn't finish in a given time the switch is made to postcopy.
22 To enable postcopy, issue this command on the monitor (both source and
23 destination) prior to the start of migration:
27 The normal commands are then used to start a migration, which is still
32 will now cause the transition from precopy to postcopy.
34 time later on.  Issuing it after the end of a migration is harmless.
37 long the vCPU was in state of interruptible sleep due to pagefault.
51   During the postcopy phase, the bandwidth limits set using
53   the destination is waiting for).
66     Set at the start of migration if postcopy is enabled, even
67     if it hasn't had the start command; here the destination
68     checks that its OS has the support needed for postcopy, and performs
69     setup to ensure the RAM mappings are suitable for later postcopy.
70     The destination will fail early in migration at this point if the
76     Entered on receipt of the first 'discard' command; prior to
77     the first Discard being performed, hugepages are switched off
79     during the postcopy phase, and to cause any huge pages that
84     The first command in the package, POSTCOPY_LISTEN, switches
85     the destination state to Listen, and starts a new thread
86     (the 'listen thread') which takes over the job of receiving
87     pages off the migration stream, while the main thread carries
88     on processing the blob.  With this thread able to process page
89     reception, the destination now 'sensitises' the RAM to detect
90     any access to missing pages (on Linux using the 'userfault'
95     POSTCOPY_RUN causes the destination to synchronise all
96     state and start the CPUs and IO devices running.  The main
97     thread now finishes processing the migration package and
99     (although it can't do the cleanup it would do as it
104     The listen thread can now quit, and perform the cleanup of migration
105     state, the migration is now complete.
110 Loading of device data may cause the device emulation to access guest RAM
111 that may trigger faults that have to be resolved by the source, as such
112 the migration stream has to be able to respond with page data *during* the
113 device load, and hence the device data has to be read from the stream completely
114 before the device load begins to free the stream up.  This is achieved by
115 'packaging' the device data into a blob that's read in one go.
120 Until postcopy is entered the migration stream is identical to normal
121 precopy, except for the addition of a 'postcopy advise' command at
122 the beginning, to tell the destination that postcopy might happen.
123 When postcopy starts the source sends the page discard data and then
124 forms the 'package' containing:
127    - The device state
129      A series of sections, identical to the precopy streams device state stream
133 The 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
134 contents are formatted in the same way as the main migration stream.
136 During postcopy the source scans the list of dirty pages and sends them
137 to the destination without being requested (in much the same way as precopy),
138 however when a page request is received from the destination, the dirty page
139 scanning restarts from the requested location.  This causes requested pages
140 to be sent quickly, and also causes pages directly after the requested page
141 to be sent quickly in the hope that those pages are likely to be used
142 by the destination soon.
147 Initially the destination looks the same as precopy, with a single thread
148 reading the migration stream; the 'postcopy advise' and 'discard' commands
149 are processed to change the way RAM is managed, but don't affect the stream
168    All the data associated with the package - the ( ... ) section in the diagram -
169    is read into memory, and the main thread recurses into qemu_loadvm_state_main
170    to process the contents of the package (2) which contains commands (3,6) and
173 - On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
175    a new thread (a) is started that takes over servicing the migration stream,
176    while the main thread carries on loading the package.   It loads normal
178    the returned page (c) is loaded by the listen thread allowing the main
181 - The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
183    letting the destination CPUs start running.  At the end of the
184    ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
185    is no longer used by migration, while the listen thread carries on servicing
186    page data until the end of migration.
191 The 'migration bitmap' in postcopy is basically the same as in the precopy,
192 where each of the bit to indicate that page is 'dirty' - i.e. needs
193 sending.  During the precopy phase this is updated as the CPU dirties
194 pages, however during postcopy the CPUs are stopped and nothing should
195 dirty anything any more. Instead, dirty bits are cleared when the relevant
207 QEMU instances.  On the other hand, when issue happens QEMU on both sides
211 The recovery phase normally contains a few steps:
216   - When the network is recovered (or a new network is provided), the admin
217     can setup the new channel for migration using QMP command
220   - On source host, the admin can continue the interrupted postcopy
223     re-establish the channels.
227     procedure will be needed to properly synchronize the VM states between
228     the two QEMUs to continue the postcopy migration.  For example, there
229     can be pages sent right during the window when the network is
230     interrupted, then the handshake will guarantee pages lost in-flight
233   - After a proper handshake synchronization, QEMU will continue the
237 During a paused postcopy migration, the VM can logically still continue
239 were already migrated to destination VM before the interruption happens.
240 However, if any of the missing pages got accessed on destination VM, the VM
241 thread will be halted waiting for the page to be migrated, it means it can
242 be halted until the recovery is complete.
244 The impact of accessing missing pages can be relevant to different
245 configurations of the guest.  For example, when with async page fault
246 enabled, logically the guest can proactively schedule out the threads
254   a) The linux kernel on the destination must support userfault on hugepages.
255   b) The huge-page configuration on the source and destination VMs must be
256      identical; i.e. RAMBlocks on both sides must use the same page size.
259      Using ``-mem-prealloc`` enforces the allocation using hugepages.
260   d) Care should be taken with the size of hugepage used; postcopy with 2MB
263      and until the full page is transferred the destination thread is blocked.
268 Postcopy migration with shared memory needs explicit support from the other
269 processes that share memory and from QEMU. There are restrictions on the type of
272 The Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
273 (although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
276 The vhost-user code in QEMU supports clients that have Postcopy support,
277 and the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
280 The client needs to open a userfaultfd and register the areas
281 of memory that it maps with userfault.  The client must then pass the
283 fault addresses in the clients address space to be converted back to
284 RAMBlock/offsets.  The client's userfaultfd is added to the postcopy
285 fault-thread and page requests are made on behalf of the client by QEMU.
286 QEMU performs 'wake' operations on the client's userfaultfd to allow it
291     a) Some way to make QEMU ignorant of the addresses in the clients
293     b) Avoiding the need for QEMU to perform ufd-wake calls after the
297   a) A mechanism is needed for the registration with userfault as above,
298      and the registration needs to be coordinated with the phases of
299      postcopy.  In vhost-user extra messages are added to the existing
302      identified and the implication understood; for example if the
312 the background migration channel.  Anyone who cares about latencies of page