xref: /qemu/docs/devel/migration/postcopy.rst (revision bfb4c7cd99f1c39dedf33381954d03b9f8f244ec)
1*bfb4c7cdSPeter XuPostcopy
2*bfb4c7cdSPeter Xu========
3*bfb4c7cdSPeter Xu
4*bfb4c7cdSPeter Xu'Postcopy' migration is a way to deal with migrations that refuse to converge
5*bfb4c7cdSPeter Xu(or take too long to converge) its plus side is that there is an upper bound on
6*bfb4c7cdSPeter Xuthe amount of migration traffic and time it takes, the down side is that during
7*bfb4c7cdSPeter Xuthe postcopy phase, a failure of *either* side causes the guest to be lost.
8*bfb4c7cdSPeter Xu
9*bfb4c7cdSPeter XuIn postcopy the destination CPUs are started before all the memory has been
10*bfb4c7cdSPeter Xutransferred, and accesses to pages that are yet to be transferred cause
11*bfb4c7cdSPeter Xua fault that's translated by QEMU into a request to the source QEMU.
12*bfb4c7cdSPeter Xu
13*bfb4c7cdSPeter XuPostcopy can be combined with precopy (i.e. normal migration) so that if precopy
14*bfb4c7cdSPeter Xudoesn't finish in a given time the switch is made to postcopy.
15*bfb4c7cdSPeter Xu
16*bfb4c7cdSPeter XuEnabling postcopy
17*bfb4c7cdSPeter Xu-----------------
18*bfb4c7cdSPeter Xu
19*bfb4c7cdSPeter XuTo enable postcopy, issue this command on the monitor (both source and
20*bfb4c7cdSPeter Xudestination) prior to the start of migration:
21*bfb4c7cdSPeter Xu
22*bfb4c7cdSPeter Xu``migrate_set_capability postcopy-ram on``
23*bfb4c7cdSPeter Xu
24*bfb4c7cdSPeter XuThe normal commands are then used to start a migration, which is still
25*bfb4c7cdSPeter Xustarted in precopy mode.  Issuing:
26*bfb4c7cdSPeter Xu
27*bfb4c7cdSPeter Xu``migrate_start_postcopy``
28*bfb4c7cdSPeter Xu
29*bfb4c7cdSPeter Xuwill now cause the transition from precopy to postcopy.
30*bfb4c7cdSPeter XuIt can be issued immediately after migration is started or any
31*bfb4c7cdSPeter Xutime later on.  Issuing it after the end of a migration is harmless.
32*bfb4c7cdSPeter Xu
33*bfb4c7cdSPeter XuBlocktime is a postcopy live migration metric, intended to show how
34*bfb4c7cdSPeter Xulong the vCPU was in state of interruptible sleep due to pagefault.
35*bfb4c7cdSPeter XuThat metric is calculated both for all vCPUs as overlapped value, and
36*bfb4c7cdSPeter Xuseparately for each vCPU. These values are calculated on destination
37*bfb4c7cdSPeter Xuside.  To enable postcopy blocktime calculation, enter following
38*bfb4c7cdSPeter Xucommand on destination monitor:
39*bfb4c7cdSPeter Xu
40*bfb4c7cdSPeter Xu``migrate_set_capability postcopy-blocktime on``
41*bfb4c7cdSPeter Xu
42*bfb4c7cdSPeter XuPostcopy blocktime can be retrieved by query-migrate qmp command.
43*bfb4c7cdSPeter Xupostcopy-blocktime value of qmp command will show overlapped blocking
44*bfb4c7cdSPeter Xutime for all vCPU, postcopy-vcpu-blocktime will show list of blocking
45*bfb4c7cdSPeter Xutime per vCPU.
46*bfb4c7cdSPeter Xu
47*bfb4c7cdSPeter Xu.. note::
48*bfb4c7cdSPeter Xu  During the postcopy phase, the bandwidth limits set using
49*bfb4c7cdSPeter Xu  ``migrate_set_parameter`` is ignored (to avoid delaying requested pages that
50*bfb4c7cdSPeter Xu  the destination is waiting for).
51*bfb4c7cdSPeter Xu
52*bfb4c7cdSPeter XuPostcopy device transfer
53*bfb4c7cdSPeter Xu------------------------
54*bfb4c7cdSPeter Xu
55*bfb4c7cdSPeter XuLoading of device data may cause the device emulation to access guest RAM
56*bfb4c7cdSPeter Xuthat may trigger faults that have to be resolved by the source, as such
57*bfb4c7cdSPeter Xuthe migration stream has to be able to respond with page data *during* the
58*bfb4c7cdSPeter Xudevice load, and hence the device data has to be read from the stream completely
59*bfb4c7cdSPeter Xubefore the device load begins to free the stream up.  This is achieved by
60*bfb4c7cdSPeter Xu'packaging' the device data into a blob that's read in one go.
61*bfb4c7cdSPeter Xu
62*bfb4c7cdSPeter XuSource behaviour
63*bfb4c7cdSPeter Xu----------------
64*bfb4c7cdSPeter Xu
65*bfb4c7cdSPeter XuUntil postcopy is entered the migration stream is identical to normal
66*bfb4c7cdSPeter Xuprecopy, except for the addition of a 'postcopy advise' command at
67*bfb4c7cdSPeter Xuthe beginning, to tell the destination that postcopy might happen.
68*bfb4c7cdSPeter XuWhen postcopy starts the source sends the page discard data and then
69*bfb4c7cdSPeter Xuforms the 'package' containing:
70*bfb4c7cdSPeter Xu
71*bfb4c7cdSPeter Xu   - Command: 'postcopy listen'
72*bfb4c7cdSPeter Xu   - The device state
73*bfb4c7cdSPeter Xu
74*bfb4c7cdSPeter Xu     A series of sections, identical to the precopy streams device state stream
75*bfb4c7cdSPeter Xu     containing everything except postcopiable devices (i.e. RAM)
76*bfb4c7cdSPeter Xu   - Command: 'postcopy run'
77*bfb4c7cdSPeter Xu
78*bfb4c7cdSPeter XuThe 'package' is sent as the data part of a Command: ``CMD_PACKAGED``, and the
79*bfb4c7cdSPeter Xucontents are formatted in the same way as the main migration stream.
80*bfb4c7cdSPeter Xu
81*bfb4c7cdSPeter XuDuring postcopy the source scans the list of dirty pages and sends them
82*bfb4c7cdSPeter Xuto the destination without being requested (in much the same way as precopy),
83*bfb4c7cdSPeter Xuhowever when a page request is received from the destination, the dirty page
84*bfb4c7cdSPeter Xuscanning restarts from the requested location.  This causes requested pages
85*bfb4c7cdSPeter Xuto be sent quickly, and also causes pages directly after the requested page
86*bfb4c7cdSPeter Xuto be sent quickly in the hope that those pages are likely to be used
87*bfb4c7cdSPeter Xuby the destination soon.
88*bfb4c7cdSPeter Xu
89*bfb4c7cdSPeter XuDestination behaviour
90*bfb4c7cdSPeter Xu---------------------
91*bfb4c7cdSPeter Xu
92*bfb4c7cdSPeter XuInitially the destination looks the same as precopy, with a single thread
93*bfb4c7cdSPeter Xureading the migration stream; the 'postcopy advise' and 'discard' commands
94*bfb4c7cdSPeter Xuare processed to change the way RAM is managed, but don't affect the stream
95*bfb4c7cdSPeter Xuprocessing.
96*bfb4c7cdSPeter Xu
97*bfb4c7cdSPeter Xu::
98*bfb4c7cdSPeter Xu
99*bfb4c7cdSPeter Xu  ------------------------------------------------------------------------------
100*bfb4c7cdSPeter Xu                          1      2   3     4 5                      6   7
101*bfb4c7cdSPeter Xu  main -----DISCARD-CMD_PACKAGED ( LISTEN  DEVICE     DEVICE DEVICE RUN )
102*bfb4c7cdSPeter Xu  thread                             |       |
103*bfb4c7cdSPeter Xu                                     |     (page request)
104*bfb4c7cdSPeter Xu                                     |        \___
105*bfb4c7cdSPeter Xu                                     v            \
106*bfb4c7cdSPeter Xu  listen thread:                     --- page -- page -- page -- page -- page --
107*bfb4c7cdSPeter Xu
108*bfb4c7cdSPeter Xu                                     a   b        c
109*bfb4c7cdSPeter Xu  ------------------------------------------------------------------------------
110*bfb4c7cdSPeter Xu
111*bfb4c7cdSPeter Xu- On receipt of ``CMD_PACKAGED`` (1)
112*bfb4c7cdSPeter Xu
113*bfb4c7cdSPeter Xu   All the data associated with the package - the ( ... ) section in the diagram -
114*bfb4c7cdSPeter Xu   is read into memory, and the main thread recurses into qemu_loadvm_state_main
115*bfb4c7cdSPeter Xu   to process the contents of the package (2) which contains commands (3,6) and
116*bfb4c7cdSPeter Xu   devices (4...)
117*bfb4c7cdSPeter Xu
118*bfb4c7cdSPeter Xu- On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
119*bfb4c7cdSPeter Xu
120*bfb4c7cdSPeter Xu   a new thread (a) is started that takes over servicing the migration stream,
121*bfb4c7cdSPeter Xu   while the main thread carries on loading the package.   It loads normal
122*bfb4c7cdSPeter Xu   background page data (b) but if during a device load a fault happens (5)
123*bfb4c7cdSPeter Xu   the returned page (c) is loaded by the listen thread allowing the main
124*bfb4c7cdSPeter Xu   threads device load to carry on.
125*bfb4c7cdSPeter Xu
126*bfb4c7cdSPeter Xu- The last thing in the ``CMD_PACKAGED`` is a 'RUN' command (6)
127*bfb4c7cdSPeter Xu
128*bfb4c7cdSPeter Xu   letting the destination CPUs start running.  At the end of the
129*bfb4c7cdSPeter Xu   ``CMD_PACKAGED`` (7) the main thread returns to normal running behaviour and
130*bfb4c7cdSPeter Xu   is no longer used by migration, while the listen thread carries on servicing
131*bfb4c7cdSPeter Xu   page data until the end of migration.
132*bfb4c7cdSPeter Xu
133*bfb4c7cdSPeter XuPostcopy Recovery
134*bfb4c7cdSPeter Xu-----------------
135*bfb4c7cdSPeter Xu
136*bfb4c7cdSPeter XuComparing to precopy, postcopy is special on error handlings.  When any
137*bfb4c7cdSPeter Xuerror happens (in this case, mostly network errors), QEMU cannot easily
138*bfb4c7cdSPeter Xufail a migration because VM data resides in both source and destination
139*bfb4c7cdSPeter XuQEMU instances.  On the other hand, when issue happens QEMU on both sides
140*bfb4c7cdSPeter Xuwill go into a paused state.  It'll need a recovery phase to continue a
141*bfb4c7cdSPeter Xupaused postcopy migration.
142*bfb4c7cdSPeter Xu
143*bfb4c7cdSPeter XuThe recovery phase normally contains a few steps:
144*bfb4c7cdSPeter Xu
145*bfb4c7cdSPeter Xu  - When network issue occurs, both QEMU will go into PAUSED state
146*bfb4c7cdSPeter Xu
147*bfb4c7cdSPeter Xu  - When the network is recovered (or a new network is provided), the admin
148*bfb4c7cdSPeter Xu    can setup the new channel for migration using QMP command
149*bfb4c7cdSPeter Xu    'migrate-recover' on destination node, preparing for a resume.
150*bfb4c7cdSPeter Xu
151*bfb4c7cdSPeter Xu  - On source host, the admin can continue the interrupted postcopy
152*bfb4c7cdSPeter Xu    migration using QMP command 'migrate' with resume=true flag set.
153*bfb4c7cdSPeter Xu
154*bfb4c7cdSPeter Xu  - After the connection is re-established, QEMU will continue the postcopy
155*bfb4c7cdSPeter Xu    migration on both sides.
156*bfb4c7cdSPeter Xu
157*bfb4c7cdSPeter XuDuring a paused postcopy migration, the VM can logically still continue
158*bfb4c7cdSPeter Xurunning, and it will not be impacted from any page access to pages that
159*bfb4c7cdSPeter Xuwere already migrated to destination VM before the interruption happens.
160*bfb4c7cdSPeter XuHowever, if any of the missing pages got accessed on destination VM, the VM
161*bfb4c7cdSPeter Xuthread will be halted waiting for the page to be migrated, it means it can
162*bfb4c7cdSPeter Xube halted until the recovery is complete.
163*bfb4c7cdSPeter Xu
164*bfb4c7cdSPeter XuThe impact of accessing missing pages can be relevant to different
165*bfb4c7cdSPeter Xuconfigurations of the guest.  For example, when with async page fault
166*bfb4c7cdSPeter Xuenabled, logically the guest can proactively schedule out the threads
167*bfb4c7cdSPeter Xuaccessing missing pages.
168*bfb4c7cdSPeter Xu
169*bfb4c7cdSPeter XuPostcopy states
170*bfb4c7cdSPeter Xu---------------
171*bfb4c7cdSPeter Xu
172*bfb4c7cdSPeter XuPostcopy moves through a series of states (see postcopy_state) from
173*bfb4c7cdSPeter XuADVISE->DISCARD->LISTEN->RUNNING->END
174*bfb4c7cdSPeter Xu
175*bfb4c7cdSPeter Xu - Advise
176*bfb4c7cdSPeter Xu
177*bfb4c7cdSPeter Xu    Set at the start of migration if postcopy is enabled, even
178*bfb4c7cdSPeter Xu    if it hasn't had the start command; here the destination
179*bfb4c7cdSPeter Xu    checks that its OS has the support needed for postcopy, and performs
180*bfb4c7cdSPeter Xu    setup to ensure the RAM mappings are suitable for later postcopy.
181*bfb4c7cdSPeter Xu    The destination will fail early in migration at this point if the
182*bfb4c7cdSPeter Xu    required OS support is not present.
183*bfb4c7cdSPeter Xu    (Triggered by reception of POSTCOPY_ADVISE command)
184*bfb4c7cdSPeter Xu
185*bfb4c7cdSPeter Xu - Discard
186*bfb4c7cdSPeter Xu
187*bfb4c7cdSPeter Xu    Entered on receipt of the first 'discard' command; prior to
188*bfb4c7cdSPeter Xu    the first Discard being performed, hugepages are switched off
189*bfb4c7cdSPeter Xu    (using madvise) to ensure that no new huge pages are created
190*bfb4c7cdSPeter Xu    during the postcopy phase, and to cause any huge pages that
191*bfb4c7cdSPeter Xu    have discards on them to be broken.
192*bfb4c7cdSPeter Xu
193*bfb4c7cdSPeter Xu - Listen
194*bfb4c7cdSPeter Xu
195*bfb4c7cdSPeter Xu    The first command in the package, POSTCOPY_LISTEN, switches
196*bfb4c7cdSPeter Xu    the destination state to Listen, and starts a new thread
197*bfb4c7cdSPeter Xu    (the 'listen thread') which takes over the job of receiving
198*bfb4c7cdSPeter Xu    pages off the migration stream, while the main thread carries
199*bfb4c7cdSPeter Xu    on processing the blob.  With this thread able to process page
200*bfb4c7cdSPeter Xu    reception, the destination now 'sensitises' the RAM to detect
201*bfb4c7cdSPeter Xu    any access to missing pages (on Linux using the 'userfault'
202*bfb4c7cdSPeter Xu    system).
203*bfb4c7cdSPeter Xu
204*bfb4c7cdSPeter Xu - Running
205*bfb4c7cdSPeter Xu
206*bfb4c7cdSPeter Xu    POSTCOPY_RUN causes the destination to synchronise all
207*bfb4c7cdSPeter Xu    state and start the CPUs and IO devices running.  The main
208*bfb4c7cdSPeter Xu    thread now finishes processing the migration package and
209*bfb4c7cdSPeter Xu    now carries on as it would for normal precopy migration
210*bfb4c7cdSPeter Xu    (although it can't do the cleanup it would do as it
211*bfb4c7cdSPeter Xu    finishes a normal migration).
212*bfb4c7cdSPeter Xu
213*bfb4c7cdSPeter Xu - Paused
214*bfb4c7cdSPeter Xu
215*bfb4c7cdSPeter Xu    Postcopy can run into a paused state (normally on both sides when
216*bfb4c7cdSPeter Xu    happens), where all threads will be temporarily halted mostly due to
217*bfb4c7cdSPeter Xu    network errors.  When reaching paused state, migration will make sure
218*bfb4c7cdSPeter Xu    the qemu binary on both sides maintain the data without corrupting
219*bfb4c7cdSPeter Xu    the VM.  To continue the migration, the admin needs to fix the
220*bfb4c7cdSPeter Xu    migration channel using the QMP command 'migrate-recover' on the
221*bfb4c7cdSPeter Xu    destination node, then resume the migration using QMP command 'migrate'
222*bfb4c7cdSPeter Xu    again on source node, with resume=true flag set.
223*bfb4c7cdSPeter Xu
224*bfb4c7cdSPeter Xu - End
225*bfb4c7cdSPeter Xu
226*bfb4c7cdSPeter Xu    The listen thread can now quit, and perform the cleanup of migration
227*bfb4c7cdSPeter Xu    state, the migration is now complete.
228*bfb4c7cdSPeter Xu
229*bfb4c7cdSPeter XuSource side page map
230*bfb4c7cdSPeter Xu--------------------
231*bfb4c7cdSPeter Xu
232*bfb4c7cdSPeter XuThe 'migration bitmap' in postcopy is basically the same as in the precopy,
233*bfb4c7cdSPeter Xuwhere each of the bit to indicate that page is 'dirty' - i.e. needs
234*bfb4c7cdSPeter Xusending.  During the precopy phase this is updated as the CPU dirties
235*bfb4c7cdSPeter Xupages, however during postcopy the CPUs are stopped and nothing should
236*bfb4c7cdSPeter Xudirty anything any more. Instead, dirty bits are cleared when the relevant
237*bfb4c7cdSPeter Xupages are sent during postcopy.
238*bfb4c7cdSPeter Xu
239*bfb4c7cdSPeter XuPostcopy with hugepages
240*bfb4c7cdSPeter Xu-----------------------
241*bfb4c7cdSPeter Xu
242*bfb4c7cdSPeter XuPostcopy now works with hugetlbfs backed memory:
243*bfb4c7cdSPeter Xu
244*bfb4c7cdSPeter Xu  a) The linux kernel on the destination must support userfault on hugepages.
245*bfb4c7cdSPeter Xu  b) The huge-page configuration on the source and destination VMs must be
246*bfb4c7cdSPeter Xu     identical; i.e. RAMBlocks on both sides must use the same page size.
247*bfb4c7cdSPeter Xu  c) Note that ``-mem-path /dev/hugepages``  will fall back to allocating normal
248*bfb4c7cdSPeter Xu     RAM if it doesn't have enough hugepages, triggering (b) to fail.
249*bfb4c7cdSPeter Xu     Using ``-mem-prealloc`` enforces the allocation using hugepages.
250*bfb4c7cdSPeter Xu  d) Care should be taken with the size of hugepage used; postcopy with 2MB
251*bfb4c7cdSPeter Xu     hugepages works well, however 1GB hugepages are likely to be problematic
252*bfb4c7cdSPeter Xu     since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
253*bfb4c7cdSPeter Xu     and until the full page is transferred the destination thread is blocked.
254*bfb4c7cdSPeter Xu
255*bfb4c7cdSPeter XuPostcopy with shared memory
256*bfb4c7cdSPeter Xu---------------------------
257*bfb4c7cdSPeter Xu
258*bfb4c7cdSPeter XuPostcopy migration with shared memory needs explicit support from the other
259*bfb4c7cdSPeter Xuprocesses that share memory and from QEMU. There are restrictions on the type of
260*bfb4c7cdSPeter Xumemory that userfault can support shared.
261*bfb4c7cdSPeter Xu
262*bfb4c7cdSPeter XuThe Linux kernel userfault support works on ``/dev/shm`` memory and on ``hugetlbfs``
263*bfb4c7cdSPeter Xu(although the kernel doesn't provide an equivalent to ``madvise(MADV_DONTNEED)``
264*bfb4c7cdSPeter Xufor hugetlbfs which may be a problem in some configurations).
265*bfb4c7cdSPeter Xu
266*bfb4c7cdSPeter XuThe vhost-user code in QEMU supports clients that have Postcopy support,
267*bfb4c7cdSPeter Xuand the ``vhost-user-bridge`` (in ``tests/``) and the DPDK package have changes
268*bfb4c7cdSPeter Xuto support postcopy.
269*bfb4c7cdSPeter Xu
270*bfb4c7cdSPeter XuThe client needs to open a userfaultfd and register the areas
271*bfb4c7cdSPeter Xuof memory that it maps with userfault.  The client must then pass the
272*bfb4c7cdSPeter Xuuserfaultfd back to QEMU together with a mapping table that allows
273*bfb4c7cdSPeter Xufault addresses in the clients address space to be converted back to
274*bfb4c7cdSPeter XuRAMBlock/offsets.  The client's userfaultfd is added to the postcopy
275*bfb4c7cdSPeter Xufault-thread and page requests are made on behalf of the client by QEMU.
276*bfb4c7cdSPeter XuQEMU performs 'wake' operations on the client's userfaultfd to allow it
277*bfb4c7cdSPeter Xuto continue after a page has arrived.
278*bfb4c7cdSPeter Xu
279*bfb4c7cdSPeter Xu.. note::
280*bfb4c7cdSPeter Xu  There are two future improvements that would be nice:
281*bfb4c7cdSPeter Xu    a) Some way to make QEMU ignorant of the addresses in the clients
282*bfb4c7cdSPeter Xu       address space
283*bfb4c7cdSPeter Xu    b) Avoiding the need for QEMU to perform ufd-wake calls after the
284*bfb4c7cdSPeter Xu       pages have arrived
285*bfb4c7cdSPeter Xu
286*bfb4c7cdSPeter XuRetro-fitting postcopy to existing clients is possible:
287*bfb4c7cdSPeter Xu  a) A mechanism is needed for the registration with userfault as above,
288*bfb4c7cdSPeter Xu     and the registration needs to be coordinated with the phases of
289*bfb4c7cdSPeter Xu     postcopy.  In vhost-user extra messages are added to the existing
290*bfb4c7cdSPeter Xu     control channel.
291*bfb4c7cdSPeter Xu  b) Any thread that can block due to guest memory accesses must be
292*bfb4c7cdSPeter Xu     identified and the implication understood; for example if the
293*bfb4c7cdSPeter Xu     guest memory access is made while holding a lock then all other
294*bfb4c7cdSPeter Xu     threads waiting for that lock will also be blocked.
295*bfb4c7cdSPeter Xu
296*bfb4c7cdSPeter XuPostcopy Preemption Mode
297*bfb4c7cdSPeter Xu------------------------
298*bfb4c7cdSPeter Xu
299*bfb4c7cdSPeter XuPostcopy preempt is a new capability introduced in 8.0 QEMU release, it
300*bfb4c7cdSPeter Xuallows urgent pages (those got page fault requested from destination QEMU
301*bfb4c7cdSPeter Xuexplicitly) to be sent in a separate preempt channel, rather than queued in
302*bfb4c7cdSPeter Xuthe background migration channel.  Anyone who cares about latencies of page
303*bfb4c7cdSPeter Xufaults during a postcopy migration should enable this feature.  By default,
304*bfb4c7cdSPeter Xuit's not enabled.
305