qemu/docs/rdma.txt

10 linked on the QEMU wiki above.
29 of the significantly lower latency and higher throughput over TCP/IP. This is
30 because the RDMA I/O architecture reduces the number of interrupts and
31 data copies by bypassing the host networking stack. In particular, a TCP-based
33 unpredictable amount of time to complete the migration if the amount of
35 with the rate of dirty memory produced by the workload.
40 the use of the OpenFabrics OFED software stack that abstracts out the
41 programming model irrespective of the underlying hardware.
44 an understanding on how to verify that you have the OFED software stack
46 against the "librdmacm" and "libibverbs" libraries and development headers
53 with the hardware. This means that memory must be physically resident
54 before the hardware can transmit that memory to another machine.
55 If this is not acceptable for your application or product, then the use
57 software on the machine if there is not sufficient memory available to
58 relocate the entire footprint of the virtual machine. If so, then the
64 be pinned and resident in memory. This feature mostly affects the
65 bulk-phase round of the migration and can be enabled for extremely
66 high-performance RDMA hardware using the following command:
74 On the other hand, this will also significantly speed up the bulk round
75 of the migration, which can greatly reduce the "total" time of your migration.
76 Example performance of this using an idle VM in the previous example
77 can be found in the "Performance" section.
80 *all* of the memory of your virtual machine in the kernel is very expensive
81 may extend the initial bulk iteration time by many seconds,
82 and thus extending the total migration time. However, this will not
83 affect the determinism or predictability of your migration you will
84 still gain from the benefits of advanced pinning with RDMA.
89 First, set the migration speed to match your hardware's capabilities:
92 $ migrate_set_parameter max-bandwidth 40g # or whatever is the MAX of your RDMA device
94 Next, on the destination machine, add the following to the QEMU command line:
98 Finally, perform the actual migration on the source machine:
110 Using the following command:
119 For example, in the same 8GB RAM example with all 8GB of memory in
120 active use and the VM itself is completely idle using the same 40 gbps
130 migration *downtime*. This is because, without this feature, all of the
132 the bulk round and does not need to be re-registered during the successive
140 1. The transmission of the pages using RDMA
146 An infiniband SEND message is the standard ibverbs
148 The only difference between a SEND message and an RDMA
150 to be posted to the completion queue (CQ) on the
156 1. registration of the memory that will be transmitted
158    sides of the network before the actual transmission
161 RDMA messages are much easier to deal with. Once the memory
162 on the receiver side is registered and pinned, we're
163 basically done. All that is required is for the sender
164 side to start dumping bytes onto the link.
166 (Memory is not released from pinning until the migration
169 SEND messages require more coordination because the
171 work request) on the receive queue (RQ) before QEMUFileRDMA
172 can start using them to carry all the bytes as
175 To begin the migration, the initial connection setup is
191     * Length               (of the data portion, uint32, network byte order)
195 The 'Repeat' field is here to support future multiple page registrations
196 in a single message without any need to change the protocol itself
197 so that the protocol is compatible against multiple versions of QEMU.
198 Version #1 requires that all server implementations of the protocol must
199 check this field and register all requests found in the array of commands located
200 in the data portion and return an equal number of results in the response.
201 The maximum number of repeats is hard-coded to 4096. This is a conservative
202 limit based on the maximum size of a SEND message along with empirical
203 observations on the maximum future benefit of simultaneous page registrations.
205 The 'type' field has 12 different command values:
207      2. Error                      (sent to the source during bad things)
219 A single control message, as hinted above, can contain within the data
220 portion an array of many commands of the same type. If there is more than
221 one command, then the 'repeat' field will be greater than 1.
224 information and optionally pin all the memory if requested by the user.
228 using the above list of values:
234 1. We transmit a READY command to let the sender know that
235    we are *ready* to receive some data bytes on the control channel.
236 2. Before attempting to receive the expected command, we post another
237    RQ work request to replace the one we just used up.
238 3. Block on a CQ event channel and wait for the SEND to arrive.
239 4. When the send arrives, librdmacm will unblock us.
240 5. Verify that the command-type and version received matches the one we expected.
244 1. Block on the CQ event channel waiting for a READY command
245    from the receiver to tell us that the receiver
247 2. Optionally: if we are expecting a response from the command
250 3. When the READY arrives, librdmacm will
252    to replace the one we just used up.
253 4. Now, we can actually post the work request to SEND
254    the requested command type of the header we were asked for.
256    we block again and wait for that response using the additional
258    'Register result' commands #6 back to the sender which
259    hold the rkey need to perform RDMA. Note that the virtual address
260    corresponding to this rkey was already exchanged at the beginning
261    of the connection (described below).
263 All of the remaining command types (not including 'ready')
264 described above all use the aforementioned two functions to do the hard work:
267    this protocol before the actual migration begins. This information includes
268    a description of each RAMBlock on the server side as well as the virtual addresses
269    and lengths of each RAMBlock. This is used by the client to determine the
271    before performing the RDMA operations.
273    be sent with RDMA, the registration commands are used to ask the
274    other side to register the memory for this chunk and respond
275    with the result (rkey) of the registration.
276 3. Also, the QEMUFile interfaces also call these functions (described below)
278    its own protocol information during the migration process.
282    the "Compress" command listed above. If the page *has* been registered
283    then we check the entire chunk for zero. Only if the entire chunk is
284    zero, then we send a compress command to zap the page on the other side.
288 Current version of the protocol is version #1.
290 The same version applies to both for protocol traffic and capabilities
294 librdmacm provides the user with a 'private data' area to be exchanged
304 no length field. The maximum size of the 'private data' section
305 is only 192 bytes per the Infiniband specification, so it's not
309 versioning because the user does not need to register memory to
315 If the version is invalid, we throw an error.
317 If the version is new, we only negotiate the capabilities that the
318 requested version is able to perform and ignore the rest.
322 Finally: Negotiation happens with the Flags field: If the primary-VM
323 sets a flag, but the destination does not support this capability, it
324 will return a zero-bit for that flag and the primary-VM will understand
326 capability on the primary-VM side.
336 These two functions are very short and simply use the protocol
337 describe above to deliver bytes without changing the upper-level
340 Finally, how do we handoff the actual bytes to get_buffer()?
344 to hold on to the bytes received from control-channel's SEND
348 message, the bytes from SEND are copied into a small local holding area.
350 Then, we return the number of bytes requested by get_buffer()
351 and leave the remaining bytes in the holding area until get_buffer()
354 If the buffer is empty, then we follow the same steps
356 asking for a new SEND message to re-fill the buffer.
361 At the beginning of the migration, (migration-rdma.c),
362 the sender and the receiver populate the list of RAMBlocks
364 Then, using the aforementioned protocol, they exchange a
366 during the iteration of main memory. This description includes
367 a list of all the RAMBlocks, their offsets and lengths, virtual
369 page registration was disabled on the server-side, otherwise not.
371 Main memory is not migrated with the aforementioned protocol,
378 When a chunk is full (or a flush() occurs), the memory backed by
379 the chunk is registered with librdmacm is pinned in memory on
380 both sides using the aforementioned protocol.
382 for the entire chunk.
385 do not request that the hardware signal the completion queue
386 for the completion of *every* chunk. The current batch size
388 Only the last chunk in a batch must be signaled.
390 and helps keep the hardware busy performing RDMA operations.
396 link (one of 4 choices). This is the mode in which
400 the decision is to abort the migration entirely and
401 cleanup all the RDMA descriptors and unregister all
402 the memory.
404 After cleanup, the Virtual Machine is returned to normal
405 operation the same way that would happen if the TCP
412    an aborted migration (but with the source VM left unaffected).
413 2. Use of the recent /proc/<pid>/pagemap would likely speed up
414    the use of KSM and ballooning while using RDMA.
419 5. Expose UNREGISTER support to the user by way of workload-specific