Lines Matching full:the

14 of a generic VFIO device type, living inside the VMM, which we call the client,
15 and the core device implementation, living outside the VMM, which we call the
18 The vfio-user specification is partly based on the
21 VFIO is a mature and stable API, backed by an extensively used framework. The
24 particular implementation. None of the VFIO kernel modules are required for
25 supporting the protocol, on either the client or server side. Some source
28 The main idea is to allow a virtual device to function in a separate process in
29 the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``) is
33 * Sharing of client memory for DMA with the server.
34 * Sharing of server memory with the client for fast MMIO.
37 Other socket types could be used which allow the server to run in a separate
38 guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). Theoretically
39 the underlying transport does not necessarily have to be a socket, however we do
41 domain socket and introduce basic support for the other two types of sockets
45 is not necessary for either the client or the server in order to implement the
52 to a user space process; the device-specific kernel driver does not drive the
53 device at all. Typically, the user space process is a VMM and the device is
55 and the required functionality in the kernel. QEMU has adopted VFIO to allow a
59 vfio-user reuses the core VFIO concepts defined in its API, but implements them
60 as messages to be sent over a socket. It does not change the kernel-based VFIO
61 in any way, in fact none of the VFIO kernel modules need to be loaded to use
62 vfio-user. It is also possible for the client to concurrently use the current
68 A device under VFIO presents a standard interface to the user process. Many of
69 the VFIO operations in the existing interface use the ``ioctl()`` system call, and
70 references to the existing interface are called the ``ioctl()`` implementation in
73 The following sections describe the set of messages that implement the vfio-user
74 interface over a socket. In many cases, the messages are analogous to data
75 structures used in the ``ioctl()`` implementation. Messages derived from the
76 ``ioctl()`` will have a name derived from the ``ioctl()`` command name. E.g., the
78 ``VFIO_USER_DEVICE_GET_INFO`` message. The purpose of this reuse is to share as
79 much code as feasible with the ``ioctl()`` implementation``.
84 After the client connects to the server, the initial client message is
86 apply to the session. The server replies with a compatible version and set of
87 capabilities it supports, or closes the connection if it cannot support the
93 The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the server for
94 information about the device. This information includes:
96 * The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``),
97 * the number of device regions, and
98 * the device presents to the client the number of interrupt types the device
104 The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query the
105 server for information about the device's regions. This information describes:
111 When a device region can be mapped by the client, the server provides a file
112 descriptor which the client can ``mmap()``. The server is responsible for
119 by the region info data structure. These capabilities are returned in the
126 region can be mapped by the client, a ``VFIO_REGION_INFO_CAP_SPARSE_MMAP``
127 capability is included in the region info reply. This capability describes
128 which portions can be mapped by the client.
132 that accesses to the NVMe registers (found in the beginning of BAR0) are
133 trapped (an infrequent event), while allowing direct access to the doorbells
135 BAR0), found in the next page after the NVMe registers in BAR0.
140 A device can define regions additional to the standard ones (e.g. PCI indexes
142 in the region info reply of a device-specific region. Such regions are reflected
149 For unmapped regions, region I/O from the client is done via
153 configuring the returned file descriptors as ioeventfds or ioregionfds, the
155 trip through the client.
160 The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the server
161 for the device's interrupt types. The interrupt types are specific to the bus
162 the device is attached to, and the client is expected to know the capabilities
163 of each interrupt type. The server can signal an interrupt by directly injecting
164 interrupts into the guest via an event file descriptor. The client configures
165 how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages.
170 When the guest executes load or store operations to an unmapped device region,
171 the client forwards these operations to the server with
172 ``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages. The server
173 will reply with data from the device on read operations or an acknowledgement on
179 The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP`` messages to
180 inform the server of the valid DMA ranges that the server can access on behalf
181 of a device (typically, VM guest memory). DMA memory may be accessed by the
182 server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages over the
183 socket. In this case, the "DMA" part of the naming is a misnomer.
185 Actual direct memory access of client memory from the server is possible if the
186 client provides file descriptors the server can ``mmap()``. Note that ``mmap()``
187 privileges cannot be revoked by the client, therefore file descriptors should
188 only be exported in environments where the client trusts the server not to
204 The current protocol specification requires a dedicated socket per
206 single server handles multiple virtual devices from the same or multiple
207 clients. The location of the socket is implementation-specific. Multiplexing
208 clients, devices, and servers over the same socket is not supported in this
209 version of the protocol.
214 For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files,
215 therefore it is up to the management layer to set up the socket as required.
217 mechanism. Defining that mechanism is deferred to a future version of the
224 replies. The server will process commands in the order they are received. A
225 consequence of this is if a client issues a command with the *No_reply* bit,
226 then subsequently issues a command without *No_reply*, the older command will
227 have been processed before the reply to the younger command is sent by the
228 server. The client must be aware of the device's capability to process
230 multiple client threads to concurrently access device regions; the client must
233 An example is a frame buffer device, where the device may allow concurrent
237 Note that unrelated messages sent from the server to the client can appear in
247 The server and the client can disconnect from each other, either intentionally
248 or unexpectedly. Both the client and the server need to know how to handle such
253 A server disconnecting from the client may indicate that:
259 It is impossible for the client to know whether or not a failure is
260 intermittent or innocuous and should be retried, therefore the client should
261 reset the VFIO device when it detects the socket has been disconnected.
262 Error recovery will be driven by the guest's device error handling
267 The client disconnecting from the server primarily means that the client
268 has exited. Currently, this means that the guest is shut down so the device is
269 no longer needed therefore the server can automatically exit. However, there
276 Therefore in order for the protocol to be forward compatible, the server should
281 - all IRQ file descriptors passed from the old client are closed
282 - the device state should otherwise be retained
284 The expectation is that when a client reconnects, it will re-establish IRQ and
287 If anything happens to the client (such as qemu really did exit), the control
300 responded to with an error code. Failure to send the command in the first place
301 (e.g. because the socket is disconnected) is a different type of error examined
302 earlier in the disconnect section.
309 Defining a retry and timeout scheme is deferred to a future version of the
315 Some requests have an ``argsz`` field. In a request, it defines the maximum
316 expected reply payload size, which should be at least the size of the fixed
317 reply payload headers defined here. The *request* payload size is defined by the
318 usual ``msg_size`` field in the header, not the ``argsz`` field.
320 In a reply, the server sets ``argsz`` field to the size needed for a full
321 payload size. This may be less than the requested maximum size. This may be
322 larger than the requested maximum size: in that case, the full payload is not
323 included in the reply, but the ``argsz`` field in the reply indicates the needed
324 size, allowing a client to allocate a larger buffer for holding the reply before
327 In addition, during negotiation (see `Version`_), the client and server may
328 each specify a ``max_data_xfer_size`` value; this defines the maximum data that
329 may be read or written via one of the ``VFIO_USER_DMA/REGION_READ/WRITE``
335 To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed
336 with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the
337 endianness of the host system, although this may be relaxed in future
338 revisions in cases where the client and server run on different hosts
347 The following table lists the VFIO message command IDs, and whether the
348 message command is sent from the client or the server.
373 16-byte header that contains basic information about the message. The header is
374 followed by message-specific data described in the sections below.
402 * *Message ID* identifies the message, and is echoed in the command's reply
403 message. Message IDs belong entirely to the sender, can be re-used (even
404 concurrently) and the receiver must not make any assumptions about their
406 * *Command* specifies the command to be executed, listed in Commands_. It is
407 also set in the reply header.
408 * *Message size* contains the size of the entire message, including the header.
409 * *Flags* contains attributes of the message:
411 * The *Type* bits indicate the message type.
415 command with the same message ID.
418 the last needs acknowledgement.
419 * *Error* in a reply message indicates the command being acknowledged had
420 an error. In this case, the *Error* field will be valid.
423 even if the Error bit is set in Flags. It is reserved in a command message.
426 unless the message sets the *No_Reply* bit. The reply consists of the header
427 with the *Reply* bit set, plus any additional data.
429 If an error occurs, the reply message must only include the reply header.
431 As the header is standard in both requests and replies, it is not included in
432 the command-specific specifications below; each message definition should be
433 appended to the standard header, and the offsets are given from the end of the
441 This is the initial message sent by the client after the socket connection is
442 established; the same format is used for the server's reply.
444 Upon establishing a connection, the client must send a ``VFIO_USER_VERSION``
445 message proposing a protocol version and a set of capabilities. The server
446 compares these with the versions and capabilities it supports and sends a
447 ``VFIO_USER_VERSION`` reply according to the following rules.
449 * The major version in the reply must be the same as proposed. If the client
450 does not support the proposed major, it closes the connection.
451 * The minor version in the reply must be equal to or less than the minor
453 * The capability list must be a subset of those proposed. If the server
454 requires a capability the client did not include, it closes the connection.
456 The protocol major version will only change when incompatible protocol changes
457 are made, such as changing the message format. The minor version may change
459 Both the client and server must support all minor versions less than the
463 When making a change to this specification, the protocol version number must
464 be included in the form "added in version X.Y"
477 The version data is an optional UTF-8 encoded JSON byte array with the following
484 | | | the sender supports. Optional. |
493 | | | received by the sender in one message. |
494 | | | Optional. If not specified then the receiver |
510 | | | then migration is not supported by the sender. |
513 | | | are supported if the value is ``true``. |
516 The migration capability contains the following name/value pairs:
521 | pgsize | number | Page size of dirty pages bitmap. The smallest |
522 | | | between the client and the server is used. |
532 The same message format is used in the server's reply with the semantics
538 This command message is sent by the client to the server to inform it of the
539 memory regions the server can access. It must be sent before the server can
540 perform any DMA to the client. It is normally sent directly after the version
541 handshake is completed, but may also occur when memory is added to the client,
542 or if the client uses a vIOMMU.
547 The request payload for this message is a structure of the following format:
571 * *argsz* is the size of the above structure. Note there is no reply payload,
573 * *flags* contains the following region attributes:
575 * *readable* indicates that the region can be read from.
577 * *writeable* indicates that the region can be written to.
579 * *offset* is the file offset of the region with respect to the associated file
580 descriptor, or zero if the region is not mappable
581 * *address* is the base DMA address of the region.
582 * *size* is the size of the region.
584 This structure is 32 bytes in size, so the message size is 16 + 32 bytes.
586 If the DMA region being added can be directly mapped by the server, a file
587 descriptor must be sent as part of the message meta-data. The region can be
588 mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file descriptor
589 must be passed as ``SCM_RIGHTS`` type ancillary data. Otherwise, if the DMA
590 region cannot be directly mapped by the server, no file descriptor must be sent
591 as part of the message meta-data and the DMA region can be accessed by the
594 region must be failed by the server with ``EEXIST`` set in error field in the
600 There is no payload in the reply message.
605 This command message is sent by the client to the server to inform it that a
608 subtracted from the client or if the client uses a vIOMMU. The DMA region is
609 described by the following structure:
614 The request payload for this message is a structure of the following format:
628 * *argsz* is the maximum size of the reply payload.
630 * *address* is the base DMA address of the DMA region.
631 * *size* is the size of the DMA region.
633 The address and size of the DMA region being unmapped must match exactly a
639 Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is
640 mapped then the server must release all references to that DMA region before
643 The server responds with the original DMA entry in the request.
649 This command message is sent by the client to the server to query for basic
650 information about the device.
675 * *argsz* is the maximum size of the reply payload
701 * *argsz* is the size required for the full reply payload (16 bytes today)
702 * *flags* contains the following device attributes.
704 * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the
706 * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device.
708 * *num_regions* is the number of memory regions that the device exposes.
709 * *num_irqs* is the number of distinct interrupt types that the device supports.
711 This version of the protocol only supports PCI devices. Additional devices may
717 This command message is sent by the client to the server to query for
718 information about device regions. The VFIO region info structure is defined in
740 * *argsz* the maximum size of the reply payload
741 * *index* is the index of memory region being queried, it is the only field
742 that is required to be set in the command message.
777 * *argsz* is the size required for the full reply payload (region info structure
778 plus the size of any region capabilities)
779 * *flags* are attributes of the region:
781 * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region.
782 * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the region.
783 * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the region.
784 When this flag is set, the reply will include a file descriptor in its
785 meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed as
787 * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found in the
790 * *index* is the index of memory region being queried, it is the only field
791 that is required to be set in the command message.
793 cap_offset is relative to the beginning of the VFIO region info structure.
794 The data structure it points is a VFIO cap header defined in
796 * *size* is the size of the region.
797 * *offset* is the offset that should be given to the mmap() system call for
798 regions with the MMAP attribute. It is also used as the base offset when
804 The VFIO region information can also include a capabilities list. This list is
806 identifies a capability and where the next capability in the list can be found.
807 The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct
823 * *id* is the capability identity.
825 * *next* specifies the offset of the next capability in the capability list. It
826 is relative to the beginning of the VFIO region info structure.
843 This capability is defined when only a subrange of the region supports
844 direct access by the client via mmap(). The VFIO sparse mmap area is defined in
865 * *nr_areas* is the number of sparse mmap areas in the region.
866 * *offset* and size describe a single area that can be mapped by the client.
867 There will be *nr_areas* pairs of offset and size. The offset will be added to
868 the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to form the
869 offset argument of the subsequent mmap() call.
871 The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
879 ``mmap()`` of a file descriptor provided by the server.
884 across such file descriptors to the vfio-user server, without needing to
885 round-trip through the client.
887 The server returns an array of sub-regions for the requested region. Each
888 sub-region describes a span (offset and size) of a region, along with the
889 requested file descriptor notification mechanism to use. Each sub-region in the
890 response message may choose to use a different method, as defined below. The
893 The server in addition returns a file descriptor in the ancillary data; clients
894 are expected to configure each sub-region's file descriptor with the requested
895 notification method. For example, a client could configure KVM with the
913 * *argsz* the maximum size of the reply payload
914 * *index* is the index of memory region being queried
917 The client must set ``flags`` to zero and specify the region being queried in
918 the ``index``.
937 * *argsz* is the size of the region IO FD info structure plus the
938 total size of the sub-region array. Thus, each array entry "i" is at offset
940 FD types, but this is not to be relied on. As elsewhere, this indicates the
943 * *index* is the index of memory region being queried
944 * *count* is the number of sub-regions in the array
945 * *sub-regions* is the array of Sub-Region IO FD info structures
947 The reply message will additionally include at least one file descriptor in the
948 ancillary data. Note that more than one sub-region may share the same file
951 Note that it is the client's responsibility to verify the requested values (for
952 example, that the requested offset does not exceed the region's bounds).
954 Each sub-region given in the response has one of two possible structures,
979 * *offset* is the offset of the start of the sub-region within the region
980 requested ("physical address offset" for the region)
981 * *size* is the length of the sub-region. This may be zero if the access size is
983 * *fd_index* is the index in the ancillary data of the FD to use for ioeventfd
992 * *datamatch* is the datamatch value if needed
995 KVM_IOEVENTFD* for further context on the ioeventfd-specific fields.
1018 * *offset* is the offset of the start of the sub-region within the region
1019 requested ("physical address offset" for the region)
1020 * *size* is the length of the sub-region. This may be zero if the access size is
1023 * *fd_index* is the index in the ancillary data of the FD to use for ioregionfd
1031 * *user_data* is an opaque value passed back to the server via a message on the
1034 For further information on the ioregionfd-specific fields, see:
1042 This command message is sent by the client to the server to query for
1043 information about device interrupt types. The VFIO IRQ info structure is
1073 * *argsz* is the maximum size of the reply payload (16 bytes today)
1074 * index is the index of IRQ type being queried (e.g. ``VFIO_PCI_MSIX_IRQ_INDEX``)
1104 * *argsz* is the size required for the full reply payload (16 bytes today)
1107 * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server eventfd
1109 * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the ``MASK``
1111 * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after being
1112 triggered, and the client must send an ``UNMASK`` action to receive new
1116 the entire type.
1117 * index is the index of IRQ type being queried
1118 * count describes the number of interrupts of the queried type.
1123 This command message is sent by the client to the server to set actions for
1124 device interrupt types. The VFIO IRQ set structure is defined in
1162 * *argsz* is the size of the VFIO IRQ set request payload, including any *data*
1165 * *flags* defines the action performed on the interrupt range. The ``DATA``
1166 flags describe the data field sent in the message; the ``ACTION`` flags
1167 describe the action to be performed. The flags are mutually exclusive for
1170 * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the command.
1171 The action is performed unconditionally.
1172 * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of boolean
1173 bytes. The action is performed if the corresponding boolean is true.
1175 was sent in the message meta-data. These descriptors will be signalled when
1176 the action defined by the action flags occurs. In ``AF_UNIX`` sockets, the
1178 If no file descriptors are provided, this de-assigns the specified
1182 or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the guest masks
1183 the interrupt.
1186 interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
1187 guest unmasks the interrupt.
1190 interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
1191 server triggers the interrupt.
1193 * *index* is the index of IRQ type being setup.
1194 * *start* is the start of the sub-index being set.
1195 * *count* describes the number of sub-indexes being set. As a special case, a
1197 all interrupts of the index.
1198 * *data* is an optional field included when the
1200 that specify whether the action is to be performed on the corresponding
1201 index. It's used when the action is only performed on a subset of the range
1205 The client must know the capabilities of the device and IRQ index before it
1210 1. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with
1212 with an eventfd. This associates the IRQ with a particular eventfd on the
1215 #. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with
1217 with another eventfd. This associates the given eventfd with the
1218 mask/unmask state on the server side.
1220 #. The server may trigger the IRQ by writing 1 to the eventfd.
1222 #. The server may mask/unmask an IRQ which will write 1 to the corresponding
1229 6. A client may mask or unmask the IRQ, by sending a
1236 There is no payload in the reply.
1240 Note that all of these operations must be supported by the client and/or server,
1241 even if the corresponding memory or device region has been shared as mappable.
1243 The ``count`` field must not exceed the value of ``max_data_xfer_size`` of the
1249 If a device region is not mappable, it's not directly accessible by the client
1250 via ``mmap()`` of the underlying file descriptor. In this case, a client can
1266 * *offset* into the region being accessed.
1267 * *region* is the index of the region being accessed.
1268 * *count* is the size of the data to be transferred.
1285 * *offset* into the region accessed.
1286 * *region* is the index of the region accessed.
1287 * *count* is the size of the data transferred.
1288 * *data* is the data that was read from the device region.
1293 If a device region is not mappable, it's not directly accessible by the client
1294 via mmap() of the underlying fd. In this case, a client can write to a device
1312 * *offset* into the region being accessed.
1313 * *region* is the index of the region being accessed.
1314 * *count* is the size of the data to be transferred.
1315 * *data* is the data to write
1330 * *offset* into the region accessed.
1331 * *region* is the index of the region accessed.
1332 * *count* is the size of the data transferred.
1337 If the client has not shared mappable memory, the server can use this message to
1351 * *address* is the client DMA memory address being accessed. This address must have
1352 been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
1353 * *count* is the size of the data to be transferred.
1368 * *address* is the client DMA memory address being accessed.
1369 * *count* is the size of the data transferred.
1370 * *data* is the data read.
1375 If the client has not shared mappable memory, the server can use this message to
1391 * *address* is the client DMA memory address being accessed. This address must have
1392 been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
1393 * *count* is the size of the data to be transferred.
1394 * *data* is the data to write
1407 * *address* is the client DMA memory address being accessed.
1408 * *count* is the size of the data transferred.
1413 This command message is sent from the client to the server to reset the device.
1414 Neither the request or reply have a payload.
1420 into a single messgage. It is only used as an optimization when the
1434 * *wr_cnt* is the number of device writes coalesced in the message
1452 * *offset* into the region being accessed.
1453 * *region* is the index of the region being accessed.
1454 * *count* is the size of the data to be transferred. This format can
1456 * *data* is the data to write.
1467 * *wr_cnt* is the number of device writes completed.
1476 The following VFIO commands do not have an equivalent vfio-user command:
1488 of the above commands may have to be handled by the client in their
1495 The current VFIO implementation includes group and container idioms that
1496 describe how a device relates to the host IOMMU. In the vfio-user
1497 implementation, the IOMMU is implemented in SW by the client, and is not
1498 visible to the server. The simplest idea would be that the client put each
1504 vfio-user backend program conventions are based on the vhost-user ones.
1506 * The backend program must not daemonize itself.
1507 * No assumptions must be made as to what access the backend program has on the
1511 * The backend program must honor the SIGTERM signal.
1512 * The backend program must accept the following commands line options:
1517 * The backend program must be accompanied with a JSON file stored under