1.. include:: <isonum.txt> 2.. SPDX-License-Identifier: GPL-2.0-or-later 3 4================================ 5vfio-user Protocol Specification 6================================ 7 8.. contents:: Table of Contents 9 10Introduction 11============ 12vfio-user is a protocol that allows a device to be emulated in a separate 13process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist 14of a generic VFIO device type, living inside the VMM, which we call the client, 15and the core device implementation, living outside the VMM, which we call the 16server. 17 18The vfio-user specification is partly based on the 19`Linux VFIO ioctl interface <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_. 20 21VFIO is a mature and stable API, backed by an extensively used framework. The 22existing VFIO client implementation in QEMU (``qemu/hw/vfio/``) can be largely 23re-used, though there is nothing in this specification that requires that 24particular implementation. None of the VFIO kernel modules are required for 25supporting the protocol, on either the client or server side. Some source 26definitions in VFIO are re-used for vfio-user. 27 28The main idea is to allow a virtual device to function in a separate process in 29the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``) is 30chosen because file descriptors can be trivially sent over it, which in turn 31allows: 32 33* Sharing of client memory for DMA with the server. 34* Sharing of server memory with the client for fast MMIO. 35* Efficient sharing of eventfd's for triggering interrupts. 36 37Other socket types could be used which allow the server to run in a separate 38guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). Theoretically 39the underlying transport does not necessarily have to be a socket, however we do 40not examine such alternatives. In this protocol version we focus on using a UNIX 41domain socket and introduce basic support for the other two types of sockets 42without considering performance implications. 43 44While passing of file descriptors is desirable for performance reasons, support 45is not necessary for either the client or the server in order to implement the 46protocol. There is always an in-band, message-passing fall back mechanism. 47 48Overview 49======== 50 51VFIO is a framework that allows a physical device to be securely passed through 52to a user space process; the device-specific kernel driver does not drive the 53device at all. Typically, the user space process is a VMM and the device is 54passed through to it in order to achieve high performance. VFIO provides an API 55and the required functionality in the kernel. QEMU has adopted VFIO to allow a 56guest to directly access physical devices, instead of emulating them in 57software. 58 59vfio-user reuses the core VFIO concepts defined in its API, but implements them 60as messages to be sent over a socket. It does not change the kernel-based VFIO 61in any way, in fact none of the VFIO kernel modules need to be loaded to use 62vfio-user. It is also possible for the client to concurrently use the current 63kernel-based VFIO for one device, and vfio-user for another device. 64 65VFIO Device Model 66----------------- 67 68A device under VFIO presents a standard interface to the user process. Many of 69the VFIO operations in the existing interface use the ``ioctl()`` system call, and 70references to the existing interface are called the ``ioctl()`` implementation in 71this document. 72 73The following sections describe the set of messages that implement the vfio-user 74interface over a socket. In many cases, the messages are analogous to data 75structures used in the ``ioctl()`` implementation. Messages derived from the 76``ioctl()`` will have a name derived from the ``ioctl()`` command name. E.g., the 77``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a 78``VFIO_USER_DEVICE_GET_INFO`` message. The purpose of this reuse is to share as 79much code as feasible with the ``ioctl()`` implementation``. 80 81Connection Initiation 82^^^^^^^^^^^^^^^^^^^^^ 83 84After the client connects to the server, the initial client message is 85``VFIO_USER_VERSION`` to propose a protocol version and set of capabilities to 86apply to the session. The server replies with a compatible version and set of 87capabilities it supports, or closes the connection if it cannot support the 88advertised version. 89 90Device Information 91^^^^^^^^^^^^^^^^^^ 92 93The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the server for 94information about the device. This information includes: 95 96* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``), 97* the number of device regions, and 98* the device presents to the client the number of interrupt types the device 99 supports. 100 101Region Information 102^^^^^^^^^^^^^^^^^^ 103 104The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query the 105server for information about the device's regions. This information describes: 106 107* Read and write permissions, whether it can be memory mapped, and whether it 108 supports additional capabilities (``VFIO_REGION_INFO_CAP_``). 109* Region index, size, and offset. 110 111When a device region can be mapped by the client, the server provides a file 112descriptor which the client can ``mmap()``. The server is responsible for 113polling for client updates to memory mapped regions. 114 115Region Capabilities 116""""""""""""""""""" 117 118Some regions have additional capabilities that cannot be described adequately 119by the region info data structure. These capabilities are returned in the 120region info reply in a list similar to PCI capabilities in a PCI device's 121configuration space. 122 123Sparse Regions 124"""""""""""""" 125A region can be memory-mappable in whole or in part. When only a subset of a 126region can be mapped by the client, a ``VFIO_REGION_INFO_CAP_SPARSE_MMAP`` 127capability is included in the region info reply. This capability describes 128which portions can be mapped by the client. 129 130.. Note:: 131 For example, in a virtual NVMe controller, sparse regions can be used so 132 that accesses to the NVMe registers (found in the beginning of BAR0) are 133 trapped (an infrequent event), while allowing direct access to the doorbells 134 (an extremely frequent event as every I/O submission requires a write to 135 BAR0), found in the next page after the NVMe registers in BAR0. 136 137Device-Specific Regions 138""""""""""""""""""""""" 139 140A device can define regions additional to the standard ones (e.g. PCI indexes 1410-8). This is achieved by including a ``VFIO_REGION_INFO_CAP_TYPE`` capability 142in the region info reply of a device-specific region. Such regions are reflected 143in ``struct vfio_user_device_info.num_regions``. Thus, for PCI devices this 144value can be equal to, or higher than, ``VFIO_PCI_NUM_REGIONS``. 145 146Region I/O via file descriptors 147------------------------------- 148 149For unmapped regions, region I/O from the client is done via 150``VFIO_USER_REGION_READ/WRITE``. As an optimization, ioeventfds or ioregionfds 151may be configured for sub-regions of some regions. A client may request 152information on these sub-regions via ``VFIO_USER_DEVICE_GET_REGION_IO_FDS``; by 153configuring the returned file descriptors as ioeventfds or ioregionfds, the 154server can be directly notified of I/O (for example, by KVM) without taking a 155trip through the client. 156 157Interrupts 158^^^^^^^^^^ 159 160The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the server 161for the device's interrupt types. The interrupt types are specific to the bus 162the device is attached to, and the client is expected to know the capabilities 163of each interrupt type. The server can signal an interrupt by directly injecting 164interrupts into the guest via an event file descriptor. The client configures 165how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages. 166 167Device Read and Write 168^^^^^^^^^^^^^^^^^^^^^ 169 170When the guest executes load or store operations to an unmapped device region, 171the client forwards these operations to the server with 172``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages. The server 173will reply with data from the device on read operations or an acknowledgement on 174write operations. See `Read and Write Operations`_. 175 176Client memory access 177-------------------- 178 179The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP`` messages to 180inform the server of the valid DMA ranges that the server can access on behalf 181of a device (typically, VM guest memory). DMA memory may be accessed by the 182server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages over the 183socket. In this case, the "DMA" part of the naming is a misnomer. 184 185Actual direct memory access of client memory from the server is possible if the 186client provides file descriptors the server can ``mmap()``. Note that ``mmap()`` 187privileges cannot be revoked by the client, therefore file descriptors should 188only be exported in environments where the client trusts the server not to 189corrupt guest memory. 190 191See `Read and Write Operations`_. 192 193Client/server interactions 194========================== 195 196Socket 197------ 198 199A server can serve: 200 2011) one or more clients, and/or 2022) one or more virtual devices, belonging to one or more clients. 203 204The current protocol specification requires a dedicated socket per 205client/server connection. It is a server-side implementation detail whether a 206single server handles multiple virtual devices from the same or multiple 207clients. The location of the socket is implementation-specific. Multiplexing 208clients, devices, and servers over the same socket is not supported in this 209version of the protocol. 210 211Authentication 212-------------- 213 214For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files, 215therefore it is up to the management layer to set up the socket as required. 216Socket types that span guests or hosts will require a proper authentication 217mechanism. Defining that mechanism is deferred to a future version of the 218protocol. 219 220Command Concurrency 221------------------- 222 223A client may pipeline multiple commands without waiting for previous command 224replies. The server will process commands in the order they are received. A 225consequence of this is if a client issues a command with the *No_reply* bit, 226then subsequently issues a command without *No_reply*, the older command will 227have been processed before the reply to the younger command is sent by the 228server. The client must be aware of the device's capability to process 229concurrent commands if pipelining is used. For example, pipelining allows 230multiple client threads to concurrently access device regions; the client must 231ensure these accesses obey device semantics. 232 233An example is a frame buffer device, where the device may allow concurrent 234access to different areas of video memory, but may have indeterminate behavior 235if concurrent accesses are performed to command or status registers. 236 237Note that unrelated messages sent from the server to the client can appear in 238between a client to server request/reply and vice versa. 239 240Implementers should be prepared for certain commands to exhibit potentially 241unbounded latencies. For example, ``VFIO_USER_DEVICE_RESET`` may take an 242arbitrarily long time to complete; clients should take care not to block 243unnecessarily. 244 245Socket Disconnection Behavior 246----------------------------- 247The server and the client can disconnect from each other, either intentionally 248or unexpectedly. Both the client and the server need to know how to handle such 249events. 250 251Server Disconnection 252^^^^^^^^^^^^^^^^^^^^ 253A server disconnecting from the client may indicate that: 254 2551) A virtual device has been restarted, either intentionally (e.g. because of a 256 device update) or unintentionally (e.g. because of a crash). 2572) A virtual device has been shut down with no intention to be restarted. 258 259It is impossible for the client to know whether or not a failure is 260intermittent or innocuous and should be retried, therefore the client should 261reset the VFIO device when it detects the socket has been disconnected. 262Error recovery will be driven by the guest's device error handling 263behavior. 264 265Client Disconnection 266^^^^^^^^^^^^^^^^^^^^ 267The client disconnecting from the server primarily means that the client 268has exited. Currently, this means that the guest is shut down so the device is 269no longer needed therefore the server can automatically exit. However, there 270can be cases where a client disconnection should not result in a server exit: 271 2721) A single server serving multiple clients. 2732) A multi-process QEMU upgrading itself step by step, which is not yet 274 implemented. 275 276Therefore in order for the protocol to be forward compatible, the server should 277respond to a client disconnection as follows: 278 279 - all client memory regions are unmapped and cleaned up (including closing any 280 passed file descriptors) 281 - all IRQ file descriptors passed from the old client are closed 282 - the device state should otherwise be retained 283 284The expectation is that when a client reconnects, it will re-establish IRQ and 285client memory mappings. 286 287If anything happens to the client (such as qemu really did exit), the control 288stack will know about it and can clean up resources accordingly. 289 290Security Considerations 291----------------------- 292 293Speaking generally, vfio-user clients should not trust servers, and vice versa. 294Standard tools and mechanisms should be used on both sides to validate input and 295prevent against denial of service scenarios, buffer overflow, etc. 296 297Request Retry and Response Timeout 298---------------------------------- 299A failed command is a command that has been successfully sent and has been 300responded to with an error code. Failure to send the command in the first place 301(e.g. because the socket is disconnected) is a different type of error examined 302earlier in the disconnect section. 303 304.. Note:: 305 QEMU's VFIO retries certain operations if they fail. While this makes sense 306 for real HW, we don't know for sure whether it makes sense for virtual 307 devices. 308 309Defining a retry and timeout scheme is deferred to a future version of the 310protocol. 311 312Message sizes 313------------- 314 315Some requests have an ``argsz`` field. In a request, it defines the maximum 316expected reply payload size, which should be at least the size of the fixed 317reply payload headers defined here. The *request* payload size is defined by the 318usual ``msg_size`` field in the header, not the ``argsz`` field. 319 320In a reply, the server sets ``argsz`` field to the size needed for a full 321payload size. This may be less than the requested maximum size. This may be 322larger than the requested maximum size: in that case, the full payload is not 323included in the reply, but the ``argsz`` field in the reply indicates the needed 324size, allowing a client to allocate a larger buffer for holding the reply before 325trying again. 326 327In addition, during negotiation (see `Version`_), the client and server may 328each specify a ``max_data_xfer_size`` value; this defines the maximum data that 329may be read or written via one of the ``VFIO_USER_DMA/REGION_READ/WRITE`` 330messages; see `Read and Write Operations`_. 331 332Protocol Specification 333====================== 334 335To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed 336with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the 337endianness of the host system, although this may be relaxed in future 338revisions in cases where the client and server run on different hosts 339with different endianness. 340 341Unless otherwise specified, all sizes should be presumed to be in bytes. 342 343.. _Commands: 344 345Commands 346-------- 347The following table lists the VFIO message command IDs, and whether the 348message command is sent from the client or the server. 349 350====================================== ========= ================= 351Name Command Request Direction 352====================================== ========= ================= 353``VFIO_USER_VERSION`` 1 client -> server 354``VFIO_USER_DMA_MAP`` 2 client -> server 355``VFIO_USER_DMA_UNMAP`` 3 client -> server 356``VFIO_USER_DEVICE_GET_INFO`` 4 client -> server 357``VFIO_USER_DEVICE_GET_REGION_INFO`` 5 client -> server 358``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` 6 client -> server 359``VFIO_USER_DEVICE_GET_IRQ_INFO`` 7 client -> server 360``VFIO_USER_DEVICE_SET_IRQS`` 8 client -> server 361``VFIO_USER_REGION_READ`` 9 client -> server 362``VFIO_USER_REGION_WRITE`` 10 client -> server 363``VFIO_USER_DMA_READ`` 11 server -> client 364``VFIO_USER_DMA_WRITE`` 12 server -> client 365``VFIO_USER_DEVICE_RESET`` 13 client -> server 366``VFIO_USER_REGION_WRITE_MULTI`` 15 client -> server 367====================================== ========= ================= 368 369Header 370------ 371 372All messages, both command messages and reply messages, are preceded by a 37316-byte header that contains basic information about the message. The header is 374followed by message-specific data described in the sections below. 375 376+----------------+--------+-------------+ 377| Name | Offset | Size | 378+================+========+=============+ 379| Message ID | 0 | 2 | 380+----------------+--------+-------------+ 381| Command | 2 | 2 | 382+----------------+--------+-------------+ 383| Message size | 4 | 4 | 384+----------------+--------+-------------+ 385| Flags | 8 | 4 | 386+----------------+--------+-------------+ 387| | +-----+------------+ | 388| | | Bit | Definition | | 389| | +=====+============+ | 390| | | 0-3 | Type | | 391| | +-----+------------+ | 392| | | 4 | No_reply | | 393| | +-----+------------+ | 394| | | 5 | Error | | 395| | +-----+------------+ | 396+----------------+--------+-------------+ 397| Error | 12 | 4 | 398+----------------+--------+-------------+ 399| <message data> | 16 | variable | 400+----------------+--------+-------------+ 401 402* *Message ID* identifies the message, and is echoed in the command's reply 403 message. Message IDs belong entirely to the sender, can be re-used (even 404 concurrently) and the receiver must not make any assumptions about their 405 uniqueness. 406* *Command* specifies the command to be executed, listed in Commands_. It is 407 also set in the reply header. 408* *Message size* contains the size of the entire message, including the header. 409* *Flags* contains attributes of the message: 410 411 * The *Type* bits indicate the message type. 412 413 * *Command* (value 0x0) indicates a command message. 414 * *Reply* (value 0x1) indicates a reply message acknowledging a previous 415 command with the same message ID. 416 * *No_reply* in a command message indicates that no reply is needed for this 417 command. This is commonly used when multiple commands are sent, and only 418 the last needs acknowledgement. 419 * *Error* in a reply message indicates the command being acknowledged had 420 an error. In this case, the *Error* field will be valid. 421 422* *Error* in a reply message is an optional UNIX errno value. It may be zero 423 even if the Error bit is set in Flags. It is reserved in a command message. 424 425Each command message in Commands_ must be replied to with a reply message, 426unless the message sets the *No_Reply* bit. The reply consists of the header 427with the *Reply* bit set, plus any additional data. 428 429If an error occurs, the reply message must only include the reply header. 430 431As the header is standard in both requests and replies, it is not included in 432the command-specific specifications below; each message definition should be 433appended to the standard header, and the offsets are given from the end of the 434standard header. 435 436``VFIO_USER_VERSION`` 437--------------------- 438 439.. _Version: 440 441This is the initial message sent by the client after the socket connection is 442established; the same format is used for the server's reply. 443 444Upon establishing a connection, the client must send a ``VFIO_USER_VERSION`` 445message proposing a protocol version and a set of capabilities. The server 446compares these with the versions and capabilities it supports and sends a 447``VFIO_USER_VERSION`` reply according to the following rules. 448 449* The major version in the reply must be the same as proposed. If the client 450 does not support the proposed major, it closes the connection. 451* The minor version in the reply must be equal to or less than the minor 452 version proposed. 453* The capability list must be a subset of those proposed. If the server 454 requires a capability the client did not include, it closes the connection. 455 456The protocol major version will only change when incompatible protocol changes 457are made, such as changing the message format. The minor version may change 458when compatible changes are made, such as adding new messages or capabilities, 459Both the client and server must support all minor versions less than the 460maximum minor version it supports. E.g., an implementation that supports 461version 1.3 must also support 1.0 through 1.2. 462 463When making a change to this specification, the protocol version number must 464be included in the form "added in version X.Y" 465 466Request 467^^^^^^^ 468 469============== ====== ==== 470Name Offset Size 471============== ====== ==== 472version major 0 2 473version minor 2 2 474version data 4 variable (including terminating NUL). Optional. 475============== ====== ==== 476 477The version data is an optional UTF-8 encoded JSON byte array with the following 478format: 479 480+--------------+--------+-----------------------------------+ 481| Name | Type | Description | 482+==============+========+===================================+ 483| capabilities | object | Contains common capabilities that | 484| | | the sender supports. Optional. | 485+--------------+--------+-----------------------------------+ 486 487Capabilities: 488 489+--------------------+---------+------------------------------------------------+ 490| Name | Type | Description | 491+====================+=========+================================================+ 492| max_msg_fds | number | Maximum number of file descriptors that can be | 493| | | received by the sender in one message. | 494| | | Optional. If not specified then the receiver | 495| | | must assume a value of ``1``. | 496+--------------------+---------+------------------------------------------------+ 497| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; | 498| | | see `Read and Write Operations`_. Optional, | 499| | | with a default value of 1048576 bytes. | 500+--------------------+---------+------------------------------------------------+ 501| pgsizes | number | Page sizes supported in DMA map operations | 502| | | or'ed together. Optional, with a default value | 503| | | of supporting only 4k pages. | 504+--------------------+---------+------------------------------------------------+ 505| max_dma_maps | number | Maximum number DMA map windows that can be | 506| | | valid simultaneously. Optional, with a | 507| | | value of 65535 (64k-1). | 508+--------------------+---------+------------------------------------------------+ 509| migration | object | Migration capability parameters. If missing | 510| | | then migration is not supported by the sender. | 511+--------------------+---------+------------------------------------------------+ 512| write_multiple | boolean | ``VFIO_USER_REGION_WRITE_MULTI`` messages | 513| | | are supported if the value is ``true``. | 514+--------------------+---------+------------------------------------------------+ 515 516The migration capability contains the following name/value pairs: 517 518+-----------------+--------+--------------------------------------------------+ 519| Name | Type | Description | 520+=================+========+==================================================+ 521| pgsize | number | Page size of dirty pages bitmap. The smallest | 522| | | between the client and the server is used. | 523+-----------------+--------+--------------------------------------------------+ 524| max_bitmap_size | number | Maximum bitmap size in ``VFIO_USER_DIRTY_PAGES`` | 525| | | and ``VFIO_DMA_UNMAP`` messages. Optional, | 526| | | with a default value of 256MB. | 527+-----------------+--------+--------------------------------------------------+ 528 529Reply 530^^^^^ 531 532The same message format is used in the server's reply with the semantics 533described above. 534 535``VFIO_USER_DMA_MAP`` 536--------------------- 537 538This command message is sent by the client to the server to inform it of the 539memory regions the server can access. It must be sent before the server can 540perform any DMA to the client. It is normally sent directly after the version 541handshake is completed, but may also occur when memory is added to the client, 542or if the client uses a vIOMMU. 543 544Request 545^^^^^^^ 546 547The request payload for this message is a structure of the following format: 548 549+-------------+--------+-------------+ 550| Name | Offset | Size | 551+=============+========+=============+ 552| argsz | 0 | 4 | 553+-------------+--------+-------------+ 554| flags | 4 | 4 | 555+-------------+--------+-------------+ 556| | +-----+------------+ | 557| | | Bit | Definition | | 558| | +=====+============+ | 559| | | 0 | readable | | 560| | +-----+------------+ | 561| | | 1 | writeable | | 562| | +-----+------------+ | 563+-------------+--------+-------------+ 564| offset | 8 | 8 | 565+-------------+--------+-------------+ 566| address | 16 | 8 | 567+-------------+--------+-------------+ 568| size | 24 | 8 | 569+-------------+--------+-------------+ 570 571* *argsz* is the size of the above structure. Note there is no reply payload, 572 so this field differs from other message types. 573* *flags* contains the following region attributes: 574 575 * *readable* indicates that the region can be read from. 576 577 * *writeable* indicates that the region can be written to. 578 579* *offset* is the file offset of the region with respect to the associated file 580 descriptor, or zero if the region is not mappable 581* *address* is the base DMA address of the region. 582* *size* is the size of the region. 583 584This structure is 32 bytes in size, so the message size is 16 + 32 bytes. 585 586If the DMA region being added can be directly mapped by the server, a file 587descriptor must be sent as part of the message meta-data. The region can be 588mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file descriptor 589must be passed as ``SCM_RIGHTS`` type ancillary data. Otherwise, if the DMA 590region cannot be directly mapped by the server, no file descriptor must be sent 591as part of the message meta-data and the DMA region can be accessed by the 592server using ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages, 593explained in `Read and Write Operations`_. A command to map over an existing 594region must be failed by the server with ``EEXIST`` set in error field in the 595reply. 596 597Reply 598^^^^^ 599 600There is no payload in the reply message. 601 602``VFIO_USER_DMA_UNMAP`` 603----------------------- 604 605This command message is sent by the client to the server to inform it that a 606DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command 607message, is no longer available for DMA. It typically occurs when memory is 608subtracted from the client or if the client uses a vIOMMU. The DMA region is 609described by the following structure: 610 611Request 612^^^^^^^ 613 614The request payload for this message is a structure of the following format: 615 616+--------------+--------+------------------------+ 617| Name | Offset | Size | 618+==============+========+========================+ 619| argsz | 0 | 4 | 620+--------------+--------+------------------------+ 621| flags | 4 | 4 | 622+--------------+--------+------------------------+ 623| address | 8 | 8 | 624+--------------+--------+------------------------+ 625| size | 16 | 8 | 626+--------------+--------+------------------------+ 627 628* *argsz* is the maximum size of the reply payload. 629* *flags* is unused in this version. 630* *address* is the base DMA address of the DMA region. 631* *size* is the size of the DMA region. 632 633The address and size of the DMA region being unmapped must match exactly a 634previous mapping. 635 636Reply 637^^^^^ 638 639Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is 640mapped then the server must release all references to that DMA region before 641replying, which potentially includes in-flight DMA transactions. 642 643The server responds with the original DMA entry in the request. 644 645 646``VFIO_USER_DEVICE_GET_INFO`` 647----------------------------- 648 649This command message is sent by the client to the server to query for basic 650information about the device. 651 652Request 653^^^^^^^ 654 655+-------------+--------+--------------------------+ 656| Name | Offset | Size | 657+=============+========+==========================+ 658| argsz | 0 | 4 | 659+-------------+--------+--------------------------+ 660| flags | 4 | 4 | 661+-------------+--------+--------------------------+ 662| | +-----+-------------------------+ | 663| | | Bit | Definition | | 664| | +=====+=========================+ | 665| | | 0 | VFIO_DEVICE_FLAGS_RESET | | 666| | +-----+-------------------------+ | 667| | | 1 | VFIO_DEVICE_FLAGS_PCI | | 668| | +-----+-------------------------+ | 669+-------------+--------+--------------------------+ 670| num_regions | 8 | 4 | 671+-------------+--------+--------------------------+ 672| num_irqs | 12 | 4 | 673+-------------+--------+--------------------------+ 674 675* *argsz* is the maximum size of the reply payload 676* all other fields must be zero. 677 678Reply 679^^^^^ 680 681+-------------+--------+--------------------------+ 682| Name | Offset | Size | 683+=============+========+==========================+ 684| argsz | 0 | 4 | 685+-------------+--------+--------------------------+ 686| flags | 4 | 4 | 687+-------------+--------+--------------------------+ 688| | +-----+-------------------------+ | 689| | | Bit | Definition | | 690| | +=====+=========================+ | 691| | | 0 | VFIO_DEVICE_FLAGS_RESET | | 692| | +-----+-------------------------+ | 693| | | 1 | VFIO_DEVICE_FLAGS_PCI | | 694| | +-----+-------------------------+ | 695+-------------+--------+--------------------------+ 696| num_regions | 8 | 4 | 697+-------------+--------+--------------------------+ 698| num_irqs | 12 | 4 | 699+-------------+--------+--------------------------+ 700 701* *argsz* is the size required for the full reply payload (16 bytes today) 702* *flags* contains the following device attributes. 703 704 * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the 705 ``VFIO_USER_DEVICE_RESET`` message. 706 * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device. 707 708* *num_regions* is the number of memory regions that the device exposes. 709* *num_irqs* is the number of distinct interrupt types that the device supports. 710 711This version of the protocol only supports PCI devices. Additional devices may 712be supported in future versions. 713 714``VFIO_USER_DEVICE_GET_REGION_INFO`` 715------------------------------------ 716 717This command message is sent by the client to the server to query for 718information about device regions. The VFIO region info structure is defined in 719``<linux/vfio.h>`` (``struct vfio_region_info``). 720 721Request 722^^^^^^^ 723 724+------------+--------+------------------------------+ 725| Name | Offset | Size | 726+============+========+==============================+ 727| argsz | 0 | 4 | 728+------------+--------+------------------------------+ 729| flags | 4 | 4 | 730+------------+--------+------------------------------+ 731| index | 8 | 4 | 732+------------+--------+------------------------------+ 733| cap_offset | 12 | 4 | 734+------------+--------+------------------------------+ 735| size | 16 | 8 | 736+------------+--------+------------------------------+ 737| offset | 24 | 8 | 738+------------+--------+------------------------------+ 739 740* *argsz* the maximum size of the reply payload 741* *index* is the index of memory region being queried, it is the only field 742 that is required to be set in the command message. 743* all other fields must be zero. 744 745Reply 746^^^^^ 747 748+------------+--------+------------------------------+ 749| Name | Offset | Size | 750+============+========+==============================+ 751| argsz | 0 | 4 | 752+------------+--------+------------------------------+ 753| flags | 4 | 4 | 754+------------+--------+------------------------------+ 755| | +-----+-----------------------------+ | 756| | | Bit | Definition | | 757| | +=====+=============================+ | 758| | | 0 | VFIO_REGION_INFO_FLAG_READ | | 759| | +-----+-----------------------------+ | 760| | | 1 | VFIO_REGION_INFO_FLAG_WRITE | | 761| | +-----+-----------------------------+ | 762| | | 2 | VFIO_REGION_INFO_FLAG_MMAP | | 763| | +-----+-----------------------------+ | 764| | | 3 | VFIO_REGION_INFO_FLAG_CAPS | | 765| | +-----+-----------------------------+ | 766+------------+--------+------------------------------+ 767+------------+--------+------------------------------+ 768| index | 8 | 4 | 769+------------+--------+------------------------------+ 770| cap_offset | 12 | 4 | 771+------------+--------+------------------------------+ 772| size | 16 | 8 | 773+------------+--------+------------------------------+ 774| offset | 24 | 8 | 775+------------+--------+------------------------------+ 776 777* *argsz* is the size required for the full reply payload (region info structure 778 plus the size of any region capabilities) 779* *flags* are attributes of the region: 780 781 * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region. 782 * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the region. 783 * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the region. 784 When this flag is set, the reply will include a file descriptor in its 785 meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed as 786 ``SCM_RIGHTS`` type ancillary data. 787 * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found in the 788 reply. 789 790* *index* is the index of memory region being queried, it is the only field 791 that is required to be set in the command message. 792* *cap_offset* describes where additional region capabilities can be found. 793 cap_offset is relative to the beginning of the VFIO region info structure. 794 The data structure it points is a VFIO cap header defined in 795 ``<linux/vfio.h>``. 796* *size* is the size of the region. 797* *offset* is the offset that should be given to the mmap() system call for 798 regions with the MMAP attribute. It is also used as the base offset when 799 mapping a VFIO sparse mmap area, described below. 800 801VFIO region capabilities 802"""""""""""""""""""""""" 803 804The VFIO region information can also include a capabilities list. This list is 805similar to a PCI capability list - each entry has a common header that 806identifies a capability and where the next capability in the list can be found. 807The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct 808vfio_info_cap_header``). 809 810VFIO cap header format 811"""""""""""""""""""""" 812 813+---------+--------+------+ 814| Name | Offset | Size | 815+=========+========+======+ 816| id | 0 | 2 | 817+---------+--------+------+ 818| version | 2 | 2 | 819+---------+--------+------+ 820| next | 4 | 4 | 821+---------+--------+------+ 822 823* *id* is the capability identity. 824* *version* is a capability-specific version number. 825* *next* specifies the offset of the next capability in the capability list. It 826 is relative to the beginning of the VFIO region info structure. 827 828VFIO sparse mmap cap header 829""""""""""""""""""""""""""" 830 831+------------------+----------------------------------+ 832| Name | Value | 833+==================+==================================+ 834| id | VFIO_REGION_INFO_CAP_SPARSE_MMAP | 835+------------------+----------------------------------+ 836| version | 0x1 | 837+------------------+----------------------------------+ 838| next | <next> | 839+------------------+----------------------------------+ 840| sparse mmap info | VFIO region info sparse mmap | 841+------------------+----------------------------------+ 842 843This capability is defined when only a subrange of the region supports 844direct access by the client via mmap(). The VFIO sparse mmap area is defined in 845``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area`` and ``struct 846vfio_region_info_cap_sparse_mmap``). 847 848VFIO region info cap sparse mmap 849"""""""""""""""""""""""""""""""" 850 851+----------+--------+------+ 852| Name | Offset | Size | 853+==========+========+======+ 854| nr_areas | 0 | 4 | 855+----------+--------+------+ 856| reserved | 4 | 4 | 857+----------+--------+------+ 858| offset | 8 | 8 | 859+----------+--------+------+ 860| size | 16 | 8 | 861+----------+--------+------+ 862| ... | | | 863+----------+--------+------+ 864 865* *nr_areas* is the number of sparse mmap areas in the region. 866* *offset* and size describe a single area that can be mapped by the client. 867 There will be *nr_areas* pairs of offset and size. The offset will be added to 868 the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to form the 869 offset argument of the subsequent mmap() call. 870 871The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct 872vfio_region_info_cap_sparse_mmap``). 873 874 875``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` 876-------------------------------------- 877 878Clients can access regions via ``VFIO_USER_REGION_READ/WRITE`` or, if provided, by 879``mmap()`` of a file descriptor provided by the server. 880 881``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` provides an alternative access mechanism via 882file descriptors. This is an optional feature intended for performance 883improvements where an underlying sub-system (such as KVM) supports communication 884across such file descriptors to the vfio-user server, without needing to 885round-trip through the client. 886 887The server returns an array of sub-regions for the requested region. Each 888sub-region describes a span (offset and size) of a region, along with the 889requested file descriptor notification mechanism to use. Each sub-region in the 890response message may choose to use a different method, as defined below. The 891two mechanisms supported in this specification are ioeventfds and ioregionfds. 892 893The server in addition returns a file descriptor in the ancillary data; clients 894are expected to configure each sub-region's file descriptor with the requested 895notification method. For example, a client could configure KVM with the 896requested ioeventfd via a ``KVM_IOEVENTFD`` ``ioctl()``. 897 898Request 899^^^^^^^ 900 901+-------------+--------+------+ 902| Name | Offset | Size | 903+=============+========+======+ 904| argsz | 0 | 4 | 905+-------------+--------+------+ 906| flags | 4 | 4 | 907+-------------+--------+------+ 908| index | 8 | 4 | 909+-------------+--------+------+ 910| count | 12 | 4 | 911+-------------+--------+------+ 912 913* *argsz* the maximum size of the reply payload 914* *index* is the index of memory region being queried 915* all other fields must be zero 916 917The client must set ``flags`` to zero and specify the region being queried in 918the ``index``. 919 920Reply 921^^^^^ 922 923+-------------+--------+------+ 924| Name | Offset | Size | 925+=============+========+======+ 926| argsz | 0 | 4 | 927+-------------+--------+------+ 928| flags | 4 | 4 | 929+-------------+--------+------+ 930| index | 8 | 4 | 931+-------------+--------+------+ 932| count | 12 | 4 | 933+-------------+--------+------+ 934| sub-regions | 16 | ... | 935+-------------+--------+------+ 936 937* *argsz* is the size of the region IO FD info structure plus the 938 total size of the sub-region array. Thus, each array entry "i" is at offset 939 i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO 940 FD types, but this is not to be relied on. As elsewhere, this indicates the 941 full reply payload size needed. 942* *flags* must be zero 943* *index* is the index of memory region being queried 944* *count* is the number of sub-regions in the array 945* *sub-regions* is the array of Sub-Region IO FD info structures 946 947The reply message will additionally include at least one file descriptor in the 948ancillary data. Note that more than one sub-region may share the same file 949descriptor. 950 951Note that it is the client's responsibility to verify the requested values (for 952example, that the requested offset does not exceed the region's bounds). 953 954Each sub-region given in the response has one of two possible structures, 955depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or 956``VFIO_USER_IO_FD_TYPE_IOREGIONFD``: 957 958Sub-Region IO FD info format (ioeventfd) 959"""""""""""""""""""""""""""""""""""""""" 960 961+-----------+--------+------+ 962| Name | Offset | Size | 963+===========+========+======+ 964| offset | 0 | 8 | 965+-----------+--------+------+ 966| size | 8 | 8 | 967+-----------+--------+------+ 968| fd_index | 16 | 4 | 969+-----------+--------+------+ 970| type | 20 | 4 | 971+-----------+--------+------+ 972| flags | 24 | 4 | 973+-----------+--------+------+ 974| padding | 28 | 4 | 975+-----------+--------+------+ 976| datamatch | 32 | 8 | 977+-----------+--------+------+ 978 979* *offset* is the offset of the start of the sub-region within the region 980 requested ("physical address offset" for the region) 981* *size* is the length of the sub-region. This may be zero if the access size is 982 not relevant, which may allow for optimizations 983* *fd_index* is the index in the ancillary data of the FD to use for ioeventfd 984 notification; it may be shared. 985* *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` 986* *flags* is any of: 987 988 * ``KVM_IOEVENTFD_FLAG_DATAMATCH`` 989 * ``KVM_IOEVENTFD_FLAG_PIO`` 990 * ``KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY`` (FIXME: makes sense?) 991 992* *datamatch* is the datamatch value if needed 993 994See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt, *4.59 995KVM_IOEVENTFD* for further context on the ioeventfd-specific fields. 996 997Sub-Region IO FD info format (ioregionfd) 998""""""""""""""""""""""""""""""""""""""""" 999 1000+-----------+--------+------+ 1001| Name | Offset | Size | 1002+===========+========+======+ 1003| offset | 0 | 8 | 1004+-----------+--------+------+ 1005| size | 8 | 8 | 1006+-----------+--------+------+ 1007| fd_index | 16 | 4 | 1008+-----------+--------+------+ 1009| type | 20 | 4 | 1010+-----------+--------+------+ 1011| flags | 24 | 4 | 1012+-----------+--------+------+ 1013| padding | 28 | 4 | 1014+-----------+--------+------+ 1015| user_data | 32 | 8 | 1016+-----------+--------+------+ 1017 1018* *offset* is the offset of the start of the sub-region within the region 1019 requested ("physical address offset" for the region) 1020* *size* is the length of the sub-region. This may be zero if the access size is 1021 not relevant, which may allow for optimizations; ``KVM_IOREGION_POSTED_WRITES`` 1022 must be set in *flags* in this case 1023* *fd_index* is the index in the ancillary data of the FD to use for ioregionfd 1024 messages; it may be shared 1025* *type* is ``VFIO_USER_IO_FD_TYPE_IOREGIONFD`` 1026* *flags* is any of: 1027 1028 * ``KVM_IOREGION_PIO`` 1029 * ``KVM_IOREGION_POSTED_WRITES`` 1030 1031* *user_data* is an opaque value passed back to the server via a message on the 1032 file descriptor 1033 1034For further information on the ioregionfd-specific fields, see: 1035https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/ 1036 1037(FIXME: update with final API docs.) 1038 1039``VFIO_USER_DEVICE_GET_IRQ_INFO`` 1040--------------------------------- 1041 1042This command message is sent by the client to the server to query for 1043information about device interrupt types. The VFIO IRQ info structure is 1044defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``). 1045 1046Request 1047^^^^^^^ 1048 1049+-------+--------+---------------------------+ 1050| Name | Offset | Size | 1051+=======+========+===========================+ 1052| argsz | 0 | 4 | 1053+-------+--------+---------------------------+ 1054| flags | 4 | 4 | 1055+-------+--------+---------------------------+ 1056| | +-----+--------------------------+ | 1057| | | Bit | Definition | | 1058| | +=====+==========================+ | 1059| | | 0 | VFIO_IRQ_INFO_EVENTFD | | 1060| | +-----+--------------------------+ | 1061| | | 1 | VFIO_IRQ_INFO_MASKABLE | | 1062| | +-----+--------------------------+ | 1063| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | 1064| | +-----+--------------------------+ | 1065| | | 3 | VFIO_IRQ_INFO_NORESIZE | | 1066| | +-----+--------------------------+ | 1067+-------+--------+---------------------------+ 1068| index | 8 | 4 | 1069+-------+--------+---------------------------+ 1070| count | 12 | 4 | 1071+-------+--------+---------------------------+ 1072 1073* *argsz* is the maximum size of the reply payload (16 bytes today) 1074* index is the index of IRQ type being queried (e.g. ``VFIO_PCI_MSIX_IRQ_INDEX``) 1075* all other fields must be zero 1076 1077Reply 1078^^^^^ 1079 1080+-------+--------+---------------------------+ 1081| Name | Offset | Size | 1082+=======+========+===========================+ 1083| argsz | 0 | 4 | 1084+-------+--------+---------------------------+ 1085| flags | 4 | 4 | 1086+-------+--------+---------------------------+ 1087| | +-----+--------------------------+ | 1088| | | Bit | Definition | | 1089| | +=====+==========================+ | 1090| | | 0 | VFIO_IRQ_INFO_EVENTFD | | 1091| | +-----+--------------------------+ | 1092| | | 1 | VFIO_IRQ_INFO_MASKABLE | | 1093| | +-----+--------------------------+ | 1094| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | | 1095| | +-----+--------------------------+ | 1096| | | 3 | VFIO_IRQ_INFO_NORESIZE | | 1097| | +-----+--------------------------+ | 1098+-------+--------+---------------------------+ 1099| index | 8 | 4 | 1100+-------+--------+---------------------------+ 1101| count | 12 | 4 | 1102+-------+--------+---------------------------+ 1103 1104* *argsz* is the size required for the full reply payload (16 bytes today) 1105* *flags* defines IRQ attributes: 1106 1107 * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server eventfd 1108 signalling. 1109 * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the ``MASK`` 1110 and ``UNMASK`` actions in a ``VFIO_USER_DEVICE_SET_IRQS`` message. 1111 * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after being 1112 triggered, and the client must send an ``UNMASK`` action to receive new 1113 interrupts. 1114 * ``VFIO_IRQ_INFO_NORESIZE`` indicates ``VFIO_USER_SET_IRQS`` operations setup 1115 interrupts as a set, and new sub-indexes cannot be enabled without disabling 1116 the entire type. 1117* index is the index of IRQ type being queried 1118* count describes the number of interrupts of the queried type. 1119 1120``VFIO_USER_DEVICE_SET_IRQS`` 1121----------------------------- 1122 1123This command message is sent by the client to the server to set actions for 1124device interrupt types. The VFIO IRQ set structure is defined in 1125``<linux/vfio.h>`` (``struct vfio_irq_set``). 1126 1127Request 1128^^^^^^^ 1129 1130+-------+--------+------------------------------+ 1131| Name | Offset | Size | 1132+=======+========+==============================+ 1133| argsz | 0 | 4 | 1134+-------+--------+------------------------------+ 1135| flags | 4 | 4 | 1136+-------+--------+------------------------------+ 1137| | +-----+-----------------------------+ | 1138| | | Bit | Definition | | 1139| | +=====+=============================+ | 1140| | | 0 | VFIO_IRQ_SET_DATA_NONE | | 1141| | +-----+-----------------------------+ | 1142| | | 1 | VFIO_IRQ_SET_DATA_BOOL | | 1143| | +-----+-----------------------------+ | 1144| | | 2 | VFIO_IRQ_SET_DATA_EVENTFD | | 1145| | +-----+-----------------------------+ | 1146| | | 3 | VFIO_IRQ_SET_ACTION_MASK | | 1147| | +-----+-----------------------------+ | 1148| | | 4 | VFIO_IRQ_SET_ACTION_UNMASK | | 1149| | +-----+-----------------------------+ | 1150| | | 5 | VFIO_IRQ_SET_ACTION_TRIGGER | | 1151| | +-----+-----------------------------+ | 1152+-------+--------+------------------------------+ 1153| index | 8 | 4 | 1154+-------+--------+------------------------------+ 1155| start | 12 | 4 | 1156+-------+--------+------------------------------+ 1157| count | 16 | 4 | 1158+-------+--------+------------------------------+ 1159| data | 20 | variable | 1160+-------+--------+------------------------------+ 1161 1162* *argsz* is the size of the VFIO IRQ set request payload, including any *data* 1163 field. Note there is no reply payload, so this field differs from other 1164 message types. 1165* *flags* defines the action performed on the interrupt range. The ``DATA`` 1166 flags describe the data field sent in the message; the ``ACTION`` flags 1167 describe the action to be performed. The flags are mutually exclusive for 1168 both sets. 1169 1170 * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the command. 1171 The action is performed unconditionally. 1172 * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of boolean 1173 bytes. The action is performed if the corresponding boolean is true. 1174 * ``VFIO_IRQ_SET_DATA_EVENTFD`` indicates an array of event file descriptors 1175 was sent in the message meta-data. These descriptors will be signalled when 1176 the action defined by the action flags occurs. In ``AF_UNIX`` sockets, the 1177 descriptors are sent as ``SCM_RIGHTS`` type ancillary data. 1178 If no file descriptors are provided, this de-assigns the specified 1179 previously configured interrupts. 1180 * ``VFIO_IRQ_SET_ACTION_MASK`` indicates a masking event. It can be used with 1181 ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to mask an interrupt, 1182 or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the guest masks 1183 the interrupt. 1184 * ``VFIO_IRQ_SET_ACTION_UNMASK`` indicates an unmasking event. It can be used 1185 with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to unmask an 1186 interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the 1187 guest unmasks the interrupt. 1188 * ``VFIO_IRQ_SET_ACTION_TRIGGER`` indicates a triggering event. It can be used 1189 with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to trigger an 1190 interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the 1191 server triggers the interrupt. 1192 1193* *index* is the index of IRQ type being setup. 1194* *start* is the start of the sub-index being set. 1195* *count* describes the number of sub-indexes being set. As a special case, a 1196 count (and start) of 0, with data flags of ``VFIO_IRQ_SET_DATA_NONE`` disables 1197 all interrupts of the index. 1198* *data* is an optional field included when the 1199 ``VFIO_IRQ_SET_DATA_BOOL`` flag is present. It contains an array of booleans 1200 that specify whether the action is to be performed on the corresponding 1201 index. It's used when the action is only performed on a subset of the range 1202 specified. 1203 1204Not all interrupt types support every combination of data and action flags. 1205The client must know the capabilities of the device and IRQ index before it 1206sends a ``VFIO_USER_DEVICE_SET_IRQ`` message. 1207 1208In typical operation, a specific IRQ may operate as follows: 1209 12101. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with 1211 ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_TRIGGER)`` along 1212 with an eventfd. This associates the IRQ with a particular eventfd on the 1213 server side. 1214 1215#. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with 1216 ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_MASK/UNMASK)`` along 1217 with another eventfd. This associates the given eventfd with the 1218 mask/unmask state on the server side. 1219 1220#. The server may trigger the IRQ by writing 1 to the eventfd. 1221 1222#. The server may mask/unmask an IRQ which will write 1 to the corresponding 1223 mask/unmask eventfd, if there is one. 1224 12255. A client may trigger a device IRQ itself, by sending a 1226 ``VFIO_USER_DEVICE_SET_IRQ`` message with 1227 ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_TRIGGER)``. 1228 12296. A client may mask or unmask the IRQ, by sending a 1230 ``VFIO_USER_DEVICE_SET_IRQ`` message with 1231 ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_MASK/UNMASK)``. 1232 1233Reply 1234^^^^^ 1235 1236There is no payload in the reply. 1237 1238.. _Read and Write Operations: 1239 1240Note that all of these operations must be supported by the client and/or server, 1241even if the corresponding memory or device region has been shared as mappable. 1242 1243The ``count`` field must not exceed the value of ``max_data_xfer_size`` of the 1244peer, for both reads and writes. 1245 1246``VFIO_USER_REGION_READ`` 1247------------------------- 1248 1249If a device region is not mappable, it's not directly accessible by the client 1250via ``mmap()`` of the underlying file descriptor. In this case, a client can 1251read from a device region with this message. 1252 1253Request 1254^^^^^^^ 1255 1256+--------+--------+----------+ 1257| Name | Offset | Size | 1258+========+========+==========+ 1259| offset | 0 | 8 | 1260+--------+--------+----------+ 1261| region | 8 | 4 | 1262+--------+--------+----------+ 1263| count | 12 | 4 | 1264+--------+--------+----------+ 1265 1266* *offset* into the region being accessed. 1267* *region* is the index of the region being accessed. 1268* *count* is the size of the data to be transferred. 1269 1270Reply 1271^^^^^ 1272 1273+--------+--------+----------+ 1274| Name | Offset | Size | 1275+========+========+==========+ 1276| offset | 0 | 8 | 1277+--------+--------+----------+ 1278| region | 8 | 4 | 1279+--------+--------+----------+ 1280| count | 12 | 4 | 1281+--------+--------+----------+ 1282| data | 16 | variable | 1283+--------+--------+----------+ 1284 1285* *offset* into the region accessed. 1286* *region* is the index of the region accessed. 1287* *count* is the size of the data transferred. 1288* *data* is the data that was read from the device region. 1289 1290``VFIO_USER_REGION_WRITE`` 1291-------------------------- 1292 1293If a device region is not mappable, it's not directly accessible by the client 1294via mmap() of the underlying fd. In this case, a client can write to a device 1295region with this message. 1296 1297Request 1298^^^^^^^ 1299 1300+--------+--------+----------+ 1301| Name | Offset | Size | 1302+========+========+==========+ 1303| offset | 0 | 8 | 1304+--------+--------+----------+ 1305| region | 8 | 4 | 1306+--------+--------+----------+ 1307| count | 12 | 4 | 1308+--------+--------+----------+ 1309| data | 16 | variable | 1310+--------+--------+----------+ 1311 1312* *offset* into the region being accessed. 1313* *region* is the index of the region being accessed. 1314* *count* is the size of the data to be transferred. 1315* *data* is the data to write 1316 1317Reply 1318^^^^^ 1319 1320+--------+--------+----------+ 1321| Name | Offset | Size | 1322+========+========+==========+ 1323| offset | 0 | 8 | 1324+--------+--------+----------+ 1325| region | 8 | 4 | 1326+--------+--------+----------+ 1327| count | 12 | 4 | 1328+--------+--------+----------+ 1329 1330* *offset* into the region accessed. 1331* *region* is the index of the region accessed. 1332* *count* is the size of the data transferred. 1333 1334``VFIO_USER_DMA_READ`` 1335----------------------- 1336 1337If the client has not shared mappable memory, the server can use this message to 1338read from guest memory. 1339 1340Request 1341^^^^^^^ 1342 1343+---------+--------+----------+ 1344| Name | Offset | Size | 1345+=========+========+==========+ 1346| address | 0 | 8 | 1347+---------+--------+----------+ 1348| count | 8 | 8 | 1349+---------+--------+----------+ 1350 1351* *address* is the client DMA memory address being accessed. This address must have 1352 been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message. 1353* *count* is the size of the data to be transferred. 1354 1355Reply 1356^^^^^ 1357 1358+---------+--------+----------+ 1359| Name | Offset | Size | 1360+=========+========+==========+ 1361| address | 0 | 8 | 1362+---------+--------+----------+ 1363| count | 8 | 8 | 1364+---------+--------+----------+ 1365| data | 16 | variable | 1366+---------+--------+----------+ 1367 1368* *address* is the client DMA memory address being accessed. 1369* *count* is the size of the data transferred. 1370* *data* is the data read. 1371 1372``VFIO_USER_DMA_WRITE`` 1373----------------------- 1374 1375If the client has not shared mappable memory, the server can use this message to 1376write to guest memory. 1377 1378Request 1379^^^^^^^ 1380 1381+---------+--------+----------+ 1382| Name | Offset | Size | 1383+=========+========+==========+ 1384| address | 0 | 8 | 1385+---------+--------+----------+ 1386| count | 8 | 8 | 1387+---------+--------+----------+ 1388| data | 16 | variable | 1389+---------+--------+----------+ 1390 1391* *address* is the client DMA memory address being accessed. This address must have 1392 been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message. 1393* *count* is the size of the data to be transferred. 1394* *data* is the data to write 1395 1396Reply 1397^^^^^ 1398 1399+---------+--------+----------+ 1400| Name | Offset | Size | 1401+=========+========+==========+ 1402| address | 0 | 8 | 1403+---------+--------+----------+ 1404| count | 8 | 4 | 1405+---------+--------+----------+ 1406 1407* *address* is the client DMA memory address being accessed. 1408* *count* is the size of the data transferred. 1409 1410``VFIO_USER_DEVICE_RESET`` 1411-------------------------- 1412 1413This command message is sent from the client to the server to reset the device. 1414Neither the request or reply have a payload. 1415 1416``VFIO_USER_REGION_WRITE_MULTI`` 1417-------------------------------- 1418 1419This message can be used to coalesce multiple device write operations 1420into a single messgage. It is only used as an optimization when the 1421outgoing message queue is relatively full. 1422 1423Request 1424^^^^^^^ 1425 1426+---------+--------+----------+ 1427| Name | Offset | Size | 1428+=========+========+==========+ 1429| wr_cnt | 0 | 8 | 1430+---------+--------+----------+ 1431| wrs | 8 | variable | 1432+---------+--------+----------+ 1433 1434* *wr_cnt* is the number of device writes coalesced in the message 1435* *wrs* is an array of device writes defined below 1436 1437Single Device Write Format 1438"""""""""""""""""""""""""" 1439 1440+--------+--------+----------+ 1441| Name | Offset | Size | 1442+========+========+==========+ 1443| offset | 0 | 8 | 1444+--------+--------+----------+ 1445| region | 8 | 4 | 1446+--------+--------+----------+ 1447| count | 12 | 4 | 1448+--------+--------+----------+ 1449| data | 16 | 8 | 1450+--------+--------+----------+ 1451 1452* *offset* into the region being accessed. 1453* *region* is the index of the region being accessed. 1454* *count* is the size of the data to be transferred. This format can 1455 only describe writes of 8 bytes or less. 1456* *data* is the data to write. 1457 1458Reply 1459^^^^^ 1460 1461+---------+--------+----------+ 1462| Name | Offset | Size | 1463+=========+========+==========+ 1464| wr_cnt | 0 | 8 | 1465+---------+--------+----------+ 1466 1467* *wr_cnt* is the number of device writes completed. 1468 1469 1470Appendices 1471========== 1472 1473Unused VFIO ``ioctl()`` commands 1474-------------------------------- 1475 1476The following VFIO commands do not have an equivalent vfio-user command: 1477 1478* ``VFIO_GET_API_VERSION`` 1479* ``VFIO_CHECK_EXTENSION`` 1480* ``VFIO_SET_IOMMU`` 1481* ``VFIO_GROUP_GET_STATUS`` 1482* ``VFIO_GROUP_SET_CONTAINER`` 1483* ``VFIO_GROUP_UNSET_CONTAINER`` 1484* ``VFIO_GROUP_GET_DEVICE_FD`` 1485* ``VFIO_IOMMU_GET_INFO`` 1486 1487However, once support for live migration for VFIO devices is finalized some 1488of the above commands may have to be handled by the client in their 1489corresponding vfio-user form. This will be addressed in a future protocol 1490version. 1491 1492VFIO groups and containers 1493^^^^^^^^^^^^^^^^^^^^^^^^^^ 1494 1495The current VFIO implementation includes group and container idioms that 1496describe how a device relates to the host IOMMU. In the vfio-user 1497implementation, the IOMMU is implemented in SW by the client, and is not 1498visible to the server. The simplest idea would be that the client put each 1499device into its own group and container. 1500 1501Backend Program Conventions 1502--------------------------- 1503 1504vfio-user backend program conventions are based on the vhost-user ones. 1505 1506* The backend program must not daemonize itself. 1507* No assumptions must be made as to what access the backend program has on the 1508 system. 1509* File descriptors 0, 1 and 2 must exist, must have regular 1510 stdin/stdout/stderr semantics, and can be redirected. 1511* The backend program must honor the SIGTERM signal. 1512* The backend program must accept the following commands line options: 1513 1514 * ``--socket-path=PATH``: path to UNIX domain socket, 1515 * ``--fd=FDNUM``: file descriptor for UNIX domain socket, incompatible with 1516 ``--socket-path`` 1517* The backend program must be accompanied with a JSON file stored under 1518 ``/usr/share/vfio-user``. 1519 1520TODO add schema similar to docs/interop/vhost-user.json. 1521