docs/devel/multi-process.rst

6   This is the design document for multi-process QEMU. It does not
7   necessarily reflect the status of the current implementation, which
10   the goals and general direction of this feature.
12   Please refer to the following wiki for latest details:
15 QEMU is often used as the hypervisor for virtual machines running in the
16 Oracle cloud. Since one of the advantages of cloud computing is the
17 ability to run many VMs from different tenants in the same cloud
19 potentially use the hypervisor's access privileges to access data it is
23 monolithic program that provides many features to the VMs it services.
26 attack. Separating QEMU reduces the attack surface by aiding to
27 limit each component in the system to only access the resources that
35 destroyed. A second is to emulate the CPU instructions within the VM,
37 extensions. Finally, it provides IO services to the VM by emulating HW
44 host processes. Each of these processes can be given only the privileges
46 access only to the disk images it provides, and not be allowed to
49 devices beyond what the disk service was given access to.
52 have no direct interfaces to the VM. During VM execution, it would still
53 provide the user interface to hot-plug devices or live migrate the VM.
56 from the main QEMU program, which would continue to provide CPU
57 emulation. i.e., the control process would also be the CPU emulation
58 process. In a later phase, CPU emulation could be separated from the
65 begin for a couple of reasons. One is the sheer number of IO devices QEMU
67 be exploited, and, indeed, have been a source of exploits in the past.
68 Another is the modular nature of QEMU device emulation code provides
69 interface points where the QEMU functions that perform device emulation
70 can be separated from the QEMU functions that manage the emulation of
71 guest CPU instructions. The devices emulated in the separate process are
78 Configured objects are all compiled into the QEMU binary, then objects
79 are instantiated by name when used by the guest VM. For example, the
81 instantiation code is only run when the device is included in the target
82 VM. (e.g., via the QEMU command line as *-device foo*)
84 The object model is hierarchical, so device emulation code names its
86 instantiate a parent object before calling the device's instantiation
92 In order to separate the device emulation code from the CPU emulation
93 code, the device object code must run in a different process. There are
95 separately from the main QEMU process. These are examined below.
102 device drivers in the guest and vhost user device objects in QEMU, but
103 once the QEMU vhost user code has configured the vhost user application,
104 mission-mode IO is performed by the application. The vhost user
111 As mentioned above, one of the tasks of the vhost device object within
112 QEMU is to contact the vhost application and send it configuration
113 information about this device instance. As part of the configuration
114 process, the application can also be sent other file descriptors over
115 the socket, which then can be used by the vhost user application in
121 VMs are often run using HW virtualization features via the KVM kernel
122 driver. This driver allows QEMU to accelerate the emulation of guest CPU
123 instructions by running the guest in a virtual HW mode. When the guest
125 execution returns to the KVM driver so it can inform QEMU to emulate the
128 One of the events that can cause a return to QEMU is when a guest device
129 driver accesses an IO location. QEMU then dispatches the memory
130 operation to the corresponding QEMU device object. In the case of a
131 vhost user device, the memory operation would need to be sent over a
132 socket to the vhost application. This path is accelerated by the QEMU
133 virtio code by setting up an eventfd file descriptor that the vhost
134 application can directly receive MMIO store notifications from the KVM
135 driver, instead of needing them to be sent to the QEMU process first.
140 Another optimization used by the vhost application is the ability to
141 directly inject interrupts into the VM via the KVM driver, again,
142 bypassing the need to send the interrupt back to the QEMU process first.
143 The QEMU virtio setup code configures the KVM driver with an eventfd
144 that triggers the device interrupt in the guest when the eventfd is
145 written. This irqfd file descriptor is then passed to the vhost user
151 The vhost application is also allowed to directly access guest memory,
152 instead of needing to send the data as messages to QEMU. This is also
153 done with file descriptors sent to the vhost user application by QEMU.
154 These descriptors can be passed to ``mmap()`` by the vhost application
155 to map the guest address space into the vhost application.
157 IOMMUs introduce another level of complexity, since the address given to
158 the guest virtio device to DMA to or from is not a guest physical
160 as a listener for IOMMU mapping changes. The vhost application maintains
168 Much of the vhost model can be re-used by separated device emulation. In
169 particular, the ideas of using a socket between QEMU and the device
171 the VM via KVM, and allowing the application to ``mmap()`` the guest
175 application works and the needs of separated device emulation. The most
180 progress cannot continue until the load has been emulated. By contrast,
181 stores are asynchronous, the guest can continue after the store event
182 has been sent to the vhost application.
184 Another difference is that in the vhost user model, a single daemon can
185 support multiple QEMU instances. This is contrary to the security regime
186 desired, in which the emulation application should only be allowed to
187 access the files or devices the VM it's running on behalf of can access.
190 ``qemu-io`` is a test harness used to test changes to the QEMU block backend
191 object code (e.g., the code that implements disk images for disk driver
193 does compile the QEMU block objects into a separate binary from the main
195 emulation applications will need to include the QEMU block objects.
200 A different model based on proxy objects in the QEMU program
202 while minimizing the changes needed to the device emulation code. The
209 The remote emulation process will run the QEMU object hierarchy without
210 modification. The device emulation objects will be also be based on the
211 QEMU code, because for anything but the simplest device, it would not be
212 a tractable to re-implement both the object model and the many device
215 The processes will communicate with the QEMU process over UNIX domain
216 sockets. The processes can be executed either as standalone processes,
217 or be executed by QEMU. In both cases, the host backends the emulation
230 configuration might be to put all controllers of the same device class
232 the same type can be managed by a single QMP monitor.
237 The first argument to the remote emulation process will be a Unix domain
238 socket that connects with the Proxy object. This is a required argument.
248 itself. The QMP monitor socket is specified the same as for a QEMU
255 can be monitored over the UNIX socket path */tmp/disk-mon*.
260 Each remote device emulated in a remote process on the host is
262 sub-option to this option specifies the Unix socket that connects
263 to the remote process. An *id* sub-option is required, and it should
264 be the same id as used in the remote process.
276 QEMU is not aware of the type of type of the remote PCI device. It is
285 The primary channel (referred to as com in the code) is used to bootstrap
286 the remote process. It is also used to pass on device-agnostic commands
293 channel. The proxy object sets up this channel using the primary
299 QEMU has an object model based on sub-classes inherited from the
300 "object" super-class. The sub-classes that are of interest here are the
301 "device" and "bus" sub-classes whose child sub-classes make up the
304 The proxy object model will use device proxy objects to replace the
305 device emulation code within the QEMU process. These objects will live
306 in the same place in the object and bus hierarchies as the objects they
307 replace. i.e., the proxy object for an LSI SCSI controller will be a
308 sub-class of the "pci-device" class, and will have the same PCI bus
309 parent and the same SCSI bus child objects as the LSI controller object
312 It is worth noting that the same proxy object is used to mediate with
318 The Proxy device objects are initialized in the exact same manner in
321 In addition, the Proxy objects perform the following two tasks:
322 - Parses the "socket" sub option and connects to the remote process
324 - Uses the "id" sub-option to connect to the emulated device on the
330 The ``class_init()`` method of a proxy object will, in general behave
331 similarly to the object it replaces, including setting any static
332 properties and methods needed by the proxy.
337 The ``instance_init()`` and ``realize()`` functions would only need to
343 will initialize the PCI config space in order to make a valid PCI device
344 tree within the QEMU process.
350 or ports. The QEMU device emulation code uses QEMU's memory region
352 functions that QEMU will invoke when the guest accesses the device's
353 areas of the IO address space. When a guest driver does access the
354 device, the VM will exit HW virtualization mode and return to QEMU,
355 which will then lookup and execute the corresponding callback function.
357 A proxy object would need to mirror the memory region calls the actual
360 they will forward the operation to the device emulation process.
365 PCI devices also have a configuration space that can be accessed by the
366 guest driver. Guest accesses to this space is not handled by the device
369 need to be propagated to the emulation process.
375 "pci-device-proxy" class that can serve as the parent of a PCI device
377 override the PCI parent's ``config_read()`` and ``config_write()``
378 methods with ones that forward these operations to the emulation
385 socket to receive interrupt indications from the emulation process. An
387 be injected into the guest. For example, a PCI device object may use
393 The proxy will register to save and restore any *vmstate* it needs over
394 a live migration event. The device proxy does not need to manage the
395 remote device's *vmstate*; that will be handled by the remote process
401 Generic device operations, such as DMA, will be performed by the remote
402 process proxy by sending messages to the remote process.
408 the initial messages sent to the emulation process is a guest memory
410 that the emulation process can ``mmap()`` to directly access guest
414 as RAM for the machine.
419 When the emulated system includes an IOMMU, the remote process proxy in
420 QEMU will need to create a socket for IOMMU requests from the emulation
423 unmaps, the remote process proxy will also register as a listener on the
425 within the DMA address space, an IOMMU notifier for unmaps will be added
426 to the memory region that will forward unmaps to the emulation process
427 over the IOMMU socket.
433 process. It will also have "rid" option to the command, just as the
434 *-device* command line option does. The remote process may either be one
435 started at QEMU startup, or be one added by the "add-process" QMP
436 command described above. In either case, the remote process proxy will
437 forward the new device's JSON description to the corresponding emulation
443 The remote process proxy will also register for live migration
445 the proxy will send the remote process a secondary socket file
446 descriptor to save the remote process's device *vmstate* over. The
447 incoming byte stream length and data will be saved as the proxy's
448 *vmstate*. When the proxy is resumed on its new host, this *vmstate*
450 to the new remote process through which it receives the *vmstate* in
451 order to restore the devices there.
456 The parts of QEMU that the emulation program will need include the
457 object model; the memory emulation objects; the device emulation objects
458 of the targeted device, and any dependent devices; and, the device's
459 backends. It will also need code to setup the machine environment,
460 handle requests from the QEMU process, and route machine-level requests
461 (such as interrupts or IOMMU mappings) back to the QEMU process.
466 The process initialization sequence will follow the same sequence
467 followed by QEMU. It will first initialize the backend objects, then
468 device emulation objects. The JSON descriptions sent by the QEMU process
473 Before the device objects are created, the initial address spaces and
482 ``memory_region_allocate_system_memory()``. The file descriptors needed
483 will be supplied by the guest memory table from above. Those RAM regions
484 would then be added to the *system\_memory* memory region with
489 IO initialization will be driven by the JSON descriptions sent from the
492 and added to the *system\_memory* memory region with
493 ``memory_region_add_subregion_overlap()``. The overlap version is
499 The device emulation objects will use ``memory_region_init_io()`` to
503 In order to use ``address_space_rw()`` in the emulation process to
504 handle MMIO requests from QEMU, the PCI physical addresses must be the
505 same in the QEMU process and the device emulation process. In order to
507 to the emulation process.
512 When device emulation wants to inject an interrupt into the VM, the
513 request climbs the device's bus object hierarchy until the point where a
514 bus object knows how to signal the interrupt to the guest. The details
515 depend on the type of interrupt being raised.
519 On x86 systems, there is an emulated IOAPIC object attached to the root
520 PCI bus object, and the root PCI object forwards interrupt requests to
521 it. The IOAPIC object, in turn, calls the KVM driver to inject the
522 corresponding interrupt into the VM. The simplest way to handle this in
523 an emulation process would be to setup the root PCI bus driver (via
524 ``pci_bus_irqs()``) to send a interrupt request back to the QEMU
525 process, and have the device proxy object reflect it up the PCI tree
532 these DMA writes, then calls into the KVM driver to inject the interrupt
533 into the VM. A simple emulation process implementation would be to send
534 the MSI DMA address from QEMU as a message at initialization, then
535 install an address space handler at that address which forwards the MSI
542 first must use dma\_memory\_map() to convert the DMA address to a local
543 virtual address. The emulation process memory region objects setup above
544 will be used to translate the DMA address to a local virtual address the
551 regions to translate the DMA address to a guest physical address before
552 that physical address can be translated to a local virtual address. The
557 The emulation process will maintain a cache of recent IOMMU translations
558 (the IOTLB). When the translate() callback of an IOMMU memory region is
559 invoked, the IOTLB cache will be searched for an entry that will map the
561 to QEMU requesting the corresponding translation entry, which be both be
562 used to return a guest address and be added to the cache.
566 The IOMMU emulation will also need to act on unmap requests from QEMU.
567 These happen when the guest IOMMU driver purges an entry from the
574 will set up a channel using the received file descriptor with
577 the process's device state back to QEMU. This method will be reversed on
578 restore - the channel will be passed to ``qemu_loadvm_state()`` to
579 restore the device state.
584 The messages that are required to be sent between QEMU and the emulation
585 process can add considerable latency to IO operations. The optimizations
586 described below attempt to ameliorate this effect by allowing the
587 emulation process to communicate directly with the kernel KVM driver.
588 The KVM file descriptors created would be passed to the emulation process
589 via initialization messages, much like the guest memory table is done.
593 from KVM. The issue with the eventfd mechanism used by vhost user is
594 that it does not pass any data with the event indication, so it cannot
598 The expanded idea would require a new type of KVM device:
601 that the emulation process can use to receive MMIO notifications. QEMU
602 would create both descriptors using the KVM driver, and pass the slave
603 descriptor to the emulation process via an initialization message.
610 The guest physical range structure describes the address range that a
611 device will respond to. It includes the base and length of the range, as
612 well as which bus the range resides on (e.g., on an x86machine, it can
613 specify whether the range refers to memory or IO addresses).
616 a PCI device can have multiple BARs), so the structure will also include
617 an enumerated identifier to specify which of the device's ranges is
635 physical range the MMIO was within, the offset within that range, the
637 includes a sequence number that can be used to reply to the MMIO, and
638 the CPU that issued the MMIO.
659 are two queues: pending queue is for MMIOs that haven't been read by the
660 emulation program, and the sent queue is for MMIOs that haven't been
661 acknowledged. The main use of the second queue is to validate MMIO
662 replies from the emulation program.
666 Each CPU in the VM is emulated in QEMU by a separate thread, so multiple
668 threads may be waiting for MMIO replies. The scoreboard would contain a
669 wait queue and sequence number for the per-CPU threads, allowing them to
670 be individually woken when the MMIO reply is received from the emulation
671 program. It also tracks the number of posted MMIO stores to the device
672 that haven't been replied to, in order to satisfy the PCI constraint
679 completed without sending a MMIO request to the emulation program if the
680 emulation program shares a shadow image of the device's memory image
681 with the KVM driver.
683 The emulation program will ask the KVM driver to allocate memory for the
684 shadow image, and will then use ``mmap()`` to directly access it. The
685 emulation program can control KVM access to the shadow image by sending
686 KVM an access map telling it which areas of the image have no
688 MMIO request to the emulation program. The access map can also inform
689 the KVM drive which size accesses are allowed to the image.
694 The master descriptor is used by QEMU to configure the new KVM device.
695 The descriptor would be returned by the KVM driver when QEMU issues a
701 The *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a
702 ``kvm_register_device_ops()`` call when the KVM system in initialized by
703 ``kvm_init()``. These device ops are called by the KVM driver when QEMU
711 initialize a KVM user device specific data structure, and assign the
716 This routine is invoked when QEMU issues an ``ioctl()`` on the master
717 descriptor. The ``ioctl()`` commands supported are defined by the KVM
720 *KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will
721 be passed to the device emulation program. Only one slave can be created
722 by each master descriptor. The file operations performed by this
725 The *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical
726 address range that the slave descriptor will receive MMIO notifications
727 for. The range is specified by a guest physical range structure
729 command can be executed while the guest is running, such as the case
733 register *kvm\_io\_device\_ops* callbacks to be invoked when the guest
734 performs a MMIO operation within the range. When a range is changed,
735 ``kvm_io_bus_unregister_dev()`` is used to remove the previous
739 how long KVM will wait for the emulation process to respond to a MMIO
744 This routine is called when the VM instance is destroyed. It will need
745 to destroy the slave descriptor; and free any memory allocated by the
746 driver, as well as the *kvm\_device* structure itself.
751 The slave descriptor will have its own file operations vector, which
752 responds to system calls on the descriptor performed by the device
757 A read returns any pending MMIO requests from the KVM driver as MMIO
759 multiple MMIO operations pending. The MMIO requests are moved from the
760 pending queue to the sent queue, and if there are threads waiting for
761 space in the pending to add new MMIO operations, they will be woken
767 the MMIO requests in the sent queue. Matches are removed from the sent
768 queue, and any threads waiting for the reply are woken. If a store is
769 removed, then the number of posted stores in the per-CPU scoreboard is
770 decremented. When the number is zero, and a non side-effect load was
771 waiting for posted stores to complete, the load is continued.
775 There are several ioctl()s that can be performed on the slave
778 A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
779 allocate memory for the shadow image. This memory can later be
780 ``mmap()``\ ed by the emulation process to share the emulation's view of
781 device memory with the KVM driver.
783 A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
784 shadow image. It will send the KVM driver a shadow control map, which
785 specifies which areas of the image can complete guest loads without
786 sending the load request to the emulation program. It will also specify
787 the size of load operations that are allowed.
791 An emulation program will use the ``poll()`` call with a *POLLIN* flag
793 return if the pending MMIO request queue is not empty.
797 This call allows the emulation program to directly access the shadow
798 image allocated by the KVM driver. As device emulation updates device
799 memory, changes with no side-effects will be reflected in the shadow,
800 and the KVM driver can satisfy guest loads from the shadow image without
801 needing to wait for the emulation program.
806 Each KVM per-CPU thread can handle MMIO operation on behalf of the guest
807 VM. KVM will use the MMIO's guest physical address to search for a
808 matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
809 driver instead of exiting back to QEMU. If a match is found, the
814 This callback is invoked when the guest performs a load to the device.
815 Loads with side-effects must be handled synchronously, with the KVM
816 driver putting the QEMU thread to sleep waiting for the emulation
817 process reply before re-starting the guest. Loads that do not have
818 side-effects may be optimized by satisfying them from the shadow image,
819 if there are no outstanding stores to the device by this CPU. PCI memory
821 the same device have been completed.
825 Stores can be handled asynchronously unless the pending MMIO request
826 queue is full. In this case, the QEMU thread must sleep waiting for
827 space in the queue. Stores will increment the number of posted stores in
828 the per-CPU scoreboard, in order to implement the PCI ordering
835 application does, where the QEMU process sets up *eventfds* that cause
836 the device's corresponding interrupt to be triggered by the KVM driver.
837 These irq file descriptors are sent to the emulation process at
838 initialization, and are used when the emulation code raises a device
846 the emulation program. This second file descriptor allows multiple
847 devices sharing an irq to be notified when the interrupt has been
848 acknowledged by the guest, so they can re-trigger the interrupt if their
854 The irq descriptors are created by the proxy object
855 ``using event_notifier_init()`` to create the irq and re-sampling
857 The interrupt route can be found with
863 Intx routing can be changed when the guest programs the APIC the device
864 pin is connected to. The proxy object in QEMU will use
866 changes to the route. This handler will broadly follow the VFIO
867 interrupt logic to change the route: de-assigning the existing irq
868 descriptor from its route, then assigning it the new route. (see
874 MSI/X interrupts are sent as DMA transactions to the host. The interrupt
875 data contains a vector that is programmed by the guest, A device may have
877 may need to be sent to the emulation program.
882 This case will also follow the VFIO example. For each MSI/X interrupt,
884 ``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
885 the eventfd with ``kvm_irqchip_add_irqfd_notifier()``.
890 The guest may dynamically update several MSI-related tables in the
893 config space. Much like the BAR case above, the proxy object must look
894 at guest config space programming to keep the MSI interrupt state
895 consistent between QEMU and the emulation program.
903 separate a process to handle CPU instruction emulation from the main
905 code, so the first task would be to create one.
910 Separating QEMU relies on the host OS's access restriction mechanisms to
911 enforce that the differing processes can only access the objects they
921 one for the same user ID, the second for users IDs with the same group
922 ID, and the third for all other user IDs. Each device instance would
929 Mandatory access control allows the OS to add an additional set of
930 controls on top of discretionary access for the OS to control. It also
941 advantage of type enforcement by running the emulation processes with
942 different types, both from the main QEMU process, and from the emulation
946 types separate from the main QEMU process and non-disk emulation
947 processes, and the type rules could prevent processes other than disk
949 emulation processes can have a type separate from the main QEMU process
950 and non-network emulation process, and only that type can access the
957 the process or file. The process is granted access to the file if the
958 process's set is a superset of the file's set. This enforcement can be
959 used to separate multiple instances of devices in the same class.
963 category. The different device emulation processes would not be able to
966 Alternatively, categories could be used in lieu of the type enforcement