docs/devel/multi-process.rst

12   Please refer to the following wiki for latest details:
17 ability to run many VMs from different tenants in the same cloud
19 potentially use the hypervisor's access privileges to access data it is
22 QEMU can be susceptible to security attacks because it is a large,
23 monolithic program that provides many features to the VMs it services.
26 attack. Separating QEMU reduces the attack surface by aiding to
27 limit each component in the system to only access the resources that
28 it needs to perform its job.
35 destroyed. A second is to emulate the CPU instructions within the VM,
37 extensions. Finally, it provides IO services to the VM by emulating HW
45 it needs to provide its service, e.g., a disk service could be given
46 access only to the disk images it provides, and not be allowed to
48 this service would not be able to use this exploit to access files or
49 devices beyond what the disk service was given access to.
52 have no direct interfaces to the VM. During VM execution, it would still
53 provide the user interface to hot-plug devices or live migrate the VM.
55 A first step in creating a multi-process QEMU is to separate IO services
56 from the main QEMU program, which would continue to provide CPU
64 Separating IO services into individual host processes is a good place to
72 referred to as remote devices.
80 code to emulate a device named "foo" is always present in QEMU, but its
92 In order to separate the device emulation code from the CPU emulation
100 Virtio guest device drivers can be connected to vhost user applications
101 in order to perform their IO operations. This model uses special virtio
112 QEMU is to contact the vhost application and send it configuration
122 driver. This driver allows QEMU to accelerate the emulation of guest CPU
125 execution returns to the KVM driver so it can inform QEMU to emulate the
128 One of the events that can cause a return to QEMU is when a guest device
130 operation to the corresponding QEMU device object. In the case of a
131 vhost user device, the memory operation would need to be sent over a
132 socket to the vhost application. This path is accelerated by the QEMU
135 driver, instead of needing them to be sent to the QEMU process first.
140 Another optimization used by the vhost application is the ability to
142 bypassing the need to send the interrupt back to the QEMU process first.
145 written. This irqfd file descriptor is then passed to the vhost user
148 vhost access to guest memory
151 The vhost application is also allowed to directly access guest memory,
152 instead of needing to send the data as messages to QEMU. This is also
153 done with file descriptors sent to the vhost user application by QEMU.
154 These descriptors can be passed to ``mmap()`` by the vhost application
155 to map the guest address space into the vhost application.
157 IOMMUs introduce another level of complexity, since the address given to
158 the guest virtio device to DMA to or from is not a guest physical
161 a cache of IOMMMU translations: sending translation requests back to
165 applicability to device separation
170 emulation application, using a file descriptor to inject interrupts into
171 the VM via KVM, and allowing the application to ``mmap()`` the guest
182 has been sent to the vhost application.
185 support multiple QEMU instances. This is contrary to the security regime
186 desired, in which the emulation application should only be allowed to
190 ``qemu-io`` is a test harness used to test changes to the QEMU block backend
195 emulation applications will need to include the QEMU block objects.
202 while minimizing the changes needed to the device emulation code. The
212 a tractable to re-implement both the object model and the many device
230 configuration might be to put all controllers of the same device class
237 The first argument to the remote emulation process will be a Unix domain
247 Remote emulation processes can be monitored via QMP, similar to QEMU
262 sub-option to this option specifies the Unix socket that connects
263 to the remote process. An *id* sub-option is required, and it should
270 can be used to add a device emulated in a remote process
285 The primary channel (referred to as com in the code) is used to bootstrap
286 the remote process. It is also used to pass on device-agnostic commands
304 The proxy object model will use device proxy objects to replace the
312 It is worth noting that the same proxy object is used to mediate with
322 - Parses the "socket" sub option and connects to the remote process
324 - Uses the "id" sub-option to connect to the emulated device on the
331 similarly to the object it replaces, including setting any static
337 The ``instance_init()`` and ``realize()`` functions would only need to
338 perform tasks related to being a proxy, such are registering its own
340 attached to later.
343 will initialize the PCI config space in order to make a valid PCI device
349 Most devices are driven by guest device driver accesses to IO addresses
351 function calls (such as ``memory_region_init_io()``) to add callback
354 device, the VM will exit HW virtualization mode and return to QEMU,
357 A proxy object would need to mirror the memory region calls the actual
360 they will forward the operation to the device emulation process.
366 guest driver. Guest accesses to this space is not handled by the device
369 need to be propagated to the emulation process.
374 One way to propagate guest PCI config accesses is to create a
378 methods with ones that forward these operations to the emulation
384 A proxy for a device that generates interrupts will need to create a
385 socket to receive interrupt indications from the emulation process. An
386 incoming interrupt indication would then be sent up to its bus parent to
393 The proxy will register to save and restore any *vmstate* it needs over
394 a live migration event. The device proxy does not need to manage the
402 process proxy by sending messages to the remote process.
408 the initial messages sent to the emulation process is a guest memory
410 that the emulation process can ``mmap()`` to directly access guest
411 memory, similar to ``vhost_user_set_mem_table()``. Note guest memory
420 QEMU will need to create a socket for IOMMU requests from the emulation
422 ``address_space_get_iotlb_entry()`` call. In order to handle IOMMU
426 to the memory region that will forward unmaps to the emulation process
433 process. It will also have "rid" option to the command, just as the
437 forward the new device's JSON description to the corresponding emulation
444 notifications with ``vmstate_register()``. When called to save state,
446 descriptor to save the remote process's device *vmstate* over. The
450 to the new remote process through which it receives the *vmstate* in
451 order to restore the devices there.
459 backends. It will also need code to setup the machine environment,
461 (such as interrupts or IOMMU mappings) back to the QEMU process.
469 will drive which objects need to be created.
484 would then be added to the *system\_memory* memory region with
490 QEMU process. For a PCI device, a PCI bus will need to be created with
491 ``pci_root_bus_new()``, and a PCI memory region will need to be created
492 and added to the *system\_memory* memory region with
499 The device emulation objects will use ``memory_region_init_io()`` to
500 install their MMIO handlers, and ``pci_register_bar()`` to associate
503 In order to use ``address_space_rw()`` in the emulation process to
505 same in the QEMU process and the device emulation process. In order to
507 to the emulation process.
512 When device emulation wants to inject an interrupt into the VM, the
514 bus object knows how to signal the interrupt to the guest. The details
519 On x86 systems, there is an emulated IOAPIC object attached to the root
520 PCI bus object, and the root PCI object forwards interrupt requests to
521 it. The IOAPIC object, in turn, calls the KVM driver to inject the
522 corresponding interrupt into the VM. The simplest way to handle this in
523 an emulation process would be to setup the root PCI bus driver (via
524 ``pci_bus_irqs()``) to send a interrupt request back to the QEMU
530 PCI MSI/X interrupts are implemented in HW as DMA writes to a
532 these DMA writes, then calls into the KVM driver to inject the interrupt
533 into the VM. A simple emulation process implementation would be to send
536 message back to QEMU.
541 When a emulation object wants to DMA into or out of guest memory, it
542 first must use dma\_memory\_map() to convert the DMA address to a local
544 will be used to translate the DMA address to a local virtual address the
551 regions to translate the DMA address to a guest physical address before
552 that physical address can be translated to a local virtual address. The
560 DMA address to a guest PA. On a cache miss, a message will be sent back
561 to QEMU requesting the corresponding translation entry, which be both be
562 used to return a guest address and be added to the cache.
566 The IOMMU emulation will also need to act on unmap requests from QEMU.
575 ``qio_channel_socket_new_fd()``. This channel will be used to create a
576 *QEMUfile* that can be passed to ``qemu_save_device_state()`` to send
577 the process's device state back to QEMU. This method will be reversed on
578 restore - the channel will be passed to ``qemu_loadvm_state()`` to
584 The messages that are required to be sent between QEMU and the emulation
585 process can add considerable latency to IO operations. The optimizations
586 described below attempt to ameliorate this effect by allowing the
587 emulation process to communicate directly with the kernel KVM driver.
588 The KVM file descriptors created would be passed to the emulation process
596 could, however, be expanded to cover more cases.
601 that the emulation process can use to receive MMIO notifications. QEMU
603 descriptor to the emulation process via an initialization message.
611 device will respond to. It includes the base and length of the range, as
613 specify whether the range refers to memory or IO addresses).
615 A device can have multiple physical address ranges it responds to (e.g.,
617 an enumerated identifier to specify which of the device's ranges is
618 being referred to.
637 includes a sequence number that can be used to reply to the MMIO, and
661 acknowledged. The main use of the second queue is to validate MMIO
667 MMIOs may be waiting to be consumed by an emulation program and multiple
669 wait queue and sequence number for the per-CPU threads, allowing them to
671 program. It also tracks the number of posted MMIO stores to the device
672 that haven't been replied to, in order to satisfy the PCI constraint
673 that a load to a device will not complete until all previous stores to
679 completed without sending a MMIO request to the emulation program if the
683 The emulation program will ask the KVM driver to allocate memory for the
684 shadow image, and will then use ``mmap()`` to directly access it. The
685 emulation program can control KVM access to the shadow image by sending
688 MMIO request to the emulation program. The access map can also inform
689 the KVM drive which size accesses are allowed to the image.
694 The master descriptor is used by QEMU to configure the new KVM device.
712 *kvm\_device* private field to it.
721 be passed to the device emulation program. Only one slave can be created
728 argument. For buses that assign addresses to devices dynamically, this
732 *KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to
733 register *kvm\_io\_device\_ops* callbacks to be invoked when the guest
735 ``kvm_io_bus_unregister_dev()`` is used to remove the previous
739 how long KVM will wait for the emulation process to respond to a MMIO
745 to destroy the slave descriptor; and free any memory allocated by the
752 responds to system calls on the descriptor performed by the device
760 pending queue to the sent queue, and if there are threads waiting for
761 space in the pending to add new MMIO operations, they will be woken
766 A write also consists of a set of MMIO requests. They are compared to
771 waiting for posted stores to complete, the load is continued.
778 A *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
780 ``mmap()``\ ed by the emulation process to share the emulation's view of
783 A *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
786 sending the load request to the emulation program. It will also specify
792 to determine if there are MMIO requests waiting to be read. It will
797 This call allows the emulation program to directly access the shadow
801 needing to wait for the emulation program.
807 VM. KVM will use the MMIO's guest physical address to search for a
808 matching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
809 driver instead of exiting back to QEMU. If a match is found, the
814 This callback is invoked when the guest performs a load to the device.
816 driver putting the QEMU thread to sleep waiting for the emulation
819 if there are no outstanding stores to the device by this CPU. PCI memory
820 ordering demands that a load cannot complete before all older stores to
828 the per-CPU scoreboard, in order to implement the PCI ordering
836 the device's corresponding interrupt to be triggered by the KVM driver.
837 These irq file descriptors are sent to the emulation process at
844 Traditional PCI pin interrupts are level based, so, in addition to an
845 irq file descriptor, a re-sampling file descriptor needs to be sent to
847 devices sharing an irq to be notified when the interrupt has been
855 ``using event_notifier_init()`` to create the irq and re-sampling
856 *eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt.
864 pin is connected to. The proxy object in QEMU will use
865 ``pci_device_set_intx_routing_notifier()`` to be informed of any guest
866 changes to the route. This handler will broadly follow the VFIO
867 interrupt logic to change the route: de-assigning the existing irq
874 MSI/X interrupts are sent as DMA transactions to the host. The interrupt
877 may need to be sent to the emulation program.
884 ``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
894 at guest config space programming to keep the MSI interrupt state
902 After IO services have been disaggregated, a second phase would be to
903 separate a process to handle CPU instruction emulation from the main
905 code, so the first task would be to create one.
910 Separating QEMU relies on the host OS's access restriction mechanisms to
912 are entitled to. There are a couple types of mechanisms usually provided
918 Discretionary access control allows each user to control who can access
923 need a separate user ID to provide access control, which is likely to be
929 Mandatory access control allows the OS to add an additional set of
930 controls on top of discretionary access for the OS to control. It also
931 adds other attributes to processes and files such as types, roles, and
938 Type enforcement assigns a *type* attribute to processes and files, and
939 allows rules to be written on what operations a process with a given
951 host tun/tap device used to provide guest networking.
956 Category enforcement assigns a set of numbers within a given range to
957 the process or file. The process is granted access to the file if the
959 used to separate multiple instances of devices in the same class.
961 For example, if there are multiple disk devices provides to a guest,
963 category. The different device emulation processes would not be able to
968 used to prevent device emulation processes in different classes from
969 accessing resources assigned to other classes.