xref: /qemu/docs/devel/multi-process.rst (revision 8684f1be6f2235a7672a9256b4494cb5d3ef292b)
1*8684f1beSJohn G JohnsonThis is the design document for multi-process QEMU. It does not
2*8684f1beSJohn G Johnsonnecessarily reflect the status of the current implementation, which
3*8684f1beSJohn G Johnsonmay lack features or be considerably different from what is described
4*8684f1beSJohn G Johnsonin this document. This document is still useful as a description of
5*8684f1beSJohn G Johnsonthe goals and general direction of this feature.
6*8684f1beSJohn G Johnson
7*8684f1beSJohn G JohnsonPlease refer to the following wiki for latest details:
8*8684f1beSJohn G Johnsonhttps://wiki.qemu.org/Features/MultiProcessQEMU
9*8684f1beSJohn G Johnson
10*8684f1beSJohn G JohnsonMulti-process QEMU
11*8684f1beSJohn G Johnson===================
12*8684f1beSJohn G Johnson
13*8684f1beSJohn G JohnsonQEMU is often used as the hypervisor for virtual machines running in the
14*8684f1beSJohn G JohnsonOracle cloud. Since one of the advantages of cloud computing is the
15*8684f1beSJohn G Johnsonability to run many VMs from different tenants in the same cloud
16*8684f1beSJohn G Johnsoninfrastructure, a guest that compromised its hypervisor could
17*8684f1beSJohn G Johnsonpotentially use the hypervisor's access privileges to access data it is
18*8684f1beSJohn G Johnsonnot authorized for.
19*8684f1beSJohn G Johnson
20*8684f1beSJohn G JohnsonQEMU can be susceptible to security attacks because it is a large,
21*8684f1beSJohn G Johnsonmonolithic program that provides many features to the VMs it services.
22*8684f1beSJohn G JohnsonMany of these features can be configured out of QEMU, but even a reduced
23*8684f1beSJohn G Johnsonconfiguration QEMU has a large amount of code a guest can potentially
24*8684f1beSJohn G Johnsonattack. Separating QEMU reduces the attack surface by aiding to
25*8684f1beSJohn G Johnsonlimit each component in the system to only access the resources that
26*8684f1beSJohn G Johnsonit needs to perform its job.
27*8684f1beSJohn G Johnson
28*8684f1beSJohn G JohnsonQEMU services
29*8684f1beSJohn G Johnson-------------
30*8684f1beSJohn G Johnson
31*8684f1beSJohn G JohnsonQEMU can be broadly described as providing three main services. One is a
32*8684f1beSJohn G JohnsonVM control point, where VMs can be created, migrated, re-configured, and
33*8684f1beSJohn G Johnsondestroyed. A second is to emulate the CPU instructions within the VM,
34*8684f1beSJohn G Johnsonoften accelerated by HW virtualization features such as Intel's VT
35*8684f1beSJohn G Johnsonextensions. Finally, it provides IO services to the VM by emulating HW
36*8684f1beSJohn G JohnsonIO devices, such as disk and network devices.
37*8684f1beSJohn G Johnson
38*8684f1beSJohn G JohnsonA multi-process QEMU
39*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~
40*8684f1beSJohn G Johnson
41*8684f1beSJohn G JohnsonA multi-process QEMU involves separating QEMU services into separate
42*8684f1beSJohn G Johnsonhost processes. Each of these processes can be given only the privileges
43*8684f1beSJohn G Johnsonit needs to provide its service, e.g., a disk service could be given
44*8684f1beSJohn G Johnsonaccess only to the disk images it provides, and not be allowed to
45*8684f1beSJohn G Johnsonaccess other files, or any network devices. An attacker who compromised
46*8684f1beSJohn G Johnsonthis service would not be able to use this exploit to access files or
47*8684f1beSJohn G Johnsondevices beyond what the disk service was given access to.
48*8684f1beSJohn G Johnson
49*8684f1beSJohn G JohnsonA QEMU control process would remain, but in multi-process mode, will
50*8684f1beSJohn G Johnsonhave no direct interfaces to the VM. During VM execution, it would still
51*8684f1beSJohn G Johnsonprovide the user interface to hot-plug devices or live migrate the VM.
52*8684f1beSJohn G Johnson
53*8684f1beSJohn G JohnsonA first step in creating a multi-process QEMU is to separate IO services
54*8684f1beSJohn G Johnsonfrom the main QEMU program, which would continue to provide CPU
55*8684f1beSJohn G Johnsonemulation. i.e., the control process would also be the CPU emulation
56*8684f1beSJohn G Johnsonprocess. In a later phase, CPU emulation could be separated from the
57*8684f1beSJohn G Johnsoncontrol process.
58*8684f1beSJohn G Johnson
59*8684f1beSJohn G JohnsonSeparating IO services
60*8684f1beSJohn G Johnson----------------------
61*8684f1beSJohn G Johnson
62*8684f1beSJohn G JohnsonSeparating IO services into individual host processes is a good place to
63*8684f1beSJohn G Johnsonbegin for a couple of reasons. One is the sheer number of IO devices QEMU
64*8684f1beSJohn G Johnsoncan emulate provides a large surface of interfaces which could potentially
65*8684f1beSJohn G Johnsonbe exploited, and, indeed, have been a source of exploits in the past.
66*8684f1beSJohn G JohnsonAnother is the modular nature of QEMU device emulation code provides
67*8684f1beSJohn G Johnsoninterface points where the QEMU functions that perform device emulation
68*8684f1beSJohn G Johnsoncan be separated from the QEMU functions that manage the emulation of
69*8684f1beSJohn G Johnsonguest CPU instructions. The devices emulated in the separate process are
70*8684f1beSJohn G Johnsonreferred to as remote devices.
71*8684f1beSJohn G Johnson
72*8684f1beSJohn G JohnsonQEMU device emulation
73*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~
74*8684f1beSJohn G Johnson
75*8684f1beSJohn G JohnsonQEMU uses an object oriented SW architecture for device emulation code.
76*8684f1beSJohn G JohnsonConfigured objects are all compiled into the QEMU binary, then objects
77*8684f1beSJohn G Johnsonare instantiated by name when used by the guest VM. For example, the
78*8684f1beSJohn G Johnsoncode to emulate a device named "foo" is always present in QEMU, but its
79*8684f1beSJohn G Johnsoninstantiation code is only run when the device is included in the target
80*8684f1beSJohn G JohnsonVM. (e.g., via the QEMU command line as *-device foo*)
81*8684f1beSJohn G Johnson
82*8684f1beSJohn G JohnsonThe object model is hierarchical, so device emulation code names its
83*8684f1beSJohn G Johnsonparent object (such as "pci-device" for a PCI device) and QEMU will
84*8684f1beSJohn G Johnsoninstantiate a parent object before calling the device's instantiation
85*8684f1beSJohn G Johnsoncode.
86*8684f1beSJohn G Johnson
87*8684f1beSJohn G JohnsonCurrent separation models
88*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~
89*8684f1beSJohn G Johnson
90*8684f1beSJohn G JohnsonIn order to separate the device emulation code from the CPU emulation
91*8684f1beSJohn G Johnsoncode, the device object code must run in a different process. There are
92*8684f1beSJohn G Johnsona couple of existing QEMU features that can run emulation code
93*8684f1beSJohn G Johnsonseparately from the main QEMU process. These are examined below.
94*8684f1beSJohn G Johnson
95*8684f1beSJohn G Johnsonvhost user model
96*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^
97*8684f1beSJohn G Johnson
98*8684f1beSJohn G JohnsonVirtio guest device drivers can be connected to vhost user applications
99*8684f1beSJohn G Johnsonin order to perform their IO operations. This model uses special virtio
100*8684f1beSJohn G Johnsondevice drivers in the guest and vhost user device objects in QEMU, but
101*8684f1beSJohn G Johnsononce the QEMU vhost user code has configured the vhost user application,
102*8684f1beSJohn G Johnsonmission-mode IO is performed by the application. The vhost user
103*8684f1beSJohn G Johnsonapplication is a daemon process that can be contacted via a known UNIX
104*8684f1beSJohn G Johnsondomain socket.
105*8684f1beSJohn G Johnson
106*8684f1beSJohn G Johnsonvhost socket
107*8684f1beSJohn G Johnson''''''''''''
108*8684f1beSJohn G Johnson
109*8684f1beSJohn G JohnsonAs mentioned above, one of the tasks of the vhost device object within
110*8684f1beSJohn G JohnsonQEMU is to contact the vhost application and send it configuration
111*8684f1beSJohn G Johnsoninformation about this device instance. As part of the configuration
112*8684f1beSJohn G Johnsonprocess, the application can also be sent other file descriptors over
113*8684f1beSJohn G Johnsonthe socket, which then can be used by the vhost user application in
114*8684f1beSJohn G Johnsonvarious ways, some of which are described below.
115*8684f1beSJohn G Johnson
116*8684f1beSJohn G Johnsonvhost MMIO store acceleration
117*8684f1beSJohn G Johnson'''''''''''''''''''''''''''''
118*8684f1beSJohn G Johnson
119*8684f1beSJohn G JohnsonVMs are often run using HW virtualization features via the KVM kernel
120*8684f1beSJohn G Johnsondriver. This driver allows QEMU to accelerate the emulation of guest CPU
121*8684f1beSJohn G Johnsoninstructions by running the guest in a virtual HW mode. When the guest
122*8684f1beSJohn G Johnsonexecutes instructions that cannot be executed by virtual HW mode,
123*8684f1beSJohn G Johnsonexecution returns to the KVM driver so it can inform QEMU to emulate the
124*8684f1beSJohn G Johnsoninstructions in SW.
125*8684f1beSJohn G Johnson
126*8684f1beSJohn G JohnsonOne of the events that can cause a return to QEMU is when a guest device
127*8684f1beSJohn G Johnsondriver accesses an IO location. QEMU then dispatches the memory
128*8684f1beSJohn G Johnsonoperation to the corresponding QEMU device object. In the case of a
129*8684f1beSJohn G Johnsonvhost user device, the memory operation would need to be sent over a
130*8684f1beSJohn G Johnsonsocket to the vhost application. This path is accelerated by the QEMU
131*8684f1beSJohn G Johnsonvirtio code by setting up an eventfd file descriptor that the vhost
132*8684f1beSJohn G Johnsonapplication can directly receive MMIO store notifications from the KVM
133*8684f1beSJohn G Johnsondriver, instead of needing them to be sent to the QEMU process first.
134*8684f1beSJohn G Johnson
135*8684f1beSJohn G Johnsonvhost interrupt acceleration
136*8684f1beSJohn G Johnson''''''''''''''''''''''''''''
137*8684f1beSJohn G Johnson
138*8684f1beSJohn G JohnsonAnother optimization used by the vhost application is the ability to
139*8684f1beSJohn G Johnsondirectly inject interrupts into the VM via the KVM driver, again,
140*8684f1beSJohn G Johnsonbypassing the need to send the interrupt back to the QEMU process first.
141*8684f1beSJohn G JohnsonThe QEMU virtio setup code configures the KVM driver with an eventfd
142*8684f1beSJohn G Johnsonthat triggers the device interrupt in the guest when the eventfd is
143*8684f1beSJohn G Johnsonwritten. This irqfd file descriptor is then passed to the vhost user
144*8684f1beSJohn G Johnsonapplication program.
145*8684f1beSJohn G Johnson
146*8684f1beSJohn G Johnsonvhost access to guest memory
147*8684f1beSJohn G Johnson''''''''''''''''''''''''''''
148*8684f1beSJohn G Johnson
149*8684f1beSJohn G JohnsonThe vhost application is also allowed to directly access guest memory,
150*8684f1beSJohn G Johnsoninstead of needing to send the data as messages to QEMU. This is also
151*8684f1beSJohn G Johnsondone with file descriptors sent to the vhost user application by QEMU.
152*8684f1beSJohn G JohnsonThese descriptors can be passed to ``mmap()`` by the vhost application
153*8684f1beSJohn G Johnsonto map the guest address space into the vhost application.
154*8684f1beSJohn G Johnson
155*8684f1beSJohn G JohnsonIOMMUs introduce another level of complexity, since the address given to
156*8684f1beSJohn G Johnsonthe guest virtio device to DMA to or from is not a guest physical
157*8684f1beSJohn G Johnsonaddress. This case is handled by having vhost code within QEMU register
158*8684f1beSJohn G Johnsonas a listener for IOMMU mapping changes. The vhost application maintains
159*8684f1beSJohn G Johnsona cache of IOMMMU translations: sending translation requests back to
160*8684f1beSJohn G JohnsonQEMU on cache misses, and in turn receiving flush requests from QEMU
161*8684f1beSJohn G Johnsonwhen mappings are purged.
162*8684f1beSJohn G Johnson
163*8684f1beSJohn G Johnsonapplicability to device separation
164*8684f1beSJohn G Johnson''''''''''''''''''''''''''''''''''
165*8684f1beSJohn G Johnson
166*8684f1beSJohn G JohnsonMuch of the vhost model can be re-used by separated device emulation. In
167*8684f1beSJohn G Johnsonparticular, the ideas of using a socket between QEMU and the device
168*8684f1beSJohn G Johnsonemulation application, using a file descriptor to inject interrupts into
169*8684f1beSJohn G Johnsonthe VM via KVM, and allowing the application to ``mmap()`` the guest
170*8684f1beSJohn G Johnsonshould be re used.
171*8684f1beSJohn G Johnson
172*8684f1beSJohn G JohnsonThere are, however, some notable differences between how a vhost
173*8684f1beSJohn G Johnsonapplication works and the needs of separated device emulation. The most
174*8684f1beSJohn G Johnsonbasic is that vhost uses custom virtio device drivers which always
175*8684f1beSJohn G Johnsontrigger IO with MMIO stores. A separated device emulation model must
176*8684f1beSJohn G Johnsonwork with existing IO device models and guest device drivers. MMIO loads
177*8684f1beSJohn G Johnsonbreak vhost store acceleration since they are synchronous - guest
178*8684f1beSJohn G Johnsonprogress cannot continue until the load has been emulated. By contrast,
179*8684f1beSJohn G Johnsonstores are asynchronous, the guest can continue after the store event
180*8684f1beSJohn G Johnsonhas been sent to the vhost application.
181*8684f1beSJohn G Johnson
182*8684f1beSJohn G JohnsonAnother difference is that in the vhost user model, a single daemon can
183*8684f1beSJohn G Johnsonsupport multiple QEMU instances. This is contrary to the security regime
184*8684f1beSJohn G Johnsondesired, in which the emulation application should only be allowed to
185*8684f1beSJohn G Johnsonaccess the files or devices the VM it's running on behalf of can access.
186*8684f1beSJohn G Johnson#### qemu-io model
187*8684f1beSJohn G Johnson
188*8684f1beSJohn G JohnsonQemu-io is a test harness used to test changes to the QEMU block backend
189*8684f1beSJohn G Johnsonobject code. (e.g., the code that implements disk images for disk driver
190*8684f1beSJohn G Johnsonemulation) Qemu-io is not a device emulation application per se, but it
191*8684f1beSJohn G Johnsondoes compile the QEMU block objects into a separate binary from the main
192*8684f1beSJohn G JohnsonQEMU one. This could be useful for disk device emulation, since its
193*8684f1beSJohn G Johnsonemulation applications will need to include the QEMU block objects.
194*8684f1beSJohn G Johnson
195*8684f1beSJohn G JohnsonNew separation model based on proxy objects
196*8684f1beSJohn G Johnson-------------------------------------------
197*8684f1beSJohn G Johnson
198*8684f1beSJohn G JohnsonA different model based on proxy objects in the QEMU program
199*8684f1beSJohn G Johnsoncommunicating with remote emulation programs could provide separation
200*8684f1beSJohn G Johnsonwhile minimizing the changes needed to the device emulation code. The
201*8684f1beSJohn G Johnsonrest of this section is a discussion of how a proxy object model would
202*8684f1beSJohn G Johnsonwork.
203*8684f1beSJohn G Johnson
204*8684f1beSJohn G JohnsonRemote emulation processes
205*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~
206*8684f1beSJohn G Johnson
207*8684f1beSJohn G JohnsonThe remote emulation process will run the QEMU object hierarchy without
208*8684f1beSJohn G Johnsonmodification. The device emulation objects will be also be based on the
209*8684f1beSJohn G JohnsonQEMU code, because for anything but the simplest device, it would not be
210*8684f1beSJohn G Johnsona tractable to re-implement both the object model and the many device
211*8684f1beSJohn G Johnsonbackends that QEMU has.
212*8684f1beSJohn G Johnson
213*8684f1beSJohn G JohnsonThe processes will communicate with the QEMU process over UNIX domain
214*8684f1beSJohn G Johnsonsockets. The processes can be executed either as standalone processes,
215*8684f1beSJohn G Johnsonor be executed by QEMU. In both cases, the host backends the emulation
216*8684f1beSJohn G Johnsonprocesses will provide are specified on its command line, as they would
217*8684f1beSJohn G Johnsonbe for QEMU. For example:
218*8684f1beSJohn G Johnson
219*8684f1beSJohn G Johnson::
220*8684f1beSJohn G Johnson
221*8684f1beSJohn G Johnson    disk-proc -blockdev driver=file,node-name=file0,filename=disk-file0  \
222*8684f1beSJohn G Johnson    -blockdev driver=qcow2,node-name=drive0,file=file0
223*8684f1beSJohn G Johnson
224*8684f1beSJohn G Johnsonwould indicate process *disk-proc* uses a qcow2 emulated disk named
225*8684f1beSJohn G Johnson*file0* as its backend.
226*8684f1beSJohn G Johnson
227*8684f1beSJohn G JohnsonEmulation processes may emulate more than one guest controller. A common
228*8684f1beSJohn G Johnsonconfiguration might be to put all controllers of the same device class
229*8684f1beSJohn G Johnson(e.g., disk, network, etc.) in a single process, so that all backends of
230*8684f1beSJohn G Johnsonthe same type can be managed by a single QMP monitor.
231*8684f1beSJohn G Johnson
232*8684f1beSJohn G Johnsoncommunication with QEMU
233*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^
234*8684f1beSJohn G Johnson
235*8684f1beSJohn G JohnsonThe first argument to the remote emulation process will be a Unix domain
236*8684f1beSJohn G Johnsonsocket that connects with the Proxy object. This is a required argument.
237*8684f1beSJohn G Johnson
238*8684f1beSJohn G Johnson::
239*8684f1beSJohn G Johnson
240*8684f1beSJohn G Johnson    disk-proc <socket number> <backend list>
241*8684f1beSJohn G Johnson
242*8684f1beSJohn G Johnsonremote process QMP monitor
243*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^^^^
244*8684f1beSJohn G Johnson
245*8684f1beSJohn G JohnsonRemote emulation processes can be monitored via QMP, similar to QEMU
246*8684f1beSJohn G Johnsonitself. The QMP monitor socket is specified the same as for a QEMU
247*8684f1beSJohn G Johnsonprocess:
248*8684f1beSJohn G Johnson
249*8684f1beSJohn G Johnson::
250*8684f1beSJohn G Johnson
251*8684f1beSJohn G Johnson    disk-proc -qmp unix:/tmp/disk-mon,server
252*8684f1beSJohn G Johnson
253*8684f1beSJohn G Johnsoncan be monitored over the UNIX socket path */tmp/disk-mon*.
254*8684f1beSJohn G Johnson
255*8684f1beSJohn G JohnsonQEMU command line
256*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~
257*8684f1beSJohn G Johnson
258*8684f1beSJohn G JohnsonEach remote device emulated in a remote process on the host is
259*8684f1beSJohn G Johnsonrepresented as a *-device* of type *pci-proxy-dev*. A socket
260*8684f1beSJohn G Johnsonsub-option to this option specifies the Unix socket that connects
261*8684f1beSJohn G Johnsonto the remote process. An *id* sub-option is required, and it should
262*8684f1beSJohn G Johnsonbe the same id as used in the remote process.
263*8684f1beSJohn G Johnson
264*8684f1beSJohn G Johnson::
265*8684f1beSJohn G Johnson
266*8684f1beSJohn G Johnson    qemu-system-x86_64 ... -device pci-proxy-dev,id=lsi0,socket=3
267*8684f1beSJohn G Johnson
268*8684f1beSJohn G Johnsoncan be used to add a device emulated in a remote process
269*8684f1beSJohn G Johnson
270*8684f1beSJohn G Johnson
271*8684f1beSJohn G JohnsonQEMU management of remote processes
272*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
273*8684f1beSJohn G Johnson
274*8684f1beSJohn G JohnsonQEMU is not aware of the type of type of the remote PCI device. It is
275*8684f1beSJohn G Johnsona pass through device as far as QEMU is concerned.
276*8684f1beSJohn G Johnson
277*8684f1beSJohn G Johnsoncommunication with emulation process
278*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
279*8684f1beSJohn G Johnson
280*8684f1beSJohn G Johnsonprimary channel
281*8684f1beSJohn G Johnson'''''''''''''''
282*8684f1beSJohn G Johnson
283*8684f1beSJohn G JohnsonThe primary channel (referred to as com in the code) is used to bootstrap
284*8684f1beSJohn G Johnsonthe remote process. It is also used to pass on device-agnostic commands
285*8684f1beSJohn G Johnsonlike reset.
286*8684f1beSJohn G Johnson
287*8684f1beSJohn G Johnsonper-device channels
288*8684f1beSJohn G Johnson'''''''''''''''''''
289*8684f1beSJohn G Johnson
290*8684f1beSJohn G JohnsonEach remote device communicates with QEMU using a dedicated communication
291*8684f1beSJohn G Johnsonchannel. The proxy object sets up this channel using the primary
292*8684f1beSJohn G Johnsonchannel during its initialization.
293*8684f1beSJohn G Johnson
294*8684f1beSJohn G JohnsonQEMU device proxy objects
295*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~
296*8684f1beSJohn G Johnson
297*8684f1beSJohn G JohnsonQEMU has an object model based on sub-classes inherited from the
298*8684f1beSJohn G Johnson"object" super-class. The sub-classes that are of interest here are the
299*8684f1beSJohn G Johnson"device" and "bus" sub-classes whose child sub-classes make up the
300*8684f1beSJohn G Johnsondevice tree of a QEMU emulated system.
301*8684f1beSJohn G Johnson
302*8684f1beSJohn G JohnsonThe proxy object model will use device proxy objects to replace the
303*8684f1beSJohn G Johnsondevice emulation code within the QEMU process. These objects will live
304*8684f1beSJohn G Johnsonin the same place in the object and bus hierarchies as the objects they
305*8684f1beSJohn G Johnsonreplace. i.e., the proxy object for an LSI SCSI controller will be a
306*8684f1beSJohn G Johnsonsub-class of the "pci-device" class, and will have the same PCI bus
307*8684f1beSJohn G Johnsonparent and the same SCSI bus child objects as the LSI controller object
308*8684f1beSJohn G Johnsonit replaces.
309*8684f1beSJohn G Johnson
310*8684f1beSJohn G JohnsonIt is worth noting that the same proxy object is used to mediate with
311*8684f1beSJohn G Johnsonall types of remote PCI devices.
312*8684f1beSJohn G Johnson
313*8684f1beSJohn G Johnsonobject initialization
314*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^
315*8684f1beSJohn G Johnson
316*8684f1beSJohn G JohnsonThe Proxy device objects are initialized in the exact same manner in
317*8684f1beSJohn G Johnsonwhich any other QEMU device would be initialized.
318*8684f1beSJohn G Johnson
319*8684f1beSJohn G JohnsonIn addition, the Proxy objects perform the following two tasks:
320*8684f1beSJohn G Johnson- Parses the "socket" sub option and connects to the remote process
321*8684f1beSJohn G Johnsonusing this channel
322*8684f1beSJohn G Johnson- Uses the "id" sub-option to connect to the emulated device on the
323*8684f1beSJohn G Johnsonseparate process
324*8684f1beSJohn G Johnson
325*8684f1beSJohn G Johnsonclass\_init
326*8684f1beSJohn G Johnson'''''''''''
327*8684f1beSJohn G Johnson
328*8684f1beSJohn G JohnsonThe ``class_init()`` method of a proxy object will, in general behave
329*8684f1beSJohn G Johnsonsimilarly to the object it replaces, including setting any static
330*8684f1beSJohn G Johnsonproperties and methods needed by the proxy.
331*8684f1beSJohn G Johnson
332*8684f1beSJohn G Johnsoninstance\_init / realize
333*8684f1beSJohn G Johnson''''''''''''''''''''''''
334*8684f1beSJohn G Johnson
335*8684f1beSJohn G JohnsonThe ``instance_init()`` and ``realize()`` functions would only need to
336*8684f1beSJohn G Johnsonperform tasks related to being a proxy, such are registering its own
337*8684f1beSJohn G JohnsonMMIO handlers, or creating a child bus that other proxy devices can be
338*8684f1beSJohn G Johnsonattached to later.
339*8684f1beSJohn G Johnson
340*8684f1beSJohn G JohnsonOther tasks will be device-specific. For example, PCI device objects
341*8684f1beSJohn G Johnsonwill initialize the PCI config space in order to make a valid PCI device
342*8684f1beSJohn G Johnsontree within the QEMU process.
343*8684f1beSJohn G Johnson
344*8684f1beSJohn G Johnsonaddress space registration
345*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^^^^
346*8684f1beSJohn G Johnson
347*8684f1beSJohn G JohnsonMost devices are driven by guest device driver accesses to IO addresses
348*8684f1beSJohn G Johnsonor ports. The QEMU device emulation code uses QEMU's memory region
349*8684f1beSJohn G Johnsonfunction calls (such as ``memory_region_init_io()``) to add callback
350*8684f1beSJohn G Johnsonfunctions that QEMU will invoke when the guest accesses the device's
351*8684f1beSJohn G Johnsonareas of the IO address space. When a guest driver does access the
352*8684f1beSJohn G Johnsondevice, the VM will exit HW virtualization mode and return to QEMU,
353*8684f1beSJohn G Johnsonwhich will then lookup and execute the corresponding callback function.
354*8684f1beSJohn G Johnson
355*8684f1beSJohn G JohnsonA proxy object would need to mirror the memory region calls the actual
356*8684f1beSJohn G Johnsondevice emulator would perform in its initialization code, but with its
357*8684f1beSJohn G Johnsonown callbacks. When invoked by QEMU as a result of a guest IO operation,
358*8684f1beSJohn G Johnsonthey will forward the operation to the device emulation process.
359*8684f1beSJohn G Johnson
360*8684f1beSJohn G JohnsonPCI config space
361*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^
362*8684f1beSJohn G Johnson
363*8684f1beSJohn G JohnsonPCI devices also have a configuration space that can be accessed by the
364*8684f1beSJohn G Johnsonguest driver. Guest accesses to this space is not handled by the device
365*8684f1beSJohn G Johnsonemulation object, but by its PCI parent object. Much of this space is
366*8684f1beSJohn G Johnsonread-only, but certain registers (especially BAR and MSI-related ones)
367*8684f1beSJohn G Johnsonneed to be propagated to the emulation process.
368*8684f1beSJohn G Johnson
369*8684f1beSJohn G JohnsonPCI parent proxy
370*8684f1beSJohn G Johnson''''''''''''''''
371*8684f1beSJohn G Johnson
372*8684f1beSJohn G JohnsonOne way to propagate guest PCI config accesses is to create a
373*8684f1beSJohn G Johnson"pci-device-proxy" class that can serve as the parent of a PCI device
374*8684f1beSJohn G Johnsonproxy object. This class's parent would be "pci-device" and it would
375*8684f1beSJohn G Johnsonoverride the PCI parent's ``config_read()`` and ``config_write()``
376*8684f1beSJohn G Johnsonmethods with ones that forward these operations to the emulation
377*8684f1beSJohn G Johnsonprogram.
378*8684f1beSJohn G Johnson
379*8684f1beSJohn G Johnsoninterrupt receipt
380*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^
381*8684f1beSJohn G Johnson
382*8684f1beSJohn G JohnsonA proxy for a device that generates interrupts will need to create a
383*8684f1beSJohn G Johnsonsocket to receive interrupt indications from the emulation process. An
384*8684f1beSJohn G Johnsonincoming interrupt indication would then be sent up to its bus parent to
385*8684f1beSJohn G Johnsonbe injected into the guest. For example, a PCI device object may use
386*8684f1beSJohn G Johnson``pci_set_irq()``.
387*8684f1beSJohn G Johnson
388*8684f1beSJohn G Johnsonlive migration
389*8684f1beSJohn G Johnson^^^^^^^^^^^^^^
390*8684f1beSJohn G Johnson
391*8684f1beSJohn G JohnsonThe proxy will register to save and restore any *vmstate* it needs over
392*8684f1beSJohn G Johnsona live migration event. The device proxy does not need to manage the
393*8684f1beSJohn G Johnsonremote device's *vmstate*; that will be handled by the remote process
394*8684f1beSJohn G Johnsonproxy (see below).
395*8684f1beSJohn G Johnson
396*8684f1beSJohn G JohnsonQEMU remote device operation
397*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~
398*8684f1beSJohn G Johnson
399*8684f1beSJohn G JohnsonGeneric device operations, such as DMA, will be performed by the remote
400*8684f1beSJohn G Johnsonprocess proxy by sending messages to the remote process.
401*8684f1beSJohn G Johnson
402*8684f1beSJohn G JohnsonDMA operations
403*8684f1beSJohn G Johnson^^^^^^^^^^^^^^
404*8684f1beSJohn G Johnson
405*8684f1beSJohn G JohnsonDMA operations would be handled much like vhost applications do. One of
406*8684f1beSJohn G Johnsonthe initial messages sent to the emulation process is a guest memory
407*8684f1beSJohn G Johnsontable. Each entry in this table consists of a file descriptor and size
408*8684f1beSJohn G Johnsonthat the emulation process can ``mmap()`` to directly access guest
409*8684f1beSJohn G Johnsonmemory, similar to ``vhost_user_set_mem_table()``. Note guest memory
410*8684f1beSJohn G Johnsonmust be backed by file descriptors, such as when QEMU is given the
411*8684f1beSJohn G Johnson*-mem-path* command line option.
412*8684f1beSJohn G Johnson
413*8684f1beSJohn G JohnsonIOMMU operations
414*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^
415*8684f1beSJohn G Johnson
416*8684f1beSJohn G JohnsonWhen the emulated system includes an IOMMU, the remote process proxy in
417*8684f1beSJohn G JohnsonQEMU will need to create a socket for IOMMU requests from the emulation
418*8684f1beSJohn G Johnsonprocess. It will handle those requests with an
419*8684f1beSJohn G Johnson``address_space_get_iotlb_entry()`` call. In order to handle IOMMU
420*8684f1beSJohn G Johnsonunmaps, the remote process proxy will also register as a listener on the
421*8684f1beSJohn G Johnsondevice's DMA address space. When an IOMMU memory region is created
422*8684f1beSJohn G Johnsonwithin the DMA address space, an IOMMU notifier for unmaps will be added
423*8684f1beSJohn G Johnsonto the memory region that will forward unmaps to the emulation process
424*8684f1beSJohn G Johnsonover the IOMMU socket.
425*8684f1beSJohn G Johnson
426*8684f1beSJohn G Johnsondevice hot-plug via QMP
427*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^^
428*8684f1beSJohn G Johnson
429*8684f1beSJohn G JohnsonAn QMP "device\_add" command can add a device emulated by a remote
430*8684f1beSJohn G Johnsonprocess. It will also have "rid" option to the command, just as the
431*8684f1beSJohn G Johnson*-device* command line option does. The remote process may either be one
432*8684f1beSJohn G Johnsonstarted at QEMU startup, or be one added by the "add-process" QMP
433*8684f1beSJohn G Johnsoncommand described above. In either case, the remote process proxy will
434*8684f1beSJohn G Johnsonforward the new device's JSON description to the corresponding emulation
435*8684f1beSJohn G Johnsonprocess.
436*8684f1beSJohn G Johnson
437*8684f1beSJohn G Johnsonlive migration
438*8684f1beSJohn G Johnson^^^^^^^^^^^^^^
439*8684f1beSJohn G Johnson
440*8684f1beSJohn G JohnsonThe remote process proxy will also register for live migration
441*8684f1beSJohn G Johnsonnotifications with ``vmstate_register()``. When called to save state,
442*8684f1beSJohn G Johnsonthe proxy will send the remote process a secondary socket file
443*8684f1beSJohn G Johnsondescriptor to save the remote process's device *vmstate* over. The
444*8684f1beSJohn G Johnsonincoming byte stream length and data will be saved as the proxy's
445*8684f1beSJohn G Johnson*vmstate*. When the proxy is resumed on its new host, this *vmstate*
446*8684f1beSJohn G Johnsonwill be extracted, and a secondary socket file descriptor will be sent
447*8684f1beSJohn G Johnsonto the new remote process through which it receives the *vmstate* in
448*8684f1beSJohn G Johnsonorder to restore the devices there.
449*8684f1beSJohn G Johnson
450*8684f1beSJohn G Johnsondevice emulation in remote process
451*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
452*8684f1beSJohn G Johnson
453*8684f1beSJohn G JohnsonThe parts of QEMU that the emulation program will need include the
454*8684f1beSJohn G Johnsonobject model; the memory emulation objects; the device emulation objects
455*8684f1beSJohn G Johnsonof the targeted device, and any dependent devices; and, the device's
456*8684f1beSJohn G Johnsonbackends. It will also need code to setup the machine environment,
457*8684f1beSJohn G Johnsonhandle requests from the QEMU process, and route machine-level requests
458*8684f1beSJohn G Johnson(such as interrupts or IOMMU mappings) back to the QEMU process.
459*8684f1beSJohn G Johnson
460*8684f1beSJohn G Johnsoninitialization
461*8684f1beSJohn G Johnson^^^^^^^^^^^^^^
462*8684f1beSJohn G Johnson
463*8684f1beSJohn G JohnsonThe process initialization sequence will follow the same sequence
464*8684f1beSJohn G Johnsonfollowed by QEMU. It will first initialize the backend objects, then
465*8684f1beSJohn G Johnsondevice emulation objects. The JSON descriptions sent by the QEMU process
466*8684f1beSJohn G Johnsonwill drive which objects need to be created.
467*8684f1beSJohn G Johnson
468*8684f1beSJohn G Johnson-  address spaces
469*8684f1beSJohn G Johnson
470*8684f1beSJohn G JohnsonBefore the device objects are created, the initial address spaces and
471*8684f1beSJohn G Johnsonmemory regions must be configured with ``memory_map_init()``. This
472*8684f1beSJohn G Johnsoncreates a RAM memory region object (*system\_memory*) and an IO memory
473*8684f1beSJohn G Johnsonregion object (*system\_io*).
474*8684f1beSJohn G Johnson
475*8684f1beSJohn G Johnson-  RAM
476*8684f1beSJohn G Johnson
477*8684f1beSJohn G JohnsonRAM memory region creation will follow how ``pc_memory_init()`` creates
478*8684f1beSJohn G Johnsonthem, but must use ``memory_region_init_ram_from_fd()`` instead of
479*8684f1beSJohn G Johnson``memory_region_allocate_system_memory()``. The file descriptors needed
480*8684f1beSJohn G Johnsonwill be supplied by the guest memory table from above. Those RAM regions
481*8684f1beSJohn G Johnsonwould then be added to the *system\_memory* memory region with
482*8684f1beSJohn G Johnson``memory_region_add_subregion()``.
483*8684f1beSJohn G Johnson
484*8684f1beSJohn G Johnson-  PCI
485*8684f1beSJohn G Johnson
486*8684f1beSJohn G JohnsonIO initialization will be driven by the JSON descriptions sent from the
487*8684f1beSJohn G JohnsonQEMU process. For a PCI device, a PCI bus will need to be created with
488*8684f1beSJohn G Johnson``pci_root_bus_new()``, and a PCI memory region will need to be created
489*8684f1beSJohn G Johnsonand added to the *system\_memory* memory region with
490*8684f1beSJohn G Johnson``memory_region_add_subregion_overlap()``. The overlap version is
491*8684f1beSJohn G Johnsonrequired for architectures where PCI memory overlaps with RAM memory.
492*8684f1beSJohn G Johnson
493*8684f1beSJohn G JohnsonMMIO handling
494*8684f1beSJohn G Johnson^^^^^^^^^^^^^
495*8684f1beSJohn G Johnson
496*8684f1beSJohn G JohnsonThe device emulation objects will use ``memory_region_init_io()`` to
497*8684f1beSJohn G Johnsoninstall their MMIO handlers, and ``pci_register_bar()`` to associate
498*8684f1beSJohn G Johnsonthose handlers with a PCI BAR, as they do within QEMU currently.
499*8684f1beSJohn G Johnson
500*8684f1beSJohn G JohnsonIn order to use ``address_space_rw()`` in the emulation process to
501*8684f1beSJohn G Johnsonhandle MMIO requests from QEMU, the PCI physical addresses must be the
502*8684f1beSJohn G Johnsonsame in the QEMU process and the device emulation process. In order to
503*8684f1beSJohn G Johnsonaccomplish that, guest BAR programming must also be forwarded from QEMU
504*8684f1beSJohn G Johnsonto the emulation process.
505*8684f1beSJohn G Johnson
506*8684f1beSJohn G Johnsoninterrupt injection
507*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^
508*8684f1beSJohn G Johnson
509*8684f1beSJohn G JohnsonWhen device emulation wants to inject an interrupt into the VM, the
510*8684f1beSJohn G Johnsonrequest climbs the device's bus object hierarchy until the point where a
511*8684f1beSJohn G Johnsonbus object knows how to signal the interrupt to the guest. The details
512*8684f1beSJohn G Johnsondepend on the type of interrupt being raised.
513*8684f1beSJohn G Johnson
514*8684f1beSJohn G Johnson-  PCI pin interrupts
515*8684f1beSJohn G Johnson
516*8684f1beSJohn G JohnsonOn x86 systems, there is an emulated IOAPIC object attached to the root
517*8684f1beSJohn G JohnsonPCI bus object, and the root PCI object forwards interrupt requests to
518*8684f1beSJohn G Johnsonit. The IOAPIC object, in turn, calls the KVM driver to inject the
519*8684f1beSJohn G Johnsoncorresponding interrupt into the VM. The simplest way to handle this in
520*8684f1beSJohn G Johnsonan emulation process would be to setup the root PCI bus driver (via
521*8684f1beSJohn G Johnson``pci_bus_irqs()``) to send a interrupt request back to the QEMU
522*8684f1beSJohn G Johnsonprocess, and have the device proxy object reflect it up the PCI tree
523*8684f1beSJohn G Johnsonthere.
524*8684f1beSJohn G Johnson
525*8684f1beSJohn G Johnson-  PCI MSI/X interrupts
526*8684f1beSJohn G Johnson
527*8684f1beSJohn G JohnsonPCI MSI/X interrupts are implemented in HW as DMA writes to a
528*8684f1beSJohn G JohnsonCPU-specific PCI address. In QEMU on x86, a KVM APIC object receives
529*8684f1beSJohn G Johnsonthese DMA writes, then calls into the KVM driver to inject the interrupt
530*8684f1beSJohn G Johnsoninto the VM. A simple emulation process implementation would be to send
531*8684f1beSJohn G Johnsonthe MSI DMA address from QEMU as a message at initialization, then
532*8684f1beSJohn G Johnsoninstall an address space handler at that address which forwards the MSI
533*8684f1beSJohn G Johnsonmessage back to QEMU.
534*8684f1beSJohn G Johnson
535*8684f1beSJohn G JohnsonDMA operations
536*8684f1beSJohn G Johnson^^^^^^^^^^^^^^
537*8684f1beSJohn G Johnson
538*8684f1beSJohn G JohnsonWhen a emulation object wants to DMA into or out of guest memory, it
539*8684f1beSJohn G Johnsonfirst must use dma\_memory\_map() to convert the DMA address to a local
540*8684f1beSJohn G Johnsonvirtual address. The emulation process memory region objects setup above
541*8684f1beSJohn G Johnsonwill be used to translate the DMA address to a local virtual address the
542*8684f1beSJohn G Johnsondevice emulation code can access.
543*8684f1beSJohn G Johnson
544*8684f1beSJohn G JohnsonIOMMU
545*8684f1beSJohn G Johnson^^^^^
546*8684f1beSJohn G Johnson
547*8684f1beSJohn G JohnsonWhen an IOMMU is in use in QEMU, DMA translation uses IOMMU memory
548*8684f1beSJohn G Johnsonregions to translate the DMA address to a guest physical address before
549*8684f1beSJohn G Johnsonthat physical address can be translated to a local virtual address. The
550*8684f1beSJohn G Johnsonemulation process will need similar functionality.
551*8684f1beSJohn G Johnson
552*8684f1beSJohn G Johnson-  IOTLB cache
553*8684f1beSJohn G Johnson
554*8684f1beSJohn G JohnsonThe emulation process will maintain a cache of recent IOMMU translations
555*8684f1beSJohn G Johnson(the IOTLB). When the translate() callback of an IOMMU memory region is
556*8684f1beSJohn G Johnsoninvoked, the IOTLB cache will be searched for an entry that will map the
557*8684f1beSJohn G JohnsonDMA address to a guest PA. On a cache miss, a message will be sent back
558*8684f1beSJohn G Johnsonto QEMU requesting the corresponding translation entry, which be both be
559*8684f1beSJohn G Johnsonused to return a guest address and be added to the cache.
560*8684f1beSJohn G Johnson
561*8684f1beSJohn G Johnson-  IOTLB purge
562*8684f1beSJohn G Johnson
563*8684f1beSJohn G JohnsonThe IOMMU emulation will also need to act on unmap requests from QEMU.
564*8684f1beSJohn G JohnsonThese happen when the guest IOMMU driver purges an entry from the
565*8684f1beSJohn G Johnsonguest's translation table.
566*8684f1beSJohn G Johnson
567*8684f1beSJohn G Johnsonlive migration
568*8684f1beSJohn G Johnson^^^^^^^^^^^^^^
569*8684f1beSJohn G Johnson
570*8684f1beSJohn G JohnsonWhen a remote process receives a live migration indication from QEMU, it
571*8684f1beSJohn G Johnsonwill set up a channel using the received file descriptor with
572*8684f1beSJohn G Johnson``qio_channel_socket_new_fd()``. This channel will be used to create a
573*8684f1beSJohn G Johnson*QEMUfile* that can be passed to ``qemu_save_device_state()`` to send
574*8684f1beSJohn G Johnsonthe process's device state back to QEMU. This method will be reversed on
575*8684f1beSJohn G Johnsonrestore - the channel will be passed to ``qemu_loadvm_state()`` to
576*8684f1beSJohn G Johnsonrestore the device state.
577*8684f1beSJohn G Johnson
578*8684f1beSJohn G JohnsonAccelerating device emulation
579*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
580*8684f1beSJohn G Johnson
581*8684f1beSJohn G JohnsonThe messages that are required to be sent between QEMU and the emulation
582*8684f1beSJohn G Johnsonprocess can add considerable latency to IO operations. The optimizations
583*8684f1beSJohn G Johnsondescribed below attempt to ameliorate this effect by allowing the
584*8684f1beSJohn G Johnsonemulation process to communicate directly with the kernel KVM driver.
585*8684f1beSJohn G JohnsonThe KVM file descriptors created would be passed to the emulation process
586*8684f1beSJohn G Johnsonvia initialization messages, much like the guest memory table is done.
587*8684f1beSJohn G Johnson#### MMIO acceleration
588*8684f1beSJohn G Johnson
589*8684f1beSJohn G JohnsonVhost user applications can receive guest virtio driver stores directly
590*8684f1beSJohn G Johnsonfrom KVM. The issue with the eventfd mechanism used by vhost user is
591*8684f1beSJohn G Johnsonthat it does not pass any data with the event indication, so it cannot
592*8684f1beSJohn G Johnsonhandle guest loads or guest stores that carry store data. This concept
593*8684f1beSJohn G Johnsoncould, however, be expanded to cover more cases.
594*8684f1beSJohn G Johnson
595*8684f1beSJohn G JohnsonThe expanded idea would require a new type of KVM device:
596*8684f1beSJohn G Johnson*KVM\_DEV\_TYPE\_USER*. This device has two file descriptors: a master
597*8684f1beSJohn G Johnsondescriptor that QEMU can use for configuration, and a slave descriptor
598*8684f1beSJohn G Johnsonthat the emulation process can use to receive MMIO notifications. QEMU
599*8684f1beSJohn G Johnsonwould create both descriptors using the KVM driver, and pass the slave
600*8684f1beSJohn G Johnsondescriptor to the emulation process via an initialization message.
601*8684f1beSJohn G Johnson
602*8684f1beSJohn G Johnsondata structures
603*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^
604*8684f1beSJohn G Johnson
605*8684f1beSJohn G Johnson-  guest physical range
606*8684f1beSJohn G Johnson
607*8684f1beSJohn G JohnsonThe guest physical range structure describes the address range that a
608*8684f1beSJohn G Johnsondevice will respond to. It includes the base and length of the range, as
609*8684f1beSJohn G Johnsonwell as which bus the range resides on (e.g., on an x86machine, it can
610*8684f1beSJohn G Johnsonspecify whether the range refers to memory or IO addresses).
611*8684f1beSJohn G Johnson
612*8684f1beSJohn G JohnsonA device can have multiple physical address ranges it responds to (e.g.,
613*8684f1beSJohn G Johnsona PCI device can have multiple BARs), so the structure will also include
614*8684f1beSJohn G Johnsonan enumerated identifier to specify which of the device's ranges is
615*8684f1beSJohn G Johnsonbeing referred to.
616*8684f1beSJohn G Johnson
617*8684f1beSJohn G Johnson+--------+----------------------------+
618*8684f1beSJohn G Johnson| Name   | Description                |
619*8684f1beSJohn G Johnson+========+============================+
620*8684f1beSJohn G Johnson| addr   | range base address         |
621*8684f1beSJohn G Johnson+--------+----------------------------+
622*8684f1beSJohn G Johnson| len    | range length               |
623*8684f1beSJohn G Johnson+--------+----------------------------+
624*8684f1beSJohn G Johnson| bus    | addr type (memory or IO)   |
625*8684f1beSJohn G Johnson+--------+----------------------------+
626*8684f1beSJohn G Johnson| id     | range ID (e.g., PCI BAR)   |
627*8684f1beSJohn G Johnson+--------+----------------------------+
628*8684f1beSJohn G Johnson
629*8684f1beSJohn G Johnson-  MMIO request structure
630*8684f1beSJohn G Johnson
631*8684f1beSJohn G JohnsonThis structure describes an MMIO operation. It includes which guest
632*8684f1beSJohn G Johnsonphysical range the MMIO was within, the offset within that range, the
633*8684f1beSJohn G JohnsonMMIO type (e.g., load or store), and its length and data. It also
634*8684f1beSJohn G Johnsonincludes a sequence number that can be used to reply to the MMIO, and
635*8684f1beSJohn G Johnsonthe CPU that issued the MMIO.
636*8684f1beSJohn G Johnson
637*8684f1beSJohn G Johnson+----------+------------------------+
638*8684f1beSJohn G Johnson| Name     | Description            |
639*8684f1beSJohn G Johnson+==========+========================+
640*8684f1beSJohn G Johnson| rid      | range MMIO is within   |
641*8684f1beSJohn G Johnson+----------+------------------------+
642*8684f1beSJohn G Johnson| offset   | offset withing *rid*   |
643*8684f1beSJohn G Johnson+----------+------------------------+
644*8684f1beSJohn G Johnson| type     | e.g., load or store    |
645*8684f1beSJohn G Johnson+----------+------------------------+
646*8684f1beSJohn G Johnson| len      | MMIO length            |
647*8684f1beSJohn G Johnson+----------+------------------------+
648*8684f1beSJohn G Johnson| data     | store data             |
649*8684f1beSJohn G Johnson+----------+------------------------+
650*8684f1beSJohn G Johnson| seq      | sequence ID            |
651*8684f1beSJohn G Johnson+----------+------------------------+
652*8684f1beSJohn G Johnson
653*8684f1beSJohn G Johnson-  MMIO request queues
654*8684f1beSJohn G Johnson
655*8684f1beSJohn G JohnsonMMIO request queues are FIFO arrays of MMIO request structures. There
656*8684f1beSJohn G Johnsonare two queues: pending queue is for MMIOs that haven't been read by the
657*8684f1beSJohn G Johnsonemulation program, and the sent queue is for MMIOs that haven't been
658*8684f1beSJohn G Johnsonacknowledged. The main use of the second queue is to validate MMIO
659*8684f1beSJohn G Johnsonreplies from the emulation program.
660*8684f1beSJohn G Johnson
661*8684f1beSJohn G Johnson-  scoreboard
662*8684f1beSJohn G Johnson
663*8684f1beSJohn G JohnsonEach CPU in the VM is emulated in QEMU by a separate thread, so multiple
664*8684f1beSJohn G JohnsonMMIOs may be waiting to be consumed by an emulation program and multiple
665*8684f1beSJohn G Johnsonthreads may be waiting for MMIO replies. The scoreboard would contain a
666*8684f1beSJohn G Johnsonwait queue and sequence number for the per-CPU threads, allowing them to
667*8684f1beSJohn G Johnsonbe individually woken when the MMIO reply is received from the emulation
668*8684f1beSJohn G Johnsonprogram. It also tracks the number of posted MMIO stores to the device
669*8684f1beSJohn G Johnsonthat haven't been replied to, in order to satisfy the PCI constraint
670*8684f1beSJohn G Johnsonthat a load to a device will not complete until all previous stores to
671*8684f1beSJohn G Johnsonthat device have been completed.
672*8684f1beSJohn G Johnson
673*8684f1beSJohn G Johnson-  device shadow memory
674*8684f1beSJohn G Johnson
675*8684f1beSJohn G JohnsonSome MMIO loads do not have device side-effects. These MMIOs can be
676*8684f1beSJohn G Johnsoncompleted without sending a MMIO request to the emulation program if the
677*8684f1beSJohn G Johnsonemulation program shares a shadow image of the device's memory image
678*8684f1beSJohn G Johnsonwith the KVM driver.
679*8684f1beSJohn G Johnson
680*8684f1beSJohn G JohnsonThe emulation program will ask the KVM driver to allocate memory for the
681*8684f1beSJohn G Johnsonshadow image, and will then use ``mmap()`` to directly access it. The
682*8684f1beSJohn G Johnsonemulation program can control KVM access to the shadow image by sending
683*8684f1beSJohn G JohnsonKVM an access map telling it which areas of the image have no
684*8684f1beSJohn G Johnsonside-effects (and can be completed immediately), and which require a
685*8684f1beSJohn G JohnsonMMIO request to the emulation program. The access map can also inform
686*8684f1beSJohn G Johnsonthe KVM drive which size accesses are allowed to the image.
687*8684f1beSJohn G Johnson
688*8684f1beSJohn G Johnsonmaster descriptor
689*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^
690*8684f1beSJohn G Johnson
691*8684f1beSJohn G JohnsonThe master descriptor is used by QEMU to configure the new KVM device.
692*8684f1beSJohn G JohnsonThe descriptor would be returned by the KVM driver when QEMU issues a
693*8684f1beSJohn G Johnson*KVM\_CREATE\_DEVICE* ``ioctl()`` with a *KVM\_DEV\_TYPE\_USER* type.
694*8684f1beSJohn G Johnson
695*8684f1beSJohn G JohnsonKVM\_DEV\_TYPE\_USER device ops
696*8684f1beSJohn G Johnson
697*8684f1beSJohn G Johnson
698*8684f1beSJohn G JohnsonThe *KVM\_DEV\_TYPE\_USER* operations vector will be registered by a
699*8684f1beSJohn G Johnson``kvm_register_device_ops()`` call when the KVM system in initialized by
700*8684f1beSJohn G Johnson``kvm_init()``. These device ops are called by the KVM driver when QEMU
701*8684f1beSJohn G Johnsonexecutes certain ``ioctl()`` operations on its KVM file descriptor. They
702*8684f1beSJohn G Johnsoninclude:
703*8684f1beSJohn G Johnson
704*8684f1beSJohn G Johnson-  create
705*8684f1beSJohn G Johnson
706*8684f1beSJohn G JohnsonThis routine is called when QEMU issues a *KVM\_CREATE\_DEVICE*
707*8684f1beSJohn G Johnson``ioctl()`` on its per-VM file descriptor. It will allocate and
708*8684f1beSJohn G Johnsoninitialize a KVM user device specific data structure, and assign the
709*8684f1beSJohn G Johnson*kvm\_device* private field to it.
710*8684f1beSJohn G Johnson
711*8684f1beSJohn G Johnson-  ioctl
712*8684f1beSJohn G Johnson
713*8684f1beSJohn G JohnsonThis routine is invoked when QEMU issues an ``ioctl()`` on the master
714*8684f1beSJohn G Johnsondescriptor. The ``ioctl()`` commands supported are defined by the KVM
715*8684f1beSJohn G Johnsondevice type. *KVM\_DEV\_TYPE\_USER* ones will need several commands:
716*8684f1beSJohn G Johnson
717*8684f1beSJohn G Johnson*KVM\_DEV\_USER\_SLAVE\_FD* creates the slave file descriptor that will
718*8684f1beSJohn G Johnsonbe passed to the device emulation program. Only one slave can be created
719*8684f1beSJohn G Johnsonby each master descriptor. The file operations performed by this
720*8684f1beSJohn G Johnsondescriptor are described below.
721*8684f1beSJohn G Johnson
722*8684f1beSJohn G JohnsonThe *KVM\_DEV\_USER\_PA\_RANGE* command configures a guest physical
723*8684f1beSJohn G Johnsonaddress range that the slave descriptor will receive MMIO notifications
724*8684f1beSJohn G Johnsonfor. The range is specified by a guest physical range structure
725*8684f1beSJohn G Johnsonargument. For buses that assign addresses to devices dynamically, this
726*8684f1beSJohn G Johnsoncommand can be executed while the guest is running, such as the case
727*8684f1beSJohn G Johnsonwhen a guest changes a device's PCI BAR registers.
728*8684f1beSJohn G Johnson
729*8684f1beSJohn G Johnson*KVM\_DEV\_USER\_PA\_RANGE* will use ``kvm_io_bus_register_dev()`` to
730*8684f1beSJohn G Johnsonregister *kvm\_io\_device\_ops* callbacks to be invoked when the guest
731*8684f1beSJohn G Johnsonperforms a MMIO operation within the range. When a range is changed,
732*8684f1beSJohn G Johnson``kvm_io_bus_unregister_dev()`` is used to remove the previous
733*8684f1beSJohn G Johnsoninstantiation.
734*8684f1beSJohn G Johnson
735*8684f1beSJohn G Johnson*KVM\_DEV\_USER\_TIMEOUT* will configure a timeout value that specifies
736*8684f1beSJohn G Johnsonhow long KVM will wait for the emulation process to respond to a MMIO
737*8684f1beSJohn G Johnsonindication.
738*8684f1beSJohn G Johnson
739*8684f1beSJohn G Johnson-  destroy
740*8684f1beSJohn G Johnson
741*8684f1beSJohn G JohnsonThis routine is called when the VM instance is destroyed. It will need
742*8684f1beSJohn G Johnsonto destroy the slave descriptor; and free any memory allocated by the
743*8684f1beSJohn G Johnsondriver, as well as the *kvm\_device* structure itself.
744*8684f1beSJohn G Johnson
745*8684f1beSJohn G Johnsonslave descriptor
746*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^
747*8684f1beSJohn G Johnson
748*8684f1beSJohn G JohnsonThe slave descriptor will have its own file operations vector, which
749*8684f1beSJohn G Johnsonresponds to system calls on the descriptor performed by the device
750*8684f1beSJohn G Johnsonemulation program.
751*8684f1beSJohn G Johnson
752*8684f1beSJohn G Johnson-  read
753*8684f1beSJohn G Johnson
754*8684f1beSJohn G JohnsonA read returns any pending MMIO requests from the KVM driver as MMIO
755*8684f1beSJohn G Johnsonrequest structures. Multiple structures can be returned if there are
756*8684f1beSJohn G Johnsonmultiple MMIO operations pending. The MMIO requests are moved from the
757*8684f1beSJohn G Johnsonpending queue to the sent queue, and if there are threads waiting for
758*8684f1beSJohn G Johnsonspace in the pending to add new MMIO operations, they will be woken
759*8684f1beSJohn G Johnsonhere.
760*8684f1beSJohn G Johnson
761*8684f1beSJohn G Johnson-  write
762*8684f1beSJohn G Johnson
763*8684f1beSJohn G JohnsonA write also consists of a set of MMIO requests. They are compared to
764*8684f1beSJohn G Johnsonthe MMIO requests in the sent queue. Matches are removed from the sent
765*8684f1beSJohn G Johnsonqueue, and any threads waiting for the reply are woken. If a store is
766*8684f1beSJohn G Johnsonremoved, then the number of posted stores in the per-CPU scoreboard is
767*8684f1beSJohn G Johnsondecremented. When the number is zero, and a non side-effect load was
768*8684f1beSJohn G Johnsonwaiting for posted stores to complete, the load is continued.
769*8684f1beSJohn G Johnson
770*8684f1beSJohn G Johnson-  ioctl
771*8684f1beSJohn G Johnson
772*8684f1beSJohn G JohnsonThere are several ioctl()s that can be performed on the slave
773*8684f1beSJohn G Johnsondescriptor.
774*8684f1beSJohn G Johnson
775*8684f1beSJohn G JohnsonA *KVM\_DEV\_USER\_SHADOW\_SIZE* ``ioctl()`` causes the KVM driver to
776*8684f1beSJohn G Johnsonallocate memory for the shadow image. This memory can later be
777*8684f1beSJohn G Johnson``mmap()``\ ed by the emulation process to share the emulation's view of
778*8684f1beSJohn G Johnsondevice memory with the KVM driver.
779*8684f1beSJohn G Johnson
780*8684f1beSJohn G JohnsonA *KVM\_DEV\_USER\_SHADOW\_CTRL* ``ioctl()`` controls access to the
781*8684f1beSJohn G Johnsonshadow image. It will send the KVM driver a shadow control map, which
782*8684f1beSJohn G Johnsonspecifies which areas of the image can complete guest loads without
783*8684f1beSJohn G Johnsonsending the load request to the emulation program. It will also specify
784*8684f1beSJohn G Johnsonthe size of load operations that are allowed.
785*8684f1beSJohn G Johnson
786*8684f1beSJohn G Johnson-  poll
787*8684f1beSJohn G Johnson
788*8684f1beSJohn G JohnsonAn emulation program will use the ``poll()`` call with a *POLLIN* flag
789*8684f1beSJohn G Johnsonto determine if there are MMIO requests waiting to be read. It will
790*8684f1beSJohn G Johnsonreturn if the pending MMIO request queue is not empty.
791*8684f1beSJohn G Johnson
792*8684f1beSJohn G Johnson-  mmap
793*8684f1beSJohn G Johnson
794*8684f1beSJohn G JohnsonThis call allows the emulation program to directly access the shadow
795*8684f1beSJohn G Johnsonimage allocated by the KVM driver. As device emulation updates device
796*8684f1beSJohn G Johnsonmemory, changes with no side-effects will be reflected in the shadow,
797*8684f1beSJohn G Johnsonand the KVM driver can satisfy guest loads from the shadow image without
798*8684f1beSJohn G Johnsonneeding to wait for the emulation program.
799*8684f1beSJohn G Johnson
800*8684f1beSJohn G Johnsonkvm\_io\_device ops
801*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^
802*8684f1beSJohn G Johnson
803*8684f1beSJohn G JohnsonEach KVM per-CPU thread can handle MMIO operation on behalf of the guest
804*8684f1beSJohn G JohnsonVM. KVM will use the MMIO's guest physical address to search for a
805*8684f1beSJohn G Johnsonmatching *kvm\_io\_device* to see if the MMIO can be handled by the KVM
806*8684f1beSJohn G Johnsondriver instead of exiting back to QEMU. If a match is found, the
807*8684f1beSJohn G Johnsoncorresponding callback will be invoked.
808*8684f1beSJohn G Johnson
809*8684f1beSJohn G Johnson-  read
810*8684f1beSJohn G Johnson
811*8684f1beSJohn G JohnsonThis callback is invoked when the guest performs a load to the device.
812*8684f1beSJohn G JohnsonLoads with side-effects must be handled synchronously, with the KVM
813*8684f1beSJohn G Johnsondriver putting the QEMU thread to sleep waiting for the emulation
814*8684f1beSJohn G Johnsonprocess reply before re-starting the guest. Loads that do not have
815*8684f1beSJohn G Johnsonside-effects may be optimized by satisfying them from the shadow image,
816*8684f1beSJohn G Johnsonif there are no outstanding stores to the device by this CPU. PCI memory
817*8684f1beSJohn G Johnsonordering demands that a load cannot complete before all older stores to
818*8684f1beSJohn G Johnsonthe same device have been completed.
819*8684f1beSJohn G Johnson
820*8684f1beSJohn G Johnson-  write
821*8684f1beSJohn G Johnson
822*8684f1beSJohn G JohnsonStores can be handled asynchronously unless the pending MMIO request
823*8684f1beSJohn G Johnsonqueue is full. In this case, the QEMU thread must sleep waiting for
824*8684f1beSJohn G Johnsonspace in the queue. Stores will increment the number of posted stores in
825*8684f1beSJohn G Johnsonthe per-CPU scoreboard, in order to implement the PCI ordering
826*8684f1beSJohn G Johnsonconstraint above.
827*8684f1beSJohn G Johnson
828*8684f1beSJohn G Johnsoninterrupt acceleration
829*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^^^
830*8684f1beSJohn G Johnson
831*8684f1beSJohn G JohnsonThis performance optimization would work much like a vhost user
832*8684f1beSJohn G Johnsonapplication does, where the QEMU process sets up *eventfds* that cause
833*8684f1beSJohn G Johnsonthe device's corresponding interrupt to be triggered by the KVM driver.
834*8684f1beSJohn G JohnsonThese irq file descriptors are sent to the emulation process at
835*8684f1beSJohn G Johnsoninitialization, and are used when the emulation code raises a device
836*8684f1beSJohn G Johnsoninterrupt.
837*8684f1beSJohn G Johnson
838*8684f1beSJohn G Johnsonintx acceleration
839*8684f1beSJohn G Johnson'''''''''''''''''
840*8684f1beSJohn G Johnson
841*8684f1beSJohn G JohnsonTraditional PCI pin interrupts are level based, so, in addition to an
842*8684f1beSJohn G Johnsonirq file descriptor, a re-sampling file descriptor needs to be sent to
843*8684f1beSJohn G Johnsonthe emulation program. This second file descriptor allows multiple
844*8684f1beSJohn G Johnsondevices sharing an irq to be notified when the interrupt has been
845*8684f1beSJohn G Johnsonacknowledged by the guest, so they can re-trigger the interrupt if their
846*8684f1beSJohn G Johnsondevice has not de-asserted its interrupt.
847*8684f1beSJohn G Johnson
848*8684f1beSJohn G Johnsonintx irq descriptor
849*8684f1beSJohn G Johnson
850*8684f1beSJohn G Johnson
851*8684f1beSJohn G JohnsonThe irq descriptors are created by the proxy object
852*8684f1beSJohn G Johnson``using event_notifier_init()`` to create the irq and re-sampling
853*8684f1beSJohn G Johnson*eventds*, and ``kvm_vm_ioctl(KVM_IRQFD)`` to bind them to an interrupt.
854*8684f1beSJohn G JohnsonThe interrupt route can be found with
855*8684f1beSJohn G Johnson``pci_device_route_intx_to_irq()``.
856*8684f1beSJohn G Johnson
857*8684f1beSJohn G Johnsonintx routing changes
858*8684f1beSJohn G Johnson
859*8684f1beSJohn G Johnson
860*8684f1beSJohn G JohnsonIntx routing can be changed when the guest programs the APIC the device
861*8684f1beSJohn G Johnsonpin is connected to. The proxy object in QEMU will use
862*8684f1beSJohn G Johnson``pci_device_set_intx_routing_notifier()`` to be informed of any guest
863*8684f1beSJohn G Johnsonchanges to the route. This handler will broadly follow the VFIO
864*8684f1beSJohn G Johnsoninterrupt logic to change the route: de-assigning the existing irq
865*8684f1beSJohn G Johnsondescriptor from its route, then assigning it the new route. (see
866*8684f1beSJohn G Johnson``vfio_intx_update()``)
867*8684f1beSJohn G Johnson
868*8684f1beSJohn G JohnsonMSI/X acceleration
869*8684f1beSJohn G Johnson''''''''''''''''''
870*8684f1beSJohn G Johnson
871*8684f1beSJohn G JohnsonMSI/X interrupts are sent as DMA transactions to the host. The interrupt
872*8684f1beSJohn G Johnsondata contains a vector that is programmed by the guest, A device may have
873*8684f1beSJohn G Johnsonmultiple MSI interrupts associated with it, so multiple irq descriptors
874*8684f1beSJohn G Johnsonmay need to be sent to the emulation program.
875*8684f1beSJohn G Johnson
876*8684f1beSJohn G JohnsonMSI/X irq descriptor
877*8684f1beSJohn G Johnson
878*8684f1beSJohn G Johnson
879*8684f1beSJohn G JohnsonThis case will also follow the VFIO example. For each MSI/X interrupt,
880*8684f1beSJohn G Johnsonan *eventfd* is created, a virtual interrupt is allocated by
881*8684f1beSJohn G Johnson``kvm_irqchip_add_msi_route()``, and the virtual interrupt is bound to
882*8684f1beSJohn G Johnsonthe eventfd with ``kvm_irqchip_add_irqfd_notifier()``.
883*8684f1beSJohn G Johnson
884*8684f1beSJohn G JohnsonMSI/X config space changes
885*8684f1beSJohn G Johnson
886*8684f1beSJohn G Johnson
887*8684f1beSJohn G JohnsonThe guest may dynamically update several MSI-related tables in the
888*8684f1beSJohn G Johnsondevice's PCI config space. These include per-MSI interrupt enables and
889*8684f1beSJohn G Johnsonvector data. Additionally, MSIX tables exist in device memory space, not
890*8684f1beSJohn G Johnsonconfig space. Much like the BAR case above, the proxy object must look
891*8684f1beSJohn G Johnsonat guest config space programming to keep the MSI interrupt state
892*8684f1beSJohn G Johnsonconsistent between QEMU and the emulation program.
893*8684f1beSJohn G Johnson
894*8684f1beSJohn G Johnson--------------
895*8684f1beSJohn G Johnson
896*8684f1beSJohn G JohnsonDisaggregated CPU emulation
897*8684f1beSJohn G Johnson---------------------------
898*8684f1beSJohn G Johnson
899*8684f1beSJohn G JohnsonAfter IO services have been disaggregated, a second phase would be to
900*8684f1beSJohn G Johnsonseparate a process to handle CPU instruction emulation from the main
901*8684f1beSJohn G JohnsonQEMU control function. There are no object separation points for this
902*8684f1beSJohn G Johnsoncode, so the first task would be to create one.
903*8684f1beSJohn G Johnson
904*8684f1beSJohn G JohnsonHost access controls
905*8684f1beSJohn G Johnson--------------------
906*8684f1beSJohn G Johnson
907*8684f1beSJohn G JohnsonSeparating QEMU relies on the host OS's access restriction mechanisms to
908*8684f1beSJohn G Johnsonenforce that the differing processes can only access the objects they
909*8684f1beSJohn G Johnsonare entitled to. There are a couple types of mechanisms usually provided
910*8684f1beSJohn G Johnsonby general purpose OSs.
911*8684f1beSJohn G Johnson
912*8684f1beSJohn G JohnsonDiscretionary access control
913*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~~~~~
914*8684f1beSJohn G Johnson
915*8684f1beSJohn G JohnsonDiscretionary access control allows each user to control who can access
916*8684f1beSJohn G Johnsontheir files. In Linux, this type of control is usually too coarse for
917*8684f1beSJohn G JohnsonQEMU separation, since it only provides three separate access controls:
918*8684f1beSJohn G Johnsonone for the same user ID, the second for users IDs with the same group
919*8684f1beSJohn G JohnsonID, and the third for all other user IDs. Each device instance would
920*8684f1beSJohn G Johnsonneed a separate user ID to provide access control, which is likely to be
921*8684f1beSJohn G Johnsonunwieldy for dynamically created VMs.
922*8684f1beSJohn G Johnson
923*8684f1beSJohn G JohnsonMandatory access control
924*8684f1beSJohn G Johnson~~~~~~~~~~~~~~~~~~~~~~~~
925*8684f1beSJohn G Johnson
926*8684f1beSJohn G JohnsonMandatory access control allows the OS to add an additional set of
927*8684f1beSJohn G Johnsoncontrols on top of discretionary access for the OS to control. It also
928*8684f1beSJohn G Johnsonadds other attributes to processes and files such as types, roles, and
929*8684f1beSJohn G Johnsoncategories, and can establish rules for how processes and files can
930*8684f1beSJohn G Johnsoninteract.
931*8684f1beSJohn G Johnson
932*8684f1beSJohn G JohnsonType enforcement
933*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^
934*8684f1beSJohn G Johnson
935*8684f1beSJohn G JohnsonType enforcement assigns a *type* attribute to processes and files, and
936*8684f1beSJohn G Johnsonallows rules to be written on what operations a process with a given
937*8684f1beSJohn G Johnsontype can perform on a file with a given type. QEMU separation could take
938*8684f1beSJohn G Johnsonadvantage of type enforcement by running the emulation processes with
939*8684f1beSJohn G Johnsondifferent types, both from the main QEMU process, and from the emulation
940*8684f1beSJohn G Johnsonprocesses of different classes of devices.
941*8684f1beSJohn G Johnson
942*8684f1beSJohn G JohnsonFor example, guest disk images and disk emulation processes could have
943*8684f1beSJohn G Johnsontypes separate from the main QEMU process and non-disk emulation
944*8684f1beSJohn G Johnsonprocesses, and the type rules could prevent processes other than disk
945*8684f1beSJohn G Johnsonemulation ones from accessing guest disk images. Similarly, network
946*8684f1beSJohn G Johnsonemulation processes can have a type separate from the main QEMU process
947*8684f1beSJohn G Johnsonand non-network emulation process, and only that type can access the
948*8684f1beSJohn G Johnsonhost tun/tap device used to provide guest networking.
949*8684f1beSJohn G Johnson
950*8684f1beSJohn G JohnsonCategory enforcement
951*8684f1beSJohn G Johnson^^^^^^^^^^^^^^^^^^^^
952*8684f1beSJohn G Johnson
953*8684f1beSJohn G JohnsonCategory enforcement assigns a set of numbers within a given range to
954*8684f1beSJohn G Johnsonthe process or file. The process is granted access to the file if the
955*8684f1beSJohn G Johnsonprocess's set is a superset of the file's set. This enforcement can be
956*8684f1beSJohn G Johnsonused to separate multiple instances of devices in the same class.
957*8684f1beSJohn G Johnson
958*8684f1beSJohn G JohnsonFor example, if there are multiple disk devices provides to a guest,
959*8684f1beSJohn G Johnsoneach device emulation process could be provisioned with a separate
960*8684f1beSJohn G Johnsoncategory. The different device emulation processes would not be able to
961*8684f1beSJohn G Johnsonaccess each other's backing disk images.
962*8684f1beSJohn G Johnson
963*8684f1beSJohn G JohnsonAlternatively, categories could be used in lieu of the type enforcement
964*8684f1beSJohn G Johnsonscheme described above. In this scenario, different categories would be
965*8684f1beSJohn G Johnsonused to prevent device emulation processes in different classes from
966*8684f1beSJohn G Johnsonaccessing resources assigned to other classes.
967