xref: /cloud-hypervisor/docs/iommu.md (revision 4d7a4c598ac247aaf770b00dfb057cdac891f67d)
1# Virtual IOMMU
2
3## Rationales
4
5Having the possibility to expose a virtual IOMMU to the guest can be
6interesting to support specific use cases. That being said, it is always
7important to keep in mind a virtual IOMMU can impact the performance of the
8attached devices, which is the reason why one should be careful when enabling
9this feature.
10
11### Protect nested virtual machines
12
13The first reason why one might want to expose a virtual IOMMU to the guest is
14to increase the security regarding the memory accesses performed by the virtual
15devices (VIRTIO devices), on behalf of the guest drivers.
16
17With a virtual IOMMU, the VMM stands between the guest driver and its device
18counterpart, validating and translating every address before to try accessing
19the guest memory. This is standard interposition that is performed here by the
20VMM.
21
22The increased security does not apply for a simple case where we have one VM
23per VMM. Because the guest cannot be trusted, as we always consider it could
24be malicious and gain unauthorized privileges inside the VM, preventing some
25devices from accessing the entire guest memory is pointless.
26
27But let's take the interesting case of nested virtualization, and let's assume
28we have a VMM running a first layer VM. This L1 guest is fully trusted as the
29user intends to run multiple VMs from this L1. We can end up with multiple L2
30VMs running on a single L1 VM. In this particular case, and without exposing a
31virtual IOMMU to the L1 guest, it would be possible for any L2 guest to use the
32device implementation from the host VMM to access the entire guest L1 memory.
33The virtual IOMMU prevents from this kind of trouble as it will validate the
34addresses the device is authorized to access.
35
36### Achieve VFIO nested
37
38Another reason for having a virtual IOMMU is to allow passing physical devices
39from the host through multiple layers of virtualization. Let's take as example
40a system with a physical IOMMU running a VM with a virtual IOMMU. The
41implementation of the virtual IOMMU is responsible for updating the physical
42DMA Remapping table (DMAR) everytime the DMA mapping changes. This must happen
43through the VFIO framework on the host as this is the only userspace interface
44to interact with a physical IOMMU.
45
46Relying on this update mechanism, it is possible to attach physical devices to
47the virtual IOMMU, which allows these devices to be passed from L1 to another
48layer of virtualization.
49
50## Why virtio-iommu?
51
52The Cloud Hypervisor project decided to implement the brand new virtio-iommu
53device in order to provide a virtual IOMMU to its users. The reason being the
54simplicity brought by the paravirtualization solution. By having one side
55handled from the guest itself, it removes the complexity of trapping memory
56page accesses and shadowing them. This is why the project will not try to
57implement a full emulation of a physical IOMMU.
58
59## Pre-requisites
60
61### Kernel
62
63Since virtio-iommu has landed partially into the version 5.3 of the Linux
64kernel, a special branch is needed to get things working with Cloud Hypervisor.
65By partially, we are talking about x86 specifically, as it is already fully
66functional for ARM architectures.
67
68## Usage
69
70In order to expose a virtual IOMMU to the guest, it is required to create a
71virtio-iommu device and expose it through the ACPI IORT table. This can be
72simply achieved by attaching at least one device to the virtual IOMMU.
73
74The way to expose to the guest a specific device as sitting behind this IOMMU
75is to explicitly tag it from the command line with the option `iommu=on`.
76
77Not all devices support this extra option, and the default value will always
78be `off` since we want to avoid the performance impact for most users who don't
79need this.
80
81Refer to the command line `--help` to find out which device support to be
82attached to the virtual IOMMU.
83
84Below is a simple example exposing the `virtio-blk` device as attached to the
85virtual IOMMU:
86
87```bash
88./cloud-hypervisor \
89    --cpus boot=1 \
90    --memory size=512M \
91    --disk path=focal-server-cloudimg-amd64.raw,iommu=on \
92    --kernel custom-vmlinux \
93    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
94```
95
96From a guest perspective, it is easy to verify if the device is protected by
97the virtual IOMMU. Check the directories listed under
98`/sys/kernel/iommu_groups`:
99
100```bash
101ls /sys/kernel/iommu_groups
1020
103```
104
105In this case, only one IOMMU group should be created. Under this group, it is
106possible to find out the b/d/f of the device(s) part of this group.
107
108```bash
109ls /sys/kernel/iommu_groups/0/devices/
1100000:00:03.0
111```
112
113And you can validate the device is the one we expect running `lspci`:
114
115```bash
116lspci
11700:00.0 Host bridge: Intel Corporation Device 0d57
11800:01.0 Unassigned class [ffff]: Red Hat, Inc. Device 1057
11900:02.0 Unassigned class [ffff]: Red Hat, Inc. Virtio console
12000:03.0 Mass storage controller: Red Hat, Inc. Virtio block device
12100:04.0 Unassigned class [ffff]: Red Hat, Inc. Virtio RNG
122```
123
124### Work with FDT on AArch64
125
126On AArch64 architecture, the virtual IOMMU can still be used even if ACPI is not
127enabled. But the effect is different with what the aforementioned test showed.
128
129When ACPI is disabled, virtual IOMMU is supported through Flattened Device Tree
130(FDT). In this case, the guest kernel can not tell which device should be
131IOMMU-attached and which should not. No matter how many devices you attached to
132the virtual IOMMU by setting `iommu=on` option, all the devices on the PCI bus
133will be attached to the virtual IOMMU (except the IOMMU itself). Each of the
134devices will be added into a IOMMU group.
135
136As a result, the directory content of `/sys/kernel/iommu_groups` would be:
137
138```bash
139ls /sys/kernel/iommu_groups/0/devices/
1400000:00:02.0
141ls /sys/kernel/iommu_groups/1/devices/
1420000:00:03.0
143ls /sys/kernel/iommu_groups/2/devices/
1440000:00:04.0
145```
146
147## Faster mappings
148
149By default, the guest memory is mapped with 4k pages and no huge pages, which
150causes the virtual IOMMU device to be asked for 4k mappings only. This
151configuration slows down the setup of the physical IOMMU as an important number
152of requests need to be issued in order to create large mappings.
153
154One use case is even more impacted by the slowdown, the nested VFIO case. When
155passing a device through a L2 guest, the VFIO driver running in L1 will update
156the DMAR entries for the specific device. Because VFIO pins the entire guest
157memory, this means the entire mapping of the L2 guest need to be stored into
158multiple 4k mappings. Obviously, the bigger the L2 guest RAM is, the longer the
159update of the mappings will last. There is an additional problem happening in
160this case, if the L2 guest RAM is quite large, it will require a large number
161of mappings, which might exceed the VFIO limit set on the host. The default
162value is 65536, which can simply be reached with a 256MiB sized RAM.
163
164The way to solve both problems, the slowdown and the limit being exceeded, is
165to reduce the amount of requests to describe those same large mappings. This
166can be achieved by using 2MiB pages, known as huge pages. By seeing the guest
167RAM as larger pages, and because the virtual IOMMU device supports it, the
168guest will require less mappings, which will prevent the limit from being
169exceeded, but also will take less time to process them on the host. That's
170how using huge pages as much as possible can speed up VM boot time.
171
172### Basic usage
173
174Let's look at an example of how to run a guest with huge pages.
175
176First, make sure your system has enough pages to cover the entire guest RAM:
177```bash
178# This example creates 4096 hugepages
179echo 4096 > /proc/sys/vm/nr_hugepages
180```
181
182Next step is simply to create the VM. Two things are important, first we want
183the VM RAM to be mapped on huge pages by backing it with `/dev/hugepages`. And
184second thing, we need to create some huge pages in the guest itself so they can
185be consumed.
186
187```bash
188./cloud-hypervisor \
189    --cpus boot=1 \
190    --memory size=8G,hugepages=on \
191    --disk path=focal-server-cloudimg-amd64.raw \
192    --kernel custom-vmlinux \
193    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw hugepagesz=2M hugepages=2048" \
194    --net tap=,mac=,iommu=on
195```
196
197### Nested usage
198
199Let's now look at the specific example of nested virtualization. In order to
200reach optimized performances, the L2 guest also need to be mapped based on
201huge pages. Here is how to achieve this, assuming the physical device you are
202passing through is `0000:00:01.0`.
203
204```bash
205./cloud-hypervisor \
206    --cpus boot=1 \
207    --memory size=8G,hugepages=on \
208    --disk path=focal-server-cloudimg-amd64.raw \
209    --kernel custom-vmlinux \
210    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw kvm-intel.nested=1 vfio_iommu_type1.allow_unsafe_interrupts rw hugepagesz=2M hugepages=2048" \
211    --device path=/sys/bus/pci/devices/0000:00:01.0,iommu=on
212```
213
214Once the L1 VM is running, unbind the device from the default driver in the
215guest, and bind it to VFIO (it should appear as `0000:00:04.0`).
216
217```bash
218echo 0000:00:04.0 > /sys/bus/pci/devices/0000\:00\:04.0/driver/unbind
219echo 8086 1502 > /sys/bus/pci/drivers/vfio-pci/new_id
220echo 0000:00:04.0 > /sys/bus/pci/drivers/vfio-pci/bind
221```
222
223Last thing is to start the L2 guest with the huge pages memory backend.
224
225```bash
226./cloud-hypervisor \
227    --cpus boot=1 \
228    --memory size=4G,hugepages=on \
229    --disk path=focal-server-cloudimg-amd64.raw \
230    --kernel custom-vmlinux \
231    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
232    --device path=/sys/bus/pci/devices/0000:00:04.0
233```
234
235### Dedicated IOMMU PCI segments
236
237To facilitate hotplug of devices that require being behind an IOMMU it is
238possible to mark entire PCI segments as behind the IOMMU.
239
240This is accomplished through `--platform
241num_pci_segments=<number_of_segments>,iommu_segments=<range of segments>` or
242via the equivalents in `PlatformConfig` for the API.
243
244e.g.
245
246```bash
247./cloud-hypervisor \
248    --api-socket=/tmp/api \
249    --cpus boot=1 \
250    --memory size=4G,hugepages=on \
251    --disk path=focal-server-cloudimg-amd64.raw \
252    --kernel custom-vmlinux \
253    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
254    --platform num_pci_segments=2,iommu_segments=1
255```
256
257This adds a second PCI segment to the platform behind the IOMMU. A VFIO device
258requiring the IOMMU then may be hotplugged:
259
260e.g.
261
262```bash
263./ch-remote --api-socket=/tmp/api add-device path=/sys/bus/pci/devices/0000:00:04.0,iommu=on,pci_segment=1
264```
265
266Devices that cannot be placed behind an IOMMU (e.g. lacking an `iommu=` option)
267cannot be placed on the IOMMU segments.
268
269