xref: /cloud-hypervisor/docs/iommu.md (revision fee769bed4c58a07b67e25a7339cfd397f701f3a)
1# Virtual IOMMU
2
3## Rationales
4
5Having the possibility to expose a virtual IOMMU to the guest can be
6interesting to support specific use cases. That being said, it is always
7important to keep in mind a virtual IOMMU can impact the performance of the
8attached devices, which is the reason why one should be careful when enabling
9this feature.
10
11### Protect nested virtual machines
12
13The first reason why one might want to expose a virtual IOMMU to the guest is
14to increase the security regarding the memory accesses performed by the virtual
15devices (VIRTIO devices), on behalf of the guest drivers.
16
17With a virtual IOMMU, the VMM stands between the guest driver and its device
18counterpart, validating and translating every address before to try accessing
19the guest memory. This is standard interposition that is performed here by the
20VMM.
21
22The increased security does not apply for a simple case where we have one VM
23per VMM. Because the guest cannot be trusted, as we always consider it could
24be malicious and gain unauthorized privileges inside the VM, preventing some
25devices from accessing the entire guest memory is pointless.
26
27But let's take the interesting case of nested virtualization, and let's assume
28we have a VMM running a first layer VM. This L1 guest is fully trusted as the
29user intends to run multiple VMs from this L1. We can end up with multiple L2
30VMs running on a single L1 VM. In this particular case, and without exposing a
31virtual IOMMU to the L1 guest, it would be possible for any L2 guest to use the
32device implementation from the host VMM to access the entire guest L1 memory.
33The virtual IOMMU prevents from this kind of trouble as it will validate the
34addresses the device is authorized to access.
35
36### Achieve VFIO nested
37
38Another reason for having a virtual IOMMU is to allow passing physical devices
39from the host through multiple layers of virtualization. Let's take as example
40a system with a physical IOMMU running a VM with a virtual IOMMU. The
41implementation of the virtual IOMMU is responsible for updating the physical
42DMA Remapping table (DMAR) every time the DMA mapping changes. This must happen
43through the VFIO framework on the host as this is the only userspace interface
44to interact with a physical IOMMU.
45
46Relying on this update mechanism, it is possible to attach physical devices to
47the virtual IOMMU, which allows these devices to be passed from L1 to another
48layer of virtualization.
49
50## Why virtio-iommu?
51
52The Cloud Hypervisor project decided to implement the brand new virtio-iommu
53device in order to provide a virtual IOMMU to its users. The reason being the
54simplicity brought by the paravirtualization solution. By having one side
55handled from the guest itself, it removes the complexity of trapping memory
56page accesses and shadowing them. This is why the project will not try to
57implement a full emulation of a physical IOMMU.
58
59## Pre-requisites
60
61### Kernel
62
63As of Kernel 5.14, virtio-iommu is available for both X86-64 and Aarch64.
64
65## Usage
66
67In order to expose a virtual IOMMU to the guest, it is required to create a
68virtio-iommu device and expose it through the ACPI IORT table. This can be
69simply achieved by attaching at least one device to the virtual IOMMU.
70
71The way to expose to the guest a specific device as sitting behind this IOMMU
72is to explicitly tag it from the command line with the option `iommu=on`.
73
74Not all devices support this extra option, and the default value will always
75be `off` since we want to avoid the performance impact for most users who don't
76need this.
77
78Refer to the command line `--help` to find out which device support to be
79attached to the virtual IOMMU.
80
81Below is a simple example exposing the `virtio-blk` device as attached to the
82virtual IOMMU:
83
84```bash
85./cloud-hypervisor \
86    --cpus boot=1 \
87    --memory size=512M \
88    --disk path=focal-server-cloudimg-amd64.raw,iommu=on \
89    --kernel custom-vmlinux \
90    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
91```
92
93From a guest perspective, it is easy to verify if the device is protected by
94the virtual IOMMU. Check the directories listed under
95`/sys/kernel/iommu_groups`:
96
97```bash
98ls /sys/kernel/iommu_groups
990
100```
101
102In this case, only one IOMMU group should be created. Under this group, it is
103possible to find out the b/d/f of the device(s) part of this group.
104
105```bash
106ls /sys/kernel/iommu_groups/0/devices/
1070000:00:03.0
108```
109
110And you can validate the device is the one we expect running `lspci`:
111
112```bash
113lspci
11400:00.0 Host bridge: Intel Corporation Device 0d57
11500:01.0 Unassigned class [ffff]: Red Hat, Inc. Device 1057
11600:02.0 Unassigned class [ffff]: Red Hat, Inc. Virtio console
11700:03.0 Mass storage controller: Red Hat, Inc. Virtio block device
11800:04.0 Unassigned class [ffff]: Red Hat, Inc. Virtio RNG
119```
120
121### Work with FDT on AArch64
122
123On AArch64 architecture, the virtual IOMMU can still be used even if ACPI is not
124enabled. But the effect is different with what the aforementioned test showed.
125
126When ACPI is disabled, virtual IOMMU is supported through Flattened Device Tree
127(FDT). In this case, the guest kernel cannot tell which device should be
128IOMMU-attached and which should not. No matter how many devices you attached to
129the virtual IOMMU by setting `iommu=on` option, all the devices on the PCI bus
130will be attached to the virtual IOMMU (except the IOMMU itself). Each of the
131devices will be added into a IOMMU group.
132
133As a result, the directory content of `/sys/kernel/iommu_groups` would be:
134
135```bash
136ls /sys/kernel/iommu_groups/0/devices/
1370000:00:02.0
138ls /sys/kernel/iommu_groups/1/devices/
1390000:00:03.0
140ls /sys/kernel/iommu_groups/2/devices/
1410000:00:04.0
142```
143
144## Faster mappings
145
146By default, the guest memory is mapped with 4k pages and no huge pages, which
147causes the virtual IOMMU device to be asked for 4k mappings only. This
148configuration slows down the setup of the physical IOMMU as an important number
149of requests need to be issued in order to create large mappings.
150
151One use case is even more impacted by the slowdown, the nested VFIO case. When
152passing a device through a L2 guest, the VFIO driver running in L1 will update
153the DMAR entries for the specific device. Because VFIO pins the entire guest
154memory, this means the entire mapping of the L2 guest need to be stored into
155multiple 4k mappings. Obviously, the bigger the L2 guest RAM is, the longer the
156update of the mappings will last. There is an additional problem happening in
157this case, if the L2 guest RAM is quite large, it will require a large number
158of mappings, which might exceed the VFIO limit set on the host. The default
159value is 65536, which can simply be reached with a 256MiB sized RAM.
160
161The way to solve both problems, the slowdown and the limit being exceeded, is
162to reduce the amount of requests to describe those same large mappings. This
163can be achieved by using 2MiB pages, known as huge pages. By seeing the guest
164RAM as larger pages, and because the virtual IOMMU device supports it, the
165guest will require less mappings, which will prevent the limit from being
166exceeded, but also will take less time to process them on the host. That's
167how using huge pages as much as possible can speed up VM boot time.
168
169### Basic usage
170
171Let's look at an example of how to run a guest with huge pages.
172
173First, make sure your system has enough pages to cover the entire guest RAM:
174```bash
175# This example creates 4096 hugepages
176echo 4096 > /proc/sys/vm/nr_hugepages
177```
178
179Next step is simply to create the VM. Two things are important, first we want
180the VM RAM to be mapped on huge pages by backing it with `/dev/hugepages`. And
181second thing, we need to create some huge pages in the guest itself so they can
182be consumed.
183
184```bash
185./cloud-hypervisor \
186    --cpus boot=1 \
187    --memory size=8G,hugepages=on \
188    --disk path=focal-server-cloudimg-amd64.raw \
189    --kernel custom-vmlinux \
190    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw hugepagesz=2M hugepages=2048" \
191    --net tap=,mac=,iommu=on
192```
193
194### Nested usage
195
196Let's now look at the specific example of nested virtualization. In order to
197reach optimized performances, the L2 guest also need to be mapped based on
198huge pages. Here is how to achieve this, assuming the physical device you are
199passing through is `0000:00:01.0`.
200
201```bash
202./cloud-hypervisor \
203    --cpus boot=1 \
204    --memory size=8G,hugepages=on \
205    --disk path=focal-server-cloudimg-amd64.raw \
206    --kernel custom-vmlinux \
207    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw kvm-intel.nested=1 vfio_iommu_type1.allow_unsafe_interrupts rw hugepagesz=2M hugepages=2048" \
208    --device path=/sys/bus/pci/devices/0000:00:01.0,iommu=on
209```
210
211Once the L1 VM is running, unbind the device from the default driver in the
212guest, and bind it to VFIO (it should appear as `0000:00:04.0`).
213
214```bash
215echo 0000:00:04.0 > /sys/bus/pci/devices/0000\:00\:04.0/driver/unbind
216echo 8086 1502 > /sys/bus/pci/drivers/vfio-pci/new_id
217echo 0000:00:04.0 > /sys/bus/pci/drivers/vfio-pci/bind
218```
219
220Last thing is to start the L2 guest with the huge pages memory backend.
221
222```bash
223./cloud-hypervisor \
224    --cpus boot=1 \
225    --memory size=4G,hugepages=on \
226    --disk path=focal-server-cloudimg-amd64.raw \
227    --kernel custom-vmlinux \
228    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
229    --device path=/sys/bus/pci/devices/0000:00:04.0
230```
231
232### Dedicated IOMMU PCI segments
233
234To facilitate hotplug of devices that require being behind an IOMMU it is
235possible to mark entire PCI segments as behind the IOMMU.
236
237This is accomplished through `--platform
238num_pci_segments=<number_of_segments>,iommu_segments=<range of segments>` or
239via the equivalents in `PlatformConfig` for the API.
240
241e.g.
242
243```bash
244./cloud-hypervisor \
245    --api-socket=/tmp/api \
246    --cpus boot=1 \
247    --memory size=4G,hugepages=on \
248    --disk path=focal-server-cloudimg-amd64.raw \
249    --kernel custom-vmlinux \
250    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
251    --platform num_pci_segments=2,iommu_segments=1
252```
253
254This adds a second PCI segment to the platform behind the IOMMU. A VFIO device
255requiring the IOMMU then may be hotplugged:
256
257e.g.
258
259```bash
260./ch-remote --api-socket=/tmp/api add-device path=/sys/bus/pci/devices/0000:00:04.0,iommu=on,pci_segment=1
261```
262
263Devices that cannot be placed behind an IOMMU (e.g. lacking an `iommu=` option)
264cannot be placed on the IOMMU segments.
265
266