xref: /cloud-hypervisor/docs/iommu.md (revision 9af2968a7dc47b89bf07ea9dc5e735084efcfa3a)
1# Virtual IOMMU
2
3## Rationales
4
5Having the possibility to expose a virtual IOMMU to the guest can be
6interesting to support specific use cases. That being said, it is always
7important to keep in mind a virtual IOMMU can impact the performance of the
8attached devices, which is the reason why one should be careful when enabling
9this feature.
10
11### Protect nested virtual machines
12
13The first reason why one might want to expose a virtual IOMMU to the guest is
14to increase the security regarding the memory accesses performed by the virtual
15devices (VIRTIO devices), on behalf of the guest drivers.
16
17With a virtual IOMMU, the VMM stands between the guest driver and its device
18counterpart, validating and translating every address before to try accessing
19the guest memory. This is standard interposition that is performed here by the
20VMM.
21
22The increased security does not apply for a simple case where we have one VM
23per VMM. Because the guest cannot be trusted, as we always consider it could
24be malicious and gain unauthorized privileges inside the VM, preventing some
25devices from accessing the entire guest memory is pointless.
26
27But let's take the interesting case of nested virtualization, and let's assume
28we have a VMM running a first layer VM. This L1 guest is fully trusted as the
29user intends to run multiple VMs from this L1. We can end up with multiple L2
30VMs running on a single L1 VM. In this particular case, and without exposing a
31virtual IOMMU to the L1 guest, it would be possible for any L2 guest to use the
32device implementation from the host VMM to access the entire guest L1 memory.
33The virtual IOMMU prevents from this kind of trouble as it will validate the
34addresses the device is authorized to access.
35
36### Achieve VFIO nested
37
38Another reason for having a virtual IOMMU is to allow passing physical devices
39from the host through multiple layers of virtualization. Let's take as example
40a system with a physical IOMMU running a VM with a virtual IOMMU. The
41implementation of the virtual IOMMU is responsible for updating the physical
42DMA Remapping table (DMAR) everytime the DMA mapping changes. This must happen
43through the VFIO framework on the host as this is the only userspace interface
44to interact with a physical IOMMU.
45
46Relying on this update mechanism, it is possible to attach physical devices to
47the virtual IOMMU, which allows these devices to be passed from L1 to another
48layer of virtualization.
49
50## Why virtio-iommu?
51
52The Cloud Hypervisor project decided to implement the brand new virtio-iommu
53device in order to provide a virtual IOMMU to its users. The reason being the
54simplicity brought by the paravirtualization solution. By having one side
55handled from the guest itself, it removes the complexity of trapping memory
56page accesses and shadowing them. This is why the project will not try to
57implement a full emulation of a physical IOMMU.
58
59## Pre-requisites
60
61### Kernel
62
63Since virtio-iommu has landed partially into the version 5.3 of the Linux
64kernel, a special branch is needed to get things working with Cloud Hypervisor.
65By partially, we are talking about x86 specifically, as it is already fully
66functional for ARM architectures.
67
68## Usage
69
70In order to expose a virtual IOMMU to the guest, it is required to create a
71virtio-iommu device and expose it through the ACPI IORT table. This can be
72simply achieved by attaching at least one device to the virtual IOMMU.
73
74The way to expose to the guest a specific device as sitting behind this IOMMU
75is to explicitly tag it from the command line with the option `iommu=on`.
76
77Not all devices support this extra option, and the default value will always
78be `off` since we want to avoid the performance impact for most users who don't
79need this.
80
81Refer to the command line `--help` to find out which device support to be
82attached to the virtual IOMMU.
83
84Below is a simple example exposing the `virtio-blk` device as attached to the
85virtual IOMMU:
86
87```bash
88./cloud-hypervisor \
89    --cpus boot=1 \
90    --memory size=512M \
91    --disk path=focal-server-cloudimg-amd64.raw,iommu=on \
92    --kernel custom-vmlinux \
93    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
94```
95
96From a guest perspective, it is easy to verify if the device is protected by
97the virtual IOMMU. Check the directories listed under
98`/sys/kernel/iommu_groups`:
99
100```bash
101ls /sys/kernel/iommu_groups
1020
103```
104
105In this case, only one IOMMU group should be created. Under this group, it is
106possible to find out the b/d/f of the device(s) part of this group.
107
108```bash
109ls /sys/kernel/iommu_groups/0/devices/
1100000:00:03.0
111```
112
113And you can validate the device is the one we expect running `lspci`:
114
115```bash
116lspci
11700:00.0 Host bridge: Intel Corporation Device 0d57
11800:01.0 Unassigned class [ffff]: Red Hat, Inc. Device 1057
11900:02.0 Unassigned class [ffff]: Red Hat, Inc. Virtio console
12000:03.0 Mass storage controller: Red Hat, Inc. Virtio block device
12100:04.0 Unassigned class [ffff]: Red Hat, Inc. Virtio RNG
122```
123
124## Faster mappings
125
126By default, the guest memory is mapped with 4k pages and no huge pages, which
127causes the virtual IOMMU device to be asked for 4k mappings only. This
128configuration slows down the setup of the physical IOMMU as an important number
129of requests need to be issued in order to create large mappings.
130
131One use case is even more impacted by the slowdown, the nested VFIO case. When
132passing a device through a L2 guest, the VFIO driver running in L1 will update
133the DMAR entries for the specific device. Because VFIO pins the entire guest
134memory, this means the entire mapping of the L2 guest need to be stored into
135multiple 4k mappings. Obviously, the bigger the L2 guest RAM is, the longer the
136update of the mappings will last. There is an additional problem happening in
137this case, if the L2 guest RAM is quite large, it will require a large number
138of mappings, which might exceed the VFIO limit set on the host. The default
139value is 65536, which can simply be reached with a 256MiB sized RAM.
140
141The way to solve both problems, the slowdown and the limit being exceeded, is
142to reduce the amount of requests to describe those same large mappings. This
143can be achieved by using 2MiB pages, known as huge pages. By seeing the guest
144RAM as larger pages, and because the virtual IOMMU device supports it, the
145guest will require less mappings, which will prevent the limit from being
146exceeded, but also will take less time to process them on the host. That's
147how using huge pages as much as possible can speed up VM boot time.
148
149### Basic usage
150
151Let's look at an example of how to run a guest with huge pages.
152
153First, make sure your system has enough pages to cover the entire guest RAM:
154```bash
155# This example creates 4096 hugepages
156echo 4096 > /proc/sys/vm/nr_hugepages
157```
158
159Next step is simply to create the VM. Two things are important, first we want
160the VM RAM to be mapped on huge pages by backing it with `/dev/hugepages`. And
161second thing, we need to create some huge pages in the guest itself so they can
162be consumed.
163
164```bash
165./cloud-hypervisor \
166    --cpus boot=1 \
167    --memory size=8G,hugepages=on \
168    --disk path=focal-server-cloudimg-amd64.raw \
169    --kernel custom-vmlinux \
170    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw hugepagesz=2M hugepages=2048" \
171    --net tap=,mac=,iommu=on
172```
173
174### Nested usage
175
176Let's now look at the specific example of nested virtualization. In order to
177reach optimized performances, the L2 guest also need to be mapped based on
178huge pages. Here is how to achieve this, assuming the physical device you are
179passing through is `0000:00:01.0`.
180
181```bash
182./cloud-hypervisor \
183    --cpus boot=1 \
184    --memory size=8G,hugepages=on \
185    --disk path=focal-server-cloudimg-amd64.raw \
186    --kernel custom-vmlinux \
187    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw kvm-intel.nested=1 vfio_iommu_type1.allow_unsafe_interrupts rw hugepagesz=2M hugepages=2048" \
188    --device path=/sys/bus/pci/devices/0000:00:01.0,iommu=on
189```
190
191Once the L1 VM is running, unbind the device from the default driver in the
192guest, and bind it to VFIO (it should appear as `0000:00:04.0`).
193
194```bash
195echo 0000:00:04.0 > /sys/bus/pci/devices/0000\:00\:04.0/driver/unbind
196echo 8086 1502 > /sys/bus/pci/drivers/vfio-pci/new_id
197echo 0000:00:04.0 > /sys/bus/pci/drivers/vfio-pci/bind
198```
199
200Last thing is to start the L2 guest with the huge pages memory backend.
201
202```bash
203./cloud-hypervisor \
204    --cpus boot=1 \
205    --memory size=4G,hugepages=on \
206    --disk path=focal-server-cloudimg-amd64.raw \
207    --kernel custom-vmlinux \
208    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
209    --device path=/sys/bus/pci/devices/0000:00:04.0
210```
211