1# Virtual IOMMU 2 3## Rationales 4 5Having the possibility to expose a virtual IOMMU to the guest can be 6interesting to support specific use cases. That being said, it is always 7important to keep in mind a virtual IOMMU can impact the performance of the 8attached devices, which is the reason why one should be careful when enabling 9this feature. 10 11### Protect nested virtual machines 12 13The first reason why one might want to expose a virtual IOMMU to the guest is 14to increase the security regarding the memory accesses performed by the virtual 15devices (VIRTIO devices), on behalf of the guest drivers. 16 17With a virtual IOMMU, the VMM stands between the guest driver and its device 18counterpart, validating and translating every address before to try accessing 19the guest memory. This is standard interposition that is performed here by the 20VMM. 21 22The increased security does not apply for a simple case where we have one VM 23per VMM. Because the guest cannot be trusted, as we always consider it could 24be malicious and gain unauthorized privileges inside the VM, preventing some 25devices from accessing the entire guest memory is pointless. 26 27But let's take the interesting case of nested virtualization, and let's assume 28we have a VMM running a first layer VM. This L1 guest is fully trusted as the 29user intends to run multiple VMs from this L1. We can end up with multiple L2 30VMs running on a single L1 VM. In this particular case, and without exposing a 31virtual IOMMU to the L1 guest, it would be possible for any L2 guest to use the 32device implementation from the host VMM to access the entire guest L1 memory. 33The virtual IOMMU prevents from this kind of trouble as it will validate the 34addresses the device is authorized to access. 35 36### Achieve VFIO nested 37 38Another reason for having a virtual IOMMU is to allow passing physical devices 39from the host through multiple layers of virtualization. Let's take as example 40a system with a physical IOMMU running a VM with a virtual IOMMU. The 41implementation of the virtual IOMMU is responsible for updating the physical 42DMA Remapping table (DMAR) everytime the DMA mapping changes. This must happen 43through the VFIO framework on the host as this is the only userspace interface 44to interact with a physical IOMMU. 45 46Relying on this update mechanism, it is possible to attach physical devices to 47the virtual IOMMU, which allows these devices to be passed from L1 to another 48layer of virtualization. 49 50## Why virtio-iommu? 51 52The Cloud Hypervisor project decided to implement the brand new virtio-iommu 53device in order to provide a virtual IOMMU to its users. The reason being the 54simplicity brought by the paravirtualization solution. By having one side 55handled from the guest itself, it removes the complexity of trapping memory 56page accesses and shadowing them. This is why the project will not try to 57implement a full emulation of a physical IOMMU. 58 59## Pre-requisites 60 61### Kernel 62 63Since virtio-iommu has landed partially into the version 5.3 of the Linux 64kernel, a special branch is needed to get things working with Cloud Hypervisor. 65By partially, we are talking about x86 specifically, as it is already fully 66functional for ARM architectures. 67 68## Usage 69 70In order to expose a virtual IOMMU to the guest, it is required to create a 71virtio-iommu device and expose it through the ACPI IORT table. This can be 72simply achieved by attaching at least one device to the virtual IOMMU. 73 74The way to expose to the guest a specific device as sitting behind this IOMMU 75is to explicitly tag it from the command line with the option `iommu=on`. 76 77Not all devices support this extra option, and the default value will always 78be `off` since we want to avoid the performance impact for most users who don't 79need this. 80 81Refer to the command line `--help` to find out which device support to be 82attached to the virtual IOMMU. 83 84Below is a simple example exposing the `virtio-blk` device as attached to the 85virtual IOMMU: 86 87```bash 88./cloud-hypervisor \ 89 --cpus boot=1 \ 90 --memory size=512M \ 91 --disk path=focal-server-cloudimg-amd64.raw,iommu=on \ 92 --kernel custom-vmlinux \ 93 --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \ 94``` 95 96From a guest perspective, it is easy to verify if the device is protected by 97the virtual IOMMU. Check the directories listed under 98`/sys/kernel/iommu_groups`: 99 100```bash 101ls /sys/kernel/iommu_groups 1020 103``` 104 105In this case, only one IOMMU group should be created. Under this group, it is 106possible to find out the b/d/f of the device(s) part of this group. 107 108```bash 109ls /sys/kernel/iommu_groups/0/devices/ 1100000:00:03.0 111``` 112 113And you can validate the device is the one we expect running `lspci`: 114 115```bash 116lspci 11700:00.0 Host bridge: Intel Corporation Device 0d57 11800:01.0 Unassigned class [ffff]: Red Hat, Inc. Device 1057 11900:02.0 Unassigned class [ffff]: Red Hat, Inc. Virtio console 12000:03.0 Mass storage controller: Red Hat, Inc. Virtio block device 12100:04.0 Unassigned class [ffff]: Red Hat, Inc. Virtio RNG 122``` 123 124### Work with FDT on AArch64 125 126On AArch64 architecture, the virtual IOMMU can still be used even if ACPI is not 127enabled. But the effect is different with what the aforementioned test showed. 128 129When ACPI is disabled, virtual IOMMU is supported through Flattened Device Tree 130(FDT). In this case, the guest kernel can not tell which device should be 131IOMMU-attached and which should not. No matter how many devices you attached to 132the virtual IOMMU by setting `iommu=on` option, all the devices on the PCI bus 133will be attached to the virtual IOMMU (except the IOMMU itself). Each of the 134devices will be added into a IOMMU group. 135 136As a result, the directory content of `/sys/kernel/iommu_groups` would be: 137 138```bash 139ls /sys/kernel/iommu_groups/0/devices/ 1400000:00:02.0 141ls /sys/kernel/iommu_groups/1/devices/ 1420000:00:03.0 143ls /sys/kernel/iommu_groups/2/devices/ 1440000:00:04.0 145``` 146 147## Faster mappings 148 149By default, the guest memory is mapped with 4k pages and no huge pages, which 150causes the virtual IOMMU device to be asked for 4k mappings only. This 151configuration slows down the setup of the physical IOMMU as an important number 152of requests need to be issued in order to create large mappings. 153 154One use case is even more impacted by the slowdown, the nested VFIO case. When 155passing a device through a L2 guest, the VFIO driver running in L1 will update 156the DMAR entries for the specific device. Because VFIO pins the entire guest 157memory, this means the entire mapping of the L2 guest need to be stored into 158multiple 4k mappings. Obviously, the bigger the L2 guest RAM is, the longer the 159update of the mappings will last. There is an additional problem happening in 160this case, if the L2 guest RAM is quite large, it will require a large number 161of mappings, which might exceed the VFIO limit set on the host. The default 162value is 65536, which can simply be reached with a 256MiB sized RAM. 163 164The way to solve both problems, the slowdown and the limit being exceeded, is 165to reduce the amount of requests to describe those same large mappings. This 166can be achieved by using 2MiB pages, known as huge pages. By seeing the guest 167RAM as larger pages, and because the virtual IOMMU device supports it, the 168guest will require less mappings, which will prevent the limit from being 169exceeded, but also will take less time to process them on the host. That's 170how using huge pages as much as possible can speed up VM boot time. 171 172### Basic usage 173 174Let's look at an example of how to run a guest with huge pages. 175 176First, make sure your system has enough pages to cover the entire guest RAM: 177```bash 178# This example creates 4096 hugepages 179echo 4096 > /proc/sys/vm/nr_hugepages 180``` 181 182Next step is simply to create the VM. Two things are important, first we want 183the VM RAM to be mapped on huge pages by backing it with `/dev/hugepages`. And 184second thing, we need to create some huge pages in the guest itself so they can 185be consumed. 186 187```bash 188./cloud-hypervisor \ 189 --cpus boot=1 \ 190 --memory size=8G,hugepages=on \ 191 --disk path=focal-server-cloudimg-amd64.raw \ 192 --kernel custom-vmlinux \ 193 --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw hugepagesz=2M hugepages=2048" \ 194 --net tap=,mac=,iommu=on 195``` 196 197### Nested usage 198 199Let's now look at the specific example of nested virtualization. In order to 200reach optimized performances, the L2 guest also need to be mapped based on 201huge pages. Here is how to achieve this, assuming the physical device you are 202passing through is `0000:00:01.0`. 203 204```bash 205./cloud-hypervisor \ 206 --cpus boot=1 \ 207 --memory size=8G,hugepages=on \ 208 --disk path=focal-server-cloudimg-amd64.raw \ 209 --kernel custom-vmlinux \ 210 --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw kvm-intel.nested=1 vfio_iommu_type1.allow_unsafe_interrupts rw hugepagesz=2M hugepages=2048" \ 211 --device path=/sys/bus/pci/devices/0000:00:01.0,iommu=on 212``` 213 214Once the L1 VM is running, unbind the device from the default driver in the 215guest, and bind it to VFIO (it should appear as `0000:00:04.0`). 216 217```bash 218echo 0000:00:04.0 > /sys/bus/pci/devices/0000\:00\:04.0/driver/unbind 219echo 8086 1502 > /sys/bus/pci/drivers/vfio-pci/new_id 220echo 0000:00:04.0 > /sys/bus/pci/drivers/vfio-pci/bind 221``` 222 223Last thing is to start the L2 guest with the huge pages memory backend. 224 225```bash 226./cloud-hypervisor \ 227 --cpus boot=1 \ 228 --memory size=4G,hugepages=on \ 229 --disk path=focal-server-cloudimg-amd64.raw \ 230 --kernel custom-vmlinux \ 231 --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \ 232 --device path=/sys/bus/pci/devices/0000:00:04.0 233``` 234 235### Dedicated IOMMU PCI segments 236 237To facilitate hotplug of devices that require being behind an IOMMU it is 238possible to mark entire PCI segments as behind the IOMMU. 239 240This is accomplished through `--platform 241num_pci_segments=<number_of_segments>,iommu_segments=<range of segments>` or 242via the equivalents in `PlatformConfig` for the API. 243 244e.g. 245 246```bash 247./cloud-hypervisor \ 248 --api-socket=/tmp/api \ 249 --cpus boot=1 \ 250 --memory size=4G,hugepages=on \ 251 --disk path=focal-server-cloudimg-amd64.raw \ 252 --kernel custom-vmlinux \ 253 --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \ 254 --platform num_pci_segments=2,iommu_segments=1 255``` 256 257This adds a second PCI segment to the platform behind the IOMMU. A VFIO device 258requiring the IOMMU then may be hotplugged: 259 260e.g. 261 262```bash 263./ch-remote --api-socket=/tmp/api add-device path=/sys/bus/pci/devices/0000:00:04.0,iommu=on,pci_segment=1 264``` 265 266Devices that cannot be placed behind an IOMMU (e.g. lacking an `iommu=` option) 267cannot be placed on the IOMMU segments. 268 269