12c50c963SSebastien Boeuf# Virtual IOMMU 22c50c963SSebastien Boeuf 32c50c963SSebastien Boeuf## Rationales 42c50c963SSebastien Boeuf 52c50c963SSebastien BoeufHaving the possibility to expose a virtual IOMMU to the guest can be 62c50c963SSebastien Boeufinteresting to support specific use cases. That being said, it is always 72c50c963SSebastien Boeufimportant to keep in mind a virtual IOMMU can impact the performance of the 82c50c963SSebastien Boeufattached devices, which is the reason why one should be careful when enabling 92c50c963SSebastien Boeufthis feature. 102c50c963SSebastien Boeuf 112c50c963SSebastien Boeuf### Protect nested virtual machines 122c50c963SSebastien Boeuf 132c50c963SSebastien BoeufThe first reason why one might want to expose a virtual IOMMU to the guest is 142c50c963SSebastien Boeufto increase the security regarding the memory accesses performed by the virtual 152c50c963SSebastien Boeufdevices (VIRTIO devices), on behalf of the guest drivers. 162c50c963SSebastien Boeuf 172c50c963SSebastien BoeufWith a virtual IOMMU, the VMM stands between the guest driver and its device 182c50c963SSebastien Boeufcounterpart, validating and translating every address before to try accessing 192c50c963SSebastien Boeufthe guest memory. This is standard interposition that is performed here by the 202c50c963SSebastien BoeufVMM. 212c50c963SSebastien Boeuf 222c50c963SSebastien BoeufThe increased security does not apply for a simple case where we have one VM 232c50c963SSebastien Boeufper VMM. Because the guest cannot be trusted, as we always consider it could 242c50c963SSebastien Boeufbe malicious and gain unauthorized privileges inside the VM, preventing some 252c50c963SSebastien Boeufdevices from accessing the entire guest memory is pointless. 262c50c963SSebastien Boeuf 272c50c963SSebastien BoeufBut let's take the interesting case of nested virtualization, and let's assume 282c50c963SSebastien Boeufwe have a VMM running a first layer VM. This L1 guest is fully trusted as the 292c50c963SSebastien Boeufuser intends to run multiple VMs from this L1. We can end up with multiple L2 302c50c963SSebastien BoeufVMs running on a single L1 VM. In this particular case, and without exposing a 312c50c963SSebastien Boeufvirtual IOMMU to the L1 guest, it would be possible for any L2 guest to use the 322c50c963SSebastien Boeufdevice implementation from the host VMM to access the entire guest L1 memory. 332c50c963SSebastien BoeufThe virtual IOMMU prevents from this kind of trouble as it will validate the 342c50c963SSebastien Boeufaddresses the device is authorized to access. 352c50c963SSebastien Boeuf 362c50c963SSebastien Boeuf### Achieve VFIO nested 372c50c963SSebastien Boeuf 382c50c963SSebastien BoeufAnother reason for having a virtual IOMMU is to allow passing physical devices 392c50c963SSebastien Boeuffrom the host through multiple layers of virtualization. Let's take as example 402c50c963SSebastien Boeufa system with a physical IOMMU running a VM with a virtual IOMMU. The 412c50c963SSebastien Boeufimplementation of the virtual IOMMU is responsible for updating the physical 422c50c963SSebastien BoeufDMA Remapping table (DMAR) every time the DMA mapping changes. This must happen 432c50c963SSebastien Boeufthrough the VFIO framework on the host as this is the only userspace interface 442c50c963SSebastien Boeufto interact with a physical IOMMU. 452c50c963SSebastien Boeuf 462c50c963SSebastien BoeufRelying on this update mechanism, it is possible to attach physical devices to 472c50c963SSebastien Boeufthe virtual IOMMU, which allows these devices to be passed from L1 to another 482c50c963SSebastien Boeuflayer of virtualization. 492c50c963SSebastien Boeuf 502c50c963SSebastien Boeuf## Why virtio-iommu? 512c50c963SSebastien Boeuf 522c50c963SSebastien BoeufThe Cloud Hypervisor project decided to implement the brand new virtio-iommu 532c50c963SSebastien Boeufdevice in order to provide a virtual IOMMU to its users. The reason being the 542c50c963SSebastien Boeufsimplicity brought by the paravirtualization solution. By having one side 552c50c963SSebastien Boeufhandled from the guest itself, it removes the complexity of trapping memory 562c50c963SSebastien Boeufpage accesses and shadowing them. This is why the project will not try to 572c50c963SSebastien Boeufimplement a full emulation of a physical IOMMU. 582c50c963SSebastien Boeuf 592c50c963SSebastien Boeuf## Pre-requisites 602c50c963SSebastien Boeuf 612c50c963SSebastien Boeuf### Kernel 622c50c963SSebastien Boeuf 63*9b67bc5fSDaniel FarinaAs of Kernel 5.14, virtio-iommu is available for both X86-64 and Aarch64. 642c50c963SSebastien Boeuf 652c50c963SSebastien Boeuf## Usage 662c50c963SSebastien Boeuf 672c50c963SSebastien BoeufIn order to expose a virtual IOMMU to the guest, it is required to create a 682c50c963SSebastien Boeufvirtio-iommu device and expose it through the ACPI IORT table. This can be 692c50c963SSebastien Boeufsimply achieved by attaching at least one device to the virtual IOMMU. 702c50c963SSebastien Boeuf 712c50c963SSebastien BoeufThe way to expose to the guest a specific device as sitting behind this IOMMU 722c50c963SSebastien Boeufis to explicitly tag it from the command line with the option `iommu=on`. 732c50c963SSebastien Boeuf 742c50c963SSebastien BoeufNot all devices support this extra option, and the default value will always 752c50c963SSebastien Boeufbe `off` since we want to avoid the performance impact for most users who don't 762c50c963SSebastien Boeufneed this. 772c50c963SSebastien Boeuf 782c50c963SSebastien BoeufRefer to the command line `--help` to find out which device support to be 792c50c963SSebastien Boeufattached to the virtual IOMMU. 802c50c963SSebastien Boeuf 812c50c963SSebastien BoeufBelow is a simple example exposing the `virtio-blk` device as attached to the 822c50c963SSebastien Boeufvirtual IOMMU: 832c50c963SSebastien Boeuf 842c50c963SSebastien Boeuf```bash 852c50c963SSebastien Boeuf./cloud-hypervisor \ 86dc9c1251SAmey Narkhede --cpus boot=1 \ 872c50c963SSebastien Boeuf --memory size=512M \ 88a3342bdbSSebastien Boeuf --disk path=focal-server-cloudimg-amd64.raw,iommu=on \ 895c7164e5SRob Bradford --kernel custom-vmlinux \ 90a3342bdbSSebastien Boeuf --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \ 912c50c963SSebastien Boeuf``` 922c50c963SSebastien Boeuf 932c50c963SSebastien BoeufFrom a guest perspective, it is easy to verify if the device is protected by 942c50c963SSebastien Boeufthe virtual IOMMU. Check the directories listed under 952c50c963SSebastien Boeuf`/sys/kernel/iommu_groups`: 962c50c963SSebastien Boeuf 972c50c963SSebastien Boeuf```bash 982c50c963SSebastien Boeufls /sys/kernel/iommu_groups 992c50c963SSebastien Boeuf0 1002c50c963SSebastien Boeuf``` 1012c50c963SSebastien Boeuf 1022c50c963SSebastien BoeufIn this case, only one IOMMU group should be created. Under this group, it is 1032c50c963SSebastien Boeufpossible to find out the b/d/f of the device(s) part of this group. 1042c50c963SSebastien Boeuf 1052c50c963SSebastien Boeuf```bash 1062c50c963SSebastien Boeufls /sys/kernel/iommu_groups/0/devices/ 1072c50c963SSebastien Boeuf0000:00:03.0 1082c50c963SSebastien Boeuf``` 1092c50c963SSebastien Boeuf 1102c50c963SSebastien BoeufAnd you can validate the device is the one we expect running `lspci`: 1112c50c963SSebastien Boeuf 1122c50c963SSebastien Boeuf```bash 1132c50c963SSebastien Boeuflspci 1142c50c963SSebastien Boeuf00:00.0 Host bridge: Intel Corporation Device 0d57 1152c50c963SSebastien Boeuf00:01.0 Unassigned class [ffff]: Red Hat, Inc. Device 1057 1162c50c963SSebastien Boeuf00:02.0 Unassigned class [ffff]: Red Hat, Inc. Virtio console 1172c50c963SSebastien Boeuf00:03.0 Mass storage controller: Red Hat, Inc. Virtio block device 1182c50c963SSebastien Boeuf00:04.0 Unassigned class [ffff]: Red Hat, Inc. Virtio RNG 1192c50c963SSebastien Boeuf``` 120defc3392SSebastien Boeuf 121da8eecc7SMichael Zhao### Work with FDT on AArch64 122da8eecc7SMichael Zhao 123da8eecc7SMichael ZhaoOn AArch64 architecture, the virtual IOMMU can still be used even if ACPI is not 124da8eecc7SMichael Zhaoenabled. But the effect is different with what the aforementioned test showed. 125da8eecc7SMichael Zhao 126da8eecc7SMichael ZhaoWhen ACPI is disabled, virtual IOMMU is supported through Flattened Device Tree 127da8eecc7SMichael Zhao(FDT). In this case, the guest kernel cannot tell which device should be 128da8eecc7SMichael ZhaoIOMMU-attached and which should not. No matter how many devices you attached to 129da8eecc7SMichael Zhaothe virtual IOMMU by setting `iommu=on` option, all the devices on the PCI bus 130da8eecc7SMichael Zhaowill be attached to the virtual IOMMU (except the IOMMU itself). Each of the 131da8eecc7SMichael Zhaodevices will be added into a IOMMU group. 132da8eecc7SMichael Zhao 133da8eecc7SMichael ZhaoAs a result, the directory content of `/sys/kernel/iommu_groups` would be: 134da8eecc7SMichael Zhao 135da8eecc7SMichael Zhao```bash 136da8eecc7SMichael Zhaols /sys/kernel/iommu_groups/0/devices/ 137da8eecc7SMichael Zhao0000:00:02.0 138da8eecc7SMichael Zhaols /sys/kernel/iommu_groups/1/devices/ 139da8eecc7SMichael Zhao0000:00:03.0 140da8eecc7SMichael Zhaols /sys/kernel/iommu_groups/2/devices/ 141da8eecc7SMichael Zhao0000:00:04.0 142da8eecc7SMichael Zhao``` 143da8eecc7SMichael Zhao 144defc3392SSebastien Boeuf## Faster mappings 145defc3392SSebastien Boeuf 146defc3392SSebastien BoeufBy default, the guest memory is mapped with 4k pages and no huge pages, which 147defc3392SSebastien Boeufcauses the virtual IOMMU device to be asked for 4k mappings only. This 148defc3392SSebastien Boeufconfiguration slows down the setup of the physical IOMMU as an important number 149defc3392SSebastien Boeufof requests need to be issued in order to create large mappings. 150defc3392SSebastien Boeuf 151defc3392SSebastien BoeufOne use case is even more impacted by the slowdown, the nested VFIO case. When 152defc3392SSebastien Boeufpassing a device through a L2 guest, the VFIO driver running in L1 will update 153defc3392SSebastien Boeufthe DMAR entries for the specific device. Because VFIO pins the entire guest 154defc3392SSebastien Boeufmemory, this means the entire mapping of the L2 guest need to be stored into 155defc3392SSebastien Boeufmultiple 4k mappings. Obviously, the bigger the L2 guest RAM is, the longer the 156defc3392SSebastien Boeufupdate of the mappings will last. There is an additional problem happening in 157defc3392SSebastien Boeufthis case, if the L2 guest RAM is quite large, it will require a large number 158defc3392SSebastien Boeufof mappings, which might exceed the VFIO limit set on the host. The default 159defc3392SSebastien Boeufvalue is 65536, which can simply be reached with a 256MiB sized RAM. 160defc3392SSebastien Boeuf 161defc3392SSebastien BoeufThe way to solve both problems, the slowdown and the limit being exceeded, is 162defc3392SSebastien Boeufto reduce the amount of requests to describe those same large mappings. This 163defc3392SSebastien Boeufcan be achieved by using 2MiB pages, known as huge pages. By seeing the guest 164defc3392SSebastien BoeufRAM as larger pages, and because the virtual IOMMU device supports it, the 165defc3392SSebastien Boeufguest will require less mappings, which will prevent the limit from being 166defc3392SSebastien Boeufexceeded, but also will take less time to process them on the host. That's 167defc3392SSebastien Boeufhow using huge pages as much as possible can speed up VM boot time. 168defc3392SSebastien Boeuf 169defc3392SSebastien Boeuf### Basic usage 170defc3392SSebastien Boeuf 171defc3392SSebastien BoeufLet's look at an example of how to run a guest with huge pages. 172defc3392SSebastien Boeuf 173defc3392SSebastien BoeufFirst, make sure your system has enough pages to cover the entire guest RAM: 174defc3392SSebastien Boeuf```bash 175defc3392SSebastien Boeuf# This example creates 4096 hugepages 176defc3392SSebastien Boeufecho 4096 > /proc/sys/vm/nr_hugepages 177defc3392SSebastien Boeuf``` 178defc3392SSebastien Boeuf 179defc3392SSebastien BoeufNext step is simply to create the VM. Two things are important, first we want 180defc3392SSebastien Boeufthe VM RAM to be mapped on huge pages by backing it with `/dev/hugepages`. And 181defc3392SSebastien Boeufsecond thing, we need to create some huge pages in the guest itself so they can 182defc3392SSebastien Boeufbe consumed. 183defc3392SSebastien Boeuf 184defc3392SSebastien Boeuf```bash 185defc3392SSebastien Boeuf./cloud-hypervisor \ 186dc9c1251SAmey Narkhede --cpus boot=1 \ 187dc9c1251SAmey Narkhede --memory size=8G,hugepages=on \ 188a3342bdbSSebastien Boeuf --disk path=focal-server-cloudimg-amd64.raw \ 1895c7164e5SRob Bradford --kernel custom-vmlinux \ 190a3342bdbSSebastien Boeuf --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw hugepagesz=2M hugepages=2048" \ 191defc3392SSebastien Boeuf --net tap=,mac=,iommu=on 192defc3392SSebastien Boeuf``` 193defc3392SSebastien Boeuf 194defc3392SSebastien Boeuf### Nested usage 195defc3392SSebastien Boeuf 196defc3392SSebastien BoeufLet's now look at the specific example of nested virtualization. In order to 197defc3392SSebastien Boeufreach optimized performances, the L2 guest also need to be mapped based on 198defc3392SSebastien Boeufhuge pages. Here is how to achieve this, assuming the physical device you are 199defc3392SSebastien Boeufpassing through is `0000:00:01.0`. 200defc3392SSebastien Boeuf 201defc3392SSebastien Boeuf```bash 202defc3392SSebastien Boeuf./cloud-hypervisor \ 203dc9c1251SAmey Narkhede --cpus boot=1 \ 204dc9c1251SAmey Narkhede --memory size=8G,hugepages=on \ 205a3342bdbSSebastien Boeuf --disk path=focal-server-cloudimg-amd64.raw \ 2065c7164e5SRob Bradford --kernel custom-vmlinux \ 207a3342bdbSSebastien Boeuf --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw kvm-intel.nested=1 vfio_iommu_type1.allow_unsafe_interrupts rw hugepagesz=2M hugepages=2048" \ 208defc3392SSebastien Boeuf --device path=/sys/bus/pci/devices/0000:00:01.0,iommu=on 209defc3392SSebastien Boeuf``` 210defc3392SSebastien Boeuf 211defc3392SSebastien BoeufOnce the L1 VM is running, unbind the device from the default driver in the 212defc3392SSebastien Boeufguest, and bind it to VFIO (it should appear as `0000:00:04.0`). 213defc3392SSebastien Boeuf 214defc3392SSebastien Boeuf```bash 215defc3392SSebastien Boeufecho 0000:00:04.0 > /sys/bus/pci/devices/0000\:00\:04.0/driver/unbind 216defc3392SSebastien Boeufecho 8086 1502 > /sys/bus/pci/drivers/vfio-pci/new_id 2173ffc655aSSebastien Boeufecho 0000:00:04.0 > /sys/bus/pci/drivers/vfio-pci/bind 218defc3392SSebastien Boeuf``` 219defc3392SSebastien Boeuf 220defc3392SSebastien BoeufLast thing is to start the L2 guest with the huge pages memory backend. 221defc3392SSebastien Boeuf 222defc3392SSebastien Boeuf```bash 223defc3392SSebastien Boeuf./cloud-hypervisor \ 224dc9c1251SAmey Narkhede --cpus boot=1 \ 225dc9c1251SAmey Narkhede --memory size=4G,hugepages=on \ 226a3342bdbSSebastien Boeuf --disk path=focal-server-cloudimg-amd64.raw \ 2275c7164e5SRob Bradford --kernel custom-vmlinux \ 228a3342bdbSSebastien Boeuf --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \ 229defc3392SSebastien Boeuf --device path=/sys/bus/pci/devices/0000:00:04.0 230defc3392SSebastien Boeuf``` 2313120260eSRob Bradford 2323120260eSRob Bradford### Dedicated IOMMU PCI segments 2333120260eSRob Bradford 2343120260eSRob BradfordTo facilitate hotplug of devices that require being behind an IOMMU it is 2353120260eSRob Bradfordpossible to mark entire PCI segments as behind the IOMMU. 2363120260eSRob Bradford 2373120260eSRob BradfordThis is accomplished through `--platform 2383120260eSRob Bradfordnum_pci_segments=<number_of_segments>,iommu_segments=<range of segments>` or 2393120260eSRob Bradfordvia the equivalents in `PlatformConfig` for the API. 2403120260eSRob Bradford 2413120260eSRob Bradforde.g. 2423120260eSRob Bradford 2433120260eSRob Bradford```bash 2443120260eSRob Bradford./cloud-hypervisor \ 245fa22cb0bSRavi kumar Veeramally --api-socket=/tmp/api \ 2463120260eSRob Bradford --cpus boot=1 \ 2473120260eSRob Bradford --memory size=4G,hugepages=on \ 2483120260eSRob Bradford --disk path=focal-server-cloudimg-amd64.raw \ 2493120260eSRob Bradford --kernel custom-vmlinux \ 2503120260eSRob Bradford --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \ 2513120260eSRob Bradford --platform num_pci_segments=2,iommu_segments=1 2523120260eSRob Bradford``` 2533120260eSRob Bradford 2543120260eSRob BradfordThis adds a second PCI segment to the platform behind the IOMMU. A VFIO device 2553120260eSRob Bradfordrequiring the IOMMU then may be hotplugged: 2563120260eSRob Bradford 2573120260eSRob Bradforde.g. 2583120260eSRob Bradford 2593120260eSRob Bradford```bash 260fa22cb0bSRavi kumar Veeramally./ch-remote --api-socket=/tmp/api add-device path=/sys/bus/pci/devices/0000:00:04.0,iommu=on,pci_segment=1 2613120260eSRob Bradford``` 2623120260eSRob Bradford 2633120260eSRob BradfordDevices that cannot be placed behind an IOMMU (e.g. lacking an `iommu=` option) 2643120260eSRob Bradfordcannot be placed on the IOMMU segments. 2653120260eSRob Bradford 266