xref: /cloud-hypervisor/docs/iommu.md (revision 42e9632c53d14cd0040db4952d40ba806c4b6ee9)
12c50c963SSebastien Boeuf# Virtual IOMMU
22c50c963SSebastien Boeuf
32c50c963SSebastien Boeuf## Rationales
42c50c963SSebastien Boeuf
52c50c963SSebastien BoeufHaving the possibility to expose a virtual IOMMU to the guest can be
62c50c963SSebastien Boeufinteresting to support specific use cases. That being said, it is always
72c50c963SSebastien Boeufimportant to keep in mind a virtual IOMMU can impact the performance of the
82c50c963SSebastien Boeufattached devices, which is the reason why one should be careful when enabling
92c50c963SSebastien Boeufthis feature.
102c50c963SSebastien Boeuf
112c50c963SSebastien Boeuf### Protect nested virtual machines
122c50c963SSebastien Boeuf
132c50c963SSebastien BoeufThe first reason why one might want to expose a virtual IOMMU to the guest is
142c50c963SSebastien Boeufto increase the security regarding the memory accesses performed by the virtual
152c50c963SSebastien Boeufdevices (VIRTIO devices), on behalf of the guest drivers.
162c50c963SSebastien Boeuf
172c50c963SSebastien BoeufWith a virtual IOMMU, the VMM stands between the guest driver and its device
182c50c963SSebastien Boeufcounterpart, validating and translating every address before to try accessing
192c50c963SSebastien Boeufthe guest memory. This is standard interposition that is performed here by the
202c50c963SSebastien BoeufVMM.
212c50c963SSebastien Boeuf
222c50c963SSebastien BoeufThe increased security does not apply for a simple case where we have one VM
232c50c963SSebastien Boeufper VMM. Because the guest cannot be trusted, as we always consider it could
242c50c963SSebastien Boeufbe malicious and gain unauthorized privileges inside the VM, preventing some
252c50c963SSebastien Boeufdevices from accessing the entire guest memory is pointless.
262c50c963SSebastien Boeuf
272c50c963SSebastien BoeufBut let's take the interesting case of nested virtualization, and let's assume
282c50c963SSebastien Boeufwe have a VMM running a first layer VM. This L1 guest is fully trusted as the
292c50c963SSebastien Boeufuser intends to run multiple VMs from this L1. We can end up with multiple L2
302c50c963SSebastien BoeufVMs running on a single L1 VM. In this particular case, and without exposing a
312c50c963SSebastien Boeufvirtual IOMMU to the L1 guest, it would be possible for any L2 guest to use the
322c50c963SSebastien Boeufdevice implementation from the host VMM to access the entire guest L1 memory.
332c50c963SSebastien BoeufThe virtual IOMMU prevents from this kind of trouble as it will validate the
342c50c963SSebastien Boeufaddresses the device is authorized to access.
352c50c963SSebastien Boeuf
362c50c963SSebastien Boeuf### Achieve VFIO nested
372c50c963SSebastien Boeuf
382c50c963SSebastien BoeufAnother reason for having a virtual IOMMU is to allow passing physical devices
392c50c963SSebastien Boeuffrom the host through multiple layers of virtualization. Let's take as example
402c50c963SSebastien Boeufa system with a physical IOMMU running a VM with a virtual IOMMU. The
412c50c963SSebastien Boeufimplementation of the virtual IOMMU is responsible for updating the physical
422c50c963SSebastien BoeufDMA Remapping table (DMAR) every time the DMA mapping changes. This must happen
432c50c963SSebastien Boeufthrough the VFIO framework on the host as this is the only userspace interface
442c50c963SSebastien Boeufto interact with a physical IOMMU.
452c50c963SSebastien Boeuf
462c50c963SSebastien BoeufRelying on this update mechanism, it is possible to attach physical devices to
472c50c963SSebastien Boeufthe virtual IOMMU, which allows these devices to be passed from L1 to another
482c50c963SSebastien Boeuflayer of virtualization.
492c50c963SSebastien Boeuf
502c50c963SSebastien Boeuf## Why virtio-iommu?
512c50c963SSebastien Boeuf
522c50c963SSebastien BoeufThe Cloud Hypervisor project decided to implement the brand new virtio-iommu
532c50c963SSebastien Boeufdevice in order to provide a virtual IOMMU to its users. The reason being the
542c50c963SSebastien Boeufsimplicity brought by the paravirtualization solution. By having one side
552c50c963SSebastien Boeufhandled from the guest itself, it removes the complexity of trapping memory
562c50c963SSebastien Boeufpage accesses and shadowing them. This is why the project will not try to
572c50c963SSebastien Boeufimplement a full emulation of a physical IOMMU.
582c50c963SSebastien Boeuf
592c50c963SSebastien Boeuf## Pre-requisites
602c50c963SSebastien Boeuf
612c50c963SSebastien Boeuf### Kernel
622c50c963SSebastien Boeuf
63*9b67bc5fSDaniel FarinaAs of Kernel 5.14, virtio-iommu is available for both X86-64 and Aarch64.
642c50c963SSebastien Boeuf
652c50c963SSebastien Boeuf## Usage
662c50c963SSebastien Boeuf
672c50c963SSebastien BoeufIn order to expose a virtual IOMMU to the guest, it is required to create a
682c50c963SSebastien Boeufvirtio-iommu device and expose it through the ACPI IORT table. This can be
692c50c963SSebastien Boeufsimply achieved by attaching at least one device to the virtual IOMMU.
702c50c963SSebastien Boeuf
712c50c963SSebastien BoeufThe way to expose to the guest a specific device as sitting behind this IOMMU
722c50c963SSebastien Boeufis to explicitly tag it from the command line with the option `iommu=on`.
732c50c963SSebastien Boeuf
742c50c963SSebastien BoeufNot all devices support this extra option, and the default value will always
752c50c963SSebastien Boeufbe `off` since we want to avoid the performance impact for most users who don't
762c50c963SSebastien Boeufneed this.
772c50c963SSebastien Boeuf
782c50c963SSebastien BoeufRefer to the command line `--help` to find out which device support to be
792c50c963SSebastien Boeufattached to the virtual IOMMU.
802c50c963SSebastien Boeuf
812c50c963SSebastien BoeufBelow is a simple example exposing the `virtio-blk` device as attached to the
822c50c963SSebastien Boeufvirtual IOMMU:
832c50c963SSebastien Boeuf
842c50c963SSebastien Boeuf```bash
852c50c963SSebastien Boeuf./cloud-hypervisor \
86dc9c1251SAmey Narkhede    --cpus boot=1 \
872c50c963SSebastien Boeuf    --memory size=512M \
88a3342bdbSSebastien Boeuf    --disk path=focal-server-cloudimg-amd64.raw,iommu=on \
895c7164e5SRob Bradford    --kernel custom-vmlinux \
90a3342bdbSSebastien Boeuf    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
912c50c963SSebastien Boeuf```
922c50c963SSebastien Boeuf
932c50c963SSebastien BoeufFrom a guest perspective, it is easy to verify if the device is protected by
942c50c963SSebastien Boeufthe virtual IOMMU. Check the directories listed under
952c50c963SSebastien Boeuf`/sys/kernel/iommu_groups`:
962c50c963SSebastien Boeuf
972c50c963SSebastien Boeuf```bash
982c50c963SSebastien Boeufls /sys/kernel/iommu_groups
992c50c963SSebastien Boeuf0
1002c50c963SSebastien Boeuf```
1012c50c963SSebastien Boeuf
1022c50c963SSebastien BoeufIn this case, only one IOMMU group should be created. Under this group, it is
1032c50c963SSebastien Boeufpossible to find out the b/d/f of the device(s) part of this group.
1042c50c963SSebastien Boeuf
1052c50c963SSebastien Boeuf```bash
1062c50c963SSebastien Boeufls /sys/kernel/iommu_groups/0/devices/
1072c50c963SSebastien Boeuf0000:00:03.0
1082c50c963SSebastien Boeuf```
1092c50c963SSebastien Boeuf
1102c50c963SSebastien BoeufAnd you can validate the device is the one we expect running `lspci`:
1112c50c963SSebastien Boeuf
1122c50c963SSebastien Boeuf```bash
1132c50c963SSebastien Boeuflspci
1142c50c963SSebastien Boeuf00:00.0 Host bridge: Intel Corporation Device 0d57
1152c50c963SSebastien Boeuf00:01.0 Unassigned class [ffff]: Red Hat, Inc. Device 1057
1162c50c963SSebastien Boeuf00:02.0 Unassigned class [ffff]: Red Hat, Inc. Virtio console
1172c50c963SSebastien Boeuf00:03.0 Mass storage controller: Red Hat, Inc. Virtio block device
1182c50c963SSebastien Boeuf00:04.0 Unassigned class [ffff]: Red Hat, Inc. Virtio RNG
1192c50c963SSebastien Boeuf```
120defc3392SSebastien Boeuf
121da8eecc7SMichael Zhao### Work with FDT on AArch64
122da8eecc7SMichael Zhao
123da8eecc7SMichael ZhaoOn AArch64 architecture, the virtual IOMMU can still be used even if ACPI is not
124da8eecc7SMichael Zhaoenabled. But the effect is different with what the aforementioned test showed.
125da8eecc7SMichael Zhao
126da8eecc7SMichael ZhaoWhen ACPI is disabled, virtual IOMMU is supported through Flattened Device Tree
127da8eecc7SMichael Zhao(FDT). In this case, the guest kernel cannot tell which device should be
128da8eecc7SMichael ZhaoIOMMU-attached and which should not. No matter how many devices you attached to
129da8eecc7SMichael Zhaothe virtual IOMMU by setting `iommu=on` option, all the devices on the PCI bus
130da8eecc7SMichael Zhaowill be attached to the virtual IOMMU (except the IOMMU itself). Each of the
131da8eecc7SMichael Zhaodevices will be added into a IOMMU group.
132da8eecc7SMichael Zhao
133da8eecc7SMichael ZhaoAs a result, the directory content of `/sys/kernel/iommu_groups` would be:
134da8eecc7SMichael Zhao
135da8eecc7SMichael Zhao```bash
136da8eecc7SMichael Zhaols /sys/kernel/iommu_groups/0/devices/
137da8eecc7SMichael Zhao0000:00:02.0
138da8eecc7SMichael Zhaols /sys/kernel/iommu_groups/1/devices/
139da8eecc7SMichael Zhao0000:00:03.0
140da8eecc7SMichael Zhaols /sys/kernel/iommu_groups/2/devices/
141da8eecc7SMichael Zhao0000:00:04.0
142da8eecc7SMichael Zhao```
143da8eecc7SMichael Zhao
144defc3392SSebastien Boeuf## Faster mappings
145defc3392SSebastien Boeuf
146defc3392SSebastien BoeufBy default, the guest memory is mapped with 4k pages and no huge pages, which
147defc3392SSebastien Boeufcauses the virtual IOMMU device to be asked for 4k mappings only. This
148defc3392SSebastien Boeufconfiguration slows down the setup of the physical IOMMU as an important number
149defc3392SSebastien Boeufof requests need to be issued in order to create large mappings.
150defc3392SSebastien Boeuf
151defc3392SSebastien BoeufOne use case is even more impacted by the slowdown, the nested VFIO case. When
152defc3392SSebastien Boeufpassing a device through a L2 guest, the VFIO driver running in L1 will update
153defc3392SSebastien Boeufthe DMAR entries for the specific device. Because VFIO pins the entire guest
154defc3392SSebastien Boeufmemory, this means the entire mapping of the L2 guest need to be stored into
155defc3392SSebastien Boeufmultiple 4k mappings. Obviously, the bigger the L2 guest RAM is, the longer the
156defc3392SSebastien Boeufupdate of the mappings will last. There is an additional problem happening in
157defc3392SSebastien Boeufthis case, if the L2 guest RAM is quite large, it will require a large number
158defc3392SSebastien Boeufof mappings, which might exceed the VFIO limit set on the host. The default
159defc3392SSebastien Boeufvalue is 65536, which can simply be reached with a 256MiB sized RAM.
160defc3392SSebastien Boeuf
161defc3392SSebastien BoeufThe way to solve both problems, the slowdown and the limit being exceeded, is
162defc3392SSebastien Boeufto reduce the amount of requests to describe those same large mappings. This
163defc3392SSebastien Boeufcan be achieved by using 2MiB pages, known as huge pages. By seeing the guest
164defc3392SSebastien BoeufRAM as larger pages, and because the virtual IOMMU device supports it, the
165defc3392SSebastien Boeufguest will require less mappings, which will prevent the limit from being
166defc3392SSebastien Boeufexceeded, but also will take less time to process them on the host. That's
167defc3392SSebastien Boeufhow using huge pages as much as possible can speed up VM boot time.
168defc3392SSebastien Boeuf
169defc3392SSebastien Boeuf### Basic usage
170defc3392SSebastien Boeuf
171defc3392SSebastien BoeufLet's look at an example of how to run a guest with huge pages.
172defc3392SSebastien Boeuf
173defc3392SSebastien BoeufFirst, make sure your system has enough pages to cover the entire guest RAM:
174defc3392SSebastien Boeuf```bash
175defc3392SSebastien Boeuf# This example creates 4096 hugepages
176defc3392SSebastien Boeufecho 4096 > /proc/sys/vm/nr_hugepages
177defc3392SSebastien Boeuf```
178defc3392SSebastien Boeuf
179defc3392SSebastien BoeufNext step is simply to create the VM. Two things are important, first we want
180defc3392SSebastien Boeufthe VM RAM to be mapped on huge pages by backing it with `/dev/hugepages`. And
181defc3392SSebastien Boeufsecond thing, we need to create some huge pages in the guest itself so they can
182defc3392SSebastien Boeufbe consumed.
183defc3392SSebastien Boeuf
184defc3392SSebastien Boeuf```bash
185defc3392SSebastien Boeuf./cloud-hypervisor \
186dc9c1251SAmey Narkhede    --cpus boot=1 \
187dc9c1251SAmey Narkhede    --memory size=8G,hugepages=on \
188a3342bdbSSebastien Boeuf    --disk path=focal-server-cloudimg-amd64.raw \
1895c7164e5SRob Bradford    --kernel custom-vmlinux \
190a3342bdbSSebastien Boeuf    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw hugepagesz=2M hugepages=2048" \
191defc3392SSebastien Boeuf    --net tap=,mac=,iommu=on
192defc3392SSebastien Boeuf```
193defc3392SSebastien Boeuf
194defc3392SSebastien Boeuf### Nested usage
195defc3392SSebastien Boeuf
196defc3392SSebastien BoeufLet's now look at the specific example of nested virtualization. In order to
197defc3392SSebastien Boeufreach optimized performances, the L2 guest also need to be mapped based on
198defc3392SSebastien Boeufhuge pages. Here is how to achieve this, assuming the physical device you are
199defc3392SSebastien Boeufpassing through is `0000:00:01.0`.
200defc3392SSebastien Boeuf
201defc3392SSebastien Boeuf```bash
202defc3392SSebastien Boeuf./cloud-hypervisor \
203dc9c1251SAmey Narkhede    --cpus boot=1 \
204dc9c1251SAmey Narkhede    --memory size=8G,hugepages=on \
205a3342bdbSSebastien Boeuf    --disk path=focal-server-cloudimg-amd64.raw \
2065c7164e5SRob Bradford    --kernel custom-vmlinux \
207a3342bdbSSebastien Boeuf    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw kvm-intel.nested=1 vfio_iommu_type1.allow_unsafe_interrupts rw hugepagesz=2M hugepages=2048" \
208defc3392SSebastien Boeuf    --device path=/sys/bus/pci/devices/0000:00:01.0,iommu=on
209defc3392SSebastien Boeuf```
210defc3392SSebastien Boeuf
211defc3392SSebastien BoeufOnce the L1 VM is running, unbind the device from the default driver in the
212defc3392SSebastien Boeufguest, and bind it to VFIO (it should appear as `0000:00:04.0`).
213defc3392SSebastien Boeuf
214defc3392SSebastien Boeuf```bash
215defc3392SSebastien Boeufecho 0000:00:04.0 > /sys/bus/pci/devices/0000\:00\:04.0/driver/unbind
216defc3392SSebastien Boeufecho 8086 1502 > /sys/bus/pci/drivers/vfio-pci/new_id
2173ffc655aSSebastien Boeufecho 0000:00:04.0 > /sys/bus/pci/drivers/vfio-pci/bind
218defc3392SSebastien Boeuf```
219defc3392SSebastien Boeuf
220defc3392SSebastien BoeufLast thing is to start the L2 guest with the huge pages memory backend.
221defc3392SSebastien Boeuf
222defc3392SSebastien Boeuf```bash
223defc3392SSebastien Boeuf./cloud-hypervisor \
224dc9c1251SAmey Narkhede    --cpus boot=1 \
225dc9c1251SAmey Narkhede    --memory size=4G,hugepages=on \
226a3342bdbSSebastien Boeuf    --disk path=focal-server-cloudimg-amd64.raw \
2275c7164e5SRob Bradford    --kernel custom-vmlinux \
228a3342bdbSSebastien Boeuf    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
229defc3392SSebastien Boeuf    --device path=/sys/bus/pci/devices/0000:00:04.0
230defc3392SSebastien Boeuf```
2313120260eSRob Bradford
2323120260eSRob Bradford### Dedicated IOMMU PCI segments
2333120260eSRob Bradford
2343120260eSRob BradfordTo facilitate hotplug of devices that require being behind an IOMMU it is
2353120260eSRob Bradfordpossible to mark entire PCI segments as behind the IOMMU.
2363120260eSRob Bradford
2373120260eSRob BradfordThis is accomplished through `--platform
2383120260eSRob Bradfordnum_pci_segments=<number_of_segments>,iommu_segments=<range of segments>` or
2393120260eSRob Bradfordvia the equivalents in `PlatformConfig` for the API.
2403120260eSRob Bradford
2413120260eSRob Bradforde.g.
2423120260eSRob Bradford
2433120260eSRob Bradford```bash
2443120260eSRob Bradford./cloud-hypervisor \
245fa22cb0bSRavi kumar Veeramally    --api-socket=/tmp/api \
2463120260eSRob Bradford    --cpus boot=1 \
2473120260eSRob Bradford    --memory size=4G,hugepages=on \
2483120260eSRob Bradford    --disk path=focal-server-cloudimg-amd64.raw \
2493120260eSRob Bradford    --kernel custom-vmlinux \
2503120260eSRob Bradford    --cmdline "console=ttyS0 console=hvc0 root=/dev/vda1 rw" \
2513120260eSRob Bradford    --platform num_pci_segments=2,iommu_segments=1
2523120260eSRob Bradford```
2533120260eSRob Bradford
2543120260eSRob BradfordThis adds a second PCI segment to the platform behind the IOMMU. A VFIO device
2553120260eSRob Bradfordrequiring the IOMMU then may be hotplugged:
2563120260eSRob Bradford
2573120260eSRob Bradforde.g.
2583120260eSRob Bradford
2593120260eSRob Bradford```bash
260fa22cb0bSRavi kumar Veeramally./ch-remote --api-socket=/tmp/api add-device path=/sys/bus/pci/devices/0000:00:04.0,iommu=on,pci_segment=1
2613120260eSRob Bradford```
2623120260eSRob Bradford
2633120260eSRob BradfordDevices that cannot be placed behind an IOMMU (e.g. lacking an `iommu=` option)
2643120260eSRob Bradfordcannot be placed on the IOMMU segments.
2653120260eSRob Bradford
266