13cc6f48cSSamuel Ortiz# Cloud Hypervisor VFIO HOWTO 23cc6f48cSSamuel Ortiz 33cc6f48cSSamuel OrtizVFIO (Virtual Function I/O) is a kernel framework that exposes direct device 43cc6f48cSSamuel Ortizaccess to userspace. `cloud-hypervisor`, as many VMMs do, uses the VFIO 53cc6f48cSSamuel Ortizframework to directly assign host physical devices to the guest workloads. 63cc6f48cSSamuel Ortiz 73cc6f48cSSamuel Ortiz## Direct Device Assignment with Cloud Hypervisor 83cc6f48cSSamuel Ortiz 93cc6f48cSSamuel OrtizTo assign a device to a `cloud-hypervisor` guest, the device needs to be managed 103cc6f48cSSamuel Ortizby the VFIO kernel drivers. However, by default, a host device will be bound to 113cc6f48cSSamuel Ortizits native driver, which is not the VFIO one. 123cc6f48cSSamuel Ortiz 133cc6f48cSSamuel OrtizAs a consequence, a device must be unbound from its native driver before passing 148160c288SDayu Liuit to `cloud-hypervisor` for assigning it to a guest. 153cc6f48cSSamuel Ortiz 163cc6f48cSSamuel Ortiz### Example 173cc6f48cSSamuel Ortiz 183cc6f48cSSamuel OrtizIn this example we're going to assign a PCI memory card (SD, MMC, etc) reader 193cc6f48cSSamuel Ortizfrom the host in a cloud hypervisor guest. 203cc6f48cSSamuel Ortiz 213cc6f48cSSamuel Ortiz`cloud-hypervisor` only supports assigning PCI devices to its guests. `lspci` 223cc6f48cSSamuel Ortizhelps with identifying PCI devices on the host: 233cc6f48cSSamuel Ortiz 243cc6f48cSSamuel Ortiz``` 253cc6f48cSSamuel Ortiz$ lspci 263cc6f48cSSamuel Ortiz[...] 273cc6f48cSSamuel Ortiz01:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01) 283cc6f48cSSamuel Ortiz[...] 293cc6f48cSSamuel Ortiz``` 303cc6f48cSSamuel Ortiz 313cc6f48cSSamuel OrtizHere we see that our device is on bus 1, slot 0 and function 0 (`01:00.0`). 323cc6f48cSSamuel Ortiz 333cc6f48cSSamuel OrtizNow that we have identified the device, we must unbind it from its native driver 343cc6f48cSSamuel Ortiz(`rtsx_pci`) and bind it to the VFIO driver instead (`vfio_pci`). 353cc6f48cSSamuel Ortiz 363cc6f48cSSamuel OrtizFirst we add VFIO support to the host: 373cc6f48cSSamuel Ortiz 383cc6f48cSSamuel Ortiz``` 39533e47f2SSebastien Boeuf# modprobe -r vfio_pci 40533e47f2SSebastien Boeuf# modprobe -r vfio_iommu_type1 41533e47f2SSebastien Boeuf# modprobe vfio_iommu_type1 allow_unsafe_interrupts 42533e47f2SSebastien Boeuf# modprobe vfio_pci 43533e47f2SSebastien Boeuf``` 44533e47f2SSebastien Boeuf 45533e47f2SSebastien BoeufIn case the VFIO drivers are built-in, enable unsafe interrupts with: 46533e47f2SSebastien Boeuf 47533e47f2SSebastien Boeuf``` 48533e47f2SSebastien Boeuf# echo 1 > /sys/module/vfio_iommu_type1/parameters/allow_unsafe_interrupts 493cc6f48cSSamuel Ortiz``` 503cc6f48cSSamuel Ortiz 513cc6f48cSSamuel OrtizThen we unbind it from its native driver: 523cc6f48cSSamuel Ortiz 533cc6f48cSSamuel Ortiz``` 54533e47f2SSebastien Boeuf# echo 0000:01:00.0 > /sys/bus/pci/devices/0000\:01\:00.0/driver/unbind 553cc6f48cSSamuel Ortiz``` 563cc6f48cSSamuel Ortiz 573cc6f48cSSamuel OrtizAnd finally we bind it to the VFIO driver. To do that we first need to get the 583cc6f48cSSamuel Ortizdevice's VID (Vendor ID) and PID (Product ID): 593cc6f48cSSamuel Ortiz 603cc6f48cSSamuel Ortiz``` 613cc6f48cSSamuel Ortiz$ lspci -n -s 01:00.0 623cc6f48cSSamuel Ortiz01:00.0 ff00: 10ec:525a (rev 01) 633cc6f48cSSamuel Ortiz 64533e47f2SSebastien Boeuf# echo 10ec 525a > /sys/bus/pci/drivers/vfio-pci/new_id 65533e47f2SSebastien Boeuf``` 66533e47f2SSebastien Boeuf 67533e47f2SSebastien BoeufIf you have more than one device with the same `vendorID`/`deviceID`, starting 68533e47f2SSebastien Boeufwith the second device, the binding is performed as follows: 69533e47f2SSebastien Boeuf 70533e47f2SSebastien Boeuf``` 71533e47f2SSebastien Boeuf# echo 0000:02:00.0 > /sys/bus/pci/drivers/vfio-pci/bind 723cc6f48cSSamuel Ortiz``` 733cc6f48cSSamuel Ortiz 743cc6f48cSSamuel OrtizNow the device is managed by the VFIO framework. 753cc6f48cSSamuel Ortiz 763cc6f48cSSamuel OrtizThe final step is to give that device to `cloud-hypervisor` to assign it to the 773cc6f48cSSamuel Ortizguest. This is done by using the `--device` command line option. This option 783cc6f48cSSamuel Ortiztakes the device's sysfs path as an argument. In our example it is 793cc6f48cSSamuel Ortiz`/sys/bus/pci/devices/0000:01:00.0/`: 803cc6f48cSSamuel Ortiz 813cc6f48cSSamuel Ortiz``` 823cc6f48cSSamuel Ortiz./target/debug/cloud-hypervisor \ 833cc6f48cSSamuel Ortiz --kernel ~/vmlinux \ 84a3342bdbSSebastien Boeuf --disk path=~/focal-server-cloudimg-amd64.raw \ 853cc6f48cSSamuel Ortiz --console off \ 863cc6f48cSSamuel Ortiz --serial tty \ 87a3342bdbSSebastien Boeuf --cmdline "console=ttyS0 root=/dev/vda1 rw" \ 883cc6f48cSSamuel Ortiz --cpus 4 \ 893cc6f48cSSamuel Ortiz --memory size=512M \ 905fc3f37cSSebastien Boeuf --device path=/sys/bus/pci/devices/0000:01:00.0/ 913cc6f48cSSamuel Ortiz``` 923cc6f48cSSamuel Ortiz 933cc6f48cSSamuel OrtizThe guest kernel will then detect the card reader on its PCI bus and provided 943cc6f48cSSamuel Ortizthat support for this device is enabled, it will probe and enable it for the 953cc6f48cSSamuel Ortizguest to use. 96533e47f2SSebastien Boeuf 97533e47f2SSebastien BoeufIn case you want to pass multiple devices, here is the correct syntax: 98533e47f2SSebastien Boeuf 99533e47f2SSebastien Boeuf``` 100533e47f2SSebastien Boeuf--device path=/sys/bus/pci/devices/0000:01:00.0/ path=/sys/bus/pci/devices/0000:02:00.0/ 101533e47f2SSebastien Boeuf``` 102533e47f2SSebastien Boeuf 103533e47f2SSebastien Boeuf### Multiple devices in the same IOMMU group 104533e47f2SSebastien Boeuf 105533e47f2SSebastien BoeufThere are cases where multiple devices can be found under the same IOMMU group. 106533e47f2SSebastien BoeufThis happens often with graphics card embedding an audio controller. 107533e47f2SSebastien Boeuf 108533e47f2SSebastien Boeuf``` 109533e47f2SSebastien Boeuf$ lspci 110533e47f2SSebastien Boeuf[...] 111533e47f2SSebastien Boeuf01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1) 112533e47f2SSebastien Boeuf01:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1) 113533e47f2SSebastien Boeuf[...] 114533e47f2SSebastien Boeuf``` 115533e47f2SSebastien Boeuf 116533e47f2SSebastien BoeufThis is usually exposed as follows through `sysfs`: 117533e47f2SSebastien Boeuf 118533e47f2SSebastien Boeuf``` 119533e47f2SSebastien Boeuf$ ls /sys/kernel/iommu_groups/22/devices/ 120533e47f2SSebastien Boeuf0000:01:00.0 0000:01:00.1 121533e47f2SSebastien Boeuf``` 122533e47f2SSebastien Boeuf 123533e47f2SSebastien BoeufThis means these two devices are under the same IOMMU group 22. In such case, 124533e47f2SSebastien Boeufit is important to bind both devices to VFIO and pass them both through the 125533e47f2SSebastien BoeufVM, otherwise this could cause some functional and security issues. 126e7e856d8SThomas Barrett 127e7e856d8SThomas Barrett### Advanced Configuration Options 128e7e856d8SThomas Barrett 129*3b64b772SThomas BarrettWhen using NVIDIA GPUs in a VFIO passthrough configuration, advanced 130*3b64b772SThomas Barrettconfiguration options are supported to enable GPUDirect P2P DMA over 131*3b64b772SThomas BarrettPCIe. When enabled, loads and stores between GPUs use native PCIe 132*3b64b772SThomas Barrettpeer-to-peer transactions instead of a shared memory buffer. This drastically 133*3b64b772SThomas Barrettdecreases P2P latency between GPUs. This functionality is supported by 134*3b64b772SThomas Barrettcloud-hypervisor on NVIDIA Turing, Ampere, Hopper, and Lovelace GPUs. 135*3b64b772SThomas Barrett 136*3b64b772SThomas BarrettThe NVIDIA driver does not enable GPUDirect P2P over PCIe within guests 137*3b64b772SThomas Barrettby default because hardware support for routing P2P TLP between PCIe root 138*3b64b772SThomas Barrettports is optional. PCIe P2P should always be supported between devices 139*3b64b772SThomas Barretton the same PCIe switch. The `x_nv_gpudirect_clique` config argument may 140*3b64b772SThomas Barrettbe used to signal support for PCIe P2P traffic between NVIDIA VFIO endpoints. 141*3b64b772SThomas BarrettThe guest driver assumes that P2P traffic is supported between all endpoints 142*3b64b772SThomas Barrettthat are part of the same clique. 143*3b64b772SThomas Barrett``` 144*3b64b772SThomas Barrett--device path=/sys/bus/pci/devices/0000:01:00.0/,x_nv_gpudirect_clique=0 145*3b64b772SThomas Barrett``` 146*3b64b772SThomas Barrett 147*3b64b772SThomas BarrettThe following command can be run on the guest to verify that GPUDirect P2P is 148*3b64b772SThomas Barrettcorrectly enabled. 149*3b64b772SThomas Barrett``` 150*3b64b772SThomas Barrettnvidia-smi topo -p2p r 151*3b64b772SThomas Barrett GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 152*3b64b772SThomas Barrett GPU0 X OK OK OK OK OK OK OK 153*3b64b772SThomas Barrett GPU1 OK X OK OK OK OK OK OK 154*3b64b772SThomas Barrett GPU2 OK OK X OK OK OK OK OK 155*3b64b772SThomas Barrett GPU3 OK OK OK X OK OK OK OK 156*3b64b772SThomas Barrett GPU4 OK OK OK OK X OK OK OK 157*3b64b772SThomas Barrett GPU5 OK OK OK OK OK X OK OK 158*3b64b772SThomas Barrett GPU6 OK OK OK OK OK OK X OK 159*3b64b772SThomas Barrett GPU7 OK OK OK OK OK OK OK X 160*3b64b772SThomas Barrett``` 161*3b64b772SThomas Barrett 162e7e856d8SThomas BarrettSome VFIO devices have a 32-bit mmio BAR. When using many such devices, it is 163e7e856d8SThomas Barrettpossible to exhaust the 32-bit mmio space available on a PCI segment. The 164e7e856d8SThomas Barrettfollowing example demonstrates an example device with a 16 MiB 32-bit mmio BAR. 165e7e856d8SThomas Barrett``` 166e7e856d8SThomas Barrettlspci -s 0000:01:00.0 -v 167e7e856d8SThomas Barrett0000:01:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1) 168e7e856d8SThomas Barrett [...] 169e7e856d8SThomas Barrett Memory at f9000000 (32-bit, non-prefetchable) [size=16M] 170e7e856d8SThomas Barrett Memory at 46000000000 (64-bit, prefetchable) [size=64G] 171e7e856d8SThomas Barrett Memory at 48040000000 (64-bit, prefetchable) [size=32M] 172e7e856d8SThomas Barrett [...] 173e7e856d8SThomas Barrett``` 174e7e856d8SThomas Barrett 175e7e856d8SThomas BarrettWhen using multiple PCI segments, the 32-bit mmio address space available to 176e7e856d8SThomas Barrettbe allocated to VFIO devices is equally split between all PCI segments by 177e7e856d8SThomas Barrettdefault. This can be tuned with the `--pci-segment` flag. The following example 178e7e856d8SThomas Barrettdemonstrates a guest with two PCI segments. 2/3 of the 32-bit mmio address 179e7e856d8SThomas Barrettspace is available for use by devices on PCI segment 0 and 1/3 of the 32-bit 180e7e856d8SThomas Barrettmmio address space is available for use by devices on PCI segment 1. 181e7e856d8SThomas Barrett``` 182e7e856d8SThomas Barrett--platform num_pci_segments=2 183e7e856d8SThomas Barrett--pci-segment pci_segment=0,mmio32_aperture_weight=2 184e7e856d8SThomas Barrett--pci-segment pci_segment=1,mmio32_aperture_weight=1 185e7e856d8SThomas Barrett``` 186