1# Cloud Hypervisor VFIO HOWTO 2 3VFIO (Virtual Function I/O) is a kernel framework that exposes direct device 4access to userspace. `cloud-hypervisor`, as many VMMs do, uses the VFIO 5framework to directly assign host physical devices to the guest workloads. 6 7## Direct Device Assignment with Cloud Hypervisor 8 9To assign a device to a `cloud-hypervisor` guest, the device needs to be managed 10by the VFIO kernel drivers. However, by default, a host device will be bound to 11its native driver, which is not the VFIO one. 12 13As a consequence, a device must be unbound from its native driver before passing 14it to `cloud-hypervisor` for assigning it to a guest. 15 16### Example 17 18In this example we're going to assign a PCI memory card (SD, MMC, etc) reader 19from the host in a cloud hypervisor guest. 20 21`cloud-hypervisor` only supports assigning PCI devices to its guests. `lspci` 22helps with identifying PCI devices on the host: 23 24``` 25$ lspci 26[...] 2701:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01) 28[...] 29``` 30 31Here we see that our device is on bus 1, slot 0 and function 0 (`01:00.0`). 32 33Now that we have identified the device, we must unbind it from its native driver 34(`rtsx_pci`) and bind it to the VFIO driver instead (`vfio_pci`). 35 36First we add VFIO support to the host: 37 38``` 39# modprobe -r vfio_pci 40# modprobe -r vfio_iommu_type1 41# modprobe vfio_iommu_type1 allow_unsafe_interrupts 42# modprobe vfio_pci 43``` 44 45In case the VFIO drivers are built-in, enable unsafe interrupts with: 46 47``` 48# echo 1 > /sys/module/vfio_iommu_type1/parameters/allow_unsafe_interrupts 49``` 50 51Then we unbind it from its native driver: 52 53``` 54# echo 0000:01:00.0 > /sys/bus/pci/devices/0000\:01\:00.0/driver/unbind 55``` 56 57And finally we bind it to the VFIO driver. To do that we first need to get the 58device's VID (Vendor ID) and PID (Product ID): 59 60``` 61$ lspci -n -s 01:00.0 6201:00.0 ff00: 10ec:525a (rev 01) 63 64# echo 10ec 525a > /sys/bus/pci/drivers/vfio-pci/new_id 65``` 66 67If you have more than one device with the same `vendorID`/`deviceID`, starting 68with the second device, the binding is performed as follows: 69 70``` 71# echo 0000:02:00.0 > /sys/bus/pci/drivers/vfio-pci/bind 72``` 73 74Now the device is managed by the VFIO framework. 75 76The final step is to give that device to `cloud-hypervisor` to assign it to the 77guest. This is done by using the `--device` command line option. This option 78takes the device's sysfs path as an argument. In our example it is 79`/sys/bus/pci/devices/0000:01:00.0/`: 80 81``` 82./target/debug/cloud-hypervisor \ 83 --kernel ~/vmlinux \ 84 --disk path=~/focal-server-cloudimg-amd64.raw \ 85 --console off \ 86 --serial tty \ 87 --cmdline "console=ttyS0 root=/dev/vda1 rw" \ 88 --cpus 4 \ 89 --memory size=512M \ 90 --device path=/sys/bus/pci/devices/0000:01:00.0/ 91``` 92 93The guest kernel will then detect the card reader on its PCI bus and provided 94that support for this device is enabled, it will probe and enable it for the 95guest to use. 96 97In case you want to pass multiple devices, here is the correct syntax: 98 99``` 100--device path=/sys/bus/pci/devices/0000:01:00.0/ path=/sys/bus/pci/devices/0000:02:00.0/ 101``` 102 103### Multiple devices in the same IOMMU group 104 105There are cases where multiple devices can be found under the same IOMMU group. 106This happens often with graphics card embedding an audio controller. 107 108``` 109$ lspci 110[...] 11101:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1) 11201:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1) 113[...] 114``` 115 116This is usually exposed as follows through `sysfs`: 117 118``` 119$ ls /sys/kernel/iommu_groups/22/devices/ 1200000:01:00.0 0000:01:00.1 121``` 122 123This means these two devices are under the same IOMMU group 22. In such case, 124it is important to bind both devices to VFIO and pass them both through the 125VM, otherwise this could cause some functional and security issues. 126 127### Advanced Configuration Options 128 129When using NVIDIA GPUs in a VFIO passthrough configuration, advanced 130configuration options are supported to enable GPUDirect P2P DMA over 131PCIe. When enabled, loads and stores between GPUs use native PCIe 132peer-to-peer transactions instead of a shared memory buffer. This drastically 133decreases P2P latency between GPUs. This functionality is supported by 134cloud-hypervisor on NVIDIA Turing, Ampere, Hopper, and Lovelace GPUs. 135 136The NVIDIA driver does not enable GPUDirect P2P over PCIe within guests 137by default because hardware support for routing P2P TLP between PCIe root 138ports is optional. PCIe P2P should always be supported between devices 139on the same PCIe switch. The `x_nv_gpudirect_clique` config argument may 140be used to signal support for PCIe P2P traffic between NVIDIA VFIO endpoints. 141The guest driver assumes that P2P traffic is supported between all endpoints 142that are part of the same clique. 143``` 144--device path=/sys/bus/pci/devices/0000:01:00.0/,x_nv_gpudirect_clique=0 145``` 146 147The following command can be run on the guest to verify that GPUDirect P2P is 148correctly enabled. 149``` 150nvidia-smi topo -p2p r 151 GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 152 GPU0 X OK OK OK OK OK OK OK 153 GPU1 OK X OK OK OK OK OK OK 154 GPU2 OK OK X OK OK OK OK OK 155 GPU3 OK OK OK X OK OK OK OK 156 GPU4 OK OK OK OK X OK OK OK 157 GPU5 OK OK OK OK OK X OK OK 158 GPU6 OK OK OK OK OK OK X OK 159 GPU7 OK OK OK OK OK OK OK X 160``` 161 162Some VFIO devices have a 32-bit mmio BAR. When using many such devices, it is 163possible to exhaust the 32-bit mmio space available on a PCI segment. The 164following example demonstrates an example device with a 16 MiB 32-bit mmio BAR. 165``` 166lspci -s 0000:01:00.0 -v 1670000:01:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1) 168 [...] 169 Memory at f9000000 (32-bit, non-prefetchable) [size=16M] 170 Memory at 46000000000 (64-bit, prefetchable) [size=64G] 171 Memory at 48040000000 (64-bit, prefetchable) [size=32M] 172 [...] 173``` 174 175When using multiple PCI segments, the 32-bit mmio address space available to 176be allocated to VFIO devices is equally split between all PCI segments by 177default. This can be tuned with the `--pci-segment` flag. The following example 178demonstrates a guest with two PCI segments. 2/3 of the 32-bit mmio address 179space is available for use by devices on PCI segment 0 and 1/3 of the 32-bit 180mmio address space is available for use by devices on PCI segment 1. 181``` 182--platform num_pci_segments=2 183--pci-segment pci_segment=0,mmio32_aperture_weight=2 184--pci-segment pci_segment=1,mmio32_aperture_weight=1 185``` 186