xref: /cloud-hypervisor/docs/vfio.md (revision 3b64b7723bfaecc29280f55ec5ed5ee36cba4aff)
13cc6f48cSSamuel Ortiz# Cloud Hypervisor VFIO HOWTO
23cc6f48cSSamuel Ortiz
33cc6f48cSSamuel OrtizVFIO (Virtual Function I/O) is a kernel framework that exposes direct device
43cc6f48cSSamuel Ortizaccess to userspace. `cloud-hypervisor`, as many VMMs do, uses the VFIO
53cc6f48cSSamuel Ortizframework to directly assign host physical devices to the guest workloads.
63cc6f48cSSamuel Ortiz
73cc6f48cSSamuel Ortiz## Direct Device Assignment with Cloud Hypervisor
83cc6f48cSSamuel Ortiz
93cc6f48cSSamuel OrtizTo assign a device to a `cloud-hypervisor` guest, the device needs to be managed
103cc6f48cSSamuel Ortizby the VFIO kernel drivers. However, by default, a host device will be bound to
113cc6f48cSSamuel Ortizits native driver, which is not the VFIO one.
123cc6f48cSSamuel Ortiz
133cc6f48cSSamuel OrtizAs a consequence, a device must be unbound from its native driver before passing
148160c288SDayu Liuit to `cloud-hypervisor` for assigning it to a guest.
153cc6f48cSSamuel Ortiz
163cc6f48cSSamuel Ortiz### Example
173cc6f48cSSamuel Ortiz
183cc6f48cSSamuel OrtizIn this example we're going to assign a PCI memory card (SD, MMC, etc) reader
193cc6f48cSSamuel Ortizfrom the host in a cloud hypervisor guest.
203cc6f48cSSamuel Ortiz
213cc6f48cSSamuel Ortiz`cloud-hypervisor` only supports assigning PCI devices to its guests. `lspci`
223cc6f48cSSamuel Ortizhelps with identifying PCI devices on the host:
233cc6f48cSSamuel Ortiz
243cc6f48cSSamuel Ortiz```
253cc6f48cSSamuel Ortiz$ lspci
263cc6f48cSSamuel Ortiz[...]
273cc6f48cSSamuel Ortiz01:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS525A PCI Express Card Reader (rev 01)
283cc6f48cSSamuel Ortiz[...]
293cc6f48cSSamuel Ortiz```
303cc6f48cSSamuel Ortiz
313cc6f48cSSamuel OrtizHere we see that our device is on bus 1, slot 0 and function 0 (`01:00.0`).
323cc6f48cSSamuel Ortiz
333cc6f48cSSamuel OrtizNow that we have identified the device, we must unbind it from its native driver
343cc6f48cSSamuel Ortiz(`rtsx_pci`) and bind it to the VFIO driver instead (`vfio_pci`).
353cc6f48cSSamuel Ortiz
363cc6f48cSSamuel OrtizFirst we add VFIO support to the host:
373cc6f48cSSamuel Ortiz
383cc6f48cSSamuel Ortiz```
39533e47f2SSebastien Boeuf# modprobe -r vfio_pci
40533e47f2SSebastien Boeuf# modprobe -r vfio_iommu_type1
41533e47f2SSebastien Boeuf# modprobe vfio_iommu_type1 allow_unsafe_interrupts
42533e47f2SSebastien Boeuf# modprobe vfio_pci
43533e47f2SSebastien Boeuf```
44533e47f2SSebastien Boeuf
45533e47f2SSebastien BoeufIn case the VFIO drivers are built-in, enable unsafe interrupts with:
46533e47f2SSebastien Boeuf
47533e47f2SSebastien Boeuf```
48533e47f2SSebastien Boeuf# echo 1 > /sys/module/vfio_iommu_type1/parameters/allow_unsafe_interrupts
493cc6f48cSSamuel Ortiz```
503cc6f48cSSamuel Ortiz
513cc6f48cSSamuel OrtizThen we unbind it from its native driver:
523cc6f48cSSamuel Ortiz
533cc6f48cSSamuel Ortiz```
54533e47f2SSebastien Boeuf# echo 0000:01:00.0 > /sys/bus/pci/devices/0000\:01\:00.0/driver/unbind
553cc6f48cSSamuel Ortiz```
563cc6f48cSSamuel Ortiz
573cc6f48cSSamuel OrtizAnd finally we bind it to the VFIO driver. To do that we first need to get the
583cc6f48cSSamuel Ortizdevice's VID (Vendor ID) and PID (Product ID):
593cc6f48cSSamuel Ortiz
603cc6f48cSSamuel Ortiz```
613cc6f48cSSamuel Ortiz$ lspci -n -s 01:00.0
623cc6f48cSSamuel Ortiz01:00.0 ff00: 10ec:525a (rev 01)
633cc6f48cSSamuel Ortiz
64533e47f2SSebastien Boeuf# echo 10ec 525a > /sys/bus/pci/drivers/vfio-pci/new_id
65533e47f2SSebastien Boeuf```
66533e47f2SSebastien Boeuf
67533e47f2SSebastien BoeufIf you have more than one device with the same `vendorID`/`deviceID`, starting
68533e47f2SSebastien Boeufwith the second device, the binding is performed as follows:
69533e47f2SSebastien Boeuf
70533e47f2SSebastien Boeuf```
71533e47f2SSebastien Boeuf# echo 0000:02:00.0 > /sys/bus/pci/drivers/vfio-pci/bind
723cc6f48cSSamuel Ortiz```
733cc6f48cSSamuel Ortiz
743cc6f48cSSamuel OrtizNow the device is managed by the VFIO framework.
753cc6f48cSSamuel Ortiz
763cc6f48cSSamuel OrtizThe final step is to give that device to `cloud-hypervisor` to assign it to the
773cc6f48cSSamuel Ortizguest. This is done by using the `--device` command line option. This option
783cc6f48cSSamuel Ortiztakes the device's sysfs path as an argument. In our example it is
793cc6f48cSSamuel Ortiz`/sys/bus/pci/devices/0000:01:00.0/`:
803cc6f48cSSamuel Ortiz
813cc6f48cSSamuel Ortiz```
823cc6f48cSSamuel Ortiz./target/debug/cloud-hypervisor \
833cc6f48cSSamuel Ortiz    --kernel ~/vmlinux \
84a3342bdbSSebastien Boeuf    --disk path=~/focal-server-cloudimg-amd64.raw \
853cc6f48cSSamuel Ortiz    --console off \
863cc6f48cSSamuel Ortiz    --serial tty \
87a3342bdbSSebastien Boeuf    --cmdline "console=ttyS0 root=/dev/vda1 rw" \
883cc6f48cSSamuel Ortiz    --cpus 4 \
893cc6f48cSSamuel Ortiz    --memory size=512M \
905fc3f37cSSebastien Boeuf    --device path=/sys/bus/pci/devices/0000:01:00.0/
913cc6f48cSSamuel Ortiz```
923cc6f48cSSamuel Ortiz
933cc6f48cSSamuel OrtizThe guest kernel will then detect the card reader on its PCI bus and provided
943cc6f48cSSamuel Ortizthat support for this device is enabled, it will probe and enable it for the
953cc6f48cSSamuel Ortizguest to use.
96533e47f2SSebastien Boeuf
97533e47f2SSebastien BoeufIn case you want to pass multiple devices, here is the correct syntax:
98533e47f2SSebastien Boeuf
99533e47f2SSebastien Boeuf```
100533e47f2SSebastien Boeuf--device path=/sys/bus/pci/devices/0000:01:00.0/ path=/sys/bus/pci/devices/0000:02:00.0/
101533e47f2SSebastien Boeuf```
102533e47f2SSebastien Boeuf
103533e47f2SSebastien Boeuf### Multiple devices in the same IOMMU group
104533e47f2SSebastien Boeuf
105533e47f2SSebastien BoeufThere are cases where multiple devices can be found under the same IOMMU group.
106533e47f2SSebastien BoeufThis happens often with graphics card embedding an audio controller.
107533e47f2SSebastien Boeuf
108533e47f2SSebastien Boeuf```
109533e47f2SSebastien Boeuf$ lspci
110533e47f2SSebastien Boeuf[...]
111533e47f2SSebastien Boeuf01:00.0 VGA compatible controller: NVIDIA Corporation GK208B [GeForce GT 710] (rev a1)
112533e47f2SSebastien Boeuf01:00.1 Audio device: NVIDIA Corporation GK208 HDMI/DP Audio Controller (rev a1)
113533e47f2SSebastien Boeuf[...]
114533e47f2SSebastien Boeuf```
115533e47f2SSebastien Boeuf
116533e47f2SSebastien BoeufThis is usually exposed as follows through `sysfs`:
117533e47f2SSebastien Boeuf
118533e47f2SSebastien Boeuf```
119533e47f2SSebastien Boeuf$ ls /sys/kernel/iommu_groups/22/devices/
120533e47f2SSebastien Boeuf0000:01:00.0  0000:01:00.1
121533e47f2SSebastien Boeuf```
122533e47f2SSebastien Boeuf
123533e47f2SSebastien BoeufThis means these two devices are under the same IOMMU group 22. In such case,
124533e47f2SSebastien Boeufit is important to bind both devices to VFIO and pass them both through the
125533e47f2SSebastien BoeufVM, otherwise this could cause some functional and security issues.
126e7e856d8SThomas Barrett
127e7e856d8SThomas Barrett### Advanced Configuration Options
128e7e856d8SThomas Barrett
129*3b64b772SThomas BarrettWhen using NVIDIA GPUs in a VFIO passthrough configuration, advanced
130*3b64b772SThomas Barrettconfiguration options are supported to enable GPUDirect P2P DMA over
131*3b64b772SThomas BarrettPCIe. When enabled, loads and stores between GPUs use native PCIe
132*3b64b772SThomas Barrettpeer-to-peer transactions instead of a shared memory buffer. This drastically
133*3b64b772SThomas Barrettdecreases P2P latency between GPUs. This functionality is supported by
134*3b64b772SThomas Barrettcloud-hypervisor on NVIDIA Turing, Ampere, Hopper, and Lovelace GPUs.
135*3b64b772SThomas Barrett
136*3b64b772SThomas BarrettThe NVIDIA driver does not enable GPUDirect P2P over PCIe within guests
137*3b64b772SThomas Barrettby default because hardware support for routing P2P TLP between PCIe root
138*3b64b772SThomas Barrettports is optional. PCIe P2P should always be supported between devices
139*3b64b772SThomas Barretton the same PCIe switch. The `x_nv_gpudirect_clique` config argument may
140*3b64b772SThomas Barrettbe used to signal support for PCIe P2P traffic between NVIDIA VFIO endpoints.
141*3b64b772SThomas BarrettThe guest driver assumes that P2P traffic is supported between all endpoints
142*3b64b772SThomas Barrettthat are part of the same clique.
143*3b64b772SThomas Barrett```
144*3b64b772SThomas Barrett--device path=/sys/bus/pci/devices/0000:01:00.0/,x_nv_gpudirect_clique=0
145*3b64b772SThomas Barrett```
146*3b64b772SThomas Barrett
147*3b64b772SThomas BarrettThe following command can be run on the guest to verify that GPUDirect P2P is
148*3b64b772SThomas Barrettcorrectly enabled.
149*3b64b772SThomas Barrett```
150*3b64b772SThomas Barrettnvidia-smi topo -p2p r
151*3b64b772SThomas Barrett 	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7
152*3b64b772SThomas Barrett GPU0	X	OK	OK	OK	OK	OK	OK	OK
153*3b64b772SThomas Barrett GPU1	OK	X	OK	OK	OK	OK	OK	OK
154*3b64b772SThomas Barrett GPU2	OK	OK	X	OK	OK	OK	OK	OK
155*3b64b772SThomas Barrett GPU3	OK	OK	OK	X	OK	OK	OK	OK
156*3b64b772SThomas Barrett GPU4	OK	OK	OK	OK	X	OK	OK	OK
157*3b64b772SThomas Barrett GPU5	OK	OK	OK	OK	OK	X	OK	OK
158*3b64b772SThomas Barrett GPU6	OK	OK	OK	OK	OK	OK	X	OK
159*3b64b772SThomas Barrett GPU7	OK	OK	OK	OK	OK	OK	OK	X
160*3b64b772SThomas Barrett```
161*3b64b772SThomas Barrett
162e7e856d8SThomas BarrettSome VFIO devices have a 32-bit mmio BAR. When using many such devices, it is
163e7e856d8SThomas Barrettpossible to exhaust the 32-bit mmio space available on a PCI segment. The
164e7e856d8SThomas Barrettfollowing example demonstrates an example device with a 16 MiB 32-bit mmio BAR.
165e7e856d8SThomas Barrett```
166e7e856d8SThomas Barrettlspci -s 0000:01:00.0  -v
167e7e856d8SThomas Barrett0000:01:00.0 3D controller: NVIDIA Corporation Device 26b9 (rev a1)
168e7e856d8SThomas Barrett    [...]
169e7e856d8SThomas Barrett    Memory at f9000000 (32-bit, non-prefetchable) [size=16M]
170e7e856d8SThomas Barrett    Memory at 46000000000 (64-bit, prefetchable) [size=64G]
171e7e856d8SThomas Barrett    Memory at 48040000000 (64-bit, prefetchable) [size=32M]
172e7e856d8SThomas Barrett    [...]
173e7e856d8SThomas Barrett```
174e7e856d8SThomas Barrett
175e7e856d8SThomas BarrettWhen using multiple PCI segments, the 32-bit mmio address space available to
176e7e856d8SThomas Barrettbe allocated to VFIO devices is equally split between all PCI segments by
177e7e856d8SThomas Barrettdefault. This can be tuned with the `--pci-segment` flag. The following example
178e7e856d8SThomas Barrettdemonstrates a guest with two PCI segments. 2/3 of the 32-bit mmio address
179e7e856d8SThomas Barrettspace is available for use by devices on PCI segment 0 and 1/3 of the 32-bit
180e7e856d8SThomas Barrettmmio address space is available for use by devices on PCI segment 1.
181e7e856d8SThomas Barrett```
182e7e856d8SThomas Barrett--platform num_pci_segments=2
183e7e856d8SThomas Barrett--pci-segment pci_segment=0,mmio32_aperture_weight=2
184e7e856d8SThomas Barrett--pci-segment pci_segment=1,mmio32_aperture_weight=1
185e7e856d8SThomas Barrett```
186