1============== 2NVMe Emulation 3============== 4 5QEMU provides NVMe emulation through the ``nvme``, ``nvme-ns`` and 6``nvme-subsys`` devices. 7 8See the following sections for specific information on 9 10 * `Adding NVMe Devices`_, `additional namespaces`_ and `NVM subsystems`_. 11 * Configuration of `Optional Features`_ such as `Controller Memory Buffer`_, 12 `Simple Copy`_, `Zoned Namespaces`_, `metadata`_ and `End-to-End Data 13 Protection`_, 14 15Adding NVMe Devices 16=================== 17 18Controller Emulation 19-------------------- 20 21The QEMU emulated NVMe controller implements version 1.4 of the NVM Express 22specification. All mandatory features are implement with a couple of exceptions 23and limitations: 24 25 * Accounting numbers in the SMART/Health log page are reset when the device 26 is power cycled. 27 * Interrupt Coalescing is not supported and is disabled by default. 28 29The simplest way to attach an NVMe controller on the QEMU PCI bus is to add the 30following parameters: 31 32.. code-block:: console 33 34 -drive file=nvm.img,if=none,id=nvm 35 -device nvme,serial=deadbeef,drive=nvm 36 37There are a number of optional general parameters for the ``nvme`` device. Some 38are mentioned here, but see ``-device nvme,help`` to list all possible 39parameters. 40 41``max_ioqpairs=UINT32`` (default: ``64``) 42 Set the maximum number of allowed I/O queue pairs. This replaces the 43 deprecated ``num_queues`` parameter. 44 45``msix_qsize=UINT16`` (default: ``65``) 46 The number of MSI-X vectors that the device should support. 47 48``mdts=UINT8`` (default: ``7``) 49 Set the Maximum Data Transfer Size of the device. 50 51``use-intel-id`` (default: ``off``) 52 Since QEMU 5.2, the device uses a QEMU allocated "Red Hat" PCI Device and 53 Vendor ID. Set this to ``on`` to revert to the unallocated Intel ID 54 previously used. 55 56``ocp`` (default: ``off``) 57 The Open Compute Project defines the Datacenter NVMe SSD Specification that 58 sits on top of NVMe. It describes additional commands and NVMe behaviors 59 specific for the Datacenter. When this option is ``on`` OCP features such as 60 the SMART / Health information extended log become available in the 61 controller. We emulate version 5 of this log page. 62 63Additional Namespaces 64--------------------- 65 66In the simplest possible invocation sketched above, the device only support a 67single namespace with the namespace identifier ``1``. To support multiple 68namespaces and additional features, the ``nvme-ns`` device must be used. 69 70.. code-block:: console 71 72 -device nvme,id=nvme-ctrl-0,serial=deadbeef 73 -drive file=nvm-1.img,if=none,id=nvm-1 74 -device nvme-ns,drive=nvm-1 75 -drive file=nvm-2.img,if=none,id=nvm-2 76 -device nvme-ns,drive=nvm-2 77 78The namespaces defined by the ``nvme-ns`` device will attach to the most 79recently defined ``nvme-bus`` that is created by the ``nvme`` device. Namespace 80identifiers are allocated automatically, starting from ``1``. 81 82There are a number of parameters available: 83 84``nsid`` (default: ``0``) 85 Explicitly set the namespace identifier. 86 87``uuid`` (default: *autogenerated*) 88 Set the UUID of the namespace. This will be reported as a "Namespace UUID" 89 descriptor in the Namespace Identification Descriptor List. 90 91``nguid`` 92 Set the NGUID of the namespace. This will be reported as a "Namespace Globally 93 Unique Identifier" descriptor in the Namespace Identification Descriptor List. 94 It is specified as a string of hexadecimal digits containing exactly 16 bytes 95 or "auto" for a random value. An optional '-' separator could be used to group 96 bytes. If not specified the NGUID will remain all zeros. 97 98``eui64`` 99 Set the EUI-64 of the namespace. This will be reported as a "IEEE Extended 100 Unique Identifier" descriptor in the Namespace Identification Descriptor List. 101 Since machine type 6.1 a non-zero default value is used if the parameter 102 is not provided. For earlier machine types the field defaults to 0. 103 104``bus`` 105 If there are more ``nvme`` devices defined, this parameter may be used to 106 attach the namespace to a specific ``nvme`` device (identified by an ``id`` 107 parameter on the controller device). 108 109NVM Subsystems 110-------------- 111 112Additional features becomes available if the controller device (``nvme``) is 113linked to an NVM Subsystem device (``nvme-subsys``). 114 115The NVM Subsystem emulation allows features such as shared namespaces and 116multipath I/O. 117 118.. code-block:: console 119 120 -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0 121 -device nvme,serial=deadbeef,subsys=nvme-subsys-0 122 -device nvme,serial=deadbeef,subsys=nvme-subsys-0 123 124This will create an NVM subsystem with two controllers. Having controllers 125linked to an ``nvme-subsys`` device allows additional ``nvme-ns`` parameters: 126 127``shared`` (default: ``on`` since 6.2) 128 Specifies that the namespace will be attached to all controllers in the 129 subsystem. If set to ``off``, the namespace will remain a private namespace 130 and may only be attached to a single controller at a time. Shared namespaces 131 are always automatically attached to all controllers (also when controllers 132 are hotplugged). 133 134``detached`` (default: ``off``) 135 If set to ``on``, the namespace will be be available in the subsystem, but 136 not attached to any controllers initially. A shared namespace with this set 137 to ``on`` will never be automatically attached to controllers. 138 139Thus, adding 140 141.. code-block:: console 142 143 -drive file=nvm-1.img,if=none,id=nvm-1 144 -device nvme-ns,drive=nvm-1,nsid=1 145 -drive file=nvm-2.img,if=none,id=nvm-2 146 -device nvme-ns,drive=nvm-2,nsid=3,shared=off,detached=on 147 148will cause NSID 1 will be a shared namespace that is initially attached to both 149controllers. NSID 3 will be a private namespace due to ``shared=off`` and only 150attachable to a single controller at a time. Additionally it will not be 151attached to any controller initially (due to ``detached=on``) or to hotplugged 152controllers. 153 154Optional Features 155================= 156 157Controller Memory Buffer 158------------------------ 159 160``nvme`` device parameters related to the Controller Memory Buffer support: 161 162``cmb_size_mb=UINT32`` (default: ``0``) 163 This adds a Controller Memory Buffer of the given size at offset zero in BAR 164 2. 165 166``legacy-cmb`` (default: ``off``) 167 By default, the device uses the "v1.4 scheme" for the Controller Memory 168 Buffer support (i.e, the CMB is initially disabled and must be explicitly 169 enabled by the host). Set this to ``on`` to behave as a v1.3 device wrt. the 170 CMB. 171 172Simple Copy 173----------- 174 175The device includes support for TP 4065 ("Simple Copy Command"). A number of 176additional ``nvme-ns`` device parameters may be used to control the Copy 177command limits: 178 179``mssrl=UINT16`` (default: ``128``) 180 Set the Maximum Single Source Range Length (``MSSRL``). This is the maximum 181 number of logical blocks that may be specified in each source range. 182 183``mcl=UINT32`` (default: ``128``) 184 Set the Maximum Copy Length (``MCL``). This is the maximum number of logical 185 blocks that may be specified in a Copy command (the total for all source 186 ranges). 187 188``msrc=UINT8`` (default: ``127``) 189 Set the Maximum Source Range Count (``MSRC``). This is the maximum number of 190 source ranges that may be used in a Copy command. This is a 0's based value. 191 192Zoned Namespaces 193---------------- 194 195A namespaces may be "Zoned" as defined by TP 4053 ("Zoned Namespaces"). Set 196``zoned=on`` on an ``nvme-ns`` device to configure it as a zoned namespace. 197 198The namespace may be configured with additional parameters 199 200``zoned.zone_size=SIZE`` (default: ``128MiB``) 201 Define the zone size (``ZSZE``). 202 203``zoned.zone_capacity=SIZE`` (default: ``0``) 204 Define the zone capacity (``ZCAP``). If left at the default (``0``), the zone 205 capacity will equal the zone size. 206 207``zoned.descr_ext_size=UINT32`` (default: ``0``) 208 Set the Zone Descriptor Extension Size (``ZDES``). Must be a multiple of 64 209 bytes. 210 211``zoned.cross_read=BOOL`` (default: ``off``) 212 Set to ``on`` to allow reads to cross zone boundaries. 213 214``zoned.max_active=UINT32`` (default: ``0``) 215 Set the maximum number of active resources (``MAR``). The default (``0``) 216 allows all zones to be active. 217 218``zoned.max_open=UINT32`` (default: ``0``) 219 Set the maximum number of open resources (``MOR``). The default (``0``) 220 allows all zones to be open. If ``zoned.max_active`` is specified, this value 221 must be less than or equal to that. 222 223``zoned.zasl=UINT8`` (default: ``0``) 224 Set the maximum data transfer size for the Zone Append command. Like 225 ``mdts``, the value is specified as a power of two (2^n) and is in units of 226 the minimum memory page size (CAP.MPSMIN). The default value (``0``) 227 has this property inherit the ``mdts`` value. 228 229Flexible Data Placement 230----------------------- 231 232The device may be configured to support TP4146 ("Flexible Data Placement") by 233configuring it (``fdp=on``) on the subsystem:: 234 235 -device nvme-subsys,id=nvme-subsys-0,nqn=subsys0,fdp=on,fdp.nruh=16 236 237The subsystem emulates a single Endurance Group, on which Flexible Data 238Placement will be supported. Also note that the device emulation deviates 239slightly from the specification, by always enabling the "FDP Mode" feature on 240the controller if the subsystems is configured for Flexible Data Placement. 241 242Enabling Flexible Data Placement on the subsyste enables the following 243parameters: 244 245``fdp.nrg`` (default: ``1``) 246 Set the number of Reclaim Groups. 247 248``fdp.nruh`` (default: ``0``) 249 Set the number of Reclaim Unit Handles. This is a mandatory parameter and 250 must be non-zero. 251 252``fdp.runs`` (default: ``96M``) 253 Set the Reclaim Unit Nominal Size. Defaults to 96 MiB. 254 255Namespaces within this subsystem may requests Reclaim Unit Handles:: 256 257 -device nvme-ns,drive=nvm-1,fdp.ruhs=RUHLIST 258 259The ``RUHLIST`` is a semicolon separated list (i.e. ``0;1;2;3``) and may 260include ranges (i.e. ``0;8-15``). If no reclaim unit handle list is specified, 261the controller will assign the controller-specified reclaim unit handle to 262placement handle identifier 0. 263 264Metadata 265-------- 266 267The virtual namespace device supports LBA metadata in the form separate 268metadata (``MPTR``-based) and extended LBAs. 269 270``ms=UINT16`` (default: ``0``) 271 Defines the number of metadata bytes per LBA. 272 273``mset=UINT8`` (default: ``0``) 274 Set to ``1`` to enable extended LBAs. 275 276End-to-End Data Protection 277-------------------------- 278 279The virtual namespace device supports DIF- and DIX-based protection information 280(depending on ``mset``). 281 282``pi=UINT8`` (default: ``0``) 283 Enable protection information of the specified type (type ``1``, ``2`` or 284 ``3``). 285 286``pil=UINT8`` (default: ``0``) 287 Controls the location of the protection information within the metadata. Set 288 to ``1`` to transfer protection information as the first bytes of metadata. 289 Otherwise, the protection information is transferred as the last bytes of 290 metadata. 291 292``pif=UINT8`` (default: ``0``) 293 By default, the namespace device uses 16 bit guard protection information 294 format (``pif=0``). Set to ``2`` to enable 64 bit guard protection 295 information format. This requires at least 16 bytes of metadata. Note that 296 ``pif=1`` (32 bit guards) is currently not supported. 297 298Virtualization Enhancements and SR-IOV (Experimental Support) 299------------------------------------------------------------- 300 301The ``nvme`` device supports Single Root I/O Virtualization and Sharing 302along with Virtualization Enhancements. The controller has to be linked to 303an NVM Subsystem device (``nvme-subsys``) for use with SR-IOV. 304 305A number of parameters are present (**please note, that they may be 306subject to change**): 307 308``sriov_max_vfs`` (default: ``0``) 309 Indicates the maximum number of PCIe virtual functions supported 310 by the controller. Specifying a non-zero value enables reporting of both 311 SR-IOV and ARI (Alternative Routing-ID Interpretation) capabilities 312 by the NVMe device. Virtual function controllers will not report SR-IOV. 313 314``sriov_vq_flexible`` 315 Indicates the total number of flexible queue resources assignable to all 316 the secondary controllers. Implicitly sets the number of primary 317 controller's private resources to ``(max_ioqpairs - sriov_vq_flexible)``. 318 319``sriov_vi_flexible`` 320 Indicates the total number of flexible interrupt resources assignable to 321 all the secondary controllers. Implicitly sets the number of primary 322 controller's private resources to ``(msix_qsize - sriov_vi_flexible)``. 323 324``sriov_max_vi_per_vf`` (default: ``0``) 325 Indicates the maximum number of virtual interrupt resources assignable 326 to a secondary controller. The default ``0`` resolves to 327 ``(sriov_vi_flexible / sriov_max_vfs)`` 328 329``sriov_max_vq_per_vf`` (default: ``0``) 330 Indicates the maximum number of virtual queue resources assignable to 331 a secondary controller. The default ``0`` resolves to 332 ``(sriov_vq_flexible / sriov_max_vfs)`` 333 334The simplest possible invocation enables the capability to set up one VF 335controller and assign an admin queue, an IO queue, and a MSI-X interrupt. 336 337.. code-block:: console 338 339 -device nvme-subsys,id=subsys0 340 -device nvme,serial=deadbeef,subsys=subsys0,sriov_max_vfs=1, 341 sriov_vq_flexible=2,sriov_vi_flexible=1 342 343The minimum steps required to configure a functional NVMe secondary 344controller are: 345 346 * unbind flexible resources from the primary controller 347 348.. code-block:: console 349 350 nvme virt-mgmt /dev/nvme0 -c 0 -r 1 -a 1 -n 0 351 nvme virt-mgmt /dev/nvme0 -c 0 -r 0 -a 1 -n 0 352 353 * perform a Function Level Reset on the primary controller to actually 354 release the resources 355 356.. code-block:: console 357 358 echo 1 > /sys/bus/pci/devices/0000:01:00.0/reset 359 360 * enable VF 361 362.. code-block:: console 363 364 echo 1 > /sys/bus/pci/devices/0000:01:00.0/sriov_numvfs 365 366 * assign the flexible resources to the VF and set it ONLINE 367 368.. code-block:: console 369 370 nvme virt-mgmt /dev/nvme0 -c 1 -r 1 -a 8 -n 1 371 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 8 -n 2 372 nvme virt-mgmt /dev/nvme0 -c 1 -r 0 -a 9 -n 0 373 374 * bind the NVMe driver to the VF 375 376.. code-block:: console 377 378 echo 0000:01:00.1 > /sys/bus/pci/drivers/nvme/bind 379