179c0f397SHaozhong ZhangQEMU Virtual NVDIMM 279c0f397SHaozhong Zhang=================== 379c0f397SHaozhong Zhang 479c0f397SHaozhong ZhangThis document explains the usage of virtual NVDIMM (vNVDIMM) feature 579c0f397SHaozhong Zhangwhich is available since QEMU v2.6.0. 679c0f397SHaozhong Zhang 779c0f397SHaozhong ZhangThe current QEMU only implements the persistent memory mode of vNVDIMM 879c0f397SHaozhong Zhangdevice and not the block window mode. 979c0f397SHaozhong Zhang 1079c0f397SHaozhong ZhangBasic Usage 1179c0f397SHaozhong Zhang----------- 1279c0f397SHaozhong Zhang 1379c0f397SHaozhong ZhangThe storage of a vNVDIMM device in QEMU is provided by the memory 1479c0f397SHaozhong Zhangbackend (i.e. memory-backend-file and memory-backend-ram). A simple 1579c0f397SHaozhong Zhangway to create a vNVDIMM device at startup time is done via the 1679c0f397SHaozhong Zhangfollowing command line options: 1779c0f397SHaozhong Zhang 1879c0f397SHaozhong Zhang -machine pc,nvdimm 1979c0f397SHaozhong Zhang -m $RAM_SIZE,slots=$N,maxmem=$MAX_SIZE 2079c0f397SHaozhong Zhang -object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE 2179c0f397SHaozhong Zhang -device nvdimm,id=nvdimm1,memdev=mem1 2279c0f397SHaozhong Zhang 2379c0f397SHaozhong ZhangWhere, 2479c0f397SHaozhong Zhang 2579c0f397SHaozhong Zhang - the "nvdimm" machine option enables vNVDIMM feature. 2679c0f397SHaozhong Zhang 2779c0f397SHaozhong Zhang - "slots=$N" should be equal to or larger than the total amount of 2879c0f397SHaozhong Zhang normal RAM devices and vNVDIMM devices, e.g. $N should be >= 2 here. 2979c0f397SHaozhong Zhang 3079c0f397SHaozhong Zhang - "maxmem=$MAX_SIZE" should be equal to or larger than the total size 3179c0f397SHaozhong Zhang of normal RAM devices and vNVDIMM devices, e.g. $MAX_SIZE should be 3279c0f397SHaozhong Zhang >= $RAM_SIZE + $NVDIMM_SIZE here. 3379c0f397SHaozhong Zhang 3479c0f397SHaozhong Zhang - "object memory-backend-file,id=mem1,share=on,mem-path=$PATH,size=$NVDIMM_SIZE" 3579c0f397SHaozhong Zhang creates a backend storage of size $NVDIMM_SIZE on a file $PATH. All 3679c0f397SHaozhong Zhang accesses to the virtual NVDIMM device go to the file $PATH. 3779c0f397SHaozhong Zhang 3879c0f397SHaozhong Zhang "share=on/off" controls the visibility of guest writes. If 3979c0f397SHaozhong Zhang "share=on", then guest writes will be applied to the backend 4079c0f397SHaozhong Zhang file. If another guest uses the same backend file with option 4179c0f397SHaozhong Zhang "share=on", then above writes will be visible to it as well. If 4279c0f397SHaozhong Zhang "share=off", then guest writes won't be applied to the backend 4379c0f397SHaozhong Zhang file and thus will be invisible to other guests. 4479c0f397SHaozhong Zhang 4579c0f397SHaozhong Zhang - "device nvdimm,id=nvdimm1,memdev=mem1" creates a virtual NVDIMM 4679c0f397SHaozhong Zhang device whose storage is provided by above memory backend device. 4779c0f397SHaozhong Zhang 4879c0f397SHaozhong ZhangMultiple vNVDIMM devices can be created if multiple pairs of "-object" 4979c0f397SHaozhong Zhangand "-device" are provided. 5079c0f397SHaozhong Zhang 5179c0f397SHaozhong ZhangFor above command line options, if the guest OS has the proper NVDIMM 5279c0f397SHaozhong Zhangdriver, it should be able to detect a NVDIMM device which is in the 5379c0f397SHaozhong Zhangpersistent memory mode and whose size is $NVDIMM_SIZE. 5479c0f397SHaozhong Zhang 5579c0f397SHaozhong ZhangNote: 5679c0f397SHaozhong Zhang 5779c0f397SHaozhong Zhang1. Prior to QEMU v2.8.0, if memory-backend-file is used and the actual 5879c0f397SHaozhong Zhang backend file size is not equal to the size given by "size" option, 5979c0f397SHaozhong Zhang QEMU will truncate the backend file by ftruncate(2), which will 6079c0f397SHaozhong Zhang corrupt the existing data in the backend file, especially for the 6179c0f397SHaozhong Zhang shrink case. 6279c0f397SHaozhong Zhang 6379c0f397SHaozhong Zhang QEMU v2.8.0 and later check the backend file size and the "size" 6479c0f397SHaozhong Zhang option. If they do not match, QEMU will report errors and abort in 6579c0f397SHaozhong Zhang order to avoid the data corruption. 6679c0f397SHaozhong Zhang 6779c0f397SHaozhong Zhang2. QEMU v2.6.0 only puts a basic alignment requirement on the "size" 6879c0f397SHaozhong Zhang option of memory-backend-file, e.g. 4KB alignment on x86. However, 6979c0f397SHaozhong Zhang QEMU v.2.7.0 puts an additional alignment requirement, which may 7079c0f397SHaozhong Zhang require a larger value than the basic one, e.g. 2MB on x86. This 7179c0f397SHaozhong Zhang change breaks the usage of memory-backend-file that only satisfies 7279c0f397SHaozhong Zhang the basic alignment. 7379c0f397SHaozhong Zhang 7479c0f397SHaozhong Zhang QEMU v2.8.0 and later remove the additional alignment on non-s390x 7579c0f397SHaozhong Zhang architectures, so the broken memory-backend-file can work again. 7679c0f397SHaozhong Zhang 7779c0f397SHaozhong ZhangLabel 7879c0f397SHaozhong Zhang----- 7979c0f397SHaozhong Zhang 8079c0f397SHaozhong ZhangQEMU v2.7.0 and later implement the label support for vNVDIMM devices. 8179c0f397SHaozhong ZhangTo enable label on vNVDIMM devices, users can simply add 8279c0f397SHaozhong Zhang"label-size=$SZ" option to "-device nvdimm", e.g. 8379c0f397SHaozhong Zhang 8479c0f397SHaozhong Zhang -device nvdimm,id=nvdimm1,memdev=mem1,label-size=128K 8579c0f397SHaozhong Zhang 8679c0f397SHaozhong ZhangNote: 8779c0f397SHaozhong Zhang 8879c0f397SHaozhong Zhang1. The minimal label size is 128KB. 8979c0f397SHaozhong Zhang 9079c0f397SHaozhong Zhang2. QEMU v2.7.0 and later store labels at the end of backend storage. 9179c0f397SHaozhong Zhang If a memory backend file, which was previously used as the backend 9279c0f397SHaozhong Zhang of a vNVDIMM device without labels, is now used for a vNVDIMM 9379c0f397SHaozhong Zhang device with label, the data in the label area at the end of file 9479c0f397SHaozhong Zhang will be inaccessible to the guest. If any useful data (e.g. the 9579c0f397SHaozhong Zhang meta-data of the file system) was stored there, the latter usage 9679c0f397SHaozhong Zhang may result guest data corruption (e.g. breakage of guest file 9779c0f397SHaozhong Zhang system). 9879c0f397SHaozhong Zhang 9979c0f397SHaozhong ZhangHotplug 10079c0f397SHaozhong Zhang------- 10179c0f397SHaozhong Zhang 10279c0f397SHaozhong ZhangQEMU v2.8.0 and later implement the hotplug support for vNVDIMM 10379c0f397SHaozhong Zhangdevices. Similarly to the RAM hotplug, the vNVDIMM hotplug is 10479c0f397SHaozhong Zhangaccomplished by two monitor commands "object_add" and "device_add". 10579c0f397SHaozhong Zhang 10679c0f397SHaozhong ZhangFor example, the following commands add another 4GB vNVDIMM device to 10779c0f397SHaozhong Zhangthe guest: 10879c0f397SHaozhong Zhang 10979c0f397SHaozhong Zhang (qemu) object_add memory-backend-file,id=mem2,share=on,mem-path=new_nvdimm.img,size=4G 11079c0f397SHaozhong Zhang (qemu) device_add nvdimm,id=nvdimm2,memdev=mem2 11179c0f397SHaozhong Zhang 11279c0f397SHaozhong ZhangNote: 11379c0f397SHaozhong Zhang 11479c0f397SHaozhong Zhang1. Each hotplugged vNVDIMM device consumes one memory slot. Users 11579c0f397SHaozhong Zhang should always ensure the memory option "-m ...,slots=N" specifies 11679c0f397SHaozhong Zhang enough number of slots, i.e. 11779c0f397SHaozhong Zhang N >= number of RAM devices + 11879c0f397SHaozhong Zhang number of statically plugged vNVDIMM devices + 11979c0f397SHaozhong Zhang number of hotplugged vNVDIMM devices 12079c0f397SHaozhong Zhang 12179c0f397SHaozhong Zhang2. The similar is required for the memory option "-m ...,maxmem=M", i.e. 12279c0f397SHaozhong Zhang M >= size of RAM devices + 12379c0f397SHaozhong Zhang size of statically plugged vNVDIMM devices + 12479c0f397SHaozhong Zhang size of hotplugged vNVDIMM devices 12598376843SHaozhong Zhang 12698376843SHaozhong ZhangAlignment 12798376843SHaozhong Zhang--------- 12898376843SHaozhong Zhang 12998376843SHaozhong ZhangQEMU uses mmap(2) to maps vNVDIMM backends and aligns the mapping 13098376843SHaozhong Zhangaddress to the page size (getpagesize(2)) by default. However, some 13198376843SHaozhong Zhangtypes of backends may require an alignment different than the page 13298376843SHaozhong Zhangsize. In that case, QEMU v2.12.0 and later provide 'align' option to 13398376843SHaozhong Zhangmemory-backend-file to allow users to specify the proper alignment. 13498376843SHaozhong Zhang 13598376843SHaozhong ZhangFor example, device dax require the 2 MB alignment, so we can use 13698376843SHaozhong Zhangfollowing QEMU command line options to use it (/dev/dax0.0) as the 13798376843SHaozhong Zhangbackend of vNVDIMM: 13898376843SHaozhong Zhang 13998376843SHaozhong Zhang -object memory-backend-file,id=mem1,share=on,mem-path=/dev/dax0.0,size=4G,align=2M 14098376843SHaozhong Zhang -device nvdimm,id=nvdimm1,memdev=mem1 141cb836434SHaozhong Zhang 142cb836434SHaozhong ZhangGuest Data Persistence 143cb836434SHaozhong Zhang---------------------- 144cb836434SHaozhong Zhang 145cb836434SHaozhong ZhangThough QEMU supports multiple types of vNVDIMM backends on Linux, 146cb836434SHaozhong Zhangcurrently the only one that can guarantee the guest write persistence 147cb836434SHaozhong Zhangis the device DAX on the real NVDIMM device (e.g., /dev/dax0.0), to 148cb836434SHaozhong Zhangwhich all guest access do not involve any host-side kernel cache. 149cb836434SHaozhong Zhang 150cb836434SHaozhong ZhangWhen using other types of backends, it's suggested to set 'unarmed' 151cb836434SHaozhong Zhangoption of '-device nvdimm' to 'on', which sets the unarmed flag of the 152cb836434SHaozhong Zhangguest NVDIMM region mapping structure. This unarmed flag indicates 153cb836434SHaozhong Zhangguest software that this vNVDIMM device contains a region that cannot 154cb836434SHaozhong Zhangaccept persistent writes. In result, for example, the guest Linux 155cb836434SHaozhong ZhangNVDIMM driver, marks such vNVDIMM device as read-only. 156*9ab3aad2SRoss Zwisler 157*9ab3aad2SRoss ZwislerPlatform Capabilities 158*9ab3aad2SRoss Zwisler--------------------- 159*9ab3aad2SRoss Zwisler 160*9ab3aad2SRoss ZwislerACPI 6.2 Errata A added support for a new Platform Capabilities Structure 161*9ab3aad2SRoss Zwislerwhich allows the platform to communicate what features it supports related to 162*9ab3aad2SRoss ZwislerNVDIMM data durability. Users can provide a capabilities value to a guest via 163*9ab3aad2SRoss Zwislerthe optional "nvdimm-cap" machine command line option: 164*9ab3aad2SRoss Zwisler 165*9ab3aad2SRoss Zwisler -machine pc,accel=kvm,nvdimm,nvdimm-cap=2 166*9ab3aad2SRoss Zwisler 167*9ab3aad2SRoss ZwislerThis "nvdimm-cap" field is an integer, and is the combined value of the 168*9ab3aad2SRoss Zwislervarious capability bits defined in table 5-137 of the ACPI 6.2 Errata A spec. 169*9ab3aad2SRoss Zwisler 170*9ab3aad2SRoss ZwislerHere is a quick summary of the three bits that are defined as of that spec: 171*9ab3aad2SRoss Zwisler 172*9ab3aad2SRoss ZwislerBit[0] - CPU Cache Flush to NVDIMM Durability on Power Loss Capable. 173*9ab3aad2SRoss ZwislerBit[1] - Memory Controller Flush to NVDIMM Durability on Power Loss Capable. 174*9ab3aad2SRoss Zwisler Note: If bit 0 is set to 1 then this bit shall be set to 1 as well. 175*9ab3aad2SRoss ZwislerBit[2] - Byte Addressable Persistent Memory Hardware Mirroring Capable. 176*9ab3aad2SRoss Zwisler 177*9ab3aad2SRoss ZwislerSo, a "nvdimm-cap" value of 2 would mean that the platform supports Memory 178*9ab3aad2SRoss ZwislerController Flush on Power Loss, a value of 3 would mean that the platform 179*9ab3aad2SRoss Zwislersupports CPU Cache Flush and Memory Controller Flush on Power Loss, etc. 180*9ab3aad2SRoss Zwisler 181*9ab3aad2SRoss ZwislerFor a complete list of the flags available and for more detailed descriptions, 182*9ab3aad2SRoss Zwislerplease consult the ACPI spec. 183