1======================================== 2zram: Compressed RAM-based block devices 3======================================== 4 5Introduction 6============ 7 8The zram module creates RAM-based block devices named /dev/zram<id> 9(<id> = 0, 1, ...). Pages written to these disks are compressed and stored 10in memory itself. These disks allow very fast I/O and compression provides 11good amounts of memory savings. Some of the use cases include /tmp storage, 12use as swap disks, various caches under /var and maybe many more. :) 13 14Statistics for individual zram devices are exported through sysfs nodes at 15/sys/block/zram<id>/ 16 17Usage 18===== 19 20There are several ways to configure and manage zram device(-s): 21 22a) using zram and zram_control sysfs attributes 23b) using zramctl utility, provided by util-linux (util-linux@vger.kernel.org). 24 25In this document we will describe only 'manual' zram configuration steps, 26IOW, zram and zram_control sysfs attributes. 27 28In order to get a better idea about zramctl please consult util-linux 29documentation, zramctl man-page or `zramctl --help`. Please be informed 30that zram maintainers do not develop/maintain util-linux or zramctl, should 31you have any questions please contact util-linux@vger.kernel.org 32 33Following shows a typical sequence of steps for using zram. 34 35WARNING 36======= 37 38For the sake of simplicity we skip error checking parts in most of the 39examples below. However, it is your sole responsibility to handle errors. 40 41zram sysfs attributes always return negative values in case of errors. 42The list of possible return codes: 43 44======== ============================================================= 45-EBUSY an attempt to modify an attribute that cannot be changed once 46 the device has been initialised. Please reset device first. 47-ENOMEM zram was not able to allocate enough memory to fulfil your 48 needs. 49-EINVAL invalid input has been provided. 50-EAGAIN re-try operation later (e.g. when attempting to run recompress 51 and writeback simultaneously). 52======== ============================================================= 53 54If you use 'echo', the returned value is set by the 'echo' utility, 55and, in general case, something like:: 56 57 echo foo > /sys/block/zram0/comp_algorithm 58 if [ $? -ne 0 ]; then 59 handle_error 60 fi 61 62should suffice. 63 641) Load Module 65============== 66 67:: 68 69 modprobe zram num_devices=4 70 71This creates 4 devices: /dev/zram{0,1,2,3} 72 73num_devices parameter is optional and tells zram how many devices should be 74pre-created. Default: 1. 75 762) Select compression algorithm 77=============================== 78 79Using comp_algorithm device attribute one can see available and 80currently selected (shown in square brackets) compression algorithms, 81or change the selected compression algorithm (once the device is initialised 82there is no way to change compression algorithm). 83 84Examples:: 85 86 #show supported compression algorithms 87 cat /sys/block/zram0/comp_algorithm 88 lzo [lz4] 89 90 #select lzo compression algorithm 91 echo lzo > /sys/block/zram0/comp_algorithm 92 93For the time being, the `comp_algorithm` content shows only compression 94algorithms that are supported by zram. 95 963) Set compression algorithm parameters: Optional 97================================================= 98 99Compression algorithms may support specific parameters which can be 100tweaked for particular dataset. ZRAM has an `algorithm_params` device 101attribute which provides a per-algorithm params configuration. 102 103For example, several compression algorithms support `level` parameter. 104In addition, certain compression algorithms support pre-trained dictionaries, 105which significantly change algorithms' characteristics. In order to configure 106compression algorithm to use external pre-trained dictionary, pass full 107path to the `dict` along with other parameters:: 108 109 #pass path to pre-trained zstd dictionary 110 echo "algo=zstd dict=/etc/dictionary" > /sys/block/zram0/algorithm_params 111 112 #same, but using algorithm priority 113 echo "priority=1 dict=/etc/dictionary" > \ 114 /sys/block/zram0/algorithm_params 115 116 #pass path to pre-trained zstd dictionary and compression level 117 echo "algo=zstd level=8 dict=/etc/dictionary" > \ 118 /sys/block/zram0/algorithm_params 119 120Parameters are algorithm specific: not all algorithms support pre-trained 121dictionaries, not all algorithms support `level`. Furthermore, for certain 122algorithms `level` controls the compression level (the higher the value the 123better the compression ratio, it even can take negatives values for some 124algorithms), for other algorithms `level` is acceleration level (the higher 125the value the lower the compression ratio). 126 1274) Set Disksize 128=============== 129 130Set disk size by writing the value to sysfs node 'disksize'. 131The value can be either in bytes or you can use mem suffixes. 132Examples:: 133 134 # Initialize /dev/zram0 with 50MB disksize 135 echo $((50*1024*1024)) > /sys/block/zram0/disksize 136 137 # Using mem suffixes 138 echo 256K > /sys/block/zram0/disksize 139 echo 512M > /sys/block/zram0/disksize 140 echo 1G > /sys/block/zram0/disksize 141 142Note: 143There is little point creating a zram of greater than twice the size of memory 144since we expect a 2:1 compression ratio. Note that zram uses about 0.1% of the 145size of the disk when not in use so a huge zram is wasteful. 146 1475) Set memory limit: Optional 148============================= 149 150Set memory limit by writing the value to sysfs node 'mem_limit'. 151The value can be either in bytes or you can use mem suffixes. 152In addition, you could change the value in runtime. 153Examples:: 154 155 # limit /dev/zram0 with 50MB memory 156 echo $((50*1024*1024)) > /sys/block/zram0/mem_limit 157 158 # Using mem suffixes 159 echo 256K > /sys/block/zram0/mem_limit 160 echo 512M > /sys/block/zram0/mem_limit 161 echo 1G > /sys/block/zram0/mem_limit 162 163 # To disable memory limit 164 echo 0 > /sys/block/zram0/mem_limit 165 1666) Activate 167=========== 168 169:: 170 171 mkswap /dev/zram0 172 swapon /dev/zram0 173 174 mkfs.ext4 /dev/zram1 175 mount /dev/zram1 /tmp 176 1777) Add/remove zram devices 178========================== 179 180zram provides a control interface, which enables dynamic (on-demand) device 181addition and removal. 182 183In order to add a new /dev/zramX device, perform a read operation on the hot_add 184attribute. This will return either the new device's device id (meaning that you 185can use /dev/zram<id>) or an error code. 186 187Example:: 188 189 cat /sys/class/zram-control/hot_add 190 1 191 192To remove the existing /dev/zramX device (where X is a device id) 193execute:: 194 195 echo X > /sys/class/zram-control/hot_remove 196 1978) Stats 198======== 199 200Per-device statistics are exported as various nodes under /sys/block/zram<id>/ 201 202A brief description of exported device attributes follows. For more details 203please read Documentation/ABI/testing/sysfs-block-zram. 204 205====================== ====== =============================================== 206Name access description 207====================== ====== =============================================== 208disksize RW show and set the device's disk size 209initstate RO shows the initialization state of the device 210reset WO trigger device reset 211mem_used_max WO reset the `mem_used_max` counter (see later) 212mem_limit WO specifies the maximum amount of memory ZRAM can 213 use to store the compressed data 214writeback_limit WO specifies the maximum amount of write IO zram 215 can write out to backing device as 4KB unit 216writeback_limit_enable RW show and set writeback_limit feature 217comp_algorithm RW show and change the compression algorithm 218algorithm_params WO setup compression algorithm parameters 219compact WO trigger memory compaction 220debug_stat RO this file is used for zram debugging purposes 221backing_dev RW set up backend storage for zram to write out 222idle WO mark allocated slot as idle 223====================== ====== =============================================== 224 225 226User space is advised to use the following files to read the device statistics. 227 228File /sys/block/zram<id>/stat 229 230Represents block layer statistics. Read Documentation/block/stat.rst for 231details. 232 233File /sys/block/zram<id>/io_stat 234 235The stat file represents device's I/O statistics not accounted by block 236layer and, thus, not available in zram<id>/stat file. It consists of a 237single line of text and contains the following stats separated by 238whitespace: 239 240 ============= ============================================================= 241 failed_reads The number of failed reads 242 failed_writes The number of failed writes 243 invalid_io The number of non-page-size-aligned I/O requests 244 notify_free Depending on device usage scenario it may account 245 246 a) the number of pages freed because of swap slot free 247 notifications 248 b) the number of pages freed because of 249 REQ_OP_DISCARD requests sent by bio. The former ones are 250 sent to a swap block device when a swap slot is freed, 251 which implies that this disk is being used as a swap disk. 252 253 The latter ones are sent by filesystem mounted with 254 discard option, whenever some data blocks are getting 255 discarded. 256 ============= ============================================================= 257 258File /sys/block/zram<id>/mm_stat 259 260The mm_stat file represents the device's mm statistics. It consists of a single 261line of text and contains the following stats separated by whitespace: 262 263 ================ ============================================================= 264 orig_data_size uncompressed size of data stored in this disk. 265 Unit: bytes 266 compr_data_size compressed size of data stored in this disk 267 mem_used_total the amount of memory allocated for this disk. This 268 includes allocator fragmentation and metadata overhead, 269 allocated for this disk. So, allocator space efficiency 270 can be calculated using compr_data_size and this statistic. 271 Unit: bytes 272 mem_limit the maximum amount of memory ZRAM can use to store 273 the compressed data 274 mem_used_max the maximum amount of memory zram has consumed to 275 store the data 276 same_pages the number of same element filled pages written to this disk. 277 No memory is allocated for such pages. 278 pages_compacted the number of pages freed during compaction 279 huge_pages the number of incompressible pages 280 huge_pages_since the number of incompressible pages since zram set up 281 ================ ============================================================= 282 283File /sys/block/zram<id>/bd_stat 284 285The bd_stat file represents a device's backing device statistics. It consists of 286a single line of text and contains the following stats separated by whitespace: 287 288 ============== ============================================================= 289 bd_count size of data written in backing device. 290 Unit: 4K bytes 291 bd_reads the number of reads from backing device 292 Unit: 4K bytes 293 bd_writes the number of writes to backing device 294 Unit: 4K bytes 295 ============== ============================================================= 296 2979) Deactivate 298============== 299 300:: 301 302 swapoff /dev/zram0 303 umount /dev/zram1 304 30510) Reset 306========= 307 308 Write any positive value to 'reset' sysfs node:: 309 310 echo 1 > /sys/block/zram0/reset 311 echo 1 > /sys/block/zram1/reset 312 313 This frees all the memory allocated for the given device and 314 resets the disksize to zero. You must set the disksize again 315 before reusing the device. 316 317Optional Feature 318================ 319 320writeback 321--------- 322 323With CONFIG_ZRAM_WRITEBACK, zram can write idle/incompressible page 324to backing storage rather than keeping it in memory. 325To use the feature, admin should set up backing device via:: 326 327 echo /dev/sda5 > /sys/block/zramX/backing_dev 328 329before disksize setting. It supports only partitions at this moment. 330If admin wants to use incompressible page writeback, they could do it via:: 331 332 echo huge > /sys/block/zramX/writeback 333 334To use idle page writeback, first, user need to declare zram pages 335as idle:: 336 337 echo all > /sys/block/zramX/idle 338 339From now on, any pages on zram are idle pages. The idle mark 340will be removed until someone requests access of the block. 341IOW, unless there is access request, those pages are still idle pages. 342Additionally, when CONFIG_ZRAM_TRACK_ENTRY_ACTIME is enabled pages can be 343marked as idle based on how long (in seconds) it's been since they were 344last accessed:: 345 346 echo 86400 > /sys/block/zramX/idle 347 348In this example all pages which haven't been accessed in more than 86400 349seconds (one day) will be marked idle. 350 351Admin can request writeback of those idle pages at right timing via:: 352 353 echo idle > /sys/block/zramX/writeback 354 355With the command, zram will writeback idle pages from memory to the storage. 356 357Additionally, if a user choose to writeback only huge and idle pages 358this can be accomplished with:: 359 360 echo huge_idle > /sys/block/zramX/writeback 361 362If a user chooses to writeback only incompressible pages (pages that none of 363algorithms can compress) this can be accomplished with:: 364 365 echo incompressible > /sys/block/zramX/writeback 366 367If an admin wants to write a specific page in zram device to the backing device, 368they could write a page index into the interface:: 369 370 echo "page_index=1251" > /sys/block/zramX/writeback 371 372If there are lots of write IO with flash device, potentially, it has 373flash wearout problem so that admin needs to design write limitation 374to guarantee storage health for entire product life. 375 376To overcome the concern, zram supports "writeback_limit" feature. 377The "writeback_limit_enable"'s default value is 0 so that it doesn't limit 378any writeback. IOW, if admin wants to apply writeback budget, they should 379enable writeback_limit_enable via:: 380 381 $ echo 1 > /sys/block/zramX/writeback_limit_enable 382 383Once writeback_limit_enable is set, zram doesn't allow any writeback 384until admin sets the budget via /sys/block/zramX/writeback_limit. 385 386(If admin doesn't enable writeback_limit_enable, writeback_limit's value 387assigned via /sys/block/zramX/writeback_limit is meaningless.) 388 389If admin wants to limit writeback as per-day 400M, they could do it 390like below:: 391 392 $ MB_SHIFT=20 393 $ 4K_SHIFT=12 394 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 395 /sys/block/zram0/writeback_limit. 396 $ echo 1 > /sys/block/zram0/writeback_limit_enable 397 398If admins want to allow further write again once the budget is exhausted, 399they could do it like below:: 400 401 $ echo $((400<<MB_SHIFT>>4K_SHIFT)) > \ 402 /sys/block/zram0/writeback_limit 403 404If an admin wants to see the remaining writeback budget since last set:: 405 406 $ cat /sys/block/zramX/writeback_limit 407 408If an admin wants to disable writeback limit, they could do:: 409 410 $ echo 0 > /sys/block/zramX/writeback_limit_enable 411 412The writeback_limit count will reset whenever you reset zram (e.g., 413system reboot, echo 1 > /sys/block/zramX/reset) so keeping how many of 414writeback happened until you reset the zram to allocate extra writeback 415budget in next setting is user's job. 416 417If admin wants to measure writeback count in a certain period, they could 418know it via /sys/block/zram0/bd_stat's 3rd column. 419 420recompression 421------------- 422 423With CONFIG_ZRAM_MULTI_COMP, zram can recompress pages using alternative 424(secondary) compression algorithms. The basic idea is that alternative 425compression algorithm can provide better compression ratio at a price of 426(potentially) slower compression/decompression speeds. Alternative compression 427algorithm can, for example, be more successful compressing huge pages (those 428that default algorithm failed to compress). Another application is idle pages 429recompression - pages that are cold and sit in the memory can be recompressed 430using more effective algorithm and, hence, reduce zsmalloc memory usage. 431 432With CONFIG_ZRAM_MULTI_COMP, zram supports up to 4 compression algorithms: 433one primary and up to 3 secondary ones. Primary zram compressor is explained 434in "3) Select compression algorithm", secondary algorithms are configured 435using recomp_algorithm device attribute. 436 437Example::: 438 439 #show supported recompression algorithms 440 cat /sys/block/zramX/recomp_algorithm 441 #1: lzo lzo-rle lz4 lz4hc [zstd] 442 #2: lzo lzo-rle lz4 [lz4hc] zstd 443 444Alternative compression algorithms are sorted by priority. In the example 445above, zstd is used as the first alternative algorithm, which has priority 446of 1, while lz4hc is configured as a compression algorithm with priority 2. 447Alternative compression algorithm's priority is provided during algorithms 448configuration::: 449 450 #select zstd recompression algorithm, priority 1 451 echo "algo=zstd priority=1" > /sys/block/zramX/recomp_algorithm 452 453 #select deflate recompression algorithm, priority 2 454 echo "algo=deflate priority=2" > /sys/block/zramX/recomp_algorithm 455 456Another device attribute that CONFIG_ZRAM_MULTI_COMP enables is recompress, 457which controls recompression. 458 459Examples::: 460 461 #IDLE pages recompression is activated by `idle` mode 462 echo "type=idle" > /sys/block/zramX/recompress 463 464 #HUGE pages recompression is activated by `huge` mode 465 echo "type=huge" > /sys/block/zram0/recompress 466 467 #HUGE_IDLE pages recompression is activated by `huge_idle` mode 468 echo "type=huge_idle" > /sys/block/zramX/recompress 469 470The number of idle pages can be significant, so user-space can pass a size 471threshold (in bytes) to the recompress knob: zram will recompress only pages 472of equal or greater size::: 473 474 #recompress all pages larger than 3000 bytes 475 echo "threshold=3000" > /sys/block/zramX/recompress 476 477 #recompress idle pages larger than 2000 bytes 478 echo "type=idle threshold=2000" > /sys/block/zramX/recompress 479 480It is also possible to limit the number of pages zram re-compression will 481attempt to recompress::: 482 483 echo "type=huge_idle max_pages=42" > /sys/block/zramX/recompress 484 485Recompression of idle pages requires memory tracking. 486 487During re-compression for every page, that matches re-compression criteria, 488ZRAM iterates the list of registered alternative compression algorithms in 489order of their priorities. ZRAM stops either when re-compression was 490successful (re-compressed object is smaller in size than the original one) 491and matches re-compression criteria (e.g. size threshold) or when there are 492no secondary algorithms left to try. If none of the secondary algorithms can 493successfully re-compressed the page such a page is marked as incompressible, 494so ZRAM will not attempt to re-compress it in the future. 495 496This re-compression behaviour, when it iterates through the list of 497registered compression algorithms, increases our chances of finding the 498algorithm that successfully compresses a particular page. Sometimes, however, 499it is convenient (and sometimes even necessary) to limit recompression to 500only one particular algorithm so that it will not try any other algorithms. 501This can be achieved by providing a `algo` or `priority` parameter::: 502 503 #use zstd algorithm only (if registered) 504 echo "type=huge algo=zstd" > /sys/block/zramX/recompress 505 506 #use zstd algorithm only (if zstd was registered under priority 1) 507 echo "type=huge priority=1" > /sys/block/zramX/recompress 508 509memory tracking 510=============== 511 512With CONFIG_ZRAM_MEMORY_TRACKING, user can know information of the 513zram block. It could be useful to catch cold or incompressible 514pages of the process with*pagemap. 515 516If you enable the feature, you could see block state via 517/sys/kernel/debug/zram/zram0/block_state". The output is as follows:: 518 519 300 75.033841 .wh... 520 301 63.806904 s..... 521 302 63.806919 ..hi.. 522 303 62.801919 ....r. 523 304 146.781902 ..hi.n 524 525First column 526 zram's block index. 527Second column 528 access time since the system was booted 529Third column 530 state of the block: 531 532 s: 533 same page 534 w: 535 written page to backing store 536 h: 537 huge page 538 i: 539 idle page 540 r: 541 recompressed page (secondary compression algorithm) 542 n: 543 none (including secondary) of algorithms could compress it 544 545First line of above example says 300th block is accessed at 75.033841sec 546and the block's state is huge so it is written back to the backing 547storage. It's a debugging feature so anyone shouldn't rely on it to work 548properly. 549 550Nitin Gupta 551ngupta@vflare.org 552