1.. SPDX-License-Identifier: GPL-2.0 2.. _iomap_design: 3 4.. 5 Dumb style notes to maintain the author's sanity: 6 Please try to start sentences on separate lines so that 7 sentence changes don't bleed colors in diff. 8 Heading decorations are documented in sphinx.rst. 9 10============== 11Library Design 12============== 13 14.. contents:: Table of Contents 15 :local: 16 17Introduction 18============ 19 20iomap is a filesystem library for handling common file operations. 21The library has two layers: 22 23 1. A lower layer that provides an iterator over ranges of file offsets. 24 This layer tries to obtain mappings of each file ranges to storage 25 from the filesystem, but the storage information is not necessarily 26 required. 27 28 2. An upper layer that acts upon the space mappings provided by the 29 lower layer iterator. 30 31The iteration can involve mappings of file's logical offset ranges to 32physical extents, but the storage layer information is not necessarily 33required, e.g. for walking cached file information. 34The library exports various APIs for implementing file operations such 35as: 36 37 * Pagecache reads and writes 38 * Folio write faults to the pagecache 39 * Writeback of dirty folios 40 * Direct I/O reads and writes 41 * fsdax I/O reads, writes, loads, and stores 42 * FIEMAP 43 * lseek ``SEEK_DATA`` and ``SEEK_HOLE`` 44 * swapfile activation 45 46This origins of this library is the file I/O path that XFS once used; it 47has now been extended to cover several other operations. 48 49Who Should Read This? 50===================== 51 52The target audience for this document are filesystem, storage, and 53pagecache programmers and code reviewers. 54 55If you are working on PCI, machine architectures, or device drivers, you 56are most likely in the wrong place. 57 58How Is This Better? 59=================== 60 61Unlike the classic Linux I/O model which breaks file I/O into small 62units (generally memory pages or blocks) and looks up space mappings on 63the basis of that unit, the iomap model asks the filesystem for the 64largest space mappings that it can create for a given file operation and 65initiates operations on that basis. 66This strategy improves the filesystem's visibility into the size of the 67operation being performed, which enables it to combat fragmentation with 68larger space allocations when possible. 69Larger space mappings improve runtime performance by amortizing the cost 70of mapping function calls into the filesystem across a larger amount of 71data. 72 73At a high level, an iomap operation `looks like this 74<https://lore.kernel.org/all/ZGbVaewzcCysclPt@dread.disaster.area/>`_: 75 761. For each byte in the operation range... 77 78 1. Obtain a space mapping via ``->iomap_begin`` 79 80 2. For each sub-unit of work... 81 82 1. Revalidate the mapping and go back to (1) above, if necessary. 83 So far only the pagecache operations need to do this. 84 85 2. Do the work 86 87 3. Increment operation cursor 88 89 4. Release the mapping via ``->iomap_end``, if necessary 90 91Each iomap operation will be covered in more detail below. 92This library was covered previously by an `LWN article 93<https://lwn.net/Articles/935934/>`_ and a `KernelNewbies page 94<https://kernelnewbies.org/KernelProjects/iomap>`_. 95 96The goal of this document is to provide a brief discussion of the 97design and capabilities of iomap, followed by a more detailed catalog 98of the interfaces presented by iomap. 99If you change iomap, please update this design document. 100 101File Range Iterator 102=================== 103 104Definitions 105----------- 106 107 * **buffer head**: Shattered remnants of the old buffer cache. 108 109 * ``fsblock``: The block size of a file, also known as ``i_blocksize``. 110 111 * ``i_rwsem``: The VFS ``struct inode`` rwsemaphore. 112 Processes hold this in shared mode to read file state and contents. 113 Some filesystems may allow shared mode for writes. 114 Processes often hold this in exclusive mode to change file state and 115 contents. 116 117 * ``invalidate_lock``: The pagecache ``struct address_space`` 118 rwsemaphore that protects against folio insertion and removal for 119 filesystems that support punching out folios below EOF. 120 Processes wishing to insert folios must hold this lock in shared 121 mode to prevent removal, though concurrent insertion is allowed. 122 Processes wishing to remove folios must hold this lock in exclusive 123 mode to prevent insertions. 124 Concurrent removals are not allowed. 125 126 * ``dax_read_lock``: The RCU read lock that dax takes to prevent a 127 device pre-shutdown hook from returning before other threads have 128 released resources. 129 130 * **filesystem mapping lock**: This synchronization primitive is 131 internal to the filesystem and must protect the file mapping data 132 from updates while a mapping is being sampled. 133 The filesystem author must determine how this coordination should 134 happen; it does not need to be an actual lock. 135 136 * **iomap internal operation lock**: This is a general term for 137 synchronization primitives that iomap functions take while holding a 138 mapping. 139 A specific example would be taking the folio lock while reading or 140 writing the pagecache. 141 142 * **pure overwrite**: A write operation that does not require any 143 metadata or zeroing operations to perform during either submission 144 or completion. 145 This implies that the filesystem must have already allocated space 146 on disk as ``IOMAP_MAPPED`` and the filesystem must not place any 147 constraints on IO alignment or size. 148 The only constraints on I/O alignment are device level (minimum I/O 149 size and alignment, typically sector size). 150 151``struct iomap`` 152---------------- 153 154The filesystem communicates to the iomap iterator the mapping of 155byte ranges of a file to byte ranges of a storage device with the 156structure below: 157 158.. code-block:: c 159 160 struct iomap { 161 u64 addr; 162 loff_t offset; 163 u64 length; 164 u16 type; 165 u16 flags; 166 struct block_device *bdev; 167 struct dax_device *dax_dev; 168 void *inline_data; 169 void *private; 170 const struct iomap_folio_ops *folio_ops; 171 u64 validity_cookie; 172 }; 173 174The fields are as follows: 175 176 * ``offset`` and ``length`` describe the range of file offsets, in 177 bytes, covered by this mapping. 178 These fields must always be set by the filesystem. 179 180 * ``type`` describes the type of the space mapping: 181 182 * **IOMAP_HOLE**: No storage has been allocated. 183 This type must never be returned in response to an ``IOMAP_WRITE`` 184 operation because writes must allocate and map space, and return 185 the mapping. 186 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 187 iomap does not support writing (whether via pagecache or direct 188 I/O) to a hole. 189 190 * **IOMAP_DELALLOC**: A promise to allocate space at a later time 191 ("delayed allocation"). 192 If the filesystem returns IOMAP_F_NEW here and the write fails, the 193 ``->iomap_end`` function must delete the reservation. 194 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 195 196 * **IOMAP_MAPPED**: The file range maps to specific space on the 197 storage device. 198 The device is returned in ``bdev`` or ``dax_dev``. 199 The device address, in bytes, is returned via ``addr``. 200 201 * **IOMAP_UNWRITTEN**: The file range maps to specific space on the 202 storage device, but the space has not yet been initialized. 203 The device is returned in ``bdev`` or ``dax_dev``. 204 The device address, in bytes, is returned via ``addr``. 205 Reads from this type of mapping will return zeroes to the caller. 206 For a write or writeback operation, the ioend should update the 207 mapping to MAPPED. 208 Refer to the sections about ioends for more details. 209 210 * **IOMAP_INLINE**: The file range maps to the memory buffer 211 specified by ``inline_data``. 212 For write operation, the ``->iomap_end`` function presumably 213 handles persisting the data. 214 The ``addr`` field must be set to ``IOMAP_NULL_ADDR``. 215 216 * ``flags`` describe the status of the space mapping. 217 These flags should be set by the filesystem in ``->iomap_begin``: 218 219 * **IOMAP_F_NEW**: The space under the mapping is newly allocated. 220 Areas that will not be written to must be zeroed. 221 If a write fails and the mapping is a space reservation, the 222 reservation must be deleted. 223 224 * **IOMAP_F_DIRTY**: The inode will have uncommitted metadata needed 225 to access any data written. 226 fdatasync is required to commit these changes to persistent 227 storage. 228 This needs to take into account metadata changes that *may* be made 229 at I/O completion, such as file size updates from direct I/O. 230 231 * **IOMAP_F_SHARED**: The space under the mapping is shared. 232 Copy on write is necessary to avoid corrupting other file data. 233 234 * **IOMAP_F_BUFFER_HEAD**: This mapping requires the use of buffer 235 heads for pagecache operations. 236 Do not add more uses of this. 237 238 * **IOMAP_F_MERGED**: Multiple contiguous block mappings were 239 coalesced into this single mapping. 240 This is only useful for FIEMAP. 241 242 * **IOMAP_F_XATTR**: The mapping is for extended attribute data, not 243 regular file data. 244 This is only useful for FIEMAP. 245 246 * **IOMAP_F_PRIVATE**: Starting with this value, the upper bits can 247 be set by the filesystem for its own purposes. 248 249 * **IOMAP_F_ANON_WRITE**: Indicates that (write) I/O does not have a target 250 block assigned to it yet and the file system will do that in the bio 251 submission handler, splitting the I/O as needed. 252 253 These flags can be set by iomap itself during file operations. 254 The filesystem should supply an ``->iomap_end`` function if it needs 255 to observe these flags: 256 257 * **IOMAP_F_SIZE_CHANGED**: The file size has changed as a result of 258 using this mapping. 259 260 * **IOMAP_F_STALE**: The mapping was found to be stale. 261 iomap will call ``->iomap_end`` on this mapping and then 262 ``->iomap_begin`` to obtain a new mapping. 263 264 Currently, these flags are only set by pagecache operations. 265 266 * ``addr`` describes the device address, in bytes. 267 268 * ``bdev`` describes the block device for this mapping. 269 This only needs to be set for mapped or unwritten operations. 270 271 * ``dax_dev`` describes the DAX device for this mapping. 272 This only needs to be set for mapped or unwritten operations, and 273 only for a fsdax operation. 274 275 * ``inline_data`` points to a memory buffer for I/O involving 276 ``IOMAP_INLINE`` mappings. 277 This value is ignored for all other mapping types. 278 279 * ``private`` is a pointer to `filesystem-private information 280 <https://lore.kernel.org/all/20180619164137.13720-7-hch@lst.de/>`_. 281 This value will be passed unchanged to ``->iomap_end``. 282 283 * ``folio_ops`` will be covered in the section on pagecache operations. 284 285 * ``validity_cookie`` is a magic freshness value set by the filesystem 286 that should be used to detect stale mappings. 287 For pagecache operations this is critical for correct operation 288 because page faults can occur, which implies that filesystem locks 289 should not be held between ``->iomap_begin`` and ``->iomap_end``. 290 Filesystems with completely static mappings need not set this value. 291 Only pagecache operations revalidate mappings; see the section about 292 ``iomap_valid`` for details. 293 294``struct iomap_ops`` 295-------------------- 296 297Every iomap function requires the filesystem to pass an operations 298structure to obtain a mapping and (optionally) to release the mapping: 299 300.. code-block:: c 301 302 struct iomap_ops { 303 int (*iomap_begin)(struct inode *inode, loff_t pos, loff_t length, 304 unsigned flags, struct iomap *iomap, 305 struct iomap *srcmap); 306 307 int (*iomap_end)(struct inode *inode, loff_t pos, loff_t length, 308 ssize_t written, unsigned flags, 309 struct iomap *iomap); 310 }; 311 312``->iomap_begin`` 313~~~~~~~~~~~~~~~~~ 314 315iomap operations call ``->iomap_begin`` to obtain one file mapping for 316the range of bytes specified by ``pos`` and ``length`` for the file 317``inode``. 318This mapping should be returned through the ``iomap`` pointer. 319The mapping must cover at least the first byte of the supplied file 320range, but it does not need to cover the entire requested range. 321 322Each iomap operation describes the requested operation through the 323``flags`` argument. 324The exact value of ``flags`` will be documented in the 325operation-specific sections below. 326These flags can, at least in principle, apply generally to iomap 327operations: 328 329 * ``IOMAP_DIRECT`` is set when the caller wishes to issue file I/O to 330 block storage. 331 332 * ``IOMAP_DAX`` is set when the caller wishes to issue file I/O to 333 memory-like storage. 334 335 * ``IOMAP_NOWAIT`` is set when the caller wishes to perform a best 336 effort attempt to avoid any operation that would result in blocking 337 the submitting task. 338 This is similar in intent to ``O_NONBLOCK`` for network APIs - it is 339 intended for asynchronous applications to keep doing other work 340 instead of waiting for the specific unavailable filesystem resource 341 to become available. 342 Filesystems implementing ``IOMAP_NOWAIT`` semantics need to use 343 trylock algorithms. 344 They need to be able to satisfy the entire I/O request range with a 345 single iomap mapping. 346 They need to avoid reading or writing metadata synchronously. 347 They need to avoid blocking memory allocations. 348 They need to avoid waiting on transaction reservations to allow 349 modifications to take place. 350 They probably should not be allocating new space. 351 And so on. 352 If there is any doubt in the filesystem developer's mind as to 353 whether any specific ``IOMAP_NOWAIT`` operation may end up blocking, 354 then they should return ``-EAGAIN`` as early as possible rather than 355 start the operation and force the submitting task to block. 356 ``IOMAP_NOWAIT`` is often set on behalf of ``IOCB_NOWAIT`` or 357 ``RWF_NOWAIT``. 358 359 * ``IOMAP_DONTCACHE`` is set when the caller wishes to perform a 360 buffered file I/O and would like the kernel to drop the pagecache 361 after the I/O completes, if it isn't already being used by another 362 thread. 363 364If it is necessary to read existing file contents from a `different 365<https://lore.kernel.org/all/20191008071527.29304-9-hch@lst.de/>`_ 366device or address range on a device, the filesystem should return that 367information via ``srcmap``. 368Only pagecache and fsdax operations support reading from one mapping and 369writing to another. 370 371``->iomap_end`` 372~~~~~~~~~~~~~~~ 373 374After the operation completes, the ``->iomap_end`` function, if present, 375is called to signal that iomap is finished with a mapping. 376Typically, implementations will use this function to tear down any 377context that were set up in ``->iomap_begin``. 378For example, a write might wish to commit the reservations for the bytes 379that were operated upon and unreserve any space that was not operated 380upon. 381``written`` might be zero if no bytes were touched. 382``flags`` will contain the same value passed to ``->iomap_begin``. 383iomap ops for reads are not likely to need to supply this function. 384 385Both functions should return a negative errno code on error, or zero on 386success. 387 388Preparing for File Operations 389============================= 390 391iomap only handles mapping and I/O. 392Filesystems must still call out to the VFS to check input parameters 393and file state before initiating an I/O operation. 394It does not handle obtaining filesystem freeze protection, updating of 395timestamps, stripping privileges, or access control. 396 397Locking Hierarchy 398================= 399 400iomap requires that filesystems supply their own locking model. 401There are three categories of synchronization primitives, as far as 402iomap is concerned: 403 404 * The **upper** level primitive is provided by the filesystem to 405 coordinate access to different iomap operations. 406 The exact primitive is specific to the filesystem and operation, 407 but is often a VFS inode, pagecache invalidation, or folio lock. 408 For example, a filesystem might take ``i_rwsem`` before calling 409 ``iomap_file_buffered_write`` and ``iomap_file_unshare`` to prevent 410 these two file operations from clobbering each other. 411 Pagecache writeback may lock a folio to prevent other threads from 412 accessing the folio until writeback is underway. 413 414 * The **lower** level primitive is taken by the filesystem in the 415 ``->iomap_begin`` and ``->iomap_end`` functions to coordinate 416 access to the file space mapping information. 417 The fields of the iomap object should be filled out while holding 418 this primitive. 419 The upper level synchronization primitive, if any, remains held 420 while acquiring the lower level synchronization primitive. 421 For example, XFS takes ``ILOCK_EXCL`` and ext4 takes ``i_data_sem`` 422 while sampling mappings. 423 Filesystems with immutable mapping information may not require 424 synchronization here. 425 426 * The **operation** primitive is taken by an iomap operation to 427 coordinate access to its own internal data structures. 428 The upper level synchronization primitive, if any, remains held 429 while acquiring this primitive. 430 The lower level primitive is not held while acquiring this 431 primitive. 432 For example, pagecache write operations will obtain a file mapping, 433 then grab and lock a folio to copy new contents. 434 It may also lock an internal folio state object to update metadata. 435 436The exact locking requirements are specific to the filesystem; for 437certain operations, some of these locks can be elided. 438All further mentions of locking are *recommendations*, not mandates. 439Each filesystem author must figure out the locking for themself. 440 441Bugs and Limitations 442==================== 443 444 * No support for fscrypt. 445 * No support for compression. 446 * No support for fsverity yet. 447 * Strong assumptions that IO should work the way it does on XFS. 448 * Does iomap *actually* work for non-regular file data? 449 450Patches welcome! 451