1.. SPDX-License-Identifier: GPL-2.0 2 3=================================== 4Network Filesystem Services Library 5=================================== 6 7.. Contents: 8 9 - Overview. 10 - Requests and streams. 11 - Subrequests. 12 - Result collection and retry. 13 - Local caching. 14 - Content encryption (fscrypt). 15 - Per-inode context. 16 - Inode context helper functions. 17 - Inode locking. 18 - Inode writeback. 19 - High-level VFS API. 20 - Unlocked read/write iter. 21 - Pre-locked read/write iter. 22 - Monolithic files API. 23 - Memory-mapped I/O API. 24 - High-level VM API. 25 - Deprecated PG_private2 API. 26 - I/O request API. 27 - Request structure. 28 - Stream structure. 29 - Subrequest structure. 30 - Filesystem methods. 31 - Terminating a subrequest. 32 - Local cache API. 33 - API function reference. 34 35 36Overview 37======== 38 39The network filesystem services library, netfslib, is a set of functions 40designed to aid a network filesystem in implementing VM/VFS API operations. It 41takes over the normal buffered read, readahead, write and writeback and also 42handles unbuffered and direct I/O. 43 44The library provides support for (re-)negotiation of I/O sizes and retrying 45failed I/O as well as local caching and will, in the future, provide content 46encryption. 47 48It insulates the filesystem from VM interface changes as much as possible and 49handles VM features such as large multipage folios. The filesystem basically 50just has to provide a way to perform read and write RPC calls. 51 52The way I/O is organised inside netfslib consists of a number of objects: 53 54 * A *request*. A request is used to track the progress of the I/O overall and 55 to hold on to resources. The collection of results is done at the request 56 level. The I/O within a request is divided into a number of parallel 57 streams of subrequests. 58 59 * A *stream*. A non-overlapping series of subrequests. The subrequests 60 within a stream do not have to be contiguous. 61 62 * A *subrequest*. This is the basic unit of I/O. It represents a single RPC 63 call or a single cache I/O operation. The library passes these to the 64 filesystem and the cache to perform. 65 66Requests and Streams 67-------------------- 68 69When actually performing I/O (as opposed to just copying into the pagecache), 70netfslib will create one or more requests to track the progress of the I/O and 71to hold resources. 72 73A read operation will have a single stream and the subrequests within that 74stream may be of mixed origins, for instance mixing RPC subrequests and cache 75subrequests. 76 77On the other hand, a write operation may have multiple streams, where each 78stream targets a different destination. For instance, there may be one stream 79writing to the local cache and one to the server. Currently, only two streams 80are allowed, but this could be increased if parallel writes to multiple servers 81is desired. 82 83The subrequests within a write stream do not need to match alignment or size 84with the subrequests in another write stream and netfslib performs the tiling 85of subrequests in each stream over the source buffer independently. Further, 86each stream may contain holes that don't correspond to holes in the other 87stream. 88 89In addition, the subrequests do not need to correspond to the boundaries of the 90folios or vectors in the source/destination buffer. The library handles the 91collection of results and the wrangling of folio flags and references. 92 93Subrequests 94----------- 95 96Subrequests are at the heart of the interaction between netfslib and the 97filesystem using it. Each subrequest is expected to correspond to a single 98read or write RPC or cache operation. The library will stitch together the 99results from a set of subrequests to provide a higher level operation. 100 101Netfslib has two interactions with the filesystem or the cache when setting up 102a subrequest. First, there's an optional preparatory step that allows the 103filesystem to negotiate the limits on the subrequest, both in terms of maximum 104number of bytes and maximum number of vectors (e.g. for RDMA). This may 105involve negotiating with the server (e.g. cifs needing to acquire credits). 106 107And, secondly, there's the issuing step in which the subrequest is handed off 108to the filesystem to perform. 109 110Note that these two steps are done slightly differently between read and write: 111 112 * For reads, the VM/VFS tells us how much is being requested up front, so the 113 library can preset maximum values that the cache and then the filesystem can 114 then reduce. The cache also gets consulted first on whether it wants to do 115 a read before the filesystem is consulted. 116 117 * For writeback, it is unknown how much there will be to write until the 118 pagecache is walked, so no limit is set by the library. 119 120Once a subrequest is completed, the filesystem or cache informs the library of 121the completion and then collection is invoked. Depending on whether the 122request is synchronous or asynchronous, the collection of results will be done 123in either the application thread or in a work queue. 124 125Result Collection and Retry 126--------------------------- 127 128As subrequests complete, the results are collected and collated by the library 129and folio unlocking is performed progressively (if appropriate). Once the 130request is complete, async completion will be invoked (again, if appropriate). 131It is possible for the filesystem to provide interim progress reports to the 132library to cause folio unlocking to happen earlier if possible. 133 134If any subrequests fail, netfslib can retry them. It will wait until all 135subrequests are completed, offer the filesystem the opportunity to fiddle with 136the resources/state held by the request and poke at the subrequests before 137re-preparing and re-issuing the subrequests. 138 139This allows the tiling of contiguous sets of failed subrequest within a stream 140to be changed, adding more subrequests or ditching excess as necessary (for 141instance, if the network sizes change or the server decides it wants smaller 142chunks). 143 144Further, if one or more contiguous cache-read subrequests fail, the library 145will pass them to the filesystem to perform instead, renegotiating and retiling 146them as necessary to fit with the filesystem's parameters rather than those of 147the cache. 148 149Local Caching 150------------- 151 152One of the services netfslib provides, via ``fscache``, is the option to cache 153on local disk a copy of the data obtained from/written to a network filesystem. 154The library will manage the storing, retrieval and some invalidation of data 155automatically on behalf of the filesystem if a cookie is attached to the 156``netfs_inode``. 157 158Note that local caching used to use the PG_private_2 (aliased as PG_fscache) to 159keep track of a page that was being written to the cache, but this is now 160deprecated as PG_private_2 will be removed. 161 162Instead, folios that are read from the server for which there was no data in 163the cache will be marked as dirty and will have ``folio->private`` set to a 164special value (``NETFS_FOLIO_COPY_TO_CACHE``) and left to writeback to write. 165If the folio is modified before that happened, the special value will be 166cleared and the write will become normally dirty. 167 168When writeback occurs, folios that are so marked will only be written to the 169cache and not to the server. Writeback handles mixed cache-only writes and 170server-and-cache writes by using two streams, sending one to the cache and one 171to the server. The server stream will have gaps in it corresponding to those 172folios. 173 174Content Encryption (fscrypt) 175---------------------------- 176 177Though it does not do so yet, at some point netfslib will acquire the ability 178to do client-side content encryption on behalf of the network filesystem (Ceph, 179for example). fscrypt can be used for this if appropriate (it may not be - 180cifs, for example). 181 182The data will be stored encrypted in the local cache using the same manner of 183encryption as the data written to the server and the library will impose bounce 184buffering and RMW cycles as necessary. 185 186 187Per-Inode Context 188================= 189 190The network filesystem helper library needs a place to store a bit of state for 191its use on each netfs inode it is helping to manage. To this end, a context 192structure is defined:: 193 194 struct netfs_inode { 195 struct inode inode; 196 const struct netfs_request_ops *ops; 197 struct fscache_cookie * cache; 198 loff_t remote_i_size; 199 unsigned long flags; 200 ... 201 }; 202 203A network filesystem that wants to use netfslib must place one of these in its 204inode wrapper struct instead of the VFS ``struct inode``. This can be done in 205a way similar to the following:: 206 207 struct my_inode { 208 struct netfs_inode netfs; /* Netfslib context and vfs inode */ 209 ... 210 }; 211 212This allows netfslib to find its state by using ``container_of()`` from the 213inode pointer, thereby allowing the netfslib helper functions to be pointed to 214directly by the VFS/VM operation tables. 215 216The structure contains the following fields that are of interest to the 217filesystem: 218 219 * ``inode`` 220 221 The VFS inode structure. 222 223 * ``ops`` 224 225 The set of operations provided by the network filesystem to netfslib. 226 227 * ``cache`` 228 229 Local caching cookie, or NULL if no caching is enabled. This field does not 230 exist if fscache is disabled. 231 232 * ``remote_i_size`` 233 234 The size of the file on the server. This differs from inode->i_size if 235 local modifications have been made but not yet written back. 236 237 * ``flags`` 238 239 A set of flags, some of which the filesystem might be interested in: 240 241 * ``NETFS_ICTX_MODIFIED_ATTR`` 242 243 Set if netfslib modifies mtime/ctime. The filesystem is free to ignore 244 this or clear it. 245 246 * ``NETFS_ICTX_UNBUFFERED`` 247 248 Do unbuffered I/O upon the file. Like direct I/O but without the 249 alignment limitations. RMW will be performed if necessary. The pagecache 250 will not be used unless mmap() is also used. 251 252 * ``NETFS_ICTX_WRITETHROUGH`` 253 254 Do writethrough caching upon the file. I/O will be set up and dispatched 255 as buffered writes are made to the page cache. mmap() does the normal 256 writeback thing. 257 258 * ``NETFS_ICTX_SINGLE_NO_UPLOAD`` 259 260 Set if the file has a monolithic content that must be read entirely in a 261 single go and must not be written back to the server, though it can be 262 cached (e.g. AFS directories). 263 264Inode Context Helper Functions 265------------------------------ 266 267To help deal with the per-inode context, a number helper functions are 268provided. Firstly, a function to perform basic initialisation on a context and 269set the operations table pointer:: 270 271 void netfs_inode_init(struct netfs_inode *ctx, 272 const struct netfs_request_ops *ops); 273 274then a function to cast from the VFS inode structure to the netfs context:: 275 276 struct netfs_inode *netfs_inode(struct inode *inode); 277 278and finally, a function to get the cache cookie pointer from the context 279attached to an inode (or NULL if fscache is disabled):: 280 281 struct fscache_cookie *netfs_i_cookie(struct netfs_inode *ctx); 282 283Inode Locking 284------------- 285 286A number of functions are provided to manage the locking of i_rwsem for I/O and 287to effectively extend it to provide more separate classes of exclusion:: 288 289 int netfs_start_io_read(struct inode *inode); 290 void netfs_end_io_read(struct inode *inode); 291 int netfs_start_io_write(struct inode *inode); 292 void netfs_end_io_write(struct inode *inode); 293 int netfs_start_io_direct(struct inode *inode); 294 void netfs_end_io_direct(struct inode *inode); 295 296The exclusion breaks down into four separate classes: 297 298 1) Buffered reads and writes. 299 300 Buffered reads can run concurrently each other and with buffered writes, 301 but buffered writes cannot run concurrently with each other. 302 303 2) Direct reads and writes. 304 305 Direct (and unbuffered) reads and writes can run concurrently since they do 306 not share local buffering (i.e. the pagecache) and, in a network 307 filesystem, are expected to have exclusion managed on the server (though 308 this may not be the case for, say, Ceph). 309 310 3) Other major inode modifying operations (e.g. truncate, fallocate). 311 312 These should just access i_rwsem directly. 313 314 4) mmap(). 315 316 mmap'd accesses might operate concurrently with any of the other classes. 317 They might form the buffer for an intra-file loopback DIO read/write. They 318 might be permitted on unbuffered files. 319 320Inode Writeback 321--------------- 322 323Netfslib will pin resources on an inode for future writeback (such as pinning 324use of an fscache cookie) when an inode is dirtied. However, this pinning 325needs careful management. To manage the pinning, the following sequence 326occurs: 327 328 1) An inode state flag ``I_PINNING_NETFS_WB`` is set by netfslib when the 329 pinning begins (when a folio is dirtied, for example) if the cache is 330 active to stop the cache structures from being discarded and the cache 331 space from being culled. This also prevents re-getting of cache resources 332 if the flag is already set. 333 334 2) This flag then cleared inside the inode lock during inode writeback in the 335 VM - and the fact that it was set is transferred to ``->unpinned_netfs_wb`` 336 in ``struct writeback_control``. 337 338 3) If ``->unpinned_netfs_wb`` is now set, the write_inode procedure is forced. 339 340 4) The filesystem's ``->write_inode()`` function is invoked to do the cleanup. 341 342 5) The filesystem invokes netfs to do its cleanup. 343 344To do the cleanup, netfslib provides a function to do the resource unpinning:: 345 346 int netfs_unpin_writeback(struct inode *inode, struct writeback_control *wbc); 347 348If the filesystem doesn't need to do anything else, this may be set as a its 349``.write_inode`` method. 350 351Further, if an inode is deleted, the filesystem's write_inode method may not 352get called, so:: 353 354 void netfs_clear_inode_writeback(struct inode *inode, const void *aux); 355 356must be called from ``->evict_inode()`` *before* ``clear_inode()`` is called. 357 358 359High-Level VFS API 360================== 361 362Netfslib provides a number of sets of API calls for the filesystem to delegate 363VFS operations to. Netfslib, in turn, will call out to the filesystem and the 364cache to negotiate I/O sizes, issue RPCs and provide places for it to intervene 365at various times. 366 367Unlocked Read/Write Iter 368------------------------ 369 370The first API set is for the delegation of operations to netfslib when the 371filesystem is called through the standard VFS read/write_iter methods:: 372 373 ssize_t netfs_file_read_iter(struct kiocb *iocb, struct iov_iter *iter); 374 ssize_t netfs_file_write_iter(struct kiocb *iocb, struct iov_iter *from); 375 ssize_t netfs_buffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); 376 ssize_t netfs_unbuffered_read_iter(struct kiocb *iocb, struct iov_iter *iter); 377 ssize_t netfs_unbuffered_write_iter(struct kiocb *iocb, struct iov_iter *from); 378 379They can be assigned directly to ``.read_iter`` and ``.write_iter``. They 380perform the inode locking themselves and the first two will switch between 381buffered I/O and DIO as appropriate. 382 383Pre-Locked Read/Write Iter 384-------------------------- 385 386The second API set is for the delegation of operations to netfslib when the 387filesystem is called through the standard VFS methods, but needs to do some 388other stuff before or after calling netfslib whilst still inside locked section 389(e.g. Ceph negotiating caps). The unbuffered read function is:: 390 391 ssize_t netfs_unbuffered_read_iter_locked(struct kiocb *iocb, struct iov_iter *iter); 392 393This must not be assigned directly to ``.read_iter`` and the filesystem is 394responsible for performing the inode locking before calling it. In the case of 395buffered read, the filesystem should use ``filemap_read()``. 396 397There are three functions for writes:: 398 399 ssize_t netfs_buffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *from, 400 struct netfs_group *netfs_group); 401 ssize_t netfs_perform_write(struct kiocb *iocb, struct iov_iter *iter, 402 struct netfs_group *netfs_group); 403 ssize_t netfs_unbuffered_write_iter_locked(struct kiocb *iocb, struct iov_iter *iter, 404 struct netfs_group *netfs_group); 405 406These must not be assigned directly to ``.write_iter`` and the filesystem is 407responsible for performing the inode locking before calling them. 408 409The first two functions are for buffered writes; the first just adds some 410standard write checks and jumps to the second, but if the filesystem wants to 411do the checks itself, it can use the second directly. The third function is 412for unbuffered or DIO writes. 413 414On all three write functions, there is a writeback group pointer (which should 415be NULL if the filesystem doesn't use this). Writeback groups are set on 416folios when they're modified. If a folio to-be-modified is already marked with 417a different group, it is flushed first. The writeback API allows writing back 418of a specific group. 419 420Memory-Mapped I/O API 421--------------------- 422 423An API for support of mmap()'d I/O is provided:: 424 425 vm_fault_t netfs_page_mkwrite(struct vm_fault *vmf, struct netfs_group *netfs_group); 426 427This allows the filesystem to delegate ``.page_mkwrite`` to netfslib. The 428filesystem should not take the inode lock before calling it, but, as with the 429locked write functions above, this does take a writeback group pointer. If the 430page to be made writable is in a different group, it will be flushed first. 431 432Monolithic Files API 433-------------------- 434 435There is also a special API set for files for which the content must be read in 436a single RPC (and not written back) and is maintained as a monolithic blob 437(e.g. an AFS directory), though it can be stored and updated in the local cache:: 438 439 ssize_t netfs_read_single(struct inode *inode, struct file *file, struct iov_iter *iter); 440 void netfs_single_mark_inode_dirty(struct inode *inode); 441 int netfs_writeback_single(struct address_space *mapping, 442 struct writeback_control *wbc, 443 struct iov_iter *iter); 444 445The first function reads from a file into the given buffer, reading from the 446cache in preference if the data is cached there; the second function allows the 447inode to be marked dirty, causing a later writeback; and the third function can 448be called from the writeback code to write the data to the cache, if there is 449one. 450 451The inode should be marked ``NETFS_ICTX_SINGLE_NO_UPLOAD`` if this API is to be 452used. The writeback function requires the buffer to be of ITER_FOLIOQ type. 453 454High-Level VM API 455================== 456 457Netfslib also provides a number of sets of API calls for the filesystem to 458delegate VM operations to. Again, netfslib, in turn, will call out to the 459filesystem and the cache to negotiate I/O sizes, issue RPCs and provide places 460for it to intervene at various times:: 461 462 void netfs_readahead(struct readahead_control *); 463 int netfs_read_folio(struct file *, struct folio *); 464 int netfs_writepages(struct address_space *mapping, 465 struct writeback_control *wbc); 466 bool netfs_dirty_folio(struct address_space *mapping, struct folio *folio); 467 void netfs_invalidate_folio(struct folio *folio, size_t offset, size_t length); 468 bool netfs_release_folio(struct folio *folio, gfp_t gfp); 469 470These are ``address_space_operations`` methods and can be set directly in the 471operations table. 472 473Deprecated PG_private_2 API 474--------------------------- 475 476There is also a deprecated function for filesystems that still use the 477``->write_begin`` method:: 478 479 int netfs_write_begin(struct netfs_inode *inode, struct file *file, 480 struct address_space *mapping, loff_t pos, unsigned int len, 481 struct folio **_folio, void **_fsdata); 482 483It uses the deprecated PG_private_2 flag and so should not be used. 484 485 486I/O Request API 487=============== 488 489The I/O request API comprises a number of structures and a number of functions 490that the filesystem may need to use. 491 492Request Structure 493----------------- 494 495The request structure manages the request as a whole, holding some resources 496and state on behalf of the filesystem and tracking the collection of results:: 497 498 struct netfs_io_request { 499 enum netfs_io_origin origin; 500 struct inode *inode; 501 struct address_space *mapping; 502 struct netfs_group *group; 503 struct netfs_io_stream io_streams[]; 504 void *netfs_priv; 505 void *netfs_priv2; 506 unsigned long long start; 507 unsigned long long len; 508 unsigned long long i_size; 509 unsigned int debug_id; 510 unsigned long flags; 511 ... 512 }; 513 514Many of the fields are for internal use, but the fields shown here are of 515interest to the filesystem: 516 517 * ``origin`` 518 519 The origin of the request (readahead, read_folio, DIO read, writeback, ...). 520 521 * ``inode`` 522 * ``mapping`` 523 524 The inode and the address space of the file being read from. The mapping 525 may or may not point to inode->i_data. 526 527 * ``group`` 528 529 The writeback group this request is dealing with or NULL. This holds a ref 530 on the group. 531 532 * ``io_streams`` 533 534 The parallel streams of subrequests available to the request. Currently two 535 are available, but this may be made extensible in future. ``NR_IO_STREAMS`` 536 indicates the size of the array. 537 538 * ``netfs_priv`` 539 * ``netfs_priv2`` 540 541 The network filesystem's private data. The value for this can be passed in 542 to the helper functions or set during the request. 543 544 * ``start`` 545 * ``len`` 546 547 The file position of the start of the read request and the length. These 548 may be altered by the ->expand_readahead() op. 549 550 * ``i_size`` 551 552 The size of the file at the start of the request. 553 554 * ``debug_id`` 555 556 A number allocated to this operation that can be displayed in trace lines 557 for reference. 558 559 * ``flags`` 560 561 Flags for managing and controlling the operation of the request. Some of 562 these may be of interest to the filesystem: 563 564 * ``NETFS_RREQ_RETRYING`` 565 566 Netfslib sets this when generating retries. 567 568 * ``NETFS_RREQ_PAUSE`` 569 570 The filesystem can set this to request to pause the library's subrequest 571 issuing loop - but care needs to be taken as netfslib may also set it. 572 573 * ``NETFS_RREQ_NONBLOCK`` 574 * ``NETFS_RREQ_BLOCKED`` 575 576 Netfslib sets the first to indicate that non-blocking mode was set by the 577 caller and the filesystem can set the second to indicate that it would 578 have had to block. 579 580 * ``NETFS_RREQ_USE_PGPRIV2`` 581 582 The filesystem can set this if it wants to use PG_private_2 to track 583 whether a folio is being written to the cache. This is deprecated as 584 PG_private_2 is going to go away. 585 586If the filesystem wants more private data than is afforded by this structure, 587then it should wrap it and provide its own allocator. 588 589Stream Structure 590---------------- 591 592A request is comprised of one or more parallel streams and each stream may be 593aimed at a different target. 594 595For read requests, only stream 0 is used. This can contain a mixture of 596subrequests aimed at different sources. For write requests, stream 0 is used 597for the server and stream 1 is used for the cache. For buffered writeback, 598stream 0 is not enabled unless a normal dirty folio is encountered, at which 599point ->begin_writeback() will be invoked and the filesystem can mark the 600stream available. 601 602The stream struct looks like:: 603 604 struct netfs_io_stream { 605 unsigned char stream_nr; 606 bool avail; 607 size_t sreq_max_len; 608 unsigned int sreq_max_segs; 609 unsigned int submit_extendable_to; 610 ... 611 }; 612 613A number of members are available for access/use by the filesystem: 614 615 * ``stream_nr`` 616 617 The number of the stream within the request. 618 619 * ``avail`` 620 621 True if the stream is available for use. The filesystem should set this on 622 stream zero if in ->begin_writeback(). 623 624 * ``sreq_max_len`` 625 * ``sreq_max_segs`` 626 627 These are set by the filesystem or the cache in ->prepare_read() or 628 ->prepare_write() for each subrequest to indicate the maximum number of 629 bytes and, optionally, the maximum number of segments (if not 0) that that 630 subrequest can support. 631 632 * ``submit_extendable_to`` 633 634 The size that a subrequest can be rounded up to beyond the EOF, given the 635 available buffer. This allows the cache to work out if it can do a DIO read 636 or write that straddles the EOF marker. 637 638Subrequest Structure 639-------------------- 640 641Individual units of I/O are managed by the subrequest structure. These 642represent slices of the overall request and run independently:: 643 644 struct netfs_io_subrequest { 645 struct netfs_io_request *rreq; 646 struct iov_iter io_iter; 647 unsigned long long start; 648 size_t len; 649 size_t transferred; 650 unsigned long flags; 651 short error; 652 unsigned short debug_index; 653 unsigned char stream_nr; 654 ... 655 }; 656 657Each subrequest is expected to access a single source, though the library will 658handle falling back from one source type to another. The members are: 659 660 * ``rreq`` 661 662 A pointer to the read request. 663 664 * ``io_iter`` 665 666 An I/O iterator representing a slice of the buffer to be read into or 667 written from. 668 669 * ``start`` 670 * ``len`` 671 672 The file position of the start of this slice of the read request and the 673 length. 674 675 * ``transferred`` 676 677 The amount of data transferred so far for this subrequest. This should be 678 added to with the length of the transfer made by this issuance of the 679 subrequest. If this is less than ``len`` then the subrequest may be 680 reissued to continue. 681 682 * ``flags`` 683 684 Flags for managing the subrequest. There are a number of interest to the 685 filesystem or cache: 686 687 * ``NETFS_SREQ_MADE_PROGRESS`` 688 689 Set by the filesystem to indicates that at least one byte of data was read 690 or written. 691 692 * ``NETFS_SREQ_HIT_EOF`` 693 694 The filesystem should set this if a read hit the EOF on the file (in which 695 case ``transferred`` should stop at the EOF). Netfslib may expand the 696 subrequest out to the size of the folio containing the EOF on the off 697 chance that a third party change happened or a DIO read may have asked for 698 more than is available. The library will clear any excess pagecache. 699 700 * ``NETFS_SREQ_CLEAR_TAIL`` 701 702 The filesystem can set this to indicate that the remainder of the slice, 703 from transferred to len, should be cleared. Do not set if HIT_EOF is set. 704 705 * ``NETFS_SREQ_NEED_RETRY`` 706 707 The filesystem can set this to tell netfslib to retry the subrequest. 708 709 * ``NETFS_SREQ_BOUNDARY`` 710 711 This can be set by the filesystem on a subrequest to indicate that it ends 712 at a boundary with the filesystem structure (e.g. at the end of a Ceph 713 object). It tells netfslib not to retile subrequests across it. 714 715 * ``error`` 716 717 This is for the filesystem to store result of the subrequest. It should be 718 set to 0 if successful and a negative error code otherwise. 719 720 * ``debug_index`` 721 * ``stream_nr`` 722 723 A number allocated to this slice that can be displayed in trace lines for 724 reference and the number of the request stream that it belongs to. 725 726If necessary, the filesystem can get and put extra refs on the subrequest it is 727given:: 728 729 void netfs_get_subrequest(struct netfs_io_subrequest *subreq, 730 enum netfs_sreq_ref_trace what); 731 void netfs_put_subrequest(struct netfs_io_subrequest *subreq, 732 enum netfs_sreq_ref_trace what); 733 734using netfs trace codes to indicate the reason. Care must be taken, however, 735as once control of the subrequest is returned to netfslib, the same subrequest 736can be reissued/retried. 737 738Filesystem Methods 739------------------ 740 741The filesystem sets a table of operations in ``netfs_inode`` for netfslib to 742use:: 743 744 struct netfs_request_ops { 745 mempool_t *request_pool; 746 mempool_t *subrequest_pool; 747 int (*init_request)(struct netfs_io_request *rreq, struct file *file); 748 void (*free_request)(struct netfs_io_request *rreq); 749 void (*free_subrequest)(struct netfs_io_subrequest *rreq); 750 void (*expand_readahead)(struct netfs_io_request *rreq); 751 int (*prepare_read)(struct netfs_io_subrequest *subreq); 752 void (*issue_read)(struct netfs_io_subrequest *subreq); 753 void (*done)(struct netfs_io_request *rreq); 754 void (*update_i_size)(struct inode *inode, loff_t i_size); 755 void (*post_modify)(struct inode *inode); 756 void (*begin_writeback)(struct netfs_io_request *wreq); 757 void (*prepare_write)(struct netfs_io_subrequest *subreq); 758 void (*issue_write)(struct netfs_io_subrequest *subreq); 759 void (*retry_request)(struct netfs_io_request *wreq, 760 struct netfs_io_stream *stream); 761 void (*invalidate_cache)(struct netfs_io_request *wreq); 762 }; 763 764The table starts with a pair of optional pointers to memory pools from which 765requests and subrequests can be allocated. If these are not given, netfslib 766has default pools that it will use instead. If the filesystem wraps the netfs 767structs in its own larger structs, then it will need to use its own pools. 768Netfslib will allocate directly from the pools. 769 770The methods defined in the table are: 771 772 * ``init_request()`` 773 * ``free_request()`` 774 * ``free_subrequest()`` 775 776 [Optional] A filesystem may implement these to initialise or clean up any 777 resources that it attaches to the request or subrequest. 778 779 * ``expand_readahead()`` 780 781 [Optional] This is called to allow the filesystem to expand the size of a 782 readahead request. The filesystem gets to expand the request in both 783 directions, though it must retain the initial region as that may represent 784 an allocation already made. If local caching is enabled, it gets to expand 785 the request first. 786 787 Expansion is communicated by changing ->start and ->len in the request 788 structure. Note that if any change is made, ->len must be increased by at 789 least as much as ->start is reduced. 790 791 * ``prepare_read()`` 792 793 [Optional] This is called to allow the filesystem to limit the size of a 794 subrequest. It may also limit the number of individual regions in iterator, 795 such as required by RDMA. This information should be set on stream zero in:: 796 797 rreq->io_streams[0].sreq_max_len 798 rreq->io_streams[0].sreq_max_segs 799 800 The filesystem can use this, for example, to chop up a request that has to 801 be split across multiple servers or to put multiple reads in flight. 802 803 Zero should be returned on success and an error code otherwise. 804 805 * ``issue_read()`` 806 807 [Required] Netfslib calls this to dispatch a subrequest to the server for 808 reading. In the subrequest, ->start, ->len and ->transferred indicate what 809 data should be read from the server and ->io_iter indicates the buffer to be 810 used. 811 812 There is no return value; the ``netfs_read_subreq_terminated()`` function 813 should be called to indicate that the subrequest completed either way. 814 ->error, ->transferred and ->flags should be updated before completing. The 815 termination can be done asynchronously. 816 817 Note: the filesystem must not deal with setting folios uptodate, unlocking 818 them or dropping their refs - the library deals with this as it may have to 819 stitch together the results of multiple subrequests that variously overlap 820 the set of folios. 821 822 * ``done()`` 823 824 [Optional] This is called after the folios in a read request have all been 825 unlocked (and marked uptodate if applicable). 826 827 * ``update_i_size()`` 828 829 [Optional] This is invoked by netfslib at various points during the write 830 paths to ask the filesystem to update its idea of the file size. If not 831 given, netfslib will set i_size and i_blocks and update the local cache 832 cookie. 833 834 * ``post_modify()`` 835 836 [Optional] This is called after netfslib writes to the pagecache or when it 837 allows an mmap'd page to be marked as writable. 838 839 * ``begin_writeback()`` 840 841 [Optional] Netfslib calls this when processing a writeback request if it 842 finds a dirty page that isn't simply marked NETFS_FOLIO_COPY_TO_CACHE, 843 indicating it must be written to the server. This allows the filesystem to 844 only set up writeback resources when it knows it's going to have to perform 845 a write. 846 847 * ``prepare_write()`` 848 849 [Optional] This is called to allow the filesystem to limit the size of a 850 subrequest. It may also limit the number of individual regions in iterator, 851 such as required by RDMA. This information should be set on stream to which 852 the subrequest belongs:: 853 854 rreq->io_streams[subreq->stream_nr].sreq_max_len 855 rreq->io_streams[subreq->stream_nr].sreq_max_segs 856 857 The filesystem can use this, for example, to chop up a request that has to 858 be split across multiple servers or to put multiple writes in flight. 859 860 This is not permitted to return an error. Instead, in the event of failure, 861 ``netfs_prepare_write_failed()`` must be called. 862 863 * ``issue_write()`` 864 865 [Required] This is used to dispatch a subrequest to the server for writing. 866 In the subrequest, ->start, ->len and ->transferred indicate what data 867 should be written to the server and ->io_iter indicates the buffer to be 868 used. 869 870 There is no return value; the ``netfs_write_subreq_terminated()`` function 871 should be called to indicate that the subrequest completed either way. 872 ->error, ->transferred and ->flags should be updated before completing. The 873 termination can be done asynchronously. 874 875 Note: the filesystem must not deal with removing the dirty or writeback 876 marks on folios involved in the operation and should not take refs or pins 877 on them, but should leave retention to netfslib. 878 879 * ``retry_request()`` 880 881 [Optional] Netfslib calls this at the beginning of a retry cycle. This 882 allows the filesystem to examine the state of the request, the subrequests 883 in the indicated stream and of its own data and make adjustments or 884 renegotiate resources. 885 886 * ``invalidate_cache()`` 887 888 [Optional] This is called by netfslib to invalidate data stored in the local 889 cache in the event that writing to the local cache fails, providing updated 890 coherency data that netfs can't provide. 891 892Terminating a subrequest 893------------------------ 894 895When a subrequest completes, there are a number of functions that the cache or 896subrequest can call to inform netfslib of the status change. One function is 897provided to terminate a write subrequest at the preparation stage and acts 898synchronously: 899 900 * ``void netfs_prepare_write_failed(struct netfs_io_subrequest *subreq);`` 901 902 Indicate that the ->prepare_write() call failed. The ``error`` field should 903 have been updated. 904 905Note that ->prepare_read() can return an error as a read can simply be aborted. 906Dealing with writeback failure is trickier. 907 908The other functions are used for subrequests that got as far as being issued: 909 910 * ``void netfs_read_subreq_terminated(struct netfs_io_subrequest *subreq);`` 911 912 Tell netfslib that a read subrequest has terminated. The ``error``, 913 ``flags`` and ``transferred`` fields should have been updated. 914 915 * ``void netfs_write_subrequest_terminated(void *_op, ssize_t transferred_or_error);`` 916 917 Tell netfslib that a write subrequest has terminated. Either the amount of 918 data processed or the negative error code can be passed in. This is 919 can be used as a kiocb completion function. 920 921 * ``void netfs_read_subreq_progress(struct netfs_io_subrequest *subreq);`` 922 923 This is provided to optionally update netfslib on the incremental progress 924 of a read, allowing some folios to be unlocked early and does not actually 925 terminate the subrequest. The ``transferred`` field should have been 926 updated. 927 928Local Cache API 929--------------- 930 931Netfslib provides a separate API for a local cache to implement, though it 932provides some somewhat similar routines to the filesystem request API. 933 934Firstly, the netfs_io_request object contains a place for the cache to hang its 935state:: 936 937 struct netfs_cache_resources { 938 const struct netfs_cache_ops *ops; 939 void *cache_priv; 940 void *cache_priv2; 941 unsigned int debug_id; 942 unsigned int inval_counter; 943 }; 944 945This contains an operations table pointer and two private pointers plus the 946debug ID of the fscache cookie for tracing purposes and an invalidation counter 947that is cranked by calls to ``fscache_invalidate()`` allowing cache subrequests 948to be invalidated after completion. 949 950The cache operation table looks like the following:: 951 952 struct netfs_cache_ops { 953 void (*end_operation)(struct netfs_cache_resources *cres); 954 void (*expand_readahead)(struct netfs_cache_resources *cres, 955 loff_t *_start, size_t *_len, loff_t i_size); 956 enum netfs_io_source (*prepare_read)(struct netfs_io_subrequest *subreq, 957 loff_t i_size); 958 int (*read)(struct netfs_cache_resources *cres, 959 loff_t start_pos, 960 struct iov_iter *iter, 961 bool seek_data, 962 netfs_io_terminated_t term_func, 963 void *term_func_priv); 964 void (*prepare_write_subreq)(struct netfs_io_subrequest *subreq); 965 void (*issue_write)(struct netfs_io_subrequest *subreq); 966 }; 967 968With a termination handler function pointer:: 969 970 typedef void (*netfs_io_terminated_t)(void *priv, 971 ssize_t transferred_or_error, 972 bool was_async); 973 974The methods defined in the table are: 975 976 * ``end_operation()`` 977 978 [Required] Called to clean up the resources at the end of the read request. 979 980 * ``expand_readahead()`` 981 982 [Optional] Called at the beginning of a readahead operation to allow the 983 cache to expand a request in either direction. This allows the cache to 984 size the request appropriately for the cache granularity. 985 986 * ``prepare_read()`` 987 988 [Required] Called to configure the next slice of a request. ->start and 989 ->len in the subrequest indicate where and how big the next slice can be; 990 the cache gets to reduce the length to match its granularity requirements. 991 992 The function is passed pointers to the start and length in its parameters, 993 plus the size of the file for reference, and adjusts the start and length 994 appropriately. It should return one of: 995 996 * ``NETFS_FILL_WITH_ZEROES`` 997 * ``NETFS_DOWNLOAD_FROM_SERVER`` 998 * ``NETFS_READ_FROM_CACHE`` 999 * ``NETFS_INVALID_READ`` 1000 1001 to indicate whether the slice should just be cleared or whether it should be 1002 downloaded from the server or read from the cache - or whether slicing 1003 should be given up at the current point. 1004 1005 * ``read()`` 1006 1007 [Required] Called to read from the cache. The start file offset is given 1008 along with an iterator to read to, which gives the length also. It can be 1009 given a hint requesting that it seek forward from that start position for 1010 data. 1011 1012 Also provided is a pointer to a termination handler function and private 1013 data to pass to that function. The termination function should be called 1014 with the number of bytes transferred or an error code, plus a flag 1015 indicating whether the termination is definitely happening in the caller's 1016 context. 1017 1018 * ``prepare_write_subreq()`` 1019 1020 [Required] This is called to allow the cache to limit the size of a 1021 subrequest. It may also limit the number of individual regions in iterator, 1022 such as required by DIO/DMA. This information should be set on stream to 1023 which the subrequest belongs:: 1024 1025 rreq->io_streams[subreq->stream_nr].sreq_max_len 1026 rreq->io_streams[subreq->stream_nr].sreq_max_segs 1027 1028 The filesystem can use this, for example, to chop up a request that has to 1029 be split across multiple servers or to put multiple writes in flight. 1030 1031 This is not permitted to return an error. In the event of failure, 1032 ``netfs_prepare_write_failed()`` must be called. 1033 1034 * ``issue_write()`` 1035 1036 [Required] This is used to dispatch a subrequest to the cache for writing. 1037 In the subrequest, ->start, ->len and ->transferred indicate what data 1038 should be written to the cache and ->io_iter indicates the buffer to be 1039 used. 1040 1041 There is no return value; the ``netfs_write_subreq_terminated()`` function 1042 should be called to indicate that the subrequest completed either way. 1043 ->error, ->transferred and ->flags should be updated before completing. The 1044 termination can be done asynchronously. 1045 1046 1047API Function Reference 1048====================== 1049 1050.. kernel-doc:: include/linux/netfs.h 1051.. kernel-doc:: fs/netfs/buffered_read.c 1052