filesystems/xfs/xfs-online-fsck-design.rst

23 The purpose of this document is threefold:
26   feature is, and issues about which they should be aware.
34 As the online fsck code is merged, the links in this document to topic branches
37 This document is licensed under the terms of the GNU Public License, v2.
38 The primary author is Darrick J. Wong.
40 This design document is split into seven parts.
43 and how it is tested to ensure correct functionality.
54 1. What is a Filesystem Check?
83 As a word of caution -- the primary goal of most Linux fsck tools is to restore
92 it is now possible to regenerate data structures when non-catastrophic errors
108 Code is posted to the kernel.org git trees as follows:
127 program is now deprecated and will not be discussed further.
135 The most important feature of this tool is its ability to respond to
152 3. **Users** experience a **total loss of service** if the filesystem is taken
162    while the filesystem is online.
172 benefit, the proposed solution is a third fsck tool that acts on a running
178 ``xfs_scrub`` is the name of the driver program.
190 | The kernel portion of online fsck that validates metadata is called      |
191 | "online scrub", and portion of the kernel that fixes metadata is called  |
195 The naming hierarchy is broken up into objects known as directories and files
196 and the physical space is split into pieces known as allocation groups.
203 While this is going on, other parts continue processing IO requests.
210 is running.
217 Because it is necessary for online fsck to lock and scan live metadata objects,
219 The first is the userspace driver program ``xfs_scrub``, which is responsible
234 philosophy, which is to say that each item should handle one aspect of a
244 If these errors cause the next mount to fail, offline fsck is the only
247 A second limitation of online fsck is that it must follow the same resource
251 In other words, online fsck is not a complete replacement for offline fsck, and
274    Each metadata structure is scheduled as a separate scrub item.
275    If corruption is found in the inode header or inode btree and ``xfs_scrub``
276    is permitted to perform repairs, then those scrub items are repaired to
279    resubmit the kernel scrub call with the repair flag enabled; this is
284    Each metadata structure is also scheduled as a separate scrub item.
285    If repairs are needed and ``xfs_scrub`` is permitted to perform repairs,
296    Unsuccessful repairs are requeued as long as forward progress on repairs is
298    Free space in the filesystem is trimmed at the end of phase 4 if the
299    filesystem is clean.
311    The ability to use hardware-assisted data file integrity checking is new
327 1. The scrub item of interest is checked for corruptions; opportunities for
330    If the item is not corrupt or does not need optimization, resource are
332    If the item is corrupt or could be optimized but the caller does not permit
337 2. The repair function is called to rebuild the data structure.
351 Each type of metadata object (and therefore each type of scrub item) is
381 owns the item being scrubbed is locked to guard against concurrent updates.
394 this is effectively an offline repair operation performed on a subset of the
396 This minimizes the complexity of the repair code because it is not necessary to
397 handle concurrent updates from other threads, nor is it necessary to access
403 Despite these limitations, the advantage that online repair holds is clear:
407 This mechanism is described in section 2.1 ("Off-Line Algorithm") of
413 in-memory array prior to formatting the new ondisk structure, which is very
417 duration of the repair is *always* an offline algorithm.
433 This class of metadata is difficult for scrub to process because scrub attaches
449 The next step is to release all locks and start the filesystem scan.
452 While the filesystem scan is in progress, the repair function hooks the
455 Once the scan is done, the owning object is re-locked, the live data is used to
457 The hooks are disabled and the staging staging area is freed.
495 file updates are elided when the record ID for the update is greater than the
500 In other words, there is no attempt to expose the keyspace of the new index
501 while repair is running.
511 However, doing that is a fair amount more work than what the checking functions
553 Delta tracking is necessary for dquots because the index builder scans inodes,
554 whereas the data structure being rebuilt is an index of dquots.
578   Systematic fuzz testing (detailed in the next section) is employed by the
584   disables building of the ``xfs_scrub`` binary, though this is not a risk
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
601   background service is configured to run with only the privileges required.
610   disclosure is somehow of some social benefit.
611   In the view of this author, the benefit is realized only when the fuzz
612   operators help to **fix** the flaws, but this opinion apparently is not
616   Automated testing should front-load some of the risk while the feature is
620 Despite this, it is hoped that this new functionality will prove useful in
647 The primary goal of any free software QA effort is to make testing as
751 This is quite the combinatoric explosion!
777 A unique requirement to online fsck is the ability to operate on a filesystem
779 Although it is of course impossible to run ``xfs_scrub`` with *zero* observable
800 Success is defined by the ability to run all of these tests without observing
812 The primary user of online fsck is the system administrator, just like offline
830 A new feature of ``xfs_scrub`` is the ``-x`` option, which employs the error
832 The media scan is not enabled by default because it may dramatically increase
835 The output of a foreground invocation is captured in the system log.
854 The output of the background service is also captured in the system log.
863 The decision to enable the background scan is left to the system administrator.
869 This automatic weekly scan is configured out of the box to perform an
871 This is less foolproof than, say, storing file data block checksums, but much
897 The information is updated whenever ``xfs_scrub`` is run, or whenever
927 is running.
942 the current filesystem, and that the information contained in the block is
948 Whenever a file system operation modifies a block, the change is submitted
970 The original design of XFS (circa 1993) is an improvement upon 1980s Unix
989 However, it has two critical advantages: first, the reverse index is key to
999 | A criticism of adding the secondary index is that it does nothing to     |
1001 | This is a valid point, but adding a new index for file data block        |
1007 | usage is much less than adding volume management and storage device      |
1013 The information captured in a reverse space mapping record is as follows:
1032 is this an attribute fork extent?  A file mapping btree extent?  Or an
1046 * The absence of an entry in the reference count data if the file is not
1057 2. Proving the consistency of secondary metadata with the primary metadata is
1059    which is very time intensive.
1067    required locking order is not the same order used by regular filesystem
1073    mapping data cannot be guaranteed if system load is heavy.
1083 The first step of checking a metadata structure is to examine every record
1094 - Is there so much damage around the filesystem that cross-referencing is not
1098 - Does the structure contain data that is not inconsistent but deserves review
1114   This assumes that metadata blocks only have one owner, which is always true
1118   scrub is expecting?
1125 establish that the filesystem code is reasonably free of gross corruption bugs
1126 and that the storage system is reasonably competent at retrieval.
1131 Every online fsck scrubbing function is expected to read every ondisk metadata
1135 failure to cross-reference once the full examination is complete.
1142 After the buffer cache, the next level of metadata protection is the internal
1147 The scope of checking is still internal to the block.
1150 - Does the type of data stored in the block match what scrub is expecting?
1156 - If the block tracks internal free space information, is it consistent with
1171 debugging is enabled or a write is about to occur.
1179 that a value is within the possible range.
1199 After internal block checks, the next higher level of checking is
1201 For regular runtime code, the cost of these checks is considered to be
1202 prohibitively expensive, but as scrub is dedicated to rooting out
1204 The exact set of cross-referencing is highly dependent on the context of the
1210 keyspace is fully, sparsely, or not at all mapped to records.
1211 For the reverse mapping btree, it is possible to mask parts of the key for the
1218 - Does the type of data stored in the block match what scrub is expecting?
1301    - If this is a CoW fork mapping, does it correspond to a CoW entry in the
1308    - Within the space subkeyspace of the rmap btree (that is to say, all
1343 Block 0 in the attribute fork is always the top of the structure, but otherwise
1351 rooted at block 0) is created to map hashes of the attribute names to leaf
1354 Checking an extended attribute structure is not so straightforward due to the
1371       If the value is stored in a remote block, this also validates the
1377 The filesystem directory tree is a directed acylic graph structure, with files
1393 exists as post-EOF extents) is populated with a block containing free space
1397 If this second partition grows beyond one block, the third partition is
1401 beyond one block, then a dabtree is used to map hashes of dirent names to
1404 Checking a directory is pretty straightforward:
1411    Each dirent is checked as follows:
1419    d. If a file type is included in the dirent, does it match the type of the
1422    e. If the child is a subdirectory, does the child's dotdot pointer point
1451 Checking and cross-referencing the dabtree is very similar to what is done for
1454 - Does the type of data stored in the block match what scrub is expecting?
1501 After performing a repair, the checking code is run a second time to validate
1504 This step is critical for enabling system administrator to monitor the status
1506 For developers, it is a useful means to judge the efficacy of error detection
1520 the metadata are temporarily inconsistent with each other, and rebuilding is
1528   The count should be bumped whenever a new item is added to the chain.
1534   If the count is zero, proceed with the checking operation.
1535   If it is nonzero, cycle the buffer locks to allow the chain to make forward
1553 The root cause of these reports is the eventual consistency model introduced by
1581    When the log is persisted to disk, the EFI item is written into the ondisk
1590    Attached to the transaction is a an extent free done (EFD) log item.
1594 If the system goes down after transaction #1 is written back to the filesystem
1595 but before #2 is committed, a scan of the filesystem metadata would show
1614   but as long as the first subtlety is handled, this should not affect the
1653 chain, but it is theoretically possible if space is very tight.
1654 For copy-on-write updates this is even worse, because this must be done once to
1670 If the main lock for a space btree is an AG header buffer lock, scrub may have
1671 interrupted another thread that is midway through finishing a chain.
1676 If a repair is attempted in this state, the results will be catastrophic!
1683    This would be very difficult to implement in practice because it is
1692    This would introduce a lot of complexity into the coordinator since it is
1702    This solution is a nonstarter because it is *extremely* invasive to the main
1713 First, the counter is incremented when a deferred work item is *queued* to a
1714 transaction, and it is decremented after the associated intent done log item is
1716 The second property is that deferred work can be added to a transaction without
1721 is an explicit deprioritization of online fsck to benefit file operations.
1722 The second property of the drain is key to the correct coordination of scrub,
1723 since scrub will always be able to decide if a conflict is possible.
1748 2. If the counter is zero (``xfs_defer_drain_busy`` returns false), there are no
1759 The proposed patchset is the
1771 later, live update hooks) where it is useful for the online fsck code to know
1773 Since it is not expected that online fsck will be constantly running in the
1774 background, it is very important to minimize the runtime overhead imposed by
1775 these hooks when online fsck is compiled into the kernel but not actively
1778 to find that no further action is necessary is expensive -- on the author's
1787 When online fsck enables the static key, the sled is replaced with an
1789 The switchover is quite expensive (~22000ns) but is paid entirely by the
1801 filesystem operations when xfs_scrub is not running, the intended usage
1811   scrub-only hook code if the static key is not enabled.
1825   static key; the ``TRY_HARDER`` flag is useful here.
1859 * Allocating a contiguous region of memory to create a C array is very
1862 * Linked lists of records introduce double pointer overhead which is very high
1865 * Kernel memory is pinned, which can drive the system into OOM conditions.
1877 that usage precedent is already established.
1915 The fifth use case is discussed in the :ref:`realtime summary <rtsummary>` case
1918 XFS is very record-based, which suggests that the ability to load and store
1919 complete records is important.
1923 in this manner is an acceptable behavior because the only reaction is to abort
1926 However, no discussion of file access idioms is complete without answering the
1928 It is convenient to access storage directly with pointers, just like userspace
1932 tmpfs can only push a pagecache folio to the swap cache if the folio is neither
1935 Short term direct access to xfile contents is done by locking the pagecache
1938 long term direct access to xfile contents is done by bumping the folio refcount,
1969 If an xfile is shared between threads to stage repairs, the caller must provide
1991 methods of the xfile directly, it is simpler for callers for there to be a
2003 Iteration of records is assumed to be necessary for all cases and will be
2011 Access to array elements is performed programmatically via ``xfarray_load`` and
2022 The typical use case here is rebuilding space btrees and key/value btrees.
2030 The third type of caller is a bag, which is useful for counting records.
2031 The typical use case here is constructing space extent reference counts from
2034 at any time, and uniqueness of records is left to callers.
2035 The ``xfarray_store_anywhere`` function is used to insert a record in any
2039 The proposed patchset is the
2084 The btree insertion code in XFS is responsible for maintaining correct ordering
2091 The sorting algorithm used in the xfarray is actually a combination of adaptive
2100 Both algorithms are (in general) O(n * lg(n)), but there is a wide performance
2115 Choosing a quicksort pivot is a tricky business.
2117 behavior that is crucial to  O(n * lg(n)) performance.
2138 The partitioning of quicksort is fairly textbook -- rearrange the record
2167 function frees them all because compaction is not needed.
2173 file, which is why compaction is not required.
2175 The proposed patchset is at the start of the
2192 unbounded memory consumption if the rest of the system is very busy.
2193 Another option is to skip the side-log and commit live updates from the
2199 Given that indexed lookups of scan data is required for both strategies, online
2217 The proposed patchset is the
2226 The first is to make it possible for the ``struct xfs_buftarg`` structure to
2229 The second change is to modify the buffer ``ioapply`` function to "read" cached
2231 Multiple access to individual buffers is controlled by the ``xfs_buf`` lock,
2243 Space management for an xfile is very simple -- each btree block is one memory
2246 block verifiers ignore the checksums, assuming that xfile memory is no more
2248 Reusing existing code here is more important than absolute memory efficiency.
2300 Although it is a clever hack to reuse the rmap btree code to handle the staging
2357 The zeroth step of bulk loading is to assemble the entire record set that will
2361 This information is required for resource reservation.
2366 Roughly speaking, the maximum number of records is::
2371 which means the minimum number of records is half of maxrecs::
2375 The next variable to determine is the desired loading factor.
2377 Choosing minrecs is undesirable because it wastes half the block.
2378 Choosing maxrecs is also undesirable because adding a single record to each
2386 If space is tight, the loading factor will be set to maxrecs to try to avoid
2391 Load factor is computed for btree node blocks using the combined size of the
2404 is computed as::
2409 The entire computation is performed recursively until the current level only
2411 The resulting geometry is as follows:
2413 - For AG-rooted btrees, this level is the root level, so the height of the new
2414   tree is ``level + 1`` and the space needed is the summation of the number of
2418   inode fork area, the height is ``level + 2``, the space needed is the
2424   height is ``level + 1``, and the space needed is one less than the summation
2427   an inode, which is a future patchset and only included here for completeness.
2436 Each reserved extent is tracked separately by the btree builder state data.
2448 While repair is writing these new btree blocks, the EFIs created for the space
2450 It's possible that other parts of the system will remain busy and push the head
2455 mechanism is reused here to commit a transaction at the log head containing an
2473 This part is pretty simple -- the btree builder (``xfs_btree_bulkload``) claims
2483 Sibling pointers are set every time a new block is added to the level::
2505 When it reaches the root level, it is ready to commit the new btree!::
2522 The first step to commit the new btree is to persist the btree blocks to disk
2524 This is a little complicated because a new btree block could have been freed
2570 The high level process to rebuild the inode index btree is:
2580    If the free space inode btree is enabled, call it again to estimate the
2587    If the free space inode btree is enabled, call it again to load the finobt.
2602 A cluster is the smallest number of ondisk inodes that can be allocated or
2603 freed in a single transaction; it is never smaller than 1 fs block or 4 inodes.
2610 ondisk inodes and to decide if the file is allocated
2612 Accumulate the results of successive inode cluster buffer reads until there is
2613 enough information to fill a single inode chunk record, which is 64 consecutive
2615 If the chunk is sparse, the chunk record may include holes.
2619 This xfarray is walked twice during the btree creation step -- once to populate
2622 The number of records for the inode btree is the number of xfarray records,
2626 The proposed patchset is the
2640 From the diagram below, it is apparent that a reference count record must start
2642 In other words, the record emission stimulus is level-triggered::
2658 The high level process to rebuild the reference count btree is:
2686 Details are as follows; the same algorithm is used by ``xfs_repair`` to
2709     - If the size of the bag changed and is greater than one, create a new
2713 The bag-like structure in this case is a type 2 xfarray as discussed in the
2719 The proposed patchset is the
2727 The high level process to rebuild a data/attr fork mapping btree is:
2760 The proposed patchset is the
2770 Whenever online fsck builds a new data structure to replace one that is
2771 suspect, there is a question of how to find and dispose of the blocks that
2773 The laziest method of course is not to deal with them at all, but this slowly
2782 to find space that is owned by the corresponding rmap owner yet truly free.
2783 Cross referencing rmap records with other rmap records is necessary because
2786 Permitting the block allocator to hand them out again will not push the system
2794    the same rmap owner code is used to denote all of the objects being rebuilt.
2797    same ``XFS_RMAP_OWN_*`` number for the metadata that is being preserved.
2810 The process for disposing of old extents is as follows:
2818    - If not, the block is part of a crosslinked structure and must not be
2824 6. If the region is crosslinked, delete the reverse mapping entry for the
2827 7. If the region is to be freed, mark any corresponding buffers in the buffer
2832 However, there is one complication to this procedure.
2837 a. EFIs logged on behalf of space that is no longer occupied
2841 This is also a window in which a crash during the reaping process can leak
2846 The proposed patchset is the
2857 Creating a list of extents to reap the old btree blocks is quite simple,
2869    old data structures and hence is a candidate for reaping.
2873 If it is possible to maintain the AGF lock throughout the repair (which is the
2880 The high level process to rebuild the free space indices is:
2899 7. Reap the old btree blocks by looking for space that is not recorded by the
2905 First, free space is not explicitly tracked in the reverse mapping records.
2911 This is impossible when repairing the free space btrees themselves.
2915 It is not necessary to back each reserved extent with an EFI because the new
2916 free space btrees are constructed in what the ondisk filesystem thinks is
2921 reservation is sufficient.
2926 is atomic, similar to the other btree repair functions.
2928 Third, finding the blocks to reap after the repair is not overly
2934 This ownership is retained when blocks move from the AGFL into the free space
2941 When the walk is complete, the bitmap disunion operation ``(ag_owner_bitmap &
2946 The proposed patchset is the
2963 discussion is that the new rmap btree will not contain any records for the old
2965 The list of candidate reaping blocks is computed by setting the bits
2972 The rest of the process of rebuildng the reverse mapping btree is discussed
2975 The proposed patchset is the
2983 The allocation group free block list (AGFL) is repaired as follows:
2985 1. Create a bitmap for all the space that the reverse mapping data claims is
2990 3. Subtract any space that the reverse mapping data claims is owned by any
2993 4. Once the AGFL is full, reap any blocks leftover.
3004 There is a very high potential for cache coherency issues if online fsck is not
3005 careful to access the ondisk metadata *only* when the ondisk metadata is so
3009 representation *or* a lock on whichever object is necessary to prevent any
3013 is necessary to get the in-core structure loaded.
3014 This means fixing whatever is caught by the inode cluster buffer and inode fork
3018 Once the in-memory representation is loaded, repair can lock the inode and can
3022 Dealing with the data and attr fork extent counts and the file block counts is
3027 The proposed patchset is the
3041 whatever is necessary to get the in-core structure loaded.
3042 Once the in-memory representation is loaded, the only attributes needing
3048 The proposed patchset is the
3061 but this is a slow process, so XFS maintains a copy in the ondisk superblock
3068 It is therefore only necessary to serialize on the superblock when the
3069 superblock is being committed to disk.
3074 The only time XFS commits the summary counters is at filesystem unmount.
3075 To reduce contention even further, the incore counter is implemented as a
3076 percpu counter, which means that each CPU is allocated a batch of blocks from a
3080 online fsck to check them, since there is no way to quiesce a percpu counter
3081 while the system is running.
3085 the time the walk is complete.
3087 scan flag, but this is not a satisfying outcome for a system administrator.
3099 This is very similar to a filesystem freeze, though not all of the pieces are
3102 - The final freeze state is set one higher than ``SB_FREEZE_COMPLETE`` to
3108 With this code in place, it is now possible to pause the filesystem for just
3116 | With the filesystem frozen, it is possible to resolve the counter values |
3130 |   This can happen if the filesystem is unmounted while the underlying    |
3148 The proposed patchset is the
3161 However, it is not practical to shut down the entire filesystem to examine
3167 - How does scrub manage the scan while it is collecting data?
3181 This system is described by J. Lions, `"inode (5659)"
3198 Because this keyspace is sparse, this cursor contains two parts.
3202 the keyspace have already been visited, which is critical for deciding if a
3206 Advancing the scan cursor is a multi-step process encapsulated in
3233       The scan is now complete.
3235 4. Otherwise, there is at least one more inode to scan in this AG:
3241       the examination cursor is now.
3259    If one is provided:
3274 Obviously, it is an absolute requirement that the inode metadata be consistent
3276 Second, if the incore inode is stuck in some intermediate state, the scan
3277 coordinator must release the AGI and push the main filesystem to get the inode
3284 The first user of the new functionality is the
3296 However, it is important to note that references to incore inodes obtained as
3318 If the inode is unlinked (or unconnected after a file handle operation), the
3339 8. Incore dquot references, if a file is being repaired.
3350 Resources are often released in the reverse order, though this is not required.
3352 an object that normally is acquired in a later stage of the locking order, and
3353 then decide to cross-reference the object with an object that is acquired
3367 When the VFS ``iput`` function is given a linked inode with no other
3378 On the other hand, if there is no scrub transaction, it is desirable to drop
3410 If the directory tree is corrupt because it contains a cycle, ``xfs_scrub``
3414 Solving both of these problems is straightforward -- any time online fsck
3419 Trylock loops enable scrub to check for pending fatal signals, which is how
3439 The child directory is kept locked to prevent updates to the dotdot dirent, but
3442 If the dotdot entry changes while the directory is unlocked, then a move or
3446 The proposed patchset is the
3457 filesystem scan is the ability to stay informed about updates being made by
3465 In this case, the downstream consumer is always an online fsck function.
3479 keys are a more performant combination; more study is needed here.
3496   the filesystem update is committed to the transaction.
3517 - Online fsck must call ``xfs_hooks_del`` to disable the hook once the scan is
3522 zero when online fsck is not running.
3574   transaction that it is running.
3585   returns zero if there is nothing left to scan
3589   This is critical for hook functions to decide if they need to update the
3597 This functionality is also a part of the
3607 It is useful to compare the mount time quotacheck code to the online repair
3620    If the incore dquot is not being flushed, add the ondisk buffer backing the
3630 once the scan is complete.
3631 Handling transactional updates is tricky because quota resource usage updates
3638    a. The dquot is locked.
3640    b. A quota reservation is added to the dquot's resource usage.
3641       The reservation is recorded in the transaction.
3643    c. The dquot is unlocked.
3647 4. At transaction commit time, each dquot is examined again:
3649    a. The dquot is locked again.
3651    b. Quota usage changes are logged and unused reservation is given back to
3654    c. The dquot is unlocked.
3660 Notice that both hooks are called with the inode locked, which is how the
3689 The proposed patchset is the
3700 The coordinated inode scanner is used to visit all directories on the
3706 1. If the entry is a dotdot (``'..'``) entry of the root directory, the
3707    directory's parent link count is bumped because the root directory's dotdot
3708    entry is self referential.
3710 2. If the entry is a dotdot entry of a subdirectory, the parent's backref
3711    count is bumped.
3713 3. If the entry is neither a dot nor a dotdot entry, the target file's parent
3714    count is bumped.
3716 4. If the target is a subdirectory, the parent's child link count is bumped.
3719 with the live update hooks is that the scan cursor tracks which *parent*
3723 Furthermore, a subdirectory A with a dotdot entry pointing back to B is
3730 For any file, the correct link count is the number of parents plus the number
3733 The backref information is used to detect inconsistencies in the number of
3739 A second coordinated inode scan cursor is used for comparisons.
3742 If repairs are desired, the inode's link count is set to the value in the
3747 The proposed patchset is the
3760 The primary advantage of this approach is the simplicity and modularity of the
3764 A secondary advantage of this repair approach is atomicity -- once the kernel
3765 decides a structure is corrupt, no other threads can access the metadata until
3812       This is performed with an empty transaction to avoid changing the
3831 The proposed patchset is the
3852 cannot be staged in memory, even when a paging scheme is available.
3857 Once the repair is complete, the old fork can be reaped as necessary; if the
3863 This dependency is the reason why online repair can only use pageable kernel
3872 There is a downside to the reaping process -- if the system crashes during the
3879 the last reference to the file is lost.
3918 | - Reaping blocks after a repair is not a simple operation, and           |
3920 |   during log recovery is daunting.                                       |
3926 |   Rewriting a single field in block headers is not a huge problem, but   |
3935 |   of blocks repeatedly, which is not conducive to quick repairs.         |
3951 The MMAPLOCK is not needed here, because there must not be page faults from
3959 locking<ilocking>` section, it is recommended that scrub functions use the
3971 must be conveyed to the file being repaired, which is the topic of the next
3984 It is not possible to swap the inumbers of two files, so instead the new
3987 swapping code used by the file defragmenting tool ``xfs_fsr`` is not sufficient
3990 a. When the reverse-mapping btree is enabled, the swap code must keep the
3993    transaction is independent.
3995 b. Reverse-mapping is critical for the operation of online fsck, so the old
3997    operation) is not useful here.
3999 c. Defragmentation is assumed to occur between two files with identical
4002    change in file contents, even if the operation is interrupted.
4018 This new functionality is called the file contents exchange (xfs_exchrange)
4028 The proposed patchset is the
4045 | filesystem, either as part of an unmount or because the system is        |
4048 | time that the log cleans itself, it is necessary for upper level code to |
4049 | communicate to the log when it is going to use a log incompatible        |
4060 | The superblock update is performed transactionally, so the wrapper to    |
4065 | When the transaction is complete, the ``xlog_drop_incompat_feat``        |
4066 | function is called to release the feature.                               |
4078 Exchanging contents between file forks is a complex task.
4079 The goal is to exchange all file fork mappings between two file fork offset
4086 This is roughly the format of the new deferred exchange-mapping work item:
4114 incremented and the blockcount field is decremented to reflect the progress
4119 operation if the file data fork is the target of the operation.
4121 When the exchange is initiated, the sequence of operations is as follows:
4128    This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
4132 3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
4137       This is the minimum of the two ``br_blockcount`` s in the mappings.
4162       This quantity is ``(map1.br_startoff + map1.br_blockcount -
4174       to inform it that there is more work to be done.
4184 This is how atomic file mapping exchanges guarantees that an outside observer
4213 operation, but it is very important to maintain correct accounting.
4227   exchange is performed by copying the incore fork contents, logging both
4229   The atomic file mapping exchange mechanism is not necessary, since this can
4232 - If both forks map blocks, then the regular atomic file mapping exchange is
4235 - Otherwise, only one fork is in local format.
4240   The regular atomic mapping exchange is used to exchange the metadata file
4244   format so that the second file will be ready to go as soon as the ILOCK is
4249 Although there is no verification, it is still important to maintain
4256 extent reaping <reaping>` mechanism that is done post-repair.
4261 repair, and is not completely foolproof.
4272    The same fork must be written to as is being repaired.
4290 In the "realtime" section of an XFS filesystem, free space is tracked via a
4292 Each bit in the bitmap represents one realtime extent, which is a multiple of
4301 The summary file itself is a flat file (with no block headers or checksums!)
4324 The temporary file is then reaped.
4326 The proposed patchset is the
4335 Values are limited in size to 64KiB, but there is no limit in the number of
4337 The attribute fork is unpartitioned, which means that the root of the attribute
4338 structure is always in logical block zero, but attribute leaf blocks, dabtree
4344 btree (``dabtree``) is created to map hashes of attribute names to entries
4347 Salvaging extended attributes is done as follows:
4351    When one is found,
4354       When one is found,
4372 The proposed patchset is the
4380 Fixing directories is difficult with currently available filesystem features,
4388 The best that online repair can do at this time is to read directory data
4391 The salvage process is discussed in the case study at the end of this section.
4399 salvaging directories is straightforward:
4402    If the dotdot entry is not unreadable, try to confirm that the alleged
4408    When one is found,
4411       When an entry is found:
4435 In theory it is necessary to scan all dentry cache entries for a directory to
4445    This is the problem case.
4449 There is no known solution.
4451 The proposed patchset is the
4459 A parent pointer is a piece of file metadata that enables a user to locate the
4462 Without them, reconstruction of directory trees is hindered in much the same
4475 each parent pointer is a directory and that it contains a dirent matching
4484 | Each link from a parent directory to a child file is mirrored with an    |
4498 |    It is not clear how this actually worked properly.                    |
4505 |    that parent pointer attribute creation is likely to fail at some      |
4506 |    point before the maximum file link count is achieved.                 |
4556 |    If the hash is sufficiently resistant to collisions (e.g. sha256)     |
4565 |    the parent inumber is now xor'd into the hash index.                  |
4607 5. When the scan is complete, replay any stashed entries in the xfarray.
4609 6. When the scan is complete, atomically exchange the contents of the temporary
4615 The proposed patchset is the
4655 5. When the scan is complete, replay any stashed entries in the xfarray.
4659 7. When the scan is complete, atomically exchange the mappings of the attribute
4665 The proposed patchset is the
4680    This is already performed as part of the connectivity checks.
4697       referenced in this section is the regular directory entry name hash, not
4704       Having a single ``name_cookie`` for each ``name`` is critical for
4740          a. If the per-AG cursor is at a lower point in the keyspace than the
4746          b. If the per-file cursor is at a lower point in the keyspace than
4758 The proposed patchset is the
4776    blocks, if phase 4 is also capable of zapping directories.
4793 As mentioned earlier, the filesystem directory tree is supposed to be a
4795 However, each node in this graph is a separate ``xfs_inode`` object with its
4805 At any point in the walk, trying to set an already set bit means there is a
4809 However, one of online repair's design goals is to avoid locking the entire
4824 This is not possible since the VFS does not take the IOLOCK of a child
4842       3. If the alleged parent is the subdirectory being scrubbed, the path is
4849          If the bit is already set, then there is a cycle in the directory
4855          If the alleged parent is not a linked directory, abort the scan
4856          because the parent pointer information is inconsistent.
4867             This repeats until the directory tree root is reached or no parents
4891 2. If the subdirectory is either the root directory or has zero link count,
4919 The root of the filesystem is a directory, and each entry in a directory points
4929 If such a file has a positive link count, the file is an orphan.
4942 This process is more involved in the kernel than it is in userspace.
4954 2. If the decision is made to reconnect a file, take the IOLOCK of both the
4965 5. If the adoption is going to happen, call ``xrep_adoption_reparent`` to
5030 Therefore, a metadata dependency graph is a convenient way to schedule checking
5060 it is desirable to scrub inodes in parallel to minimize runtime, particularly
5081 Just like before, the first workqueue is seeded with one workqueue item per AG,
5083 The second workqueue, however, is configured with an upper bound on the number
5086 second workqueue, and it is this second workqueue that queries BULKSTAT,
5089 If the second workqueue is too full, the workqueue add function blocks the
5130 Phase 4 is responsible for scheduling a lot of repair work in as quick a
5131 manner as is practical.
5135 The repair process is as follows:
5174    Complain if the repairs were not successful, since this is the last chance
5220 modern-day Linux systems is that programs work with Unicode character code
5246 If the character "Zero Width Space" U+200B is encountered in a file name, the
5283 verification request is sent to the disk as a directio read of the raw block
5297 It is hoped that the reader of this document has followed the designs laid out
5301 Although the scope of this work is daunting, it is hoped that this guide will
5310 mechanism is a new ioctl call that userspace programs can use to commit updates
5320 files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
5331 This mechanism is identical to steps 2-3 from the procedure above except for
5332 the new tracking items, because the atomic file mapping exchange mechanism is
5360   When the program is ready to commit the changes, it passes the timestamps
5364   A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
5369   Stage all writes to a temporary file, and when that is complete, call the
5392 It is hoped that ``io_uring`` will pick up enough of this functionality that
5407 One serious shortcoming of the online fsck code is that the amount of time that
5408 it can spend in the kernel holding resource locks is basically unbounded.
5409 Userspace is allowed to send a fatal signal to the process which will cause
5417 timeout is no longer useful.
5427 The first piece the ``clearspace`` program needs is the ability to read the
5430 The second piece it needs is a new fallocate mode
5434 The third piece is the ability to force an online repair.
5441 This often results in the metadata being rebuilt somewhere that is not being
5452 Clearspace makes its own copy of the frozen extent in an area that is not being
5460 To clear a piece of physical storage that has a high sharing factor, it is
5478 The trouble is, the kernel can't do anything about open files, since it cannot
5502 That requires an evacuation of the space at end of the filesystem, which is a