filesystems/xfs/xfs-online-fsck-design.rst

21 This document captures the design of the online filesystem check feature for
54 1. What is a Filesystem Check?
57 A Unix filesystem has four main responsibilities:
71 operations internal to the filesystem, such as internal consistency checking
76 The filesystem check (fsck) tool examines all the metadata in a filesystem
84 the filesystem metadata to a consistent state, not to maximize the data
91 More recent filesystem designs contain enough redundancy in their metadata that
124 It walks all metadata in the filesystem looking for inconsistencies in the
134 while it scans the metadata of the entire filesystem.
145 1. **User programs** suddenly **lose access** to the filesystem when unexpected
152 3. **Users** experience a **total loss of service** if the filesystem is taken
161    with corruptions if they **lack the means** to assess filesystem health
162    while the filesystem is online.
164 6. **Fleet monitoring tools** cannot **automate periodic checks** of filesystem
173 filesystem.
177 program to drive fsck activity on a live filesystem.
199 The division of the filesystem into principal objects (allocation groups and
201 repairs on a subset of the filesystem.
204 Even if a piece of filesystem metadata can only be regenerated by scanning the
248 sharing and lock acquisition rules as the regular filesystem.
263 repairing an entire filesystem into seven phases.
268 1. Collect geometry information about the mounted filesystem and computer,
283 3. Check all metadata of every file in the filesystem.
297    made somewhere in the filesystem.
298    Free space in the filesystem is trimmed at the end of phase 4 if the
299    filesystem is clean.
301 5. By the start of this phase, all primary and secondary filesystem metadata
310    file extents in the filesystem.
357 Metadata structures in this category should be most familiar to filesystem
360 Most filesystem objects fall into this class:
376 Scrub obeys the same rules as regular filesystem accesses for resource and lock
380 The principal filesystem object (either an allocation group or an inode) that
395 filesystem.
398 any other part of the filesystem.
404 targeted work on individual shards of the filesystem avoids total loss of
425 but are only needed for online fsck or for reorganization of the filesystem.
449 The next step is to release all locks and start the filesystem scan.
452 While the filesystem scan is in progress, the repair function hooks the
453 filesystem so that it can apply pending filesystem updates to the staging
462 Live filesystem code has to be hooked so that the repair function can observe
466 Finally, the hook, the filesystem scan, and the inode locking model must be
474 operation, which may cause application failure or an unplanned filesystem
532 Check and repair require full filesystem scans, but resource and lock
533 acquisition follow the same paths as regular filesystem accesses.
538 and file link counts) employ the same filesystem scanning and hooking
569 - **Decreased performance**: Adding metadata indices to the filesystem
577   software that result in incorrect repairs being written to the filesystem.
587 - **Inability to repair**: Sometimes, a filesystem is too badly damaged to be
593   render the filesystem unusable, the online repair functions have been
650 In other words, testing should maximize the breadth of filesystem configuration
656 The Linux filesystem community shares a common QA testing suite,
669 ``xfs_repair`` to rebuild the filesystem's metadata indices between tests.
670 This ensures that offline repair does not crash, leave a corrupt filesystem
685 This required the creation of fstests library code that can create a filesystem
687 Next, individual test cases were created to create a test filesystem, identify
698 In other words, for a given fstests filesystem configuration:
700 * For each metadata object existing on the filesystem:
715 of every metadata field of every metadata object in the filesystem.
717 block in the filesystem to simulate the effects of memory corruption and
719 Given that fstests already contains the ability to create a filesystem
720 containing every metadata format known to the filesystem, ``xfs_db`` can be
723 For a given fstests filesystem configuration:
725 * For each metadata object existing on the filesystem...
777 A unique requirement to online fsck is the ability to operate on a filesystem
781 inconsistencies into the filesystem metadata, and regular workloads should
791   filesystem doesn't cause problems.
793   force-repairing the whole filesystem doesn't cause problems.
795   freezing and thawing the filesystem.
797   remounting the filesystem read-only and read-write.
801 any unexpected filesystem shutdowns due to corrupted metadata, kernel hang
822 metadata in a filesystem, ``xfs_scrub`` can be run as a foreground process on
824 The program checks every piece of metadata in the filesystem while the
873 redundancy can be provided elsewhere above the filesystem, or the storage
882 filesystem tree was restricted to the minimum needed to start the program and
883 access the filesystem being scanned.
886 This measure was taken to minimize delays in the rest of the filesystem.
896 XFS caches a summary of each filesystem's health status in memory.
898 inconsistencies are detected in the filesystem metadata during regular
942 the current filesystem, and that the information contained in the block is
945 that doesn't belong to the filesystem, and the fourth component enables the
946 filesystem to detect lost writes.
957 log updates to the filesystem.
960 the filesystem to detect obvious corruption when reading metadata blocks from
971 filesystem design.
974 For performance reasons, filesystem authors were reluctant to add redundancy to
975 the filesystem, even at the cost of data integrity.
983 By adding a new index, the filesystem retains most of its ability to scale
991 defragmentation, better media failure reporting, and filesystem shrinking.
993 defeats device-level deduplication because the filesystem requires real
1003 | copy-writes, which age the filesystem prematurely.                       |
1026 in units of filesystem blocks.
1035 Online filesystem checking judges the consistency of each primary metadata
1067    required locking order is not the same order used by regular filesystem
1069    For example, if the filesystem normally takes a file ILOCK before taking
1078 into the filesystem are covered in subsequent sections.
1094 - Is there so much damage around the filesystem that cross-referencing is not
1111 - Does the block belong to this filesystem?
1125 establish that the filesystem code is reasonably free of gross corruption bugs
1128 failed system calls, and in the extreme case, filesystem shutdowns if the
1143 record verification code built into the filesystem.
1144 These checks are split between the buffer verifiers, the in-filesystem users of
1164 the filesystem.
1177 Various pieces of filesystem metadata are directly controlled by userspace.
1183 - Filesystem labels
1188 - Names present in directory entries, extended attribute keys, and filesystem
1347 Values that are less than 3/4 the size of a filesystem block are also stored
1350 If the leaf information exceeds a single filesystem block, a dabtree (also
1377 The filesystem directory tree is a directed acylic graph structure, with files
1487 extents) can be found by walking the entire filesystem.
1488 This would make for very slow reporting, so a transactional filesystem can
1490 Cross-referencing these values against the filesystem metadata should be a
1505 of the filesystem and the progress of any repairs.
1524 should be relatively rare as compared to filesystem change operations.
1529   The count should be dropped when the filesystem has locked the AG header
1539 filesystem updates take precedence over background checking activity.
1594 If the system goes down after transaction #1 is written back to the filesystem
1595 but before #2 is committed, a scan of the filesystem metadata would show
1596 inconsistent filesystem metadata because there would not appear to be any owner
1615   correctness of filesystem operations.
1617 * Unmounting the filesystem flushes all pending work to disk, which means that
1626 single filesystem change into a single transaction because a single file
1687    make the filesystem very slow.
1703    filesystem.
1725 For regular filesystem code, the drain works as follows:
1768 Online fsck for XFS separates the regular filesystem from the checking and
1772 what's going on in the rest of the filesystem.
1801 filesystem operations when xfs_scrub is not running, the intended usage
1810   filesystem should call the ``static_branch_unlikely`` predicate to avoid the
1813 - The regular filesystem should export helper functions that call
1846 Some online checking functions work by scanning the filesystem to build a
1855 in a place that doesn't require the correct operation of the filesystem.
1884 | it found them, which failed because filesystem could shut down with a    |
2193 between a live metadata scan of the filesystem and writer threads that are
2196 metadata updates from the filesystem into the data being collected by the scan.
2201 filesystem directly into the scan data, which trades more overhead for a lower
2315 log transactions back into the filesystem, and certainly won't exist during
2449 unused space, the free space, leaving the filesystem unchanged.
2460 To avoid livelocking the filesystem, the EFIs must not pin the tail of the log
2782 leads to service degradations as space leaks out of the filesystem.
2923 free space btrees are constructed in what the ondisk filesystem thinks is
3013 badly damaged that the filesystem cannot load the in-memory representation.
3065 Filesystem summary counters track availability of filesystem resources such
3069 that should reflect the ondisk metadata, at least when the filesystem has been
3081 The only time XFS commits the summary counters is at filesystem unmount.
3089 Although online fsck can read the filesystem metadata to compute the correct
3096 filesystem metadata to get an accurate reading and install it in the percpu
3100 system from initiating new writes to the filesystem, it must disable background
3106 This is very similar to a filesystem freeze, though not all of the pieces are
3110   prevent other threads from thawing the filesystem, or other scrub threads
3115 With this code in place, it is now possible to pause the filesystem for just
3121 | The initial implementation used the actual VFS filesystem freeze         |
3122 | mechanism to quiesce filesystem activity.                                |
3123 | With the filesystem frozen, it is possible to resolve the counter values |
3127 | - Other programs can unfreeze the filesystem without our knowledge.      |
3130 | - Adding an extra lock to prevent others from thawing the filesystem     |
3137 |   This can happen if the filesystem is unmounted while the underlying    |
3138 |   block device has frozen the filesystem.                                |
3151 |   sync_filesystem fails to flush the filesystem and returns an error.    |
3160 Full Filesystem Scans
3164 entire filesystem to record observations and comparing the observations against
3168 However, it is not practical to shut down the entire filesystem to examine
3171 all the files in the filesystem.
3198 the space in the data section filesystem.
3210 concurrent filesystem update needs to be incorporated into the scan data.
3236       filesystem.
3257    the filesystem until the scan releases the incore inode.
3261 Online fsck functions scan all files in the filesystem as follows:
3284 coordinator must release the AGI and push the main filesystem to get the inode
3299 In regular filesystem code, references to allocated XFS incore inodes are
3305 filesystem must ensure the atomicity of the ondisk inode btree index updates
3333 2. Filesystem freeze protection, if repairing (``mnt_want_write_file``).
3372 before the inode reference in the regular filesystem.
3378 Filesystem callers can short-circuit the LRU process by setting a ``DONTCACHE``
3402 In regular filesystem code, the VFS and XFS will acquire multiple IOLOCK locks
3409 Due to the structure of existing filesystem code, IOLOCKs and MMAPLOCKs must be
3427 scrub avoids deadlocking the filesystem or becoming an unresponsive process.
3442 walk of every directory on the filesystem while holding the child locked, and
3444 The coordinated inode scan provides a way to walk the filesystem without the
3460 Filesystem Hooks
3464 filesystem scan is the ability to stay informed about updates being made by
3465 other threads in the filesystem, since comparisons against the past are useless
3468 filesystem operations: filesystem hooks and :ref:`static keys<jump_labels>`.
3470 Filesystem hooks convey information about an ongoing filesystem operation to
3488 The following pieces are necessary to hook a certain point in the filesystem:
3491   a well-known incore filesystem object.
3500 - A callsite in the regular filesystem code must be chosen to call
3503   the filesystem update is committed to the transaction.
3504   In general, when the filesystem calls a hook chain, it should be able to
3512   The scanner function and the regular filesystem code must acquire resources
3528 Static keys are used to reduce the overhead of filesystem hooks to nearly
3537 filesystem code look like this::
3545             filesystem function             │
3566 checking code and the code making an update to the filesystem:
3568 - Prior to invoking the notifier call chain, the filesystem function being
3582   They must not acquire any resources that might conflict with the filesystem
3623 2. Walk every inode in the filesystem.
3633 filesystem objects until the newly collected metadata reflect all filesystem
3708 filesystem, and per-file link count records are stored in a sparse ``xfarray``
3733 Live update hooks are carefully placed in all parts of the filesystem that
3764 Most repair functions follow the same pattern: lock filesystem resources,
3769 do not require hooks in the main filesystem, and are usually the most efficient
3775 For repairs going on within a shard of the filesystem, these advantages
3780 every file in the filesystem, and the filesystem cannot stop.
3795    can receive updates to the rmap btree from the rest of the filesystem during
3858 Because file forks can consume as much space as the entire filesystem, repairs
3861 the XFS filesystem, writes a new structure at the correct offsets into the
3868 **Note**: All space usage and inode indices in the filesystem *must* be
3907 | - Array structures are linearly addressed, and the regular filesystem    |
3950 temporary file inside the filesystem.
4041 | filesystem.                                                              |
4047 | filesystem, either as part of an unmount or because the system is        |
4058 | Filesystem code signals its intention to use a log incompat feature in a |
4185 If the filesystem goes down in the middle of an operation, log recovery will
4198 Like any filesystem operation, extent swapping must determine the maximum
4217 The filesystem must not run completely out of free space, nor can the extent
4257 If the filesystem should go down during the reap part of the repair, the
4290 In the "realtime" section of an XFS filesystem, free space is tracked via a
4293 the filesystem block size between 4KiB and 1GiB in size.
4380 Fixing directories is difficult with currently available filesystem features,
4404    Otherwise, walk the filesystem to find it.
4463 reconstruction of filesystem space metadata.
4552       functions are not allowed to modify filesystem metadata.
4637       hook functions are not allowed to modify filesystem metadata.
4663    walk the surviving directories of each AG in the filesystem.
4670 3. For each AG in the filesystem,
4710 challenging because it currently uses a single-pass scan of the filesystem
4740 The root of the filesystem is a directory, and each entry in a directory points
4749 that isn't pointed to by any directory in the filesystem.
4808 filesystem from its beginnings in 1993.
4811 a. Filesystem summary counts depend on consistency within the inode indices,
4840 i. ``xfs_scrub`` depends on the filesystem being mounted and kernel support
4846 - Phase 1 checks that the provided path maps to an XFS filesystem and detect
4871 An XFS filesystem can easily contain hundreds of millions of inodes.
4886 filesystem contains one AG with a few large sparse files and the rest of the
4938 filesystem object, it became much more memory-efficient to track all eligible
4939 repairs for a given filesystem object with a single repair item.
4956            given filesystem object.
4971            given filesystem object.
4993 filesystem.
5012 If ``xfs_scrub`` succeeds in validating the filesystem metadata by the end of
5014 the filesystem.
5015 These names consist of the filesystem label, names in directory entries, and
5024 - Null bytes are not allowed in the filesystem label.
5037 In the common case, therefore, names found in an XFS filesystem are actually
5066 Most filesystem drivers persist the byte sequence names that are given to them
5089 This scan after validation of all filesystem metadata (except for the summary
5091 The scan starts by calling ``FS_IOC_GETFSMAP`` to scan the filesystem space map
5112 rebuilding of its metadata indices, and how filesystem users can interact with
5179   logical sector size matching the filesystem block size to force all writes
5180   to be aligned to the filesystem block size.
5198 filesystem object, a list of scrub types to run against that object, and a
5235 clear a portion of the physical storage underlying a filesystem so that it
5279 most shared data extents in the filesystem, and target them first.
5281 **Future Work Question**: How might the filesystem move inode chunks?
5285 the filesystem updating directory entries.
5286 The operation cannot complete if the filesystem goes down.
5289 filesystem to update directory entries.
5311 Removing the end of the filesystem ought to be a simple matter of evacuating
5312 the data and metadata at the end of the filesystem, and handing the freed space
5314 That requires an evacuation of the space at end of the filesystem, which is a