1.. SPDX-License-Identifier: GPL-2.0
2
3======================
4The SGI XFS Filesystem
5======================
6
7XFS is a high performance journaling filesystem which originated
8on the SGI IRIX platform.  It is completely multi-threaded, can
9support large files and large filesystems, extended attributes,
10variable block sizes, is extent based, and makes extensive use of
11Btrees (directories, extents, free space) to aid both performance
12and scalability.
13
14Refer to the documentation at https://xfs.wiki.kernel.org/
15for further details.  This implementation is on-disk compatible
16with the IRIX version of XFS.
17
18
19Mount Options
20=============
21
22When mounting an XFS filesystem, the following options are accepted.
23
24  allocsize=size
25	Sets the buffered I/O end-of-file preallocation size when
26	doing delayed allocation writeout (default size is 64KiB).
27	Valid values for this option are page size (typically 4KiB)
28	through to 1GiB, inclusive, in power-of-2 increments.
29
30	The default behaviour is for dynamic end-of-file
31	preallocation size, which uses a set of heuristics to
32	optimise the preallocation size based on the current
33	allocation patterns within the file and the access patterns
34	to the file. Specifying a fixed ``allocsize`` value turns off
35	the dynamic behaviour.
36
37  attr2 or noattr2
38	The options enable/disable an "opportunistic" improvement to
39	be made in the way inline extended attributes are stored
40	on-disk.  When the new form is used for the first time when
41	``attr2`` is selected (either when setting or removing extended
42	attributes) the on-disk superblock feature bit field will be
43	updated to reflect this format being in use.
44
45	The default behaviour is determined by the on-disk feature
46	bit indicating that ``attr2`` behaviour is active. If either
47	mount option is set, then that becomes the new default used
48	by the filesystem.
49
50	CRC enabled filesystems always use the ``attr2`` format, and so
51	will reject the ``noattr2`` mount option if it is set.
52
53  discard or nodiscard (default)
54	Enable/disable the issuing of commands to let the block
55	device reclaim space freed by the filesystem.  This is
56	useful for SSD devices, thinly provisioned LUNs and virtual
57	machine images, but may have a performance impact.
58
59	Note: It is currently recommended that you use the ``fstrim``
60	application to ``discard`` unused blocks rather than the ``discard``
61	mount option because the performance impact of this option
62	is quite severe.
63
64  grpid/bsdgroups or nogrpid/sysvgroups (default)
65	These options define what group ID a newly created file
66	gets.  When ``grpid`` is set, it takes the group ID of the
67	directory in which it is created; otherwise it takes the
68	``fsgid`` of the current process, unless the directory has the
69	``setgid`` bit set, in which case it takes the ``gid`` from the
70	parent directory, and also gets the ``setgid`` bit set if it is
71	a directory itself.
72
73  filestreams
74	Make the data allocator use the filestreams allocation mode
75	across the entire filesystem rather than just on directories
76	configured to use it.
77
78  ikeep or noikeep (default)
79	When ``ikeep`` is specified, XFS does not delete empty inode
80	clusters and keeps them around on disk.  When ``noikeep`` is
81	specified, empty inode clusters are returned to the free
82	space pool.
83
84  inode32 or inode64 (default)
85	When ``inode32`` is specified, it indicates that XFS limits
86	inode creation to locations which will not result in inode
87	numbers with more than 32 bits of significance.
88
89	When ``inode64`` is specified, it indicates that XFS is allowed
90	to create inodes at any location in the filesystem,
91	including those which will result in inode numbers occupying
92	more than 32 bits of significance.
93
94	``inode32`` is provided for backwards compatibility with older
95	systems and applications, since 64 bits inode numbers might
96	cause problems for some applications that cannot handle
97	large inode numbers.  If applications are in use which do
98	not handle inode numbers bigger than 32 bits, the ``inode32``
99	option should be specified.
100
101  largeio or nolargeio (default)
102	If ``nolargeio`` is specified, the optimal I/O reported in
103	``st_blksize`` by **stat(2)** will be as small as possible to allow
104	user applications to avoid inefficient read/modify/write
105	I/O.  This is typically the page size of the machine, as
106	this is the granularity of the page cache.
107
108	If ``largeio`` is specified, a filesystem that was created with a
109	``swidth`` specified will return the ``swidth`` value (in bytes)
110	in ``st_blksize``. If the filesystem does not have a ``swidth``
111	specified but does specify an ``allocsize`` then ``allocsize``
112	(in bytes) will be returned instead. Otherwise the behaviour
113	is the same as if ``nolargeio`` was specified.
114
115  logbufs=value
116	Set the number of in-memory log buffers.  Valid numbers
117	range from 2-8 inclusive.
118
119	The default value is 8 buffers.
120
121	If the memory cost of 8 log buffers is too high on small
122	systems, then it may be reduced at some cost to performance
123	on metadata intensive workloads. The ``logbsize`` option below
124	controls the size of each buffer and so is also relevant to
125	this case.
126
127  lifetime (default) or nolifetime
128	Enable data placement based on write life time hints provided
129	by the user. This turns on co-allocation of data of similar
130	life times when statistically favorable to reduce garbage
131	collection cost.
132
133	These options are only available for zoned rt file systems.
134
135  logbsize=value
136	Set the size of each in-memory log buffer.  The size may be
137	specified in bytes, or in kilobytes with a "k" suffix.
138	Valid sizes for version 1 and version 2 logs are 16384 (16k)
139	and 32768 (32k).  Valid sizes for version 2 logs also
140	include 65536 (64k), 131072 (128k) and 262144 (256k). The
141	logbsize must be an integer multiple of the log
142	stripe unit configured at **mkfs(8)** time.
143
144	The default value for version 1 logs is 32768, while the
145	default value for version 2 logs is MAX(32768, log_sunit).
146
147  logdev=device and rtdev=device
148	Use an external log (metadata journal) and/or real-time device.
149	An XFS filesystem has up to three parts: a data section, a log
150	section, and a real-time section.  The real-time section is
151	optional, and the log section can be separate from the data
152	section or contained within it.
153
154  max_atomic_write=value
155	Set the maximum size of an atomic write.  The size may be
156	specified in bytes, in kilobytes with a "k" suffix, in megabytes
157	with a "m" suffix, or in gigabytes with a "g" suffix.  The size
158	cannot be larger than the maximum write size, larger than the
159	size of any allocation group, or larger than the size of a
160	remapping operation that the log can complete atomically.
161
162	The default value is to set the maximum I/O completion size
163	to allow each CPU to handle one at a time.
164
165  max_open_zones=value
166	Specify the max number of zones to keep open for writing on a
167	zoned rt device. Many open zones aids file data separation
168	but may impact performance on HDDs.
169
170	If ``max_open_zones`` is not specified, the value is determined
171	by the capabilities and the size of the zoned rt device.
172
173  noalign
174	Data allocations will not be aligned at stripe unit
175	boundaries. This is only relevant to filesystems created
176	with non-zero data alignment parameters (``sunit``, ``swidth``) by
177	**mkfs(8)**.
178
179  norecovery
180	The filesystem will be mounted without running log recovery.
181	If the filesystem was not cleanly unmounted, it is likely to
182	be inconsistent when mounted in ``norecovery`` mode.
183	Some files or directories may not be accessible because of this.
184	Filesystems mounted ``norecovery`` must be mounted read-only or
185	the mount will fail.
186
187  nouuid
188	Don't check for double mounted file systems using the file
189	system ``uuid``.  This is useful to mount LVM snapshot volumes,
190	and often used in combination with ``norecovery`` for mounting
191	read-only snapshots.
192
193  noquota
194	Forcibly turns off all quota accounting and enforcement
195	within the filesystem.
196
197  uquota/usrquota/uqnoenforce/quota
198	User disk quota accounting enabled, and limits (optionally)
199	enforced.  Refer to **xfs_quota(8)** for further details.
200
201  gquota/grpquota/gqnoenforce
202	Group disk quota accounting enabled and limits (optionally)
203	enforced.  Refer to **xfs_quota(8)** for further details.
204
205  pquota/prjquota/pqnoenforce
206	Project disk quota accounting enabled and limits (optionally)
207	enforced.  Refer to **xfs_quota(8)** for further details.
208
209  sunit=value and swidth=value
210	Used to specify the stripe unit and width for a RAID device
211	or a stripe volume.  "value" must be specified in 512-byte
212	block units. These options are only relevant to filesystems
213	that were created with non-zero data alignment parameters.
214
215	The ``sunit`` and ``swidth`` parameters specified must be compatible
216	with the existing filesystem alignment characteristics.  In
217	general, that means the only valid changes to ``sunit`` are
218	increasing it by a power-of-2 multiple. Valid ``swidth`` values
219	are any integer multiple of a valid ``sunit`` value.
220
221	Typically the only time these mount options are necessary if
222	after an underlying RAID device has had its geometry
223	modified, such as adding a new disk to a RAID5 lun and
224	reshaping it.
225
226  swalloc
227	Data allocations will be rounded up to stripe width boundaries
228	when the current end of file is being extended and the file
229	size is larger than the stripe width size.
230
231  wsync
232	When specified, all filesystem namespace operations are
233	executed synchronously. This ensures that when the namespace
234	operation (create, unlink, etc) completes, the change to the
235	namespace is on stable storage. This is useful in HA setups
236	where failover must not result in clients seeing
237	inconsistent namespace presentation during or after a
238	failover event.
239
240Deprecation of V4 Format
241========================
242
243The V4 filesystem format lacks certain features that are supported by
244the V5 format, such as metadata checksumming, strengthened metadata
245verification, and the ability to store timestamps past the year 2038.
246Because of this, the V4 format is deprecated.  All users should upgrade
247by backing up their files, reformatting, and restoring from the backup.
248
249Administrators and users can detect a V4 filesystem by running xfs_info
250against a filesystem mountpoint and checking for a string containing
251"crc=".  If no such string is found, please upgrade xfsprogs to the
252latest version and try again.
253
254The deprecation will take place in two parts.  Support for mounting V4
255filesystems can now be disabled at kernel build time via Kconfig option.
256The option will default to yes until September 2025, at which time it
257will be changed to default to no.  In September 2030, support will be
258removed from the codebase entirely.
259
260Note: Distributors may choose to withdraw V4 format support earlier than
261the dates listed above.
262
263Deprecated Mount Options
264========================
265
266============================    ================
267  Name				Removal Schedule
268============================    ================
269Mounting with V4 filesystem     September 2030
270Mounting ascii-ci filesystem    September 2030
271ikeep/noikeep			September 2025
272attr2/noattr2			September 2025
273============================    ================
274
275
276Removed Mount Options
277=====================
278
279===========================     =======
280  Name				Removed
281===========================	=======
282  delaylog/nodelaylog		v4.0
283  ihashsize			v4.0
284  irixsgid			v4.0
285  osyncisdsync/osyncisosync	v4.0
286  barrier			v4.19
287  nobarrier			v4.19
288===========================     =======
289
290sysctls
291=======
292
293The following sysctls are available for the XFS filesystem:
294
295  fs.xfs.stats_clear		(Min: 0  Default: 0  Max: 1)
296	Setting this to "1" clears accumulated XFS statistics
297	in /proc/fs/xfs/stat.  It then immediately resets to "0".
298
299  fs.xfs.xfssyncd_centisecs	(Min: 100  Default: 3000  Max: 720000)
300	The interval at which the filesystem flushes metadata
301	out to disk and runs internal cache cleanup routines.
302
303  fs.xfs.filestream_centisecs	(Min: 1  Default: 3000  Max: 360000)
304	The interval at which the filesystem ages filestreams cache
305	references and returns timed-out AGs back to the free stream
306	pool.
307
308  fs.xfs.speculative_prealloc_lifetime
309	(Units: seconds   Min: 1  Default: 300  Max: 86400)
310	The interval at which the background scanning for inodes
311	with unused speculative preallocation runs. The scan
312	removes unused preallocation from clean inodes and releases
313	the unused space back to the free pool.
314
315  fs.xfs.speculative_cow_prealloc_lifetime
316	This is an alias for speculative_prealloc_lifetime.
317
318  fs.xfs.error_level		(Min: 0  Default: 3  Max: 11)
319	A volume knob for error reporting when internal errors occur.
320	This will generate detailed messages & backtraces for filesystem
321	shutdowns, for example.  Current threshold values are:
322
323		XFS_ERRLEVEL_OFF:       0
324		XFS_ERRLEVEL_LOW:       1
325		XFS_ERRLEVEL_HIGH:      5
326
327  fs.xfs.panic_mask		(Min: 0  Default: 0  Max: 511)
328	Causes certain error conditions to call BUG(). Value is a bitmask;
329	OR together the tags which represent errors which should cause panics:
330
331		XFS_NO_PTAG                     0
332		XFS_PTAG_IFLUSH                 0x00000001
333		XFS_PTAG_LOGRES                 0x00000002
334		XFS_PTAG_AILDELETE              0x00000004
335		XFS_PTAG_ERROR_REPORT           0x00000008
336		XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
337		XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
338		XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
339		XFS_PTAG_FSBLOCK_ZERO           0x00000080
340		XFS_PTAG_VERIFIER_ERROR         0x00000100
341
342	This option is intended for debugging only.
343
344  fs.xfs.irix_symlink_mode	(Min: 0  Default: 0  Max: 1)
345	Controls whether symlinks are created with mode 0777 (default)
346	or whether their mode is affected by the umask (irix mode).
347
348  fs.xfs.irix_sgid_inherit	(Min: 0  Default: 0  Max: 1)
349	Controls files created in SGID directories.
350	If the group ID of the new file does not match the effective group
351	ID or one of the supplementary group IDs of the parent dir, the
352	ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
353	is set.
354
355  fs.xfs.inherit_sync		(Min: 0  Default: 1  Max: 1)
356	Setting this to "1" will cause the "sync" flag set
357	by the **xfs_io(8)** chattr command on a directory to be
358	inherited by files in that directory.
359
360  fs.xfs.inherit_nodump		(Min: 0  Default: 1  Max: 1)
361	Setting this to "1" will cause the "nodump" flag set
362	by the **xfs_io(8)** chattr command on a directory to be
363	inherited by files in that directory.
364
365  fs.xfs.inherit_noatime	(Min: 0  Default: 1  Max: 1)
366	Setting this to "1" will cause the "noatime" flag set
367	by the **xfs_io(8)** chattr command on a directory to be
368	inherited by files in that directory.
369
370  fs.xfs.inherit_nosymlinks	(Min: 0  Default: 1  Max: 1)
371	Setting this to "1" will cause the "nosymlinks" flag set
372	by the **xfs_io(8)** chattr command on a directory to be
373	inherited by files in that directory.
374
375  fs.xfs.inherit_nodefrag	(Min: 0  Default: 1  Max: 1)
376	Setting this to "1" will cause the "nodefrag" flag set
377	by the **xfs_io(8)** chattr command on a directory to be
378	inherited by files in that directory.
379
380  fs.xfs.rotorstep		(Min: 1  Default: 1  Max: 256)
381	In "inode32" allocation mode, this option determines how many
382	files the allocator attempts to allocate in the same allocation
383	group before moving to the next allocation group.  The intent
384	is to control the rate at which the allocator moves between
385	allocation groups when allocating extents for new files.
386
387Deprecated Sysctls
388==================
389
390===========================================     ================
391  Name                                          Removal Schedule
392===========================================     ================
393fs.xfs.irix_sgid_inherit                        September 2025
394fs.xfs.irix_symlink_mode                        September 2025
395fs.xfs.speculative_cow_prealloc_lifetime        September 2025
396===========================================     ================
397
398
399Removed Sysctls
400===============
401
402=============================	=======
403  Name				Removed
404=============================	=======
405  fs.xfs.xfsbufd_centisec	v4.0
406  fs.xfs.age_buffer_centisecs	v4.0
407=============================	=======
408
409Error handling
410==============
411
412XFS can act differently according to the type of error found during its
413operation. The implementation introduces the following concepts to the error
414handler:
415
416 -failure speed:
417	Defines how fast XFS should propagate an error upwards when a specific
418	error is found during the filesystem operation. It can propagate
419	immediately, after a defined number of retries, after a set time period,
420	or simply retry forever.
421
422 -error classes:
423	Specifies the subsystem the error configuration will apply to, such as
424	metadata IO or memory allocation. Different subsystems will have
425	different error handlers for which behaviour can be configured.
426
427 -error handlers:
428	Defines the behavior for a specific error.
429
430The filesystem behavior during an error can be set via ``sysfs`` files. Each
431error handler works independently - the first condition met by an error handler
432for a specific class will cause the error to be propagated rather than reset and
433retried.
434
435The action taken by the filesystem when the error is propagated is context
436dependent - it may cause a shut down in the case of an unrecoverable error,
437it may be reported back to userspace, or it may even be ignored because
438there's nothing useful we can with the error or anyone we can report it to (e.g.
439during unmount).
440
441The configuration files are organized into the following hierarchy for each
442mounted filesystem:
443
444  /sys/fs/xfs/<dev>/error/<class>/<error>/
445
446Where:
447  <dev>
448	The short device name of the mounted filesystem. This is the same device
449	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
450
451  <class>
452	The subsystem the error configuration belongs to. As of 4.9, the defined
453	classes are:
454
455		- "metadata": applies metadata buffer write IO
456
457  <error>
458	The individual error handler configurations.
459
460
461Each filesystem has "global" error configuration options defined in their top
462level directory:
463
464  /sys/fs/xfs/<dev>/error/
465
466  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
467	Defines the filesystem error behavior at unmount time.
468
469	If set to a value of 1, XFS will override all other error configurations
470	during unmount and replace them with "immediate fail" characteristics.
471	i.e. no retries, no retry timeout. This will always allow unmount to
472	succeed when there are persistent errors present.
473
474	If set to 0, the configured retry behaviour will continue until all
475	retries and/or timeouts have been exhausted. This will delay unmount
476	completion when there are persistent errors, and it may prevent the
477	filesystem from ever unmounting fully in the case of "retry forever"
478	handler configurations.
479
480	Note: there is no guarantee that fail_at_unmount can be set while an
481	unmount is in progress. It is possible that the ``sysfs`` entries are
482	removed by the unmounting filesystem before a "retry forever" error
483	handler configuration causes unmount to hang, and hence the filesystem
484	must be configured appropriately before unmount begins to prevent
485	unmount hangs.
486
487Each filesystem has specific error class handlers that define the error
488propagation behaviour for specific errors. There is also a "default" error
489handler defined, which defines the behaviour for all errors that don't have
490specific handlers defined. Where multiple retry constraints are configured for
491a single error, the first retry configuration that expires will cause the error
492to be propagated. The handler configurations are found in the directory:
493
494  /sys/fs/xfs/<dev>/error/<class>/<error>/
495
496  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
497	Defines the allowed number of retries of a specific error before
498	the filesystem will propagate the error. The retry count for a given
499	error context (e.g. a specific metadata buffer) is reset every time
500	there is a successful completion of the operation.
501
502	Setting the value to "-1" will cause XFS to retry forever for this
503	specific error.
504
505	Setting the value to "0" will cause XFS to fail immediately when the
506	specific error is reported.
507
508	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
509	operation "N" times before propagating the error.
510
511  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
512	Define the amount of time (in seconds) that the filesystem is
513	allowed to retry its operations when the specific error is
514	found.
515
516	Setting the value to "-1" will allow XFS to retry forever for this
517	specific error.
518
519	Setting the value to "0" will cause XFS to fail immediately when the
520	specific error is reported.
521
522	Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
523	operation for up to "N" seconds before propagating the error.
524
525**Note:** The default behaviour for a specific error handler is dependent on both
526the class and error context. For example, the default values for
527"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
528to "fail immediately" behaviour. This is done because ENODEV is a fatal,
529unrecoverable error no matter how many times the metadata IO is retried.
530
531Workqueue Concurrency
532=====================
533
534XFS uses kernel workqueues to parallelize metadata update processes.  This
535enables it to take advantage of storage hardware that can service many IO
536operations simultaneously.  This interface exposes internal implementation
537details of XFS, and as such is explicitly not part of any userspace API/ABI
538guarantee the kernel may give userspace.  These are undocumented features of
539the generic workqueue implementation XFS uses for concurrency, and they are
540provided here purely for diagnostic and tuning purposes and may change at any
541time in the future.
542
543The control knobs for a filesystem's workqueues are organized by task at hand
544and the short name of the data device.  They all can be found in:
545
546  /sys/bus/workqueue/devices/${task}!${device}
547
548================  ===========
549  Task            Description
550================  ===========
551  xfs_iwalk-$pid  Inode scans of the entire filesystem. Currently limited to
552                  mount time quotacheck.
553  xfs-gc          Background garbage collection of disk space that have been
554                  speculatively allocated beyond EOF or for staging copy on
555                  write operations.
556================  ===========
557
558For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
559found in /sys/bus/workqueue/devices/xfs_iwalk-1111!nvme0n1/.
560
561The interesting knobs for XFS workqueues are as follows:
562
563============     ===========
564  Knob           Description
565============     ===========
566  max_active     Maximum number of background threads that can be started to
567                 run the work.
568  cpumask        CPUs upon which the threads are allowed to run.
569  nice           Relative priority of scheduling the threads.  These are the
570                 same nice levels that can be applied to userspace processes.
571============     ===========
572
573Zoned Filesystems
574=================
575
576For zoned file systems, the following attributes are exposed in:
577
578  /sys/fs/xfs/<dev>/zoned/
579
580  max_open_zones		(Min:  1  Default:  Varies  Max:  UINTMAX)
581	This read-only attribute exposes the maximum number of open zones
582	available for data placement. The value is determined at mount time and
583	is limited by the capabilities of the backing zoned device, file system
584	size and the max_open_zones mount option.
585
586  zonegc_low_space		(Min:  0  Default:  0  Max:  100)
587	Define a percentage for how much of the unused space that GC should keep
588	available for writing. A high value will reclaim more of the space
589	occupied by unused blocks, creating a larger buffer against write
590	bursts at the cost of increased write amplification.  Regardless
591	of this value, garbage collection will always aim to free a minimum
592	amount of blocks to keep max_open_zones open for data placement purposes.
593