1.. SPDX-License-Identifier: GPL-2.0
2
3======================
4The SGI XFS Filesystem
5======================
6
7XFS is a high performance journaling filesystem which originated
8on the SGI IRIX platform.  It is completely multi-threaded, can
9support large files and large filesystems, extended attributes,
10variable block sizes, is extent based, and makes extensive use of
11Btrees (directories, extents, free space) to aid both performance
12and scalability.
13
14Refer to the documentation at https://xfs.wiki.kernel.org/
15for further details.  This implementation is on-disk compatible
16with the IRIX version of XFS.
17
18
19Mount Options
20=============
21
22When mounting an XFS filesystem, the following options are accepted.
23
24  allocsize=size
25	Sets the buffered I/O end-of-file preallocation size when
26	doing delayed allocation writeout (default size is 64KiB).
27	Valid values for this option are page size (typically 4KiB)
28	through to 1GiB, inclusive, in power-of-2 increments.
29
30	The default behaviour is for dynamic end-of-file
31	preallocation size, which uses a set of heuristics to
32	optimise the preallocation size based on the current
33	allocation patterns within the file and the access patterns
34	to the file. Specifying a fixed ``allocsize`` value turns off
35	the dynamic behaviour.
36
37  attr2 or noattr2
38	The options enable/disable an "opportunistic" improvement to
39	be made in the way inline extended attributes are stored
40	on-disk.  When the new form is used for the first time when
41	``attr2`` is selected (either when setting or removing extended
42	attributes) the on-disk superblock feature bit field will be
43	updated to reflect this format being in use.
44
45	The default behaviour is determined by the on-disk feature
46	bit indicating that ``attr2`` behaviour is active. If either
47	mount option is set, then that becomes the new default used
48	by the filesystem.
49
50	CRC enabled filesystems always use the ``attr2`` format, and so
51	will reject the ``noattr2`` mount option if it is set.
52
53  discard or nodiscard (default)
54	Enable/disable the issuing of commands to let the block
55	device reclaim space freed by the filesystem.  This is
56	useful for SSD devices, thinly provisioned LUNs and virtual
57	machine images, but may have a performance impact.
58
59	Note: It is currently recommended that you use the ``fstrim``
60	application to ``discard`` unused blocks rather than the ``discard``
61	mount option because the performance impact of this option
62	is quite severe.
63
64  grpid/bsdgroups or nogrpid/sysvgroups (default)
65	These options define what group ID a newly created file
66	gets.  When ``grpid`` is set, it takes the group ID of the
67	directory in which it is created; otherwise it takes the
68	``fsgid`` of the current process, unless the directory has the
69	``setgid`` bit set, in which case it takes the ``gid`` from the
70	parent directory, and also gets the ``setgid`` bit set if it is
71	a directory itself.
72
73  filestreams
74	Make the data allocator use the filestreams allocation mode
75	across the entire filesystem rather than just on directories
76	configured to use it.
77
78  ikeep or noikeep (default)
79	When ``ikeep`` is specified, XFS does not delete empty inode
80	clusters and keeps them around on disk.  When ``noikeep`` is
81	specified, empty inode clusters are returned to the free
82	space pool.
83
84  inode32 or inode64 (default)
85	When ``inode32`` is specified, it indicates that XFS limits
86	inode creation to locations which will not result in inode
87	numbers with more than 32 bits of significance.
88
89	When ``inode64`` is specified, it indicates that XFS is allowed
90	to create inodes at any location in the filesystem,
91	including those which will result in inode numbers occupying
92	more than 32 bits of significance.
93
94	``inode32`` is provided for backwards compatibility with older
95	systems and applications, since 64 bits inode numbers might
96	cause problems for some applications that cannot handle
97	large inode numbers.  If applications are in use which do
98	not handle inode numbers bigger than 32 bits, the ``inode32``
99	option should be specified.
100
101  largeio or nolargeio (default)
102	If ``nolargeio`` is specified, the optimal I/O reported in
103	``st_blksize`` by **stat(2)** will be as small as possible to allow
104	user applications to avoid inefficient read/modify/write
105	I/O.  This is typically the page size of the machine, as
106	this is the granularity of the page cache.
107
108	If ``largeio`` is specified, a filesystem that was created with a
109	``swidth`` specified will return the ``swidth`` value (in bytes)
110	in ``st_blksize``. If the filesystem does not have a ``swidth``
111	specified but does specify an ``allocsize`` then ``allocsize``
112	(in bytes) will be returned instead. Otherwise the behaviour
113	is the same as if ``nolargeio`` was specified.
114
115  logbufs=value
116	Set the number of in-memory log buffers.  Valid numbers
117	range from 2-8 inclusive.
118
119	The default value is 8 buffers.
120
121	If the memory cost of 8 log buffers is too high on small
122	systems, then it may be reduced at some cost to performance
123	on metadata intensive workloads. The ``logbsize`` option below
124	controls the size of each buffer and so is also relevant to
125	this case.
126
127  lifetime (default) or nolifetime
128	Enable data placement based on write life time hints provided
129	by the user. This turns on co-allocation of data of similar
130	life times when statistically favorable to reduce garbage
131	collection cost.
132
133	These options are only available for zoned rt file systems.
134
135  logbsize=value
136	Set the size of each in-memory log buffer.  The size may be
137	specified in bytes, or in kilobytes with a "k" suffix.
138	Valid sizes for version 1 and version 2 logs are 16384 (16k)
139	and 32768 (32k).  Valid sizes for version 2 logs also
140	include 65536 (64k), 131072 (128k) and 262144 (256k). The
141	logbsize must be an integer multiple of the log
142	stripe unit configured at **mkfs(8)** time.
143
144	The default value for version 1 logs is 32768, while the
145	default value for version 2 logs is MAX(32768, log_sunit).
146
147  logdev=device and rtdev=device
148	Use an external log (metadata journal) and/or real-time device.
149	An XFS filesystem has up to three parts: a data section, a log
150	section, and a real-time section.  The real-time section is
151	optional, and the log section can be separate from the data
152	section or contained within it.
153
154  max_open_zones=value
155	Specify the max number of zones to keep open for writing on a
156	zoned rt device. Many open zones aids file data separation
157	but may impact performance on HDDs.
158
159	If ``max_open_zones`` is not specified, the value is determined
160	by the capabilities and the size of the zoned rt device.
161
162  noalign
163	Data allocations will not be aligned at stripe unit
164	boundaries. This is only relevant to filesystems created
165	with non-zero data alignment parameters (``sunit``, ``swidth``) by
166	**mkfs(8)**.
167
168  norecovery
169	The filesystem will be mounted without running log recovery.
170	If the filesystem was not cleanly unmounted, it is likely to
171	be inconsistent when mounted in ``norecovery`` mode.
172	Some files or directories may not be accessible because of this.
173	Filesystems mounted ``norecovery`` must be mounted read-only or
174	the mount will fail.
175
176  nouuid
177	Don't check for double mounted file systems using the file
178	system ``uuid``.  This is useful to mount LVM snapshot volumes,
179	and often used in combination with ``norecovery`` for mounting
180	read-only snapshots.
181
182  noquota
183	Forcibly turns off all quota accounting and enforcement
184	within the filesystem.
185
186  uquota/usrquota/uqnoenforce/quota
187	User disk quota accounting enabled, and limits (optionally)
188	enforced.  Refer to **xfs_quota(8)** for further details.
189
190  gquota/grpquota/gqnoenforce
191	Group disk quota accounting enabled and limits (optionally)
192	enforced.  Refer to **xfs_quota(8)** for further details.
193
194  pquota/prjquota/pqnoenforce
195	Project disk quota accounting enabled and limits (optionally)
196	enforced.  Refer to **xfs_quota(8)** for further details.
197
198  sunit=value and swidth=value
199	Used to specify the stripe unit and width for a RAID device
200	or a stripe volume.  "value" must be specified in 512-byte
201	block units. These options are only relevant to filesystems
202	that were created with non-zero data alignment parameters.
203
204	The ``sunit`` and ``swidth`` parameters specified must be compatible
205	with the existing filesystem alignment characteristics.  In
206	general, that means the only valid changes to ``sunit`` are
207	increasing it by a power-of-2 multiple. Valid ``swidth`` values
208	are any integer multiple of a valid ``sunit`` value.
209
210	Typically the only time these mount options are necessary if
211	after an underlying RAID device has had its geometry
212	modified, such as adding a new disk to a RAID5 lun and
213	reshaping it.
214
215  swalloc
216	Data allocations will be rounded up to stripe width boundaries
217	when the current end of file is being extended and the file
218	size is larger than the stripe width size.
219
220  wsync
221	When specified, all filesystem namespace operations are
222	executed synchronously. This ensures that when the namespace
223	operation (create, unlink, etc) completes, the change to the
224	namespace is on stable storage. This is useful in HA setups
225	where failover must not result in clients seeing
226	inconsistent namespace presentation during or after a
227	failover event.
228
229Deprecation of V4 Format
230========================
231
232The V4 filesystem format lacks certain features that are supported by
233the V5 format, such as metadata checksumming, strengthened metadata
234verification, and the ability to store timestamps past the year 2038.
235Because of this, the V4 format is deprecated.  All users should upgrade
236by backing up their files, reformatting, and restoring from the backup.
237
238Administrators and users can detect a V4 filesystem by running xfs_info
239against a filesystem mountpoint and checking for a string containing
240"crc=".  If no such string is found, please upgrade xfsprogs to the
241latest version and try again.
242
243The deprecation will take place in two parts.  Support for mounting V4
244filesystems can now be disabled at kernel build time via Kconfig option.
245The option will default to yes until September 2025, at which time it
246will be changed to default to no.  In September 2030, support will be
247removed from the codebase entirely.
248
249Note: Distributors may choose to withdraw V4 format support earlier than
250the dates listed above.
251
252Deprecated Mount Options
253========================
254
255============================    ================
256  Name				Removal Schedule
257============================    ================
258Mounting with V4 filesystem     September 2030
259Mounting ascii-ci filesystem    September 2030
260ikeep/noikeep			September 2025
261attr2/noattr2			September 2025
262============================    ================
263
264
265Removed Mount Options
266=====================
267
268===========================     =======
269  Name				Removed
270===========================	=======
271  delaylog/nodelaylog		v4.0
272  ihashsize			v4.0
273  irixsgid			v4.0
274  osyncisdsync/osyncisosync	v4.0
275  barrier			v4.19
276  nobarrier			v4.19
277===========================     =======
278
279sysctls
280=======
281
282The following sysctls are available for the XFS filesystem:
283
284  fs.xfs.stats_clear		(Min: 0  Default: 0  Max: 1)
285	Setting this to "1" clears accumulated XFS statistics
286	in /proc/fs/xfs/stat.  It then immediately resets to "0".
287
288  fs.xfs.xfssyncd_centisecs	(Min: 100  Default: 3000  Max: 720000)
289	The interval at which the filesystem flushes metadata
290	out to disk and runs internal cache cleanup routines.
291
292  fs.xfs.filestream_centisecs	(Min: 1  Default: 3000  Max: 360000)
293	The interval at which the filesystem ages filestreams cache
294	references and returns timed-out AGs back to the free stream
295	pool.
296
297  fs.xfs.speculative_prealloc_lifetime
298	(Units: seconds   Min: 1  Default: 300  Max: 86400)
299	The interval at which the background scanning for inodes
300	with unused speculative preallocation runs. The scan
301	removes unused preallocation from clean inodes and releases
302	the unused space back to the free pool.
303
304  fs.xfs.speculative_cow_prealloc_lifetime
305	This is an alias for speculative_prealloc_lifetime.
306
307  fs.xfs.error_level		(Min: 0  Default: 3  Max: 11)
308	A volume knob for error reporting when internal errors occur.
309	This will generate detailed messages & backtraces for filesystem
310	shutdowns, for example.  Current threshold values are:
311
312		XFS_ERRLEVEL_OFF:       0
313		XFS_ERRLEVEL_LOW:       1
314		XFS_ERRLEVEL_HIGH:      5
315
316  fs.xfs.panic_mask		(Min: 0  Default: 0  Max: 511)
317	Causes certain error conditions to call BUG(). Value is a bitmask;
318	OR together the tags which represent errors which should cause panics:
319
320		XFS_NO_PTAG                     0
321		XFS_PTAG_IFLUSH                 0x00000001
322		XFS_PTAG_LOGRES                 0x00000002
323		XFS_PTAG_AILDELETE              0x00000004
324		XFS_PTAG_ERROR_REPORT           0x00000008
325		XFS_PTAG_SHUTDOWN_CORRUPT       0x00000010
326		XFS_PTAG_SHUTDOWN_IOERROR       0x00000020
327		XFS_PTAG_SHUTDOWN_LOGERROR      0x00000040
328		XFS_PTAG_FSBLOCK_ZERO           0x00000080
329		XFS_PTAG_VERIFIER_ERROR         0x00000100
330
331	This option is intended for debugging only.
332
333  fs.xfs.irix_symlink_mode	(Min: 0  Default: 0  Max: 1)
334	Controls whether symlinks are created with mode 0777 (default)
335	or whether their mode is affected by the umask (irix mode).
336
337  fs.xfs.irix_sgid_inherit	(Min: 0  Default: 0  Max: 1)
338	Controls files created in SGID directories.
339	If the group ID of the new file does not match the effective group
340	ID or one of the supplementary group IDs of the parent dir, the
341	ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl
342	is set.
343
344  fs.xfs.inherit_sync		(Min: 0  Default: 1  Max: 1)
345	Setting this to "1" will cause the "sync" flag set
346	by the **xfs_io(8)** chattr command on a directory to be
347	inherited by files in that directory.
348
349  fs.xfs.inherit_nodump		(Min: 0  Default: 1  Max: 1)
350	Setting this to "1" will cause the "nodump" flag set
351	by the **xfs_io(8)** chattr command on a directory to be
352	inherited by files in that directory.
353
354  fs.xfs.inherit_noatime	(Min: 0  Default: 1  Max: 1)
355	Setting this to "1" will cause the "noatime" flag set
356	by the **xfs_io(8)** chattr command on a directory to be
357	inherited by files in that directory.
358
359  fs.xfs.inherit_nosymlinks	(Min: 0  Default: 1  Max: 1)
360	Setting this to "1" will cause the "nosymlinks" flag set
361	by the **xfs_io(8)** chattr command on a directory to be
362	inherited by files in that directory.
363
364  fs.xfs.inherit_nodefrag	(Min: 0  Default: 1  Max: 1)
365	Setting this to "1" will cause the "nodefrag" flag set
366	by the **xfs_io(8)** chattr command on a directory to be
367	inherited by files in that directory.
368
369  fs.xfs.rotorstep		(Min: 1  Default: 1  Max: 256)
370	In "inode32" allocation mode, this option determines how many
371	files the allocator attempts to allocate in the same allocation
372	group before moving to the next allocation group.  The intent
373	is to control the rate at which the allocator moves between
374	allocation groups when allocating extents for new files.
375
376Deprecated Sysctls
377==================
378
379===========================================     ================
380  Name                                          Removal Schedule
381===========================================     ================
382fs.xfs.irix_sgid_inherit                        September 2025
383fs.xfs.irix_symlink_mode                        September 2025
384fs.xfs.speculative_cow_prealloc_lifetime        September 2025
385===========================================     ================
386
387
388Removed Sysctls
389===============
390
391=============================	=======
392  Name				Removed
393=============================	=======
394  fs.xfs.xfsbufd_centisec	v4.0
395  fs.xfs.age_buffer_centisecs	v4.0
396=============================	=======
397
398Error handling
399==============
400
401XFS can act differently according to the type of error found during its
402operation. The implementation introduces the following concepts to the error
403handler:
404
405 -failure speed:
406	Defines how fast XFS should propagate an error upwards when a specific
407	error is found during the filesystem operation. It can propagate
408	immediately, after a defined number of retries, after a set time period,
409	or simply retry forever.
410
411 -error classes:
412	Specifies the subsystem the error configuration will apply to, such as
413	metadata IO or memory allocation. Different subsystems will have
414	different error handlers for which behaviour can be configured.
415
416 -error handlers:
417	Defines the behavior for a specific error.
418
419The filesystem behavior during an error can be set via ``sysfs`` files. Each
420error handler works independently - the first condition met by an error handler
421for a specific class will cause the error to be propagated rather than reset and
422retried.
423
424The action taken by the filesystem when the error is propagated is context
425dependent - it may cause a shut down in the case of an unrecoverable error,
426it may be reported back to userspace, or it may even be ignored because
427there's nothing useful we can with the error or anyone we can report it to (e.g.
428during unmount).
429
430The configuration files are organized into the following hierarchy for each
431mounted filesystem:
432
433  /sys/fs/xfs/<dev>/error/<class>/<error>/
434
435Where:
436  <dev>
437	The short device name of the mounted filesystem. This is the same device
438	name that shows up in XFS kernel error messages as "XFS(<dev>): ..."
439
440  <class>
441	The subsystem the error configuration belongs to. As of 4.9, the defined
442	classes are:
443
444		- "metadata": applies metadata buffer write IO
445
446  <error>
447	The individual error handler configurations.
448
449
450Each filesystem has "global" error configuration options defined in their top
451level directory:
452
453  /sys/fs/xfs/<dev>/error/
454
455  fail_at_unmount		(Min:  0  Default:  1  Max: 1)
456	Defines the filesystem error behavior at unmount time.
457
458	If set to a value of 1, XFS will override all other error configurations
459	during unmount and replace them with "immediate fail" characteristics.
460	i.e. no retries, no retry timeout. This will always allow unmount to
461	succeed when there are persistent errors present.
462
463	If set to 0, the configured retry behaviour will continue until all
464	retries and/or timeouts have been exhausted. This will delay unmount
465	completion when there are persistent errors, and it may prevent the
466	filesystem from ever unmounting fully in the case of "retry forever"
467	handler configurations.
468
469	Note: there is no guarantee that fail_at_unmount can be set while an
470	unmount is in progress. It is possible that the ``sysfs`` entries are
471	removed by the unmounting filesystem before a "retry forever" error
472	handler configuration causes unmount to hang, and hence the filesystem
473	must be configured appropriately before unmount begins to prevent
474	unmount hangs.
475
476Each filesystem has specific error class handlers that define the error
477propagation behaviour for specific errors. There is also a "default" error
478handler defined, which defines the behaviour for all errors that don't have
479specific handlers defined. Where multiple retry constraints are configured for
480a single error, the first retry configuration that expires will cause the error
481to be propagated. The handler configurations are found in the directory:
482
483  /sys/fs/xfs/<dev>/error/<class>/<error>/
484
485  max_retries			(Min: -1  Default: Varies  Max: INTMAX)
486	Defines the allowed number of retries of a specific error before
487	the filesystem will propagate the error. The retry count for a given
488	error context (e.g. a specific metadata buffer) is reset every time
489	there is a successful completion of the operation.
490
491	Setting the value to "-1" will cause XFS to retry forever for this
492	specific error.
493
494	Setting the value to "0" will cause XFS to fail immediately when the
495	specific error is reported.
496
497	Setting the value to "N" (where 0 < N < Max) will make XFS retry the
498	operation "N" times before propagating the error.
499
500  retry_timeout_seconds		(Min:  -1  Default:  Varies  Max: 1 day)
501	Define the amount of time (in seconds) that the filesystem is
502	allowed to retry its operations when the specific error is
503	found.
504
505	Setting the value to "-1" will allow XFS to retry forever for this
506	specific error.
507
508	Setting the value to "0" will cause XFS to fail immediately when the
509	specific error is reported.
510
511	Setting the value to "N" (where 0 < N < Max) will allow XFS to retry the
512	operation for up to "N" seconds before propagating the error.
513
514**Note:** The default behaviour for a specific error handler is dependent on both
515the class and error context. For example, the default values for
516"metadata/ENODEV" are "0" rather than "-1" so that this error handler defaults
517to "fail immediately" behaviour. This is done because ENODEV is a fatal,
518unrecoverable error no matter how many times the metadata IO is retried.
519
520Workqueue Concurrency
521=====================
522
523XFS uses kernel workqueues to parallelize metadata update processes.  This
524enables it to take advantage of storage hardware that can service many IO
525operations simultaneously.  This interface exposes internal implementation
526details of XFS, and as such is explicitly not part of any userspace API/ABI
527guarantee the kernel may give userspace.  These are undocumented features of
528the generic workqueue implementation XFS uses for concurrency, and they are
529provided here purely for diagnostic and tuning purposes and may change at any
530time in the future.
531
532The control knobs for a filesystem's workqueues are organized by task at hand
533and the short name of the data device.  They all can be found in:
534
535  /sys/bus/workqueue/devices/${task}!${device}
536
537================  ===========
538  Task            Description
539================  ===========
540  xfs_iwalk-$pid  Inode scans of the entire filesystem. Currently limited to
541                  mount time quotacheck.
542  xfs-gc          Background garbage collection of disk space that have been
543                  speculatively allocated beyond EOF or for staging copy on
544                  write operations.
545================  ===========
546
547For example, the knobs for the quotacheck workqueue for /dev/nvme0n1 would be
548found in /sys/bus/workqueue/devices/xfs_iwalk-1111!nvme0n1/.
549
550The interesting knobs for XFS workqueues are as follows:
551
552============     ===========
553  Knob           Description
554============     ===========
555  max_active     Maximum number of background threads that can be started to
556                 run the work.
557  cpumask        CPUs upon which the threads are allowed to run.
558  nice           Relative priority of scheduling the threads.  These are the
559                 same nice levels that can be applied to userspace processes.
560============     ===========
561
562Zoned Filesystems
563=================
564
565For zoned file systems, the following attributes are exposed in:
566
567  /sys/fs/xfs/<dev>/zoned/
568
569  max_open_zones		(Min:  1  Default:  Varies  Max:  UINTMAX)
570	This read-only attribute exposes the maximum number of open zones
571	available for data placement. The value is determined at mount time and
572	is limited by the capabilities of the backing zoned device, file system
573	size and the max_open_zones mount option.
574
575  zonegc_low_space		(Min:  0  Default:  0  Max:  100)
576	Define a percentage for how much of the unused space that GC should keep
577	available for writing. A high value will reclaim more of the space
578	occupied by unused blocks, creating a larger buffer against write
579	bursts at the cost of increased write amplification.  Regardless
580	of this value, garbage collection will always aim to free a minimum
581	amount of blocks to keep max_open_zones open for data placement purposes.
582