Documentation/admin-guide/cgroup-v2.rst

10 This is the authoritative documentation on the design, interface and
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
20      1-2. What is cgroup?
101 "cgroup" stands for "control group" and is never capitalized.  The
102 singular form is used to designate the whole feature and also as a
104 multiple individual control groups, the plural form "cgroups" is used.
107 What is cgroup?
110 cgroup is a mechanism to organize processes hierarchically and
114 cgroup is largely composed of two parts - the core and controllers.
115 cgroup core is primarily responsible for hierarchically organizing
116 processes.  A cgroup controller is usually responsible for
130 hierarchical - if a controller is enabled on a cgroup, it affects all
132 sub-hierarchy of the cgroup.  When a controller is enabled on a nested
157 is no longer referenced in its current hierarchy.  Because per-cgroup
168 controllers dynamically between the v2 and other hierarchies is
169 strongly discouraged for production use.  It is recommended to decide
175 during boot, before manual intervention is possible. To make testing
183 	option is system wide and can only be set on mount or modified
184 	through remount from the init namespace.  The mount option is
193         controllers, and then seeding it with CLONE_INTO_CGROUP is
198         and not any subtrees. This is legacy behaviour, the default
199         behaviour without this option is to include subtree counts.
200         This option is system wide and can only be set on mount or
202         option is ignored on non-init namespace mounts.
210         behavior but is a mount-option to avoid regressing setups
217         statistics reporting and memory protetion). This is a new
223         * There is no HugeTLB pool management involved in the memory
225           Specifically, when a new HugeTLB folio is allocated to
226           the pool, it is not accounted for from the perspective of the
227           memory controller. It is only charged to a cgroup when it is
234           still has pages available (but the cgroup limit is hit and
239         * HugeTLB pages utilized while this option is not selected
241           v2 is remounted later on).
244         The option restores v1-like behavior of pids.events:max, that is only
271 on a single write(2) call.  If a process is composed of multiple
275 When a process forks a child process, the new process is born into the
284 have any children and is associated only with zombie processes is
290 cgroup is in use in the system, this file may contain multiple lines,
291 one for each hierarchy.  The entry for cgroup v2 is always in the
299 is removed subsequently, " (deleted)" is appended to the path::
322 cgroup whose resource domain is further up in the hierarchy.  The root
323 of a threaded subtree, that is, the nearest ancestor which is not
324 threaded, is called threaded domain or thread root interchangeably and
333 consumptions of the subtree, it is considered to have internal
336 root cgroup is not subject to no internal process constraint, it can
339 The current operation mode or type of the cgroup is shown in the
340 "cgroup.type" file which indicates whether the cgroup is a normal
341 domain, a domain which is serving as the domain of a threaded subtree,
344 On creation, a cgroup is always a domain cgroup and can be made
346 operation is single direction::
356 - When the parent is an unthreaded domain, it must not have any domain
357   controllers enabled or populated domain children.  The root is
365 C is created as a domain but isn't connected to a parent which can
366 host child domains.  C can't be used until it is turned into a
371 A domain cgroup is turned into a threaded domain when one of its child
389 processes in the subtree and is not readable in the subtree proper.
394 a threaded controller is enabled inside a threaded subtree, it only
399 Because a threaded subtree is exempt from no internal process
417 live processes in it.  Its value is 0 if there is no live process in
447 No controller is enabled by default.  Controllers can be enabled and
455 are specified, the last one is effective.
500 This guarantees that, when a domain controller is looking at the part
505 The root cgroup is exempt from this restriction.  Root contains
508 controllers.  How resource consumption in the root cgroup is governed
509 is up to each controller (for more information on this topic please
513 Note that the restriction doesn't get in the way if there is no
514 enabled controller in the cgroup's "cgroup.subtree_control".  This is
531 Second, if the "nsdelegate" mount option is set, automatically to a
536 shouldn't be allowed to write to them.  For the first method, this is
560 A delegated sub-hierarchy is contained in the sense that processes
563 For delegations to a less privileged user, this is achieved by
575 in from or push out to outside the sub-hierarchy.
586 Let's also say U0 wants to write the PID of a process which is
589 destination cgroup C00 is above the points of delegation and U0 would
593 For delegations to namespaces, containment is achieved by requiring
595 namespace of the process which is attempting the migration.  If either
596 is not reachable, the migration is rejected with -ENOENT.
605 Migrating a process across cgroups is a relatively expensive operation
607 process.  This is an explicit design decision as there often exist
612 apply different resource restrictions is discouraged.  A workload
623 directory and it is possible to create children cgroups which collide
628 a dot.  A controller's name is composed of lower case alphabets and
649 A parent's resource is distributed by adding up the weights of all
652 resource at the moment participate in the distribution, this is
653 work-conserving.  Due to the dynamic nature, this model is usually
660 As long as the weight is in range, all configuration combinations are
661 valid and there is no reason to reject configuration changes or
665 and is an example of this type.
677 Limits are in the range [0, max] and defaults to "max", which is noop.
680 valid and there is no reason to reject configuration changes or
684 on an IO device and is an example of this type.
691 A cgroup is protected up to the configured amount of the resource
695 only up to the amount available to the parent is protected among
698 Protections are in the range [0, max] and defaults to 0, which is
702 are valid and there is no reason to reject configuration changes or
705 "memory.low" implements best-effort memory protection and is an
712 A cgroup is exclusively allocated a certain amount of a finite
717 Allocations are in the range [0, max] and defaults to 0, which is no
722 resource is mandatory for execution of processes, process migrations
725 "cpu.rt.max" hard-allocates realtime slices and is an example of this
779 - The default time unit is microseconds.  If a different unit is ever
789   intuitive (the default is 100%).
811   For example, a setting which is keyed by major:minor device numbers
857 	- "domain threaded" : A threaded domain cgroup which is
860 	- "domain invalid" : A cgroup which is in an invalid state.
864 	- "threaded" : A threaded cgroup which is a member of a
893 	as all the processes belong to the thread root.  Writing is
912 	- The cgroup that the thread is currently in must be in the
940 	the last one is effective.  When multiple enable and disable
953 		1 if the cgroup is frozen; otherwise, 0.
956 	A read-write single value files.  The default is "max".
959 	If the actual number of descendants is equal or larger,
963 	A read-write single value files.  The default is "max".
966 	If the actual descent depth is equal or larger,
997 	Allowed values are "0" and "1". The default is "0".
1003 	is completed, the "frozen" value in the cgroup.events control file
1008 	of any ancestor cgroups. If any of ancestor cgroups is frozen, the
1014 	If a process is moved to a frozen cgroup, it stops. If a process is
1023 	The only allowed value is "1".
1030 	is protected against migrations.
1033 	killing cgroups is a process directed operation, i.e. it affects
1038 	The default is "1".
1043 	This control attribute is not hierarchical, so disable or enable PSI
1047 	The reason this control attribute exists is that PSI accounts stalls for
1072 In all the above models, cycles distribution is defined only on a temporal
1090 the following section for details. Only the cpu controller is affected by
1102 	This file exists whether the controller is enabled or not.
1110 	and the following five when the controller is enabled:
1120 	cgroups.  The default is "100".
1122 	For non idle groups (cpu.idle = 0), the weight is in the
1130 	cgroups.  The default is "0".
1132 	The nice value is in the range [-20, 19].
1134 	This interface file is an alternative interface for
1136 	same values used by nice(2).  Because the range is smaller and
1137 	granularity is coarser for the nice values, the read value is
1142 	The default is "max 100000".
1150 	one number is written, $MAX is updated.
1154 	cgroups.  The default is "0".
1166         The default is "0", i.e. no utilization boosting.
1173         value is used to clamp the task specific minimum utilization clamp.
1175         The requested minimum utilization (protection) is always capped by
1181         The default is "max". i.e. no utilization capping
1188         value is used to clamp the task specific maximum utilization clamp.
1192 	The default is 0.
1194 	This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1205 The "memory" controller regulates distribution of memory.  Memory is
1208 stateful nature of memory, the distribution model is relatively
1228 All memory amounts are in bytes.  If a value which is not aligned to
1229 PAGE_SIZE is written, the value may be rounded up to the closest
1241 	cgroups.  The default is "0".
1244 	is within its effective min boundary, the cgroup's memory
1245 	won't be reclaimed under any conditions. If there is no
1247 	is invoked. Above the effective min boundary (or
1248 	effective low boundary if it is higher), pages are reclaimed
1252 	Effective min boundary is limited by memory.min values of
1253 	all ancestor cgroups. If there is memory.min overcommitment
1260 	protection is discouraged and may lead to constant OOMs.
1262 	If a memory cgroup is not populated with processes,
1263 	its memory.min is ignored.
1267 	cgroups.  The default is "0".
1270 	cgroup is within its effective low boundary, the cgroup's
1271 	memory won't be reclaimed unless there is no reclaimable
1274 	effective min boundary if it is higher), pages are reclaimed
1278 	Effective low boundary is limited by memory.low values of
1279 	all ancestor cgroups. If there is memory.low overcommitment
1286 	protection is discouraged.
1290 	cgroups.  The default is "max".
1304 	cgroups.  The default is "max".
1306 	Memory usage hard limit.  This is the main mechanism to limit
1308 	this limit and can't be reduced, the OOM killer is invoked in
1322 	This is a simple interface to trigger memory reclaim in the
1331 	specified amount, -EAGAIN is returned.
1334 	interface) is not meant to indicate memory pressure on the
1336 	the memory reclaim normally is not exercised in this case.
1363 	cgroups.  The default value is "0".
1368 	(if the memory cgroup is not a leaf cgroup) are killed
1375 	If the OOM killer is invoked in a cgroup, it's not going
1391 		The number of times the cgroup is reclaimed due to
1392 		high memory pressure even though its usage is under
1394 		boundary is over-committed.
1398 		throttled and routed to perform direct memory reclaim
1400 		cgroup whose memory usage is capped by the high limit
1406 		about to go over the max boundary.  If direct reclaim
1413 		This event is not raised if the OOM killer is not
1451 		memory of such an allocation is mapped anymore.
1484 		Amount of cached filesystem data that is swap-backed,
1497 		not all the memory of such an allocation is mapped.
1505 		is currently being written back to disk
1508 		Amount of swap cached in memory. The swapcache is accounted
1530 		the value for the foo counter, since the foo counter is type-based, not
1646 		a page fault. This counter is not present when CONFIG_TRANSPARENT_HUGEPAGE
1647                 is not set.
1651 		collapsing an existing range of pages. This counter is not
1652 		present when CONFIG_TRANSPARENT_HUGEPAGE is not set.
1687 		up if hugetlb usage is accounted for in memory.current (i.e.
1688 		cgroup is mounted with the memory_hugetlb_accounting option).
1697 	This is useful for providing visibility into the NUMA locality
1699 	allocated from any physical node. One of the use case is evaluating
1705 	The output format of memory.numa_stat is::
1724 	cgroups.  The default is "max".
1730 	This limit marks a point of no return for the cgroup. It is NOT
1750 	cgroups.  The default is "max".
1789 	cgroups.  The default is "max".
1796 	A read-write single value file. The default value is "1".
1797 	Note that this setting is hierarchical, i.e. the writeback would be
1801 	When this is set to 0, all swapping attempts to swapping devices
1808 	Note that this is subtly different from setting memory.swap.max to
1810 	This setting has no effect if zswap is disabled, and swapping
1811 	is allowed unless memory.swap.max is set to 0.
1823 "memory.high" is the main mechanism to control memory usage.
1826 usage is a viable strategy.
1833 Determining whether a cgroup has enough memory is not trivial as
1838 pressure - how much the workload is being impacted due to lack of
1839 memory - is necessary to determine whether a workload needs more
1847 A memory area is charged to the cgroup which instantiated it and stays
1848 charged to the cgroup until the area is released.  Migrating a process
1853 To which cgroup the area will be charged is in-deterministic; however,
1854 over time, the memory area is likely to end up in a cgroup which has
1857 If a cgroup sweeps a considerable amount of memory which is expected
1868 limit distribution; however, weight based distribution is available
1869 only if cfq-iosched is in use and neither scheme is available for
1904 	line for a given device is populated on the first write for
1919 	The controller is disabled by default and can be enabled by
1924 	When a better control quality is needed, latency QoS
1929 	shows that on sdb, the controller is enabled, will consider
1931 	latencies is above 75ms or write 150ms, and adjust the overall
1945 	When "ctrl" is "auto", the parameters are controlled by the
1959 	given device is populated on the first write for the device on
1968 	When "ctrl" is "auto", the kernel may change all parameters
1969 	dynamically.  When "ctrl" is set to "user" or any other
1973 	When "model" is "linear", the following model parameters are
1988 	sense and is scaled to the device behavior dynamically.
1995 	The default is "default 100".
1997 	The first line is the default weight applied to devices
2030 	to remove a specific limit.  If the same key is specified
2031 	multiple times, the outcome is undefined.
2034 	delayed if limit is reached.  Temporary bursts are allowed.
2062 Page cache is dirtied through buffered writes and shared mmaps and
2070 defines the memory domain that dirty memory ratio is calculated and
2074 of the two is enforced.
2077 filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
2082 which affects how cgroup ownership is tracked.  Memory is tracked per
2084 inode is assigned to a cgroup and all IO requests to write dirty pages
2087 As cgroup ownership for memory is tracked per page, there can be pages
2088 which are associated with different cgroups than the one the inode is
2094 While this model is enough for most use cases where a given inode is
2100 doesn't update it until the page is released, even if writeback
2114 	For cgroup writeback, this is calculated into ratio against
2122 This is a cgroup v2 controller for IO workload protection.  You provide a group
2138 So the ideal way to configure this is to set io.latency in groups A, B, and C.
2149 io.latency is work conserving; so as long as everybody is meeting their latency
2154 - Queue depth throttling.  This is the number of outstanding IO's a group is
2162   originating group is being throttled you will see the use_delay and delay
2163   fields in io.stat increase.  The delay value is how many microseconds that are
2165   grow quite large if there is a lot of swapping or metadata IO occurring we
2181 	If the controller is enabled you will see extra stats in io.stat in
2185 		This is the current queue depth for the group.
2188 		This is an exponential moving average with a decay rate of 1/exp
2194 		The sampling window size in milliseconds.  This is the minimum
2238 The numerical value that corresponds to each I/O priority class is as follows:
2250 The algorithm to set the I/O priority class for a request is as follows:
2252 - If I/O priority class policy is promote-to-rt, change the request I/O
2255 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2263 The process number controller is used to allow a cgroup to stop any
2264 new tasks from being fork()'d or clone()'d after a specified limit is
2269 example, a fork bomb is likely to exhaust the number of tasks before
2281 	cgroups.  The default is "max".
2311 Organisational operations are not blocked by cgroup policies, so it is
2314 processes to the cgroup such that pids.current is larger than
2315 pids.max.  However, it is not possible to violate a cgroup PID policy
2326 This is especially valuable on large NUMA systems where placing jobs
2331 The "cpuset" controller is hierarchical.  That means the controller
2343 	cgroup.  The actual list of CPUs to be granted, however, is
2353 	An empty value indicates that the cgroup is using the same
2355 	"cpuset.cpus" or all the available CPUs if none is found.
2368 	If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
2383 	is subjected to constraints imposed by its parent and can differ
2392 	An empty value indicates that the cgroup is using the same
2395 	is found.
2404 	There is a cost for this memory migration.  The migration
2406 	So it is recommended that "cpuset.mems" should be set properly
2407 	before spawning new tasks into the cpuset.  Even if there is
2419 	If "cpuset.mems" is empty, it shows all the memory nodes from the
2432 	to create a new cpuset partition.  Its value is not used
2435 	a cpuset partition is.
2442 	is always a subset of it.
2444 	Users can manually set it to a value that is different from
2445 	"cpuset.cpus".	One constraint in setting it is that the list of
2454 	exclusive CPU appearing in two or more of its child cgroups is
2458 	The root cgroup is a partition root and all its available CPUs
2468 	"cpuset.cpus.exclusive.effective" if its parent is not the root
2470 	if it is set.  If "cpuset.cpus.exclusive" is not set, it is
2479 	is created.
2483 	cpuset-enabled cgroups.  This flag is owned by the parent cgroup
2484 	and is not delegatable.
2494 	A cpuset partition is a collection of cpuset-enabled cgroups with
2502 	partition is one whose parent cgroup is also a valid partition
2503 	root.  A remote partition is one whose parent cgroup is not a
2505 	is optional for the creation of a local partition as its
2507 	is the same as "cpuset.cpus" if it is not set.	Writing the
2509 	before the target partition root is mandatory for the creation
2516 	The root cgroup is always a partition root and its state cannot
2519 	When set to "root", the current cgroup is the root of a new
2520 	partition or scheduling domain.  The set of exclusive CPUs is
2531 	root is in a degraded state where some state information may
2549 	why the partition is invalid is included within parentheses.
2554 	1) The parent cgroup is a valid partition root.
2557 	3) The "cpuset.cpus.effective" cannot be empty unless there is
2569 	to its child local partitions when there is no task associated
2576 	their parent is switched back to a partition root with a proper
2600 Cgroup v2 device controller has no interface files and is implemented
2714 	The default value is "max".  It exists for all the cgroup except root.
2737 cgroup resources. Controller is enabled by the CONFIG_CGROUP_MISC config
2745 Once a capacity is set then the resource usage can be updated using charge and
2817 A miscellaneous scalar resource is charged to the cgroup in which it is used
2818 first, and stays charged to that cgroup until that resource is freed. Migrating
2828 perf_event controller, if not mounted on a legacy hierarchy, is
2831 moved to a legacy hierarchy after v2 hierarchy is populated.
2838 the stable kernel API and so is subject to change.
2845 cgroup is treated as if it was hosted in a separate child cgroup of the
2846 root cgroup. This child cgroup weight is dependent on its thread nice
2851 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2858 When distributing IO resources this implicit child node is taken into
2874 cgroupns root is the cgroup of the process at the time of creation of
2905 the threads).  This is natural for the v2 hierarchy; however, for the
2908 A cgroup namespace is alive as long as there are processes inside or
2910 namespace is destroyed.  The cgroupns root and the actual cgroups
2917 The 'cgroupns root' for a cgroup namespace is the cgroup in which the
2918 process calling unshare(2) is running.  For example, if a process in
2921 init_cgroup_ns, this is the real root ('/') cgroup.
2952 From a sibling cgroup namespace (that is, a namespace rooted at a
2955 namespace root is at '/batchjobs/container_id2', then it will see::
2970 /batchjobs/container_id1, and assuming that the global hierarchy is
2979 Note that this kind of setup is not encouraged.  A task inside cgroup
2982 setns(2) to another cgroup namespace is allowed when:
2989 namespace.  It is expected that the someone moves the attaching
3014 where interacting with cgroup is necessary.  cgroup core and
3040 selective disabling of cgroup writeback support which is helpful when
3046 the writeback session is holding shared resources, e.g. a journal
3047 entry, may lead to priority inversion.  There is no one easy solution
3060 - The "tasks" file is removed and "cgroup.procs" is not sorted.
3062 - "cgroup.clone_children" is removed.
3064 - /proc/cgroups is meaningless for v2.  Use "cgroup.controllers" or
3078 For example, as there is only one instance of each controller, utility
3080 hierarchies could only be used in one.  The issue is exacerbated by
3115 completely orthogonal to each other isn't necessary.  What usually is
3116 called for is the ability to have differing levels of granularity
3120 how memory is distributed beyond a certain level while still wanting
3133 Generally, in-process knowledge is available only to the process
3149 and then read and/or write to it.  This is not only extremely clunky
3150 and unusual but also inherently racy.  There is no conventional way to
3205 This clearly is a problem which needs to be addressed from cgroup core
3220 Controller interfaces were problematic too.  An extreme example is
3244 The original lower boundary, the soft limit, is defined as a limit
3245 that is per default unset.  As a result, the set of cgroups that
3246 global reclaim prefers is opt-in, rather than opt-out.  The costs for
3253 the soft limit reclaim pass is so aggressive that it not just
3258 The memory.low boundary on the other hand is a top-down allocated
3264 The original high boundary, the hard limit, is defined as a strict
3271 estimation is hard and error prone, and getting it wrong results in
3277 into direct reclaim to work off the excess, but it never invokes the
3278 OOM killer.  As a result, a high boundary that is chosen too
3282 gives acceptable performance is found.
3288 system than killing the group.  Otherwise, memory.max is there to
3296 new limit is met - or the task writing to memory.max is killed.
3298 The combined memory+swap accounting and limiting is replaced by real
3309 For trusted jobs, on the other hand, a combined counter is not an
3312 resources.  Swap space is a resource like all others in the system,