Lines Matching +full:in +full:- +full:memory
1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
13 future changes must be reflected in this document. Documentation for
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19 1-1. Terminology
20 1-2. What is cgroup?
22 2-1. Mounting
23 2-2. Organizing Processes and Threads
24 2-2-1. Processes
25 2-2-2. Threads
26 2-3. [Un]populated Notification
27 2-4. Controlling Controllers
28 2-4-1. Enabling and Disabling
29 2-4-2. Top-down Constraint
30 2-4-3. No Internal Process Constraint
31 2-5. Delegation
32 2-5-1. Model of Delegation
33 2-5-2. Delegation Containment
34 2-6. Guidelines
35 2-6-1. Organize Once and Control
36 2-6-2. Avoid Name Collisions
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
43 4-1. Format
44 4-2. Conventions
45 4-3. Core Interface Files
47 5-1. CPU
48 5-1-1. CPU Interface Files
49 5-2. Memory
50 5-2-1. Memory Interface Files
51 5-2-2. Usage Guidelines
52 5-2-3. Memory Ownership
53 5-3. IO
54 5-3-1. IO Interface Files
55 5-3-2. Writeback
56 5-3-3. IO Latency
57 5-3-3-1. How IO Latency Throttling Works
58 5-3-3-2. IO Latency Interface Files
59 5-3-4. IO Priority
60 5-4. PID
61 5-4-1. PID Interface Files
62 5-5. Cpuset
63 5.5-1. Cpuset Interface Files
64 5-6. Device
65 5-7. RDMA
66 5-7-1. RDMA Interface Files
67 5-8. HugeTLB
68 5.8-1. HugeTLB Interface Files
69 5-9. Misc
70 5.9-1 Miscellaneous cgroup Interface Files
71 5.9-2 Migration and Ownership
72 5-10. Others
73 5-10-1. perf_event
74 5-N. Non-normative information
75 5-N-1. CPU controller root cgroup process behaviour
76 5-N-2. IO controller root cgroup process behaviour
78 6-1. Basics
79 6-2. The Root and Views
80 6-3. Migration and setns(2)
81 6-4. Interaction with Other Namespaces
83 P-1. Filesystem Support for Writeback
86 R-1. Multiple Hierarchies
87 R-2. Thread Granularity
88 R-3. Competition Between Inner Nodes and Threads
89 R-4. Other Interface Issues
90 R-5. Controller Issues and Remedies
91 R-5-1. Memory
98 -----------
102 qualifier as in "cgroup controllers". When explicitly referring to
107 ---------------
110 distribute system resources along the hierarchy in a controlled and
113 cgroup is largely composed of two parts - the core and controllers.
120 cgroups form a tree structure and every process in the system belongs
122 same cgroup. On creation, all processes are put in the cgroup that
129 hierarchical - if a controller is enabled on a cgroup, it affects all
131 sub-hierarchy of the cgroup. When a controller is enabled on a nested
133 restrictions set closer to the root in the hierarchy can not be
141 --------
146 # mount -t cgroup2 none $MOUNT_POINT
151 Controllers which are not in active use in the v2 hierarchy can be
153 legacy v1 multiple hierarchies in a fully backward compatible way.
156 is no longer referenced in its current hierarchy. Because per-cgroup
163 to inter-controller dependencies, other controllers may need to be
176 disabling controllers in v1 and make them always available in v2.
184 ignored on non-init namespace mounts. Please refer to the
196 Only populate memory.events with data for the current cgroup,
201 option is ignored on non-init namespace mounts.
204 Recursively apply memory.min and memory.low protection to
209 behavior but is a mount-option to avoid regressing setups
214 Count HugeTLB memory usage towards the cgroup's overall
215 memory usage for the memory controller (for the purpose of
216 statistics reporting and memory protetion). This is a new
218 explicitly opted in with this mount option.
220 A few caveats to keep in mind:
222 * There is no HugeTLB pool management involved in the memory
223 controller. The pre-allocated pool does not belong to anyone.
226 memory controller. It is only charged to a cgroup when it is
227 actually used (for e.g at page fault time). Host memory
229 hard limits. In general, HugeTLB pool management should be
231 * Failure to charge a HugeTLB folio to the memory controller
232 results in SIGBUS. This could happen even if the HugeTLB pool
235 * Charging HugeTLB memory towards the memory controller affects
236 memory protection and reclaim dynamics. Any userspace tuning
239 will not be tracked by the memory controller (even if cgroup
244 --------------------------------
250 A child cgroup can be created by creating a sub-directory::
255 structure. Each cgroup has a read-writable interface file
257 belong to the cgroup one-per-line. The PIDs are not ordered and the
271 zombie process does not appear in "cgroup.procs" and thus can't be
282 cgroup is in use in the system, this file may contain multiple lines,
283 one for each hierarchy. The entry for cgroup v2 is always in the
288 0::/test-cgroup/test-cgroup-nested
295 0::/test-cgroup/test-cgroup-nested (deleted)
314 cgroup whose resource domain is further up in the hierarchy. The root
319 Inside a threaded subtree, threads of a process can be put in
321 constraint - threaded controllers can be enabled on non-leaf cgroups
322 whether they have threads in them or not.
326 resource consumptions whether there are processes in it or not and
331 The current operation mode or type of the cgroup is shown in the
345 - As the cgroup will join the parent's resource domain. The parent
348 - When the parent is an unthreaded domain, it must not have any domain
352 Topology-wise, a cgroup can be in an invalid state. Please consider
355 A (threaded domain) - B (threaded) - C (domain, just created)
359 threaded cgroup. "cgroup.type" file will report "domain (invalid)" in
364 cgroup becomes threaded or threaded controllers are enabled in the
365 "cgroup.subtree_control" file while there are processes in the cgroup.
370 threads in the cgroup. Except that the operations are per-thread
371 instead of per-process, "cgroup.threads" has the same format and
373 written to in any cgroup, as it can only move threads inside the same
379 all the processes are considered to be in the threaded domain cgroup.
380 "cgroup.procs" in a threaded domain cgroup contains the PIDs of all
381 processes in the subtree and is not readable in the subtree proper.
382 However, "cgroup.procs" can be written to from anywhere in the subtree
385 Only threaded controllers can be enabled in a threaded subtree. When
388 threads in the cgroup and its descendants. All consumptions which
393 between threads in a non-leaf cgroup and its child cgroups. Each
397 in a threaded cgroup::
399 - cpu
400 - cpuset
401 - perf_event
402 - pids
405 --------------------------
407 Each non-root cgroup has a "cgroup.events" file which contains
408 "populated" field indicating whether the cgroup's sub-hierarchy has
409 live processes in it. Its value is 0 if there is no live process in
412 example, to start a clean-up operation after all processes of a given
413 sub-hierarchy have exited. The populated state updates and
414 notifications are recursive. Consider the following sub-hierarchy
415 where the numbers in the parentheses represent the numbers of processes
416 in each cgroup::
418 A(4) - B(0) - C(1)
422 process in C exits, B and C's "populated" fields would flip to "0" and
428 -----------------------
437 cpu io memory
442 # echo "+cpu +memory -io" > cgroup.subtree_control
444 Only controllers which are listed in "cgroup.controllers" can be
449 Enabling a controller in a cgroup indicates that the distribution of
451 Consider the following sub-hierarchy. The enabled controllers are
452 listed in parentheses::
454 A(cpu,memory) - B(memory) - C()
457 As A has "cpu" and "memory" enabled, A will control the distribution
458 of CPU cycles and memory to its children, in this case, B. As B has
459 "memory" enabled but not "CPU", C and D will compete freely on CPU
460 cycles but their division of memory available to B will be controlled.
464 files in the child cgroups. In the above example, enabling "cpu" on B
465 would create the "cpu." prefixed controller interface files in C and
466 D. Likewise, disabling "memory" from B would remove the "memory."
468 controller interface files - anything which doesn't start with
472 Top-down Constraint
475 Resources are distributed top-down and a cgroup can further distribute
477 parent. This means that all non-root "cgroup.subtree_control" files
478 can only contain controllers which are enabled in the parent's
487 Non-root cgroups can distribute domain resources to their children
488 only when they don't have any processes of their own. In other words,
490 controllers enabled in their "cgroup.subtree_control" files.
500 controllers. How resource consumption in the root cgroup is governed
502 refer to the Non-normative information section in the Controllers
505 Note that the restriction doesn't get in the way if there is no
506 enabled controller in the cgroup's "cgroup.subtree_control". This is
510 children before enabling controllers in its "cgroup.subtree_control"
515 ----------
520 A cgroup can be delegated in two ways. First, to a less privileged
526 Because the resource control interface files in a given directory
535 delegated, the user can build sub-hierarchy under the directory,
539 happens in the delegated sub-hierarchy, nothing can escape the
543 cgroups in or nesting depth of a delegated sub-hierarchy; however,
544 this may be limited explicitly in the future.
550 A delegated sub-hierarchy is contained in the sense that processes
551 can't be moved into or out of the sub-hierarchy by the delegatee.
554 requiring the following conditions for a process with a non-root euid
558 - The writer must have write access to the "cgroup.procs" file.
560 - The writer must have write access to the "cgroup.procs" file of the
564 processes around freely in the delegated sub-hierarchy it can't pull
565 in from or push out to outside the sub-hierarchy.
571 ~~~~~~~~~~~~~ - C0 - C00
574 ~~~~~~~~~~~~~ - C1 - C10
577 currently in C10 into "C00/cgroup.procs". U0 has write access to the
581 will be denied with -EACCES.
586 is not reachable, the migration is rejected with -ENOENT.
590 ----------
596 and stateful resources such as memory are not moved together with the
598 inherent trade-offs between migration and various hot paths in terms
604 resource structure once on start-up. Dynamic adjustments to resource
621 start or end with terms which are often used in categorizing workloads
633 describes major schemes in use along with their expected behaviors.
637 -------
642 resource at the moment participate in the distribution, this is
643 work-conserving. Due to the dynamic nature, this model is usually
646 All weights are in the range [1, 10000] with the default at 100. This
647 allows symmetric multiplicative biases in both directions at fine
648 enough granularity while staying in the intuitive range.
650 As long as the weight is in range, all configuration combinations are
658 .. _cgroupv2-limits-distributor:
661 ------
664 Limits can be over-committed - the sum of the limits of children can
667 Limits are in the range [0, max] and defaults to "max", which is noop.
669 As limits can be over-committed, all configuration combinations are
676 .. _cgroupv2-protections-distributor:
679 -----------
684 soft boundaries. Protections can also be over-committed in which case
688 Protections are in the range [0, max] and defaults to 0, which is
691 As protections can be over-committed, all configuration combinations
695 "memory.low" implements best-effort memory protection and is an
700 -----------
703 resource. Allocations can't be over-committed - the sum of the
707 Allocations are in the range [0, max] and defaults to 0, which is no
710 As allocations can't be over-committed, some configuration
715 "cpu.rt.max" hard-allocates realtime slices and is an example of this
723 ------
725 All interface files should be in one of the following formats whenever
728 New-line separated values
736 (when read-only or multiple values can be written at once)
758 may be specified in any order and not all pairs have to be specified.
762 -----------
764 - Settings for a single feature should be contained in a single file.
766 - The root cgroup should be exempt from resource control and thus
769 - The default time unit is microseconds. If a different unit is ever
772 - A parts-per quantity should use a percentage decimal with at least
773 two digit fractional part - e.g. 13.40.
775 - If a controller implements weight based resource distribution, its
778 enough and symmetric bias in both directions while keeping it
781 - If a controller implements an absolute resource guarantee and/or
787 In the above four control files, the special token "max" should be
790 - If a setting has a configurable default value and keyed specific
792 appear as the first entry in the file.
804 # cat cgroup-example-interface-file
810 # echo 125 > cgroup-example-interface-file
814 # echo "default 125" > cgroup-example-interface-file
818 # echo "8:16 170" > cgroup-example-interface-file
822 # echo "8:0 default" > cgroup-example-interface-file
823 # cat cgroup-example-interface-file
827 - For events which are not very high frequency, an interface file
834 --------------------
839 A read-write single value file which exists on non-root
845 - "domain" : A normal valid domain cgroup.
847 - "domain threaded" : A threaded domain cgroup which is
850 - "domain invalid" : A cgroup which is in an invalid state.
854 - "threaded" : A threaded cgroup which is a member of a
861 A read-write new-line separated values file which exists on
865 the cgroup one-per-line. The PIDs are not ordered and the
874 - It must have write access to the "cgroup.procs" file.
876 - It must have write access to the "cgroup.procs" file of the
879 When delegating a sub-hierarchy, write access to this file
882 In a threaded cgroup, reading this file fails with EOPNOTSUPP
887 A read-write new-line separated values file which exists on
891 the cgroup one-per-line. The TIDs are not ordered and the
900 - It must have write access to the "cgroup.threads" file.
902 - The cgroup that the thread is currently in must be in the
905 - It must have write access to the "cgroup.procs" file of the
908 When delegating a sub-hierarchy, write access to this file
912 A read-only space separated values file which exists on all
919 A read-write space separated values file which exists on all
926 Space separated list of controllers prefixed with '+' or '-'
928 name prefixed with '+' enables the controller and '-'
934 A read-only flat-keyed file which exists on non-root cgroups.
936 otherwise, a value change in this file generates a file
946 A read-write single value files. The default is "max".
950 an attempt to create a new cgroup in the hierarchy will fail.
953 A read-write single value files. The default is "max".
960 A read-only flat-keyed file with the following entries:
968 in dying state for some time undefined time (which can depend
978 A read-write single value file which exists on non-root cgroups.
985 is completed, the "frozen" value in the cgroup.events control file
993 Processes in the frozen cgroup can be killed by a fatal signal.
1001 create new sub-cgroups.
1004 A write-only single value file which exists in non-root cgroups.
1008 be killed. This means that all processes located in the affected cgroup
1014 In a threaded cgroup, writing this file fails with EOPNOTSUPP as
1016 the whole thread-group.
1019 A read-write single value file that allowed values are "0" and "1".
1023 Writing "1" to the file will re-enable the cgroup PSI accounting.
1026 accounting in a cgroup does not affect PSI accounting in descendants
1031 This may cause non-negligible overhead for some workloads when under
1032 deep level of the hierarchy, in which case this control attribute can
1033 be used to disable PSI accounting in the non-leaf cgroups.
1036 A read-write nested-keyed file.
1044 .. _cgroup-v2-cpu:
1047 ---
1054 In all the above models, cycles distribution is defined only on a temporal
1062 the cpu controller can only be enabled when all RT processes are in
1072 All time durations are in microseconds.
1075 A read-only flat-keyed file.
1080 - usage_usec
1081 - user_usec
1082 - system_usec
1086 - nr_periods
1087 - nr_throttled
1088 - throttled_usec
1089 - nr_bursts
1090 - burst_usec
1093 A read-write single value file which exists on non-root
1096 For non idle groups (cpu.idle = 0), the weight is in the
1103 A read-write single value file which exists on non-root
1106 The nice value is in the range [-20, 19].
1115 A read-write two value file which exists on non-root cgroups.
1118 The maximum bandwidth limit. It's in the following format::
1122 which indicates that the group may consume up to $MAX in each
1127 A read-write single value file which exists on non-root
1130 The burst in the range [0, $MAX].
1133 A read-write nested-keyed file.
1139 A read-write single value file which exists on non-root cgroups.
1154 A read-write single value file which exists on non-root cgroups.
1165 A read-write single value file which exists on non-root cgroups.
1168 This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1176 Memory section in Controllers
1177 ------
1179 The "memory" controller regulates distribution of memory. Memory is
1181 intertwining between memory usage and reclaim pressure and the
1182 stateful nature of memory, the distribution model is relatively
1185 While not completely water-tight, all major memory usages by a given
1186 cgroup are tracked so that the total memory consumption can be
1188 following types of memory usages are tracked.
1190 - Userland memory - page cache and anonymous memory.
1192 - Kernel data structures such as dentries and inodes.
1194 - TCP socket buffers.
1196 The above list may expand in the future for better coverage.
1199 Memory Interface Files argument
1202 All memory amounts are in bytes. If a value which is not aligned to
1206 memory.current
1207 A read-only single value file which exists on non-root
1210 The total amount of memory currently being used by the cgroup
1213 memory.min
1214 A read-write single value file which exists on non-root
1217 Hard memory protection. If the memory usage of a cgroup
1218 is within its effective min boundary, the cgroup's memory
1220 unprotected reclaimable memory available, OOM killer
1226 Effective min boundary is limited by memory.min values of
1227 all ancestor cgroups. If there is memory.min overcommitment
1228 (child cgroup or cgroups are requiring more protected memory
1231 actual memory usage below memory.min.
1233 Putting more memory than generally available under this
1236 If a memory cgroup is not populated with processes,
1237 its memory.min is ignored.
1239 memory.low
1240 A read-write single value file which exists on non-root
1243 Best-effort memory protection. If the memory usage of a
1245 memory won't be reclaimed unless there is no reclaimable
1246 memory available in unprotected cgroups.
1252 Effective low boundary is limited by memory.low values of
1253 all ancestor cgroups. If there is memory.low overcommitment
1254 (child cgroup or cgroups are requiring more protected memory
1257 actual memory usage below memory.low.
1259 Putting more memory than generally available under this
1262 memory.high
1263 A read-write single value file which exists on non-root
1266 Memory usage throttle limit. If a cgroup's usage goes
1272 limit should be used in scenarios where an external process
1276 memory.max
1277 A read-write single value file which exists on non-root
1280 Memory usage hard limit. This is the main mechanism to limit
1281 memory usage of a cgroup. If a cgroup's memory usage reaches
1282 this limit and can't be reduced, the OOM killer is invoked in
1286 In default configuration regular 0-order allocations always
1291 as -ENOMEM or silently ignore in cases like disk readahead.
1293 memory.reclaim
1294 A write-only nested-keyed file which exists for all cgroups.
1296 This is a simple interface to trigger memory reclaim in the
1304 echo "1G" > memory.reclaim
1308 type of memory to reclaim from (anon, file, ..).
1312 specified amount, -EAGAIN is returned.
1315 interface) is not meant to indicate memory pressure on the
1316 memory cgroup. Therefore socket memory balancing triggered by
1317 the memory reclaim normally is not exercised in this case.
1319 reclaim induced by memory.reclaim.
1321 memory.peak
1322 A read-only single value file which exists on non-root
1325 The max memory usage recorded for the cgroup and its
1328 memory.oom.group
1329 A read-write single value file which exists on non-root
1335 (if the memory cgroup is not a leaf cgroup) are killed
1339 Tasks with the OOM protection (oom_score_adj set to -1000)
1342 If the OOM killer is invoked in a cgroup, it's not going
1344 memory.oom.group values of ancestor cgroups.
1346 memory.events
1347 A read-only flat-keyed file which exists on non-root cgroups.
1349 otherwise, a value change in this file generates a file
1352 Note that all fields in this file are hierarchical and the
1355 memory.events.local.
1359 high memory pressure even though its usage is under
1361 boundary is over-committed.
1365 throttled and routed to perform direct memory reclaim
1366 because the high memory boundary was exceeded. For a
1367 cgroup whose memory usage is capped by the high limit
1368 rather than global memory pressure, this event's
1372 The number of times the cgroup's memory usage was
1377 The number of time the cgroup's memory usage was
1381 considered as an option, e.g. for failed high-order
1391 memory.events.local
1392 Similar to memory.events but the fields in the file are local
1396 memory.stat
1397 A read-only flat-keyed file which exists on non-root cgroups.
1399 This breaks down the cgroup's memory footprint into different
1400 types of memory, type-specific details, and other information
1401 on the state and past events of the memory management system.
1403 All memory amounts are in bytes.
1406 can show up in the middle. Don't rely on items remaining in a
1409 If the entry has no per-node counter (or not show in the
1410 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1411 to indicate that it will not show in the memory.numa_stat.
1414 Amount of memory used in anonymous mappings such as
1418 Amount of memory used to cache filesystem data,
1419 including tmpfs and shared memory.
1422 Amount of total kernel memory, including
1423 (kernel_stack, pagetables, percpu, vmalloc, slab) in
1424 addition to other kernel memory use cases.
1427 Amount of memory allocated to kernel stacks.
1430 Amount of memory allocated for page tables.
1433 Amount of memory allocated for secondary page tables,
1438 Amount of memory used for storing per-cpu kernel
1442 Amount of memory used in network transmission buffers
1445 Amount of memory used for vmap backed memory.
1448 Amount of cached filesystem data that is swap-backed,
1452 Amount of memory consumed by the zswap compression backend.
1455 Amount of application memory swapped out to zswap.
1469 Amount of swap cached in memory. The swapcache is accounted
1470 against both memory and swap usage.
1473 Amount of memory used in anonymous mappings backed by
1485 Amount of memory, swap-backed and filesystem-backed,
1486 on the internal memory management lists used by the
1490 memory management lists), inactive_foo + active_foo may not be equal to
1491 the value for the foo counter, since the foo counter is type-based, not
1492 list-based.
1499 Part of "slab" that cannot be reclaimed on memory
1503 Amount of memory used for storing in-kernel data
1531 Amount of scanned pages (in an inactive LRU list)
1537 Amount of scanned pages by kswapd (in an inactive LRU list)
1540 Amount of scanned pages directly (in an inactive LRU list)
1543 Amount of scanned pages by khugepaged (in an inactive LRU list)
1561 Amount of scanned pages (in an active LRU list)
1570 Amount of pages postponed to be freed under memory pressure
1586 Number of transparent hugepages which are swapout in one piece
1594 memory.numa_stat
1595 A read-only nested-keyed file which exists on non-root cgroups.
1597 This breaks down the cgroup's memory footprint into different
1598 types of memory, type-specific details, and other information
1599 per node on the state of the memory management system.
1607 All memory amounts are in bytes.
1609 The output format of memory.numa_stat is::
1611 type N0=<bytes in node 0> N1=<bytes in node 1> ...
1614 can show up in the middle. Don't rely on items remaining in a
1617 The entries can refer to the memory.stat.
1619 memory.swap.current
1620 A read-only single value file which exists on non-root
1626 memory.swap.high
1627 A read-write single value file which exists on non-root
1632 allow userspace to implement custom out-of-memory procedures.
1636 during regular operation. Compare to memory.swap.max, which
1638 continue unimpeded as long as other memory can be reclaimed.
1642 memory.swap.peak
1643 A read-only single value file which exists on non-root
1649 memory.swap.max
1650 A read-write single value file which exists on non-root
1654 limit, anonymous memory of the cgroup will not be swapped out.
1656 memory.swap.events
1657 A read-only flat-keyed file which exists on non-root cgroups.
1659 otherwise, a value change in this file generates a file
1673 because of running out of swap system-wide or max
1679 reduces the impact on the workload and memory management.
1681 memory.zswap.current
1682 A read-only single value file which exists on non-root
1685 The total amount of memory consumed by the zswap compression
1688 memory.zswap.max
1689 A read-write single value file which exists on non-root
1694 entries fault back in or are written out to disk.
1696 memory.zswap.writeback
1697 A read-write single value file. The default value is "1". The
1708 Note that this is subtly different from setting memory.swap.max to
1711 memory.pressure
1712 A read-only nested-keyed file.
1714 Shows pressure stall information for memory. See
1721 "memory.high" is the main mechanism to control memory usage.
1722 Over-committing on high limit (sum of high limits > available memory)
1723 and letting global memory pressure to distribute memory according to
1729 more memory or terminating the workload.
1731 Determining whether a cgroup has enough memory is not trivial as
1732 memory usage doesn't indicate whether the workload can benefit from
1733 more memory. For example, a workload which writes data received from
1734 network to a file can use all available memory but can also operate as
1735 performant with a small amount of memory. A measure of memory
1736 pressure - how much the workload is being impacted due to lack of
1737 memory - is necessary to determine whether a workload needs more
1738 memory; unfortunately, memory pressure monitoring mechanism isn't
1742 Memory Ownership argument
1745 A memory area is charged to the cgroup which instantiated it and stays
1747 to a different cgroup doesn't move the memory usages that it
1748 instantiated while in the previous cgroup to the new cgroup.
1750 A memory area may be used by processes belonging to different cgroups.
1751 To which cgroup the area will be charged is in-deterministic; however,
1752 over time, the memory area is likely to end up in a cgroup which has
1753 enough memory allowance to avoid high reclaim pressure.
1755 If a cgroup sweeps a considerable amount of memory which is expected
1757 POSIX_FADV_DONTNEED to relinquish the ownership of memory areas
1758 belonging to the affected files to ensure correct memory ownership.
1762 --
1767 only if cfq-iosched is in use and neither scheme is available for
1768 blk-mq devices.
1775 A read-only nested-keyed file.
1795 A read-write nested-keyed file which exists only on the root
1807 enable Weight-based control enable
1839 devices which show wide temporary behavior changes - e.g. a
1850 A read-write nested-keyed file which exists only on the root
1863 model The cost model in use - "linear"
1885 The IO cost model isn't expected to be accurate in absolute
1889 generate device-specific coefficients.
1892 A read-write flat-keyed file which exists on non-root cgroups.
1897 $MAJ:$MIN device numbers and not ordered. The weights are in
1899 the cgroup can use in relation to its siblings.
1912 A read-write nested-keyed file which exists on non-root
1926 When writing, any number of nested key-value pairs can be
1927 specified in any order. "max" can be specified as the value
1931 BPS and IOPS are measured in each IO direction and IOs are
1951 A read-only nested-keyed file.
1962 mechanism. Writeback sits between the memory and IO domains and
1963 regulates the proportion of dirty memory by balancing dirtying and
1966 The io controller, in conjunction with the memory controller,
1967 implements control of page cache writeback IOs. The memory controller
1968 defines the memory domain that dirty memory ratio is calculated and
1970 writes out dirty pages for the memory domain. Both system-wide and
1971 per-cgroup dirty memory states are examined and the more restrictive
1979 There are inherent differences in memory and writeback management
1980 which affects how cgroup ownership is tracked. Memory is tracked per
1985 As cgroup ownership for memory is tracked per page, there can be pages
1995 inode simultaneously are not supported well. In such circumstances, a
1997 As memory controller assigns page ownership on the first use and
2008 amount of available memory capped by limits imposed by the
2009 memory controller and system-wide clean memory.
2013 total available memory and applied the same way as
2025 The limits are only applied at the peer level in the hierarchy. This means that
2026 in the diagram below, only groups A, B, and C will influence each other, and
2036 So the ideal way to configure this is to set io.latency in groups A, B, and C.
2040 avg_lat value in io.stat for your workload group to get an idea of the
2042 your real setting, setting at 10-15% higher than the value in io.stat.
2052 - Queue depth throttling. This is the number of outstanding IO's a group is
2056 - Artificial delay induction. There are certain types of IO that cannot be
2061 fields in io.stat increase. The delay value is how many microseconds that are
2062 being added to any process that runs in this group. Because this number can
2076 "MAJOR:MINOR target=<target time in microseconds>"
2079 If the controller is enabled you will see extra stats in io.stat in
2088 calculated by multiplying the win value in io.stat by the
2092 The sampling window size in milliseconds. This is the minimum
2103 no-change
2106 promote-to-rt
2107 For requests that have a non-RT I/O priority class, change it into RT.
2111 restrict-to-be
2121 none-to-rt
2122 Deprecated. Just an alias for promote-to-rt.
2126 +----------------+---+
2127 | no-change | 0 |
2128 +----------------+---+
2129 | promote-to-rt | 1 |
2130 +----------------+---+
2131 | restrict-to-be | 2 |
2132 +----------------+---+
2134 +----------------+---+
2138 +-------------------------------+---+
2140 +-------------------------------+---+
2141 | IOPRIO_CLASS_RT (real-time) | 1 |
2142 +-------------------------------+---+
2144 +-------------------------------+---+
2146 +-------------------------------+---+
2150 - If I/O priority class policy is promote-to-rt, change the request I/O
2153 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2159 ---
2165 The number of tasks in a cgroup can be exhausted in ways which other
2168 hitting memory restrictions.
2170 Note that PIDs used in this controller refer to TIDs, process IDs as
2178 A read-write single value file which exists on non-root
2184 A read-only single value file which exists on all cgroups.
2186 The number of processes currently in the cgroup and its
2194 through fork() or clone(). These will return -EAGAIN if the creation
2199 ------
2202 the CPU and memory node placement of tasks to only the resources
2203 specified in the cpuset interface files in a task's current cgroup.
2206 memory placement to reduce cross-node memory access and contention
2210 cannot use CPUs or memory nodes not allowed in its parent.
2217 A read-write multiple values file which exists on non-root
2218 cpuset-enabled cgroups.
2225 The CPU numbers are comma-separated numbers or ranges.
2229 0-4,6,8-10
2232 setting as the nearest cgroup ancestor with a non-empty
2239 A read-only multiple values file which exists on all
2240 cpuset-enabled cgroups.
2249 "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
2250 can be granted. In this case, it will be treated just like an
2256 A read-write multiple values file which exists on non-root
2257 cpuset-enabled cgroups.
2259 It lists the requested memory nodes to be used by tasks within
2260 this cgroup. The actual list of memory nodes granted, however,
2262 from the requested memory nodes.
2264 The memory node numbers are comma-separated numbers or ranges.
2268 0-1,3
2271 setting as the nearest cgroup ancestor with a non-empty
2272 "cpuset.mems" or all the available memory nodes if none
2276 and won't be affected by any memory nodes hotplug events.
2278 Setting a non-empty value to "cpuset.mems" causes memory of
2280 they are currently using memory outside of the designated nodes.
2282 There is a cost for this memory migration. The migration
2283 may not be complete and some memory pages may be left behind.
2290 A read-only multiple values file which exists on all
2291 cpuset-enabled cgroups.
2293 It lists the onlined memory nodes that are actually granted to
2294 this cgroup by its parent. These memory nodes are allowed to
2297 If "cpuset.mems" is empty, it shows all the memory nodes from the
2300 the memory nodes listed in "cpuset.mems" can be granted. In this
2303 Its value will be affected by memory nodes hotplug events.
2306 A read-write multiple values file which exists on non-root
2307 cpuset-enabled cgroups.
2316 CPUs that are allocated to that partition are listed in
2323 "cpuset.cpus". The only constraint in setting it is that the
2328 exclusive CPU appearing in two or more of its child cgroups is
2333 are in its exclusive CPU set.
2336 A read-only multiple values file which exists on all non-root
2337 cpuset-enabled cgroups.
2345 treated to have an implicit value of "cpuset.cpus" in the
2349 A read-only and root cgroup only multiple values file.
2351 This file shows the set of all isolated CPUs used in existing
2356 A read-write single value file which exists on non-root
2357 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2363 "member" Non-root member of a partition
2368 A cpuset partition is a collection of cpuset-enabled cgroups with
2373 of that partition cannot use any CPUs in that set.
2375 There are two types of partitions - local and remote. A local
2391 be changed. All other non-root cgroups start out as "member".
2397 When set to "isolated", the CPUs in that partition will be in
2399 and excluded from the unbound workqueues. Tasks placed in such
2403 A partition root ("root" or "isolated") can be in one of the
2404 two possible states - valid or invalid. An invalid partition
2405 root is in a degraded state where some state information may
2415 "member" Non-root member of a partition
2422 In the case of an invalid partition root, a descriptive string on
2442 A valid non-root parent partition may distribute out all its CPUs
2448 invalid causing disruption to tasks running in those child
2451 value in "cpuset.cpus" or "cpuset.cpus.exclusive".
2461 A user can pre-configure certain CPUs to an isolated state
2464 into a partition, they have to be used in an isolated partition.
2468 -----------------
2479 on the return value the attempt will succeed or fail with -EPERM.
2484 If the program returns 0, the attempt fails with -EPERM, otherwise it
2487 An example of BPF_PROG_TYPE_CGROUP_DEVICE program may be found in
2488 tools/testing/selftests/bpf/progs/dev_cgroup.c in the kernel source tree.
2492 ----
2501 A readwrite nested-keyed file that exists for all the cgroups
2522 A read-only file that describes current resource usage.
2531 -------
2548 A read-only flat-keyed file which exists on non-root cgroups.
2554 Similar to hugetlb.<hugepagesize>.events but the fields in the file
2559 Similar to memory.numa_stat, it shows the numa information of the
2560 hugetlb pages of <hugepagesize> in this cgroup. Only active in
2561 use hugetlb pages are included. The per-node values are in bytes.
2564 ----
2571 A resource can be added to the controller via enum misc_res_type{} in the
2573 in the kernel/cgroup/misc.c file. Provider of the resource must set its
2577 uncharge APIs. All of the APIs to interact with misc controller are in
2586 A read-only flat-keyed file shown only in the root cgroup. It shows
2595 A read-only flat-keyed file shown in the all cgroups. It shows
2596 the current usage of the resources in the cgroup and its children.::
2603 A read-write flat-keyed file shown in the non root cgroups. Allowed
2604 maximum usage of the resources in the cgroup and its children.::
2618 Limits can be set higher than the capacity value in the misc.capacity
2622 A read-only flat-keyed file which exists on non-root cgroups. The
2624 change in this file generates a file modified event. All fields in
2634 A miscellaneous scalar resource is charged to the cgroup in which it is used
2640 ------
2651 Non-normative information
2652 -------------------------
2661 When distributing CPU cycles in the root cgroup each thread in this
2662 cgroup is treated as if it was hosted in a separate child cgroup of the
2666 For details of this mapping see sched_prio_to_weight array in
2668 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2674 Root cgroup processes are hosted in an implicit leaf child node.
2684 ------
2695 complete path of the cgroup of a process. In a container setup where
2703 The path '/batchjobs/container_id1' can be considered as system-data
2708 # ls -l /proc/self/ns/cgroup
2709 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2715 # ls -l /proc/self/ns/cgroup
2716 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2720 When some thread from a multi-threaded process unshares its cgroup
2732 ------------------
2734 The 'cgroupns root' for a cgroup namespace is the cgroup in which the
2735 process calling unshare(2) is running. For example, if a process in
2743 # ~/unshare -c # unshare cgroupns in some cgroup
2751 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
2754 cgroup paths (in /proc/self/cgroup) only inside their root cgroup.
2782 ----------------------
2811 ---------------------------------
2814 running inside a non-init cgroup namespace::
2816 # mount -t cgroup2 none $MOUNT_POINT
2823 the view of cgroup hierarchy by namespace-private cgroupfs mount
2830 This section contains kernel programming information in the areas
2836 --------------------------------
2839 address_space_operations->writepage[s]() to annotate bio's using the
2856 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
2873 - Multiple hierarchies including named ones are not supported.
2875 - All v1 mount options are not supported.
2877 - The "tasks" file is removed and "cgroup.procs" is not sorted.
2879 - "cgroup.clone_children" is removed.
2881 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" file
2889 --------------------
2893 provide a high level of flexibility, it wasn't useful in practice.
2896 type controllers such as freezer which can be useful in all
2897 hierarchies could only be used in one. The issue is exacerbated by
2904 In practice, these issues heavily limited which controllers could be
2915 used in general and what controllers was able to do.
2918 that a thread's cgroup membership couldn't be described in finite
2920 in length, which made it highly awkward to manipulate and led to
2922 which in turn exacerbated the original problem of proliferating number
2931 In most use cases, putting controllers on hierarchies which are
2934 depending on the specific controller. In other words, hierarchy may
2937 how memory is distributed beyond a certain level while still wanting
2942 ------------------
2950 Generally, in-process knowledge is available only to the process
2951 itself; thus, unlike service-level organization of processes,
2956 in combination with thread granularity. cgroups were delegated to
2958 sub-hierarchies and control resource distributions along them. This
2959 effectively raised cgroup to the status of a syscall-like API exposed
2969 that the process would actually be operating on its own sub-hierarchy.
2973 system-management pseudo filesystem. cgroup ended up with interface
2976 individual applications through the ill-defined delegation mechanism
2986 -------------------------------------------
2988 cgroup v1 allowed threads to be in any cgroups which created an
2997 cycles and the number of internal threads fluctuated - the ratios
3011 The memory controller didn't have a way to control what happened
3013 clearly defined. There were attempts to add ad-hoc behaviors and
3015 led to problems extremely difficult to resolve in the long term.
3023 in a uniform way.
3027 ----------------------
3031 was how an empty cgroup was notified - a userland helper binary was
3034 to in-kernel event delivery filtering mechanism further complicating
3049 formats and units even in the same controller.
3056 ------------------------------
3058 Memory subsection
3063 global reclaim prefers is opt-in, rather than opt-out. The costs for
3067 hierarchical meaning. All configured groups are organized in a global
3069 in the hierarchy. This makes subtree delegation impossible. Second,
3073 becomes self-defeating.
3075 The memory.low boundary on the other hand is a top-down allocated
3084 available memory. The memory consumption of workloads varies during
3088 estimation is hard and error prone, and getting it wrong results in
3092 The memory.high boundary on the other hand can be set much more
3098 and make corrections until the minimal memory footprint that still
3101 In extreme cases, with many concurrent allocations and a complete
3104 allocation from the slack available in other groups or the rest of the
3105 system than killing the group. Otherwise, memory.max is there to
3109 Setting the original memory.limit_in_bytes below the current usage was
3111 limit setting to fail. memory.max on the other hand will first set the
3113 new limit is met - or the task writing to memory.max is killed.
3115 The combined memory+swap accounting and limiting is replaced by real
3118 The main argument for a combined memory+swap facility in the original
3120 able to swap all anonymous memory of a child group, regardless of the
3122 groups can sabotage swapping by other means - such as referencing its
3123 anonymous memory in a tight loop - and an admin can not assume full
3127 intuitive userspace interface, and it flies in the face of the idea
3129 resources. Swap space is a resource like all others in the system,