Lines Matching +full:charge +full:- +full:ctrl +full:- +full:value

1 .. _cgroup-v2:
11 conventions of cgroup v2. It describes all userland-visible aspects
14 v1 is available under :ref:`Documentation/admin-guide/cgroup-v1/index.rst <cgroup-v1>`.
19 1-1. Terminology
20 1-2. What is cgroup?
22 2-1. Mounting
23 2-2. Organizing Processes and Threads
24 2-2-1. Processes
25 2-2-2. Threads
26 2-3. [Un]populated Notification
27 2-4. Controlling Controllers
28 2-4-1. Enabling and Disabling
29 2-4-2. Top-down Constraint
30 2-4-3. No Internal Process Constraint
31 2-5. Delegation
32 2-5-1. Model of Delegation
33 2-5-2. Delegation Containment
34 2-6. Guidelines
35 2-6-1. Organize Once and Control
36 2-6-2. Avoid Name Collisions
38 3-1. Weights
39 3-2. Limits
40 3-3. Protections
41 3-4. Allocations
43 4-1. Format
44 4-2. Conventions
45 4-3. Core Interface Files
47 5-1. CPU
48 5-1-1. CPU Interface Files
49 5-2. Memory
50 5-2-1. Memory Interface Files
51 5-2-2. Usage Guidelines
52 5-2-3. Memory Ownership
53 5-3. IO
54 5-3-1. IO Interface Files
55 5-3-2. Writeback
56 5-3-3. IO Latency
57 5-3-3-1. How IO Latency Throttling Works
58 5-3-3-2. IO Latency Interface Files
59 5-3-4. IO Priority
60 5-4. PID
61 5-4-1. PID Interface Files
62 5-5. Cpuset
63 5.5-1. Cpuset Interface Files
64 5-6. Device
65 5-7. RDMA
66 5-7-1. RDMA Interface Files
67 5-8. DMEM
68 5-9. HugeTLB
69 5.9-1. HugeTLB Interface Files
70 5-10. Misc
71 5.10-1 Miscellaneous cgroup Interface Files
72 5.10-2 Migration and Ownership
73 5-11. Others
74 5-11-1. perf_event
75 5-N. Non-normative information
76 5-N-1. CPU controller root cgroup process behaviour
77 5-N-2. IO controller root cgroup process behaviour
79 6-1. Basics
80 6-2. The Root and Views
81 6-3. Migration and setns(2)
82 6-4. Interaction with Other Namespaces
84 P-1. Filesystem Support for Writeback
87 R-1. Multiple Hierarchies
88 R-2. Thread Granularity
89 R-3. Competition Between Inner Nodes and Threads
90 R-4. Other Interface Issues
91 R-5. Controller Issues and Remedies
92 R-5-1. Memory
99 -----------
108 ---------------
114 cgroup is largely composed of two parts - the core and controllers.
130 hierarchical - if a controller is enabled on a cgroup, it affects all
132 sub-hierarchy of the cgroup. When a controller is enabled on a nested
142 --------
147 # mount -t cgroup2 none $MOUNT_POINT
157 is no longer referenced in its current hierarchy. Because per-cgroup
164 to inter-controller dependencies, other controllers may need to be
185 ignored on non-init namespace mounts. Please refer to the
202 option is ignored on non-init namespace mounts.
210 behavior but is a mount-option to avoid regressing setups
224 controller. The pre-allocated pool does not belong to anyone.
232 * Failure to charge a HugeTLB folio to the memory controller
244 The option restores v1-like behavior of pids.events:max, that is only
252 --------------------------------
258 A child cgroup can be created by creating a sub-directory::
263 structure. Each cgroup has a read-writable interface file
265 belong to the cgroup one-per-line. The PIDs are not ordered and the
296 0::/test-cgroup/test-cgroup-nested
303 0::/test-cgroup/test-cgroup-nested (deleted)
329 constraint - threaded controllers can be enabled on non-leaf cgroups
353 - As the cgroup will join the parent's resource domain. The parent
356 - When the parent is an unthreaded domain, it must not have any domain
360 Topology-wise, a cgroup can be in an invalid state. Please consider
363 A (threaded domain) - B (threaded) - C (domain, just created)
378 threads in the cgroup. Except that the operations are per-thread
379 instead of per-process, "cgroup.threads" has the same format and
401 between threads in a non-leaf cgroup and its child cgroups. Each
407 - cpu
408 - cpuset
409 - perf_event
410 - pids
413 --------------------------
415 Each non-root cgroup has a "cgroup.events" file which contains
416 "populated" field indicating whether the cgroup's sub-hierarchy has
417 live processes in it. Its value is 0 if there is no live process in
419 events are triggered when the value changes. This can be used, for
420 example, to start a clean-up operation after all processes of a given
421 sub-hierarchy have exited. The populated state updates and
422 notifications are recursive. Consider the following sub-hierarchy
426 A(4) - B(0) - C(1)
436 -----------------------
450 # echo "+cpu +memory -io" > cgroup.subtree_control
459 Consider the following sub-hierarchy. The enabled controllers are
462 A(cpu,memory) - B(memory) - C()
476 controller interface files - anything which doesn't start with
480 Top-down Constraint
483 Resources are distributed top-down and a cgroup can further distribute
485 parent. This means that all non-root "cgroup.subtree_control" files
495 Non-root cgroups can distribute domain resources to their children
510 refer to the Non-normative information section in the Controllers
523 ----------
545 delegated, the user can build sub-hierarchy under the directory,
549 happens in the delegated sub-hierarchy, nothing can escape the
553 cgroups in or nesting depth of a delegated sub-hierarchy; however,
560 A delegated sub-hierarchy is contained in the sense that processes
561 can't be moved into or out of the sub-hierarchy by the delegatee.
564 requiring the following conditions for a process with a non-root euid
568 - The writer must have write access to the "cgroup.procs" file.
570 - The writer must have write access to the "cgroup.procs" file of the
574 processes around freely in the delegated sub-hierarchy it can't pull
575 in from or push out to outside the sub-hierarchy.
581 ~~~~~~~~~~~~~ - C0 - C00
584 ~~~~~~~~~~~~~ - C1 - C10
591 will be denied with -EACCES.
596 is not reachable, the migration is rejected with -ENOENT.
600 ----------
608 inherent trade-offs between migration and various hot paths in terms
614 resource structure once on start-up. Dynamic adjustments to resource
647 -------
653 work-conserving. Due to the dynamic nature, this model is usually
668 .. _cgroupv2-limits-distributor:
671 ------
674 Limits can be over-committed - the sum of the limits of children can
679 As limits can be over-committed, all configuration combinations are
686 .. _cgroupv2-protections-distributor:
689 -----------
694 soft boundaries. Protections can also be over-committed in which case
701 As protections can be over-committed, all configuration combinations
705 "memory.low" implements best-effort memory protection and is an
710 -----------
713 resource. Allocations can't be over-committed - the sum of the
720 As allocations can't be over-committed, some configuration
725 "cpu.rt.max" hard-allocates realtime slices and is an example of this
733 ------
738 New-line separated values
739 (when only one value can be written at once)
746 (when read-only or multiple values can be written at once)
772 -----------
774 - Settings for a single feature should be contained in a single file.
776 - The root cgroup should be exempt from resource control and thus
779 - The default time unit is microseconds. If a different unit is ever
782 - A parts-per quantity should use a percentage decimal with at least
783 two digit fractional part - e.g. 13.40.
785 - If a controller implements weight based resource distribution, its
791 - If a controller implements an absolute resource guarantee and/or
800 - If a setting has a configurable default value and keyed specific
804 The default value can be updated by writing either "default $VAL" or
808 the value to indicate removal of the override. Override entries
809 with "default" as the value must not appear when read.
814 # cat cgroup-example-interface-file
818 The default value can be updated by::
820 # echo 125 > cgroup-example-interface-file
824 # echo "default 125" > cgroup-example-interface-file
828 # echo "8:16 170" > cgroup-example-interface-file
832 # echo "8:0 default" > cgroup-example-interface-file
833 # cat cgroup-example-interface-file
837 - For events which are not very high frequency, an interface file
838 "events" should be created which lists event key value pairs.
844 --------------------
849 A read-write single value file which exists on non-root
855 - "domain" : A normal valid domain cgroup.
857 - "domain threaded" : A threaded domain cgroup which is
860 - "domain invalid" : A cgroup which is in an invalid state.
864 - "threaded" : A threaded cgroup which is a member of a
871 A read-write new-line separated values file which exists on
875 the cgroup one-per-line. The PIDs are not ordered and the
884 - It must have write access to the "cgroup.procs" file.
886 - It must have write access to the "cgroup.procs" file of the
889 When delegating a sub-hierarchy, write access to this file
897 A read-write new-line separated values file which exists on
901 the cgroup one-per-line. The TIDs are not ordered and the
910 - It must have write access to the "cgroup.threads" file.
912 - The cgroup that the thread is currently in must be in the
915 - It must have write access to the "cgroup.procs" file of the
918 When delegating a sub-hierarchy, write access to this file
922 A read-only space separated values file which exists on all
929 A read-write space separated values file which exists on all
936 Space separated list of controllers prefixed with '+' or '-'
938 name prefixed with '+' enables the controller and '-'
944 A read-only flat-keyed file which exists on non-root cgroups.
946 otherwise, a value change in this file generates a file
956 A read-write single value files. The default is "max".
963 A read-write single value files. The default is "max".
970 A read-only flat-keyed file with the following entries:
996 A read-write single value file which exists on non-root cgroups.
1003 is completed, the "frozen" value in the cgroup.events control file
1019 create new sub-cgroups.
1022 A write-only single value file which exists in non-root cgroups.
1023 The only allowed value is "1".
1034 the whole thread-group.
1037 A read-write single value file that allowed values are "0" and "1".
1041 Writing "1" to the file will re-enable the cgroup PSI accounting.
1049 This may cause non-negligible overhead for some workloads when under
1051 be used to disable PSI accounting in the non-leaf cgroups.
1054 A read-write nested-keyed file.
1062 .. _cgroup-v2-cpu:
1065 ---
1083 management software may already have placed RT processes into non-root cgroups
1101 A read-only flat-keyed file.
1106 - usage_usec
1107 - user_usec
1108 - system_usec
1112 - nr_periods
1113 - nr_throttled
1114 - throttled_usec
1115 - nr_bursts
1116 - burst_usec
1119 A read-write single value file which exists on non-root
1129 A read-write single value file which exists on non-root
1132 The nice value is in the range [-20, 19].
1137 granularity is coarser for the nice values, the read value is
1141 A read-write two value file which exists on non-root cgroups.
1153 A read-write single value file which exists on non-root
1159 A read-write nested-keyed file.
1165 A read-write single value file which exists on non-root cgroups.
1173 value is used to clamp the task specific minimum utilization clamp.
1176 the current value for the maximum utilization (limit), i.e.
1180 A read-write single value file which exists on non-root cgroups.
1188 value is used to clamp the task specific maximum utilization clamp.
1191 A read-write single value file which exists on non-root cgroups.
1194 This is the cgroup analog of the per-task SCHED_IDLE sched policy.
1195 Setting this value to a 1 will make the scheduling policy of the
1203 ------
1211 While not completely water-tight, all major memory usages by a given
1216 - Userland memory - page cache and anonymous memory.
1218 - Kernel data structures such as dentries and inodes.
1220 - TCP socket buffers.
1228 All memory amounts are in bytes. If a value which is not aligned to
1229 PAGE_SIZE is written, the value may be rounded up to the closest
1233 A read-only single value file which exists on non-root
1240 A read-write single value file which exists on non-root
1266 A read-write single value file which exists on non-root
1269 Best-effort memory protection. If the memory usage of a
1289 A read-write single value file which exists on non-root
1303 A read-write single value file which exists on non-root
1312 In default configuration regular 0-order allocations always
1317 as -ENOMEM or silently ignore in cases like disk readahead.
1320 A write-only nested-keyed file which exists for all cgroups.
1331 specified amount, -EAGAIN is returned.
1343 swappiness Swappiness value to reclaim with
1346 Specifying a swappiness value instructs the kernel to perform
1347 the reclaim with that swappiness value. Note that this has the
1352 A read-write single value file which exists on non-root cgroups.
1357 A write of any non-empty string to this file resets it to the
1362 A read-write single value file which exists on non-root
1363 cgroups. The default value is "0".
1372 Tasks with the OOM protection (oom_score_adj set to -1000)
1380 A read-only flat-keyed file which exists on non-root cgroups.
1382 otherwise, a value change in this file generates a file
1394 boundary is over-committed.
1414 considered as an option, e.g. for failed high-order
1430 A read-only flat-keyed file which exists on non-root cgroups.
1433 types of memory, type-specific details, and other information
1442 If the entry has no per-node counter (or not show in the
1443 memory.numa_stat). We use 'npn' (non-per-node) as the tag
1474 Amount of memory used for storing per-cpu kernel
1484 Amount of cached filesystem data that is swap-backed,
1524 Amount of memory, swap-backed and filesystem-backed,
1530 the value for the foo counter, since the foo counter is type-based, not
1531 list-based.
1542 Amount of memory used for storing in-kernel data
1632 Number of zero-filled pages swapped out with I/O skipped due to the
1691 A read-only nested-keyed file which exists on non-root cgroups.
1694 types of memory, type-specific details, and other information
1716 A read-only single value file which exists on non-root
1723 A read-write single value file which exists on non-root
1728 allow userspace to implement custom out-of-memory procedures.
1739 A read-write single value file which exists on non-root cgroups.
1744 A write of any non-empty string to this file resets it to the
1749 A read-write single value file which exists on non-root
1756 A read-only flat-keyed file which exists on non-root cgroups.
1758 otherwise, a value change in this file generates a file
1772 because of running out of swap system-wide or max
1781 A read-only single value file which exists on non-root
1788 A read-write single value file which exists on non-root
1796 A read-write single value file. The default value is "1".
1814 A read-only nested-keyed file.
1824 Over-committing on high limit (sum of high limits > available memory)
1838 pressure - how much the workload is being impacted due to lack of
1839 memory - is necessary to determine whether a workload needs more
1853 To which cgroup the area will be charged is in-deterministic; however,
1864 --
1869 only if cfq-iosched is in use and neither scheme is available for
1870 blk-mq devices.
1877 A read-only nested-keyed file.
1897 A read-write nested-keyed file which exists only on the root
1909 enable Weight-based control enable
1910 ctrl "auto" or "user"
1927 8:16 enable=1 ctrl=auto rpct=95.00 rlat=75000 wpct=95.00 wlat=150000 min=50.00 max=150.0
1941 devices which show wide temporary behavior changes - e.g. a
1945 When "ctrl" is "auto", the parameters are controlled by the
1946 kernel and may change automatically. Setting "ctrl" to "user"
1949 automatic mode can be restored by setting "ctrl" to "auto".
1952 A read-write nested-keyed file which exists only on the root
1964 ctrl "auto" or "user"
1965 model The cost model in use - "linear"
1968 When "ctrl" is "auto", the kernel may change all parameters
1969 dynamically. When "ctrl" is set to "user" or any other
1970 parameters are written to, "ctrl" become "user" and the
1991 generate device-specific coefficients.
1994 A read-write flat-keyed file which exists on non-root cgroups.
2014 A read-write nested-keyed file which exists on non-root
2028 When writing, any number of nested key-value pairs can be
2029 specified in any order. "max" can be specified as the value
2053 A read-only nested-keyed file.
2072 writes out dirty pages for the memory domain. Both system-wide and
2073 per-cgroup dirty memory states are examined and the more restrictive
2111 memory controller and system-wide clean memory.
2139 Generally you do not want to set a value lower than the latency your device
2140 supports. Experiment to find the value that works best for your workload.
2142 avg_lat value in io.stat for your workload group to get an idea of the
2143 latency you see during normal operation. Use the avg_lat value as a basis for
2144 your real setting, setting at 10-15% higher than the value in io.stat.
2154 - Queue depth throttling. This is the number of outstanding IO's a group is
2158 - Artificial delay induction. There are certain types of IO that cannot be
2163 fields in io.stat increase. The delay value is how many microseconds that are
2190 calculated by multiplying the win value in io.stat by the
2191 corresponding number of samples based on the win value.
2205 no-change
2208 promote-to-rt
2209 For requests that have a non-RT I/O priority class, change it into RT.
2213 restrict-to-be
2223 none-to-rt
2224 Deprecated. Just an alias for promote-to-rt.
2228 +----------------+---+
2229 | no-change | 0 |
2230 +----------------+---+
2231 | promote-to-rt | 1 |
2232 +----------------+---+
2233 | restrict-to-be | 2 |
2234 +----------------+---+
2236 +----------------+---+
2238 The numerical value that corresponds to each I/O priority class is as follows:
2240 +-------------------------------+---+
2242 +-------------------------------+---+
2243 | IOPRIO_CLASS_RT (real-time) | 1 |
2244 +-------------------------------+---+
2246 +-------------------------------+---+
2248 +-------------------------------+---+
2252 - If I/O priority class policy is promote-to-rt, change the request I/O
2255 - If I/O priority class policy is not promote-to-rt, translate the I/O priority
2261 ---
2280 A read-write single value file which exists on non-root
2286 A read-only single value file which exists on non-root cgroups.
2292 A read-only single value file which exists on non-root cgroups.
2294 The maximum value that the number of processes in the cgroup and its
2298 A read-only flat-keyed file which exists on non-root cgroups. Unless
2299 specified otherwise, a value change in this file generates a file
2316 through fork() or clone(). These will return -EAGAIN if the creation
2321 ------
2328 memory placement to reduce cross-node memory access and contention
2339 A read-write multiple values file which exists on non-root
2340 cpuset-enabled cgroups.
2347 The CPU numbers are comma-separated numbers or ranges.
2351 0-4,6,8-10
2353 An empty value indicates that the cgroup is using the same
2354 setting as the nearest cgroup ancestor with a non-empty
2357 The value of "cpuset.cpus" stays constant until the next update
2361 A read-only multiple values file which exists on all
2362 cpuset-enabled cgroups.
2375 Its value will be affected by CPU hotplug events.
2378 A read-write multiple values file which exists on non-root
2379 cpuset-enabled cgroups.
2386 The memory node numbers are comma-separated numbers or ranges.
2390 0-1,3
2392 An empty value indicates that the cgroup is using the same
2393 setting as the nearest cgroup ancestor with a non-empty
2397 The value of "cpuset.mems" stays constant until the next update
2400 Setting a non-empty value to "cpuset.mems" causes memory of
2412 A read-only multiple values file which exists on all
2413 cpuset-enabled cgroups.
2425 Its value will be affected by memory nodes hotplug events.
2428 A read-write multiple values file which exists on non-root
2429 cpuset-enabled cgroups.
2432 to create a new cpuset partition. Its value is not used
2444 Users can manually set it to a value that is different from
2448 isn't set, its "cpuset.cpus" value, if set, cannot be a subset
2455 not allowed (the exclusivity rule). A value that violates the
2462 A read-only multiple values file which exists on all non-root
2463 cpuset-enabled cgroups.
2471 treated to have an implicit value of "cpuset.cpus" in the
2475 A read-only and root cgroup only multiple values file.
2482 A read-write single value file which exists on non-root
2483 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2489 "member" Non-root member of a partition
2494 A cpuset partition is a collection of cpuset-enabled cgroups with
2501 There are two types of partitions - local and remote. A local
2506 "cpuset.cpus.exclusive" file will assume an implicit value that
2517 be changed. All other non-root cgroups start out as "member".
2521 determined by the value of its "cpuset.cpus.exclusive.effective".
2530 two possible states - valid or invalid. An invalid partition
2541 "member" Non-root member of a partition
2568 A valid non-root parent partition may distribute out all its CPUs
2577 value in "cpuset.cpus" or "cpuset.cpus.exclusive".
2587 A user can pre-configure certain CPUs to an isolated state
2594 -----------------
2605 on the return value the attempt will succeed or fail with -EPERM.
2610 If the program returns 0, the attempt fails with -EPERM, otherwise it
2618 ----
2627 A readwrite nested-keyed file that exists for all the cgroups
2648 A read-only file that describes current resource usage.
2657 ----
2667 A readwrite nested-keyed file that exists for all the cgroups
2680 A read-only file that describes maximum region capacity.
2691 A read-only file that describes current resource usage.
2700 -------
2714 The default value is "max". It exists for all the cgroup except root.
2717 A read-only flat-keyed file which exists on non-root cgroups.
2730 use hugetlb pages are included. The per-node values are in bytes.
2733 ----
2745 Once a capacity is set then the resource usage can be updated using charge and
2755 A read-only flat-keyed file shown only in the root cgroup. It shows
2764 A read-only flat-keyed file shown in the all cgroups. It shows
2772 A read-only flat-keyed file shown in all cgroups. It shows the
2781 A read-write flat-keyed file shown in the non root cgroups. Allowed
2796 Limits can be set higher than the capacity value in the misc.capacity
2800 A read-only flat-keyed file which exists on non-root cgroups. The
2801 following entries are defined. Unless specified otherwise, a value
2819 a process to a different cgroup does not move the charge to the destination
2823 ------
2834 Non-normative information
2835 -------------------------
2851 appropriately so the neutral - nice 0 - value is 100 instead of 1024).
2860 weight value of 200.
2867 ------
2886 The path '/batchjobs/container_id1' can be considered as system-data
2891 # ls -l /proc/self/ns/cgroup
2892 lrwxrwxrwx 1 root root 0 2014-07-15 10:37 /proc/self/ns/cgroup -> cgroup:[4026531835]
2898 # ls -l /proc/self/ns/cgroup
2899 lrwxrwxrwx 1 root root 0 2014-07-15 10:35 /proc/self/ns/cgroup -> cgroup:[4026532183]
2903 When some thread from a multi-threaded process unshares its cgroup
2915 ------------------
2926 # ~/unshare -c # unshare cgroupns in some cgroup
2934 Each process gets its namespace-specific view of "/proc/$PID/cgroup"
2965 ----------------------
2994 ---------------------------------
2997 running inside a non-init cgroup namespace::
2999 # mount -t cgroup2 none $MOUNT_POINT
3006 the view of cgroup hierarchy by namespace-private cgroupfs mount
3019 --------------------------------
3022 address_space_operations->writepage[s]() to annotate bio's using the
3039 super_block by setting SB_I_CGROUPWB in ->s_iflags. This allows for
3056 - Multiple hierarchies including named ones are not supported.
3058 - All v1 mount options are not supported.
3060 - The "tasks" file is removed and "cgroup.procs" is not sorted.
3062 - "cgroup.clone_children" is removed.
3064 - /proc/cgroups is meaningless for v2. Use "cgroup.controllers" or
3072 --------------------
3125 ------------------
3133 Generally, in-process knowledge is available only to the process
3134 itself; thus, unlike service-level organization of processes,
3141 sub-hierarchies and control resource distributions along them. This
3142 effectively raised cgroup to the status of a syscall-like API exposed
3152 that the process would actually be operating on its own sub-hierarchy.
3156 system-management pseudo filesystem. cgroup ended up with interface
3159 individual applications through the ill-defined delegation mechanism
3169 -------------------------------------------
3180 cycles and the number of internal threads fluctuated - the ratios
3196 clearly defined. There were attempts to add ad-hoc behaviors and
3210 ----------------------
3214 was how an empty cgroup was notified - a userland helper binary was
3217 to in-kernel event delivery filtering mechanism further complicating
3239 ------------------------------
3246 global reclaim prefers is opt-in, rather than opt-out. The costs for
3256 becomes self-defeating.
3258 The memory.low boundary on the other hand is a top-down allocated
3296 new limit is met - or the task writing to memory.max is killed.
3305 groups can sabotage swapping by other means - such as referencing its
3306 anonymous memory in a tight loop - and an admin can not assume full