Lines Matching +full:system +full:- +full:cache +full:- +full:controller
2 Memory Resource Controller
12 The Memory Resource Controller has generically been referred to as the
13 memory controller in this document. Do not confuse memory controller
14 used here with the memory controller that is used in hardware.
17 When we mention a cgroup (cgroupfs's directory) with memory controller,
18 we call it "memory cgroup". When you see git-log and source code, you'll
22 Benefits and Purpose of the memory controller
25 The memory controller isolates the memory behaviour of a group of tasks
26 from the rest of the system. The article on LWN [12]_ mentions some probable
27 uses of the memory controller. The memory controller can be used to
30 Memory-hungry applications can be isolated and limited to a smaller
37 rest of the system to ensure that burning does not fail due to lack
39 e. There are several other use cases; find one or use the controller just
42 Current Status: linux-2.6.34-mmotm(development version of 2010/April)
46 - accounting anonymous pages, file caches, swap caches usage and limiting them.
47 - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
48 - optionally, memory+swap usage can be accounted and limited.
49 - hierarchical accounting
50 - soft limit
51 - moving (recharging) account at moving a task is selectable.
52 - usage threshold notifier
53 - memory pressure notifier
54 - oom-killer disable knob and oom-notifier
55 - Root cgroup has no limit controls.
59 <cgroup-v1-memory-kernel-extension>`)
130 The memory controller has a long history. A request for comments for the memory
131 controller was posted by Balbir Singh [1]_. At the time the RFC was posted
134 for memory control. The first RSS controller was posted by Balbir Singh [2]_
136 of the RSS controller. At OLS, at the resource management BoF, everyone
137 suggested that we handle both page cache and RSS together. Another request was
138 raised to allow user space handling of OOM. The current memory controller is
140 Cache Control [11]_.
150 The memory controller implementation has been divided into phases. These
153 1. Memory controller
154 2. mlock(2) controller
156 4. user mappings length controller
158 The memory controller is the first controller developed.
161 -----------
165 processes associated with the controller. Each cgroup has a memory controller
169 ---------------
171 .. code-block::
174 +--------------------+
177 +--------------------+
180 +---------------+ | +---------------+
183 +---------------+ | +---------------+
185 + --------------+
187 +---------------+ +------+--------+
188 | page +----------> page_cgroup|
190 +---------------+ +---------------+
194 Figure 1 shows the important aspects of the controller
205 If everything goes well, a page meta-data-structure called page_cgroup is
207 (*) page_cgroup structure is allocated at boot/memory-hotplug time.
210 ------------------------
212 All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
217 for earlier. A file page will be accounted for as Page Cache when it's
223 unmapped (by kswapd), they may exist as SwapCache in the system until they
225 A swapped-in page is accounted after adding into swapcache.
227 Note: The kernel does swapin-readahead and reads multiple swaps at once.
233 Note: we just account pages-on-LRU because our purpose is to control amount
234 of used pages; not-on-LRU pages tend to be out-of-control from VM view.
237 --------------------------
243 the cgroup that brought it in -- this will happen on memory pressure).
246 --------------------------------------
253 - memory.memsw.usage_in_bytes.
254 - memory.memsw.limit_in_bytes.
259 Example: Assume a system with 4G of swap. A task which allocates 6G of memory
262 By using the memsw limit, you can avoid system OOM which can be caused by swap
268 The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
277 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
278 in this cgroup. Then, swap-out will not be done by cgroup routine and file
280 from it for sanity of the system's memory management state. You can't forbid
284 -----------
291 cgroup. (See :ref:`10. OOM Control <cgroup-v1-memory-oom-control>` below.)
294 pages that are selected for reclaiming come from the per-cgroup LRU
302 When panic_on_oom is set to "2", the whole system will panic.
305 (See :ref:`oom_control <cgroup-v1-memory-oom-control>` section)
308 -----------
313 mm->page_table_lock or split pte_lock
314 folio_memcg_lock (memcg->move_lock)
315 mapping->i_pages lock
316 lruvec->lru_lock.
318 Per-node-per-memcgroup LRU (cgroup's private LRU) is guarded by
319 lruvec->lru_lock; the folio LRU flag is cleared before
320 isolating a page from its LRU under lruvec->lru_lock.
322 .. _cgroup-v1-memory-kernel-extension:
325 -----------------------------------------------
327 With the Kernel memory extension, the Memory Controller is able to limit
328 the amount of kernel memory used by the system. Kernel memory is fundamentally
330 possible to DoS the system by consuming too much of this precious resource.
333 it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
348 -----------------------------------------------
357 of each kmem_cache is created every time the cache is touched by the first time
359 skipped while the cache is being created. All objects in a slab page should
361 different memcg during the page allocation by the cache.
365 thresholds. The Memory Controller allows them to be controlled individually
372 ----------------------
385 deployments where the total amount of memory per-cgroup is overcommitted.
387 box can still run out of non-reclaimable memory.
410 <cgroups-why-needed>` for the background information)::
412 # mount -t tmpfs none /sys/fs/cgroup
414 # mount -t cgroup none /sys/fs/cgroup/memory -o memory
436 We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
450 availability of memory on the system. The user is required to re-read
468 Performance test is also important. To see pure memory controller's overhead,
472 Page-fault scalability is also important. At measuring parallel
473 page fault test, multi-process test may be better than multi-thread
477 Trying usual test under memory controller is always helpful.
479 .. _cgroup-v1-memory-test-troubleshoot:
482 -------------------
491 some of the pages cached in the cgroup (page cache pages).
494 <cgroup-v1-memory-oom-control>` (below) and seeing what happens will be
497 .. _cgroup-v1-memory-test-task-migration:
500 ------------------
508 See :ref:`8. "Move charges at task migration" <cgroup-v1-memory-move-charges>`
511 ---------------------
514 <cgroup-v1-memory-test-troubleshoot>` and :ref:`4.2
515 <cgroup-v1-memory-test-task-migration>`, a cgroup might have some charge
530 ---------------
540 charged file caches. Some out-of-use page caches may keep charged until
544 -------------
548 * per-memory cgroup local status
551 cache # of bytes of page cache memory.
552 rss # of bytes of anonymous and swap cache memory (includes
558 anon page(RSS) or cache page(Page Cache) to the cgroup.
565 writeback # of bytes of file/anon cache that are queued for syncing to
567 inactive_anon # of bytes of anonymous and swap cache memory on inactive
569 active_anon # of bytes of anonymous and swap cache memory on active
571 inactive_file # of bytes of file-backed memory and MADV_FREE anonymous
573 active_file # of bytes of file-backed memory on active LRU list.
607 Only anonymous and swap cache memory is listed as part of 'rss' stat.
619 cache.)
622 --------------
633 -----------
645 ------------------
651 If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
655 -------------
657 This is similar to numa_maps but operates on a per-memcg basis. This is
664 per-node page counts including "hierarchical_<counter>" which sums up all
680 The memory controller supports a deep hierarchy and hierarchical accounting.
699 ---------------------------------------
720 When the system detects memory contention or low memory, control groups
725 Please note that soft limits is a best-effort feature; it comes with
732 -------------
751 .. _cgroup-v1-memory-move-charges:
759 to it will always return -EINVAL.
770 - create an eventfd using eventfd(2);
771 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
772 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
778 It's applicable for root and non-root cgroup.
780 .. _cgroup-v1-memory-oom-control:
795 - create an eventfd using eventfd(2)
796 - open memory.oom_control file
797 - write string like "<event_fd> <fd of memory.oom_control>" to
803 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
807 If OOM-killer is disabled, tasks under cgroup will hang/sleep
808 in memory cgroup's OOM-waitqueue when they request accountable memory.
824 - oom_kill_disable 0 or 1
825 (if 1, oom-killer is disabled)
826 - under_oom 0 or 1
828 - oom_kill integer counter
842 The "low" level means that the system is reclaiming memory for new
844 maintaining cache level. Upon notification, the program (typically
848 The "medium" level means that the system is experiencing medium memory
849 pressure, the system might be making swap, paging out active file caches,
852 resources that can be easily reconstructed or re-read from a disk.
854 The "critical" level means that the system is actively thrashing, it is
855 about to out of memory (OOM) or even the in-kernel OOM killer is on its
857 system. It might be too late to consult with vmstat or any other
861 events are not pass-through. For example, you have three cgroups: A->B->C. Now
865 excessive "broadcasting" of messages, which disturbs the system and which is
871 - "default": this is the default behavior specified above. This mode is the
875 - "hierarchy": events always propagate up to the root, similar to the default
880 - "local": events are pass-through, i.e. they only receive notifications when
889 specified by a comma-delimited string, i.e. "low,hierarchy" specifies
890 hierarchical, pass-through, notification for all ancestor memcgs. Notification
891 that is the default, non pass-through behavior, does not specify a mode.
892 "medium,local" specifies pass-through notification for the medium level.
897 - create an eventfd using eventfd(2);
898 - open memory.pressure_level;
899 - write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
921 (Expect a bunch of notifications, and eventually, the oom-killer will
927 1. Make per-cgroup scanner reclaim not-shared pages first
928 2. Teach controller to account for shared-pages
935 Overall, the memory controller has been a stable controller and has been
941 .. [1] Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
942 .. [2] Singh, Balbir. Memory Controller (RSS Control),
946 .. [4] Emelianov, Pavel. RSS controller based on process cgroups (v2)
948 .. [5] Emelianov, Pavel. RSS controller based on process cgroups (v3)
954 8. Singh, Balbir. RSS controller v2 test results (lmbench),
956 9. Singh, Balbir. RSS controller v2 AIM9 results
958 10. Singh, Balbir. Memory controller v6 test results,
959 https://lore.kernel.org/r/20070819094658.654.84837.sendpatchset@balbir-laptop
961 .. [11] Singh, Balbir. Memory controller introduction (v6),
962 https://lore.kernel.org/r/20070817084228.26003.12568.sendpatchset@balbir-laptop