Lines Matching +full:system +full:- +full:cache +full:- +full:controller
2 Memory Resource Controller
12 The Memory Resource Controller has generically been referred to as the
13 memory controller in this document. Do not confuse memory controller
14 used here with the memory controller that is used in hardware.
17 When we mention a cgroup (cgroupfs's directory) with memory controller,
18 we call it "memory cgroup". When you see git-log and source code, you'll
22 Benefits and Purpose of the memory controller
25 The memory controller isolates the memory behaviour of a group of tasks
26 from the rest of the system. The article on LWN [12] mentions some probable
27 uses of the memory controller. The memory controller can be used to
30 Memory-hungry applications can be isolated and limited to a smaller
37 rest of the system to ensure that burning does not fail due to lack
39 e. There are several other use cases; find one or use the controller just
42 Current Status: linux-2.6.34-mmotm(development version of 2010/April)
46 - accounting anonymous pages, file caches, swap caches usage and limiting them.
47 - pages are linked to per-memcg LRU exclusively, and there is no global LRU.
48 - optionally, memory+swap usage can be accounted and limited.
49 - hierarchical accounting
50 - soft limit
51 - moving (recharging) account at moving a task is selectable.
52 - usage threshold notifier
53 - memory pressure notifier
54 - oom-killer disable knob and oom-notifier
55 - Root cgroup has no limit controls.
107 The memory controller has a long history. A request for comments for the memory
108 controller was posted by Balbir Singh [1]. At the time the RFC was posted
111 for memory control. The first RSS controller was posted by Balbir Singh[2]
113 RSS controller. At OLS, at the resource management BoF, everyone suggested
114 that we handle both page cache and RSS together. Another request was raised
115 to allow user space handling of OOM. The current memory controller is
117 Cache Control [11].
127 The memory controller implementation has been divided into phases. These
130 1. Memory controller
131 2. mlock(2) controller
133 4. user mappings length controller
135 The memory controller is the first controller developed.
138 -----------
142 processes associated with the controller. Each cgroup has a memory controller
146 ---------------
150 +--------------------+
153 +--------------------+
156 +---------------+ | +---------------+
159 +---------------+ | +---------------+
161 + --------------+
163 +---------------+ +------+--------+
164 | page +----------> page_cgroup|
166 +---------------+ +---------------+
171 Figure 1 shows the important aspects of the controller
182 If everything goes well, a page meta-data-structure called page_cgroup is
184 (*) page_cgroup structure is allocated at boot/memory-hotplug time.
187 ------------------------
189 All mapped anon pages (RSS) and cache pages (Page Cache) are accounted.
194 for earlier. A file page will be accounted for as Page Cache when it's
195 inserted into inode (radix-tree). While it's mapped into the page tables of
199 unaccounted when it's removed from radix-tree. Even if RSS pages are fully
200 unmapped (by kswapd), they may exist as SwapCache in the system until they
202 A swapped-in page is accounted after adding into swapcache.
204 Note: The kernel does swapin-readahead and reads multiple swaps at once.
210 Note: we just account pages-on-LRU because our purpose is to control amount
211 of used pages; not-on-LRU pages tend to be out-of-control from VM view.
214 --------------------------
220 the cgroup that brought it in -- this will happen on memory pressure).
226 --------------------------------------
233 - memory.memsw.usage_in_bytes.
234 - memory.memsw.limit_in_bytes.
239 Example: Assume a system with 4G of swap. A task which allocates 6G of memory
242 By using the memsw limit, you can avoid system OOM which can be caused by swap
247 The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
255 When a cgroup hits memory.memsw.limit_in_bytes, it's useless to do swap-out
256 in this cgroup. Then, swap-out will not be done by cgroup routine and file
258 from it for sanity of the system's memory management state. You can't forbid
262 -----------
272 pages that are selected for reclaiming come from the per-cgroup LRU
280 When panic_on_oom is set to "2", the whole system will panic.
286 -----------
294 mm->page_table_lock
295 pgdat->lru_lock
300 per-zone-per-cgroup LRU (cgroup's private LRU) is just guarded by
301 pgdat->lru_lock, it has no lock of its own.
304 -----------------------------------------------
306 With the Kernel memory extension, the Memory Controller is able to limit
307 the amount of kernel memory used by the system. Kernel memory is fundamentally
309 possible to DoS the system by consuming too much of this precious resource.
312 it can be disabled system-wide by passing cgroup.memory=nokmem to the kernel
327 -----------------------------------------------
336 of each kmem_cache is created every time the cache is touched by the first time
338 skipped while the cache is being created. All objects in a slab page should
340 different memcg during the page allocation by the cache.
344 thresholds. The Memory Controller allows them to be controlled individually
351 ----------------------
364 deployments where the total amount of memory per-cgroup is overcommited.
366 box can still run out of non-reclaimable memory.
386 ------------------
394 -------------------------------------------------------------------
398 # mount -t tmpfs none /sys/fs/cgroup
400 # mount -t cgroup none /sys/fs/cgroup/memory -o memory
417 We can write "-1" to reset the ``*.limit_in_bytes(unlimited)``.
435 availability of memory on the system. The user is required to re-read
453 Performance test is also important. To see pure memory controller's overhead,
457 Page-fault scalability is also important. At measuring parallel
458 page fault test, multi-process test may be better than multi-thread
462 Trying usual test under memory controller is always helpful.
465 -------------------
474 some of the pages cached in the cgroup (page cache pages).
480 ------------------
491 ---------------------
512 ---------------
522 charged file caches. Some out-of-use page caches may keep charged until
533 -------------
537 per-memory cgroup local status
541 cache # of bytes of page cache memory.
542 rss # of bytes of anonymous and swap cache memory (includes
548 anon page(RSS) or cache page(Page Cache) to the cgroup.
553 writeback # of bytes of file/anon cache that are queued for syncing to
555 inactive_anon # of bytes of anonymous and swap cache memory on inactive
557 active_anon # of bytes of anonymous and swap cache memory on active
559 inactive_file # of bytes of file-backed memory on inactive LRU list.
560 active_file # of bytes of file-backed memory on active LRU list.
595 Only anonymous and swap cache memory is listed as part of 'rss' stat.
603 cache.)
606 --------------
617 -----------
629 ------------------
635 If you want to know more exact memory usage, you should use RSS+CACHE(+SWAP)
639 -------------
641 This is similar to numa_maps but operates on a per-memcg basis. This is
648 per-node page counts including "hierarchical_<counter>" which sums up all
664 The memory controller supports a deep hierarchy and hierarchical accounting.
684 ------------------------------------------------
701 When panic_on_oom is set to "2", the whole system will panic in
713 When the system detects memory contention or low memory, control groups
718 Please note that soft limits is a best-effort feature; it comes with
725 -------------
752 -------------
765 Charges are moved only when you move mm->owner, in other words,
779 --------------------------------------
786 +---+--------------------------------------------------------------------------+
791 +---+--------------------------------------------------------------------------+
800 +---+--------------------------------------------------------------------------+
803 --------
805 - All of moving charge operations are done under cgroup_mutex. It's not good
817 - create an eventfd using eventfd(2);
818 - open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
819 - write string like "<event_fd> <fd of memory.usage_in_bytes> <threshold>" to
825 It's applicable for root and non-root cgroup.
838 - create an eventfd using eventfd(2)
839 - open memory.oom_control file
840 - write string like "<event_fd> <fd of memory.oom_control>" to
846 You can disable the OOM-killer by writing "1" to memory.oom_control file, as:
850 If OOM-killer is disabled, tasks under cgroup will hang/sleep
851 in memory cgroup's OOM-waitqueue when they request accountable memory.
867 - oom_kill_disable 0 or 1
868 (if 1, oom-killer is disabled)
869 - under_oom 0 or 1
880 The "low" level means that the system is reclaiming memory for new
882 maintaining cache level. Upon notification, the program (typically
886 The "medium" level means that the system is experiencing medium memory
887 pressure, the system might be making swap, paging out active file caches,
890 resources that can be easily reconstructed or re-read from a disk.
892 The "critical" level means that the system is actively thrashing, it is
893 about to out of memory (OOM) or even the in-kernel OOM killer is on its
895 system. It might be too late to consult with vmstat or any other
899 events are not pass-through. For example, you have three cgroups: A->B->C. Now
903 excessive "broadcasting" of messages, which disturbs the system and which is
909 - "default": this is the default behavior specified above. This mode is the
913 - "hierarchy": events always propagate up to the root, similar to the default
918 - "local": events are pass-through, i.e. they only receive notifications when
927 specified by a comma-delimited string, i.e. "low,hierarchy" specifies
928 hierarchical, pass-through, notification for all ancestor memcgs. Notification
929 that is the default, non pass-through behavior, does not specify a mode.
930 "medium,local" specifies pass-through notification for the medium level.
935 - create an eventfd using eventfd(2);
936 - open memory.pressure_level;
937 - write string as "<event_fd> <fd of memory.pressure_level> <level[,mode]>"
959 (Expect a bunch of notifications, and eventually, the oom-killer will
965 1. Make per-cgroup scanner reclaim not-shared pages first
966 2. Teach controller to account for shared-pages
973 Overall, the memory controller has been a stable controller and has been
979 1. Singh, Balbir. RFC: Memory Controller, http://lwn.net/Articles/206697/
980 2. Singh, Balbir. Memory Controller (RSS Control),
984 4. Emelianov, Pavel. RSS controller based on process cgroups (v2)
986 5. Emelianov, Pavel. RSS controller based on process cgroups (v3)
991 8. Singh, Balbir. RSS controller v2 test results (lmbench),
993 9. Singh, Balbir. RSS controller v2 AIM9 results
995 10. Singh, Balbir. Memory controller v6 test results,
997 11. Singh, Balbir. Memory controller introduction (v6),