xref: /linux/Documentation/admin-guide/mm/concepts.rst (revision 95d002e0a34cb0f238abb39987f9980f325d8332)
1f462951eSMike Rapoport.. _mm_concepts:
2f462951eSMike Rapoport
3f462951eSMike Rapoport=================
4f462951eSMike RapoportConcepts overview
5f462951eSMike Rapoport=================
6f462951eSMike Rapoport
7*cf17e50aSMike RapoportThe memory management in Linux is a complex system that evolved over the
8*cf17e50aSMike Rapoportyears and included more and more functionality to support a variety of
9f462951eSMike Rapoportsystems from MMU-less microcontrollers to supercomputers. The memory
10*cf17e50aSMike Rapoportmanagement for systems without an MMU is called ``nommu`` and it
11f462951eSMike Rapoportdefinitely deserves a dedicated document, which hopefully will be
12f462951eSMike Rapoporteventually written. Yet, although some of the concepts are the same,
13*cf17e50aSMike Rapoporthere we assume that an MMU is available and a CPU can translate a virtual
14f462951eSMike Rapoportaddress to a physical address.
15f462951eSMike Rapoport
16f462951eSMike Rapoport.. contents:: :local:
17f462951eSMike Rapoport
18f462951eSMike RapoportVirtual Memory Primer
19f462951eSMike Rapoport=====================
20f462951eSMike Rapoport
21f462951eSMike RapoportThe physical memory in a computer system is a limited resource and
22f462951eSMike Rapoporteven for systems that support memory hotplug there is a hard limit on
23f462951eSMike Rapoportthe amount of memory that can be installed. The physical memory is not
24*cf17e50aSMike Rapoportnecessarily contiguous; it might be accessible as a set of distinct
25f462951eSMike Rapoportaddress ranges. Besides, different CPU architectures, and even
26*cf17e50aSMike Rapoportdifferent implementations of the same architecture have different views
27*cf17e50aSMike Rapoportof how these address ranges are defined.
28f462951eSMike Rapoport
29f462951eSMike RapoportAll this makes dealing directly with physical memory quite complex and
30f462951eSMike Rapoportto avoid this complexity a concept of virtual memory was developed.
31f462951eSMike Rapoport
32f462951eSMike RapoportThe virtual memory abstracts the details of physical memory from the
33f462951eSMike Rapoportapplication software, allows to keep only needed information in the
34f462951eSMike Rapoportphysical memory (demand paging) and provides a mechanism for the
35f462951eSMike Rapoportprotection and controlled sharing of data between processes.
36f462951eSMike Rapoport
37f462951eSMike RapoportWith virtual memory, each and every memory access uses a virtual
38f462951eSMike Rapoportaddress. When the CPU decodes the an instruction that reads (or
39f462951eSMike Rapoportwrites) from (or to) the system memory, it translates the `virtual`
40f462951eSMike Rapoportaddress encoded in that instruction to a `physical` address that the
41f462951eSMike Rapoportmemory controller can understand.
42f462951eSMike Rapoport
43f462951eSMike RapoportThe physical system memory is divided into page frames, or pages. The
44f462951eSMike Rapoportsize of each page is architecture specific. Some architectures allow
45f462951eSMike Rapoportselection of the page size from several supported values; this
46f462951eSMike Rapoportselection is performed at the kernel build time by setting an
47f462951eSMike Rapoportappropriate kernel configuration option.
48f462951eSMike Rapoport
49f462951eSMike RapoportEach physical memory page can be mapped as one or more virtual
50f462951eSMike Rapoportpages. These mappings are described by page tables that allow
51*cf17e50aSMike Rapoporttranslation from a virtual address used by programs to the physical
52*cf17e50aSMike Rapoportmemory address. The page tables are organized hierarchically.
53f462951eSMike Rapoport
54f462951eSMike RapoportThe tables at the lowest level of the hierarchy contain physical
55f462951eSMike Rapoportaddresses of actual pages used by the software. The tables at higher
56f462951eSMike Rapoportlevels contain physical addresses of the pages belonging to the lower
57f462951eSMike Rapoportlevels. The pointer to the top level page table resides in a
58f462951eSMike Rapoportregister. When the CPU performs the address translation, it uses this
59f462951eSMike Rapoportregister to access the top level page table. The high bits of the
60f462951eSMike Rapoportvirtual address are used to index an entry in the top level page
61f462951eSMike Rapoporttable. That entry is then used to access the next level in the
62f462951eSMike Rapoporthierarchy with the next bits of the virtual address as the index to
63f462951eSMike Rapoportthat level page table. The lowest bits in the virtual address define
64f462951eSMike Rapoportthe offset inside the actual page.
65f462951eSMike Rapoport
66f462951eSMike RapoportHuge Pages
67f462951eSMike Rapoport==========
68f462951eSMike Rapoport
69f462951eSMike RapoportThe address translation requires several memory accesses and memory
70f462951eSMike Rapoportaccesses are slow relatively to CPU speed. To avoid spending precious
71f462951eSMike Rapoportprocessor cycles on the address translation, CPUs maintain a cache of
72f462951eSMike Rapoportsuch translations called Translation Lookaside Buffer (or
73f462951eSMike RapoportTLB). Usually TLB is pretty scarce resource and applications with
74f462951eSMike Rapoportlarge memory working set will experience performance hit because of
75f462951eSMike RapoportTLB misses.
76f462951eSMike Rapoport
77f462951eSMike RapoportMany modern CPU architectures allow mapping of the memory pages
78f462951eSMike Rapoportdirectly by the higher levels in the page table. For instance, on x86,
79f462951eSMike Rapoportit is possible to map 2M and even 1G pages using entries in the second
80f462951eSMike Rapoportand the third level page tables. In Linux such pages are called
81f462951eSMike Rapoport`huge`. Usage of huge pages significantly reduces pressure on TLB,
82f462951eSMike Rapoportimproves TLB hit-rate and thus improves overall system performance.
83f462951eSMike Rapoport
84f462951eSMike RapoportThere are two mechanisms in Linux that enable mapping of the physical
85f462951eSMike Rapoportmemory with the huge pages. The first one is `HugeTLB filesystem`, or
86f462951eSMike Rapoporthugetlbfs. It is a pseudo filesystem that uses RAM as its backing
87f462951eSMike Rapoportstore. For the files created in this filesystem the data resides in
88f462951eSMike Rapoportthe memory and mapped using huge pages. The hugetlbfs is described at
89f462951eSMike Rapoport:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`.
90f462951eSMike Rapoport
91f462951eSMike RapoportAnother, more recent, mechanism that enables use of the huge pages is
92f462951eSMike Rapoportcalled `Transparent HugePages`, or THP. Unlike the hugetlbfs that
93f462951eSMike Rapoportrequires users and/or system administrators to configure what parts of
94f462951eSMike Rapoportthe system memory should and can be mapped by the huge pages, THP
95f462951eSMike Rapoportmanages such mappings transparently to the user and hence the
96f462951eSMike Rapoportname. See
97f462951eSMike Rapoport:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>`
98f462951eSMike Rapoportfor more details about THP.
99f462951eSMike Rapoport
100f462951eSMike RapoportZones
101f462951eSMike Rapoport=====
102f462951eSMike Rapoport
103f462951eSMike RapoportOften hardware poses restrictions on how different physical memory
104f462951eSMike Rapoportranges can be accessed. In some cases, devices cannot perform DMA to
105f462951eSMike Rapoportall the addressable memory. In other cases, the size of the physical
106f462951eSMike Rapoportmemory exceeds the maximal addressable size of virtual memory and
107f462951eSMike Rapoportspecial actions are required to access portions of the memory. Linux
108f462951eSMike Rapoportgroups memory pages into `zones` according to their possible
109f462951eSMike Rapoportusage. For example, ZONE_DMA will contain memory that can be used by
110f462951eSMike Rapoportdevices for DMA, ZONE_HIGHMEM will contain memory that is not
111f462951eSMike Rapoportpermanently mapped into kernel's address space and ZONE_NORMAL will
112f462951eSMike Rapoportcontain normally addressed pages.
113f462951eSMike Rapoport
114f462951eSMike RapoportThe actual layout of the memory zones is hardware dependent as not all
115f462951eSMike Rapoportarchitectures define all zones, and requirements for DMA are different
116f462951eSMike Rapoportfor different platforms.
117f462951eSMike Rapoport
118f462951eSMike RapoportNodes
119f462951eSMike Rapoport=====
120f462951eSMike Rapoport
121f462951eSMike RapoportMany multi-processor machines are NUMA - Non-Uniform Memory Access -
122f462951eSMike Rapoportsystems. In such systems the memory is arranged into banks that have
123f462951eSMike Rapoportdifferent access latency depending on the "distance" from the
124*cf17e50aSMike Rapoportprocessor. Each bank is referred to as a `node` and for each node Linux
125*cf17e50aSMike Rapoportconstructs an independent memory management subsystem. A node has its
126f462951eSMike Rapoportown set of zones, lists of free and used pages and various statistics
127f462951eSMike Rapoportcounters. You can find more details about NUMA in
128f462951eSMike Rapoport:ref:`Documentation/vm/numa.rst <numa>` and in
129f462951eSMike Rapoport:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`.
130f462951eSMike Rapoport
131f462951eSMike RapoportPage cache
132f462951eSMike Rapoport==========
133f462951eSMike Rapoport
134f462951eSMike RapoportThe physical memory is volatile and the common case for getting data
135f462951eSMike Rapoportinto the memory is to read it from files. Whenever a file is read, the
136f462951eSMike Rapoportdata is put into the `page cache` to avoid expensive disk access on
137f462951eSMike Rapoportthe subsequent reads. Similarly, when one writes to a file, the data
138f462951eSMike Rapoportis placed in the page cache and eventually gets into the backing
139f462951eSMike Rapoportstorage device. The written pages are marked as `dirty` and when Linux
140f462951eSMike Rapoportdecides to reuse them for other purposes, it makes sure to synchronize
141f462951eSMike Rapoportthe file contents on the device with the updated data.
142f462951eSMike Rapoport
143f462951eSMike RapoportAnonymous Memory
144f462951eSMike Rapoport================
145f462951eSMike Rapoport
146f462951eSMike RapoportThe `anonymous memory` or `anonymous mappings` represent memory that
147f462951eSMike Rapoportis not backed by a filesystem. Such mappings are implicitly created
148f462951eSMike Rapoportfor program's stack and heap or by explicit calls to mmap(2) system
149f462951eSMike Rapoportcall. Usually, the anonymous mappings only define virtual memory areas
150f462951eSMike Rapoportthat the program is allowed to access. The read accesses will result
151f462951eSMike Rapoportin creation of a page table entry that references a special physical
152*cf17e50aSMike Rapoportpage filled with zeroes. When the program performs a write, a regular
153f462951eSMike Rapoportphysical page will be allocated to hold the written data. The page
154*cf17e50aSMike Rapoportwill be marked dirty and if the kernel decides to repurpose it,
155f462951eSMike Rapoportthe dirty page will be swapped out.
156f462951eSMike Rapoport
157f462951eSMike RapoportReclaim
158f462951eSMike Rapoport=======
159f462951eSMike Rapoport
160f462951eSMike RapoportThroughout the system lifetime, a physical page can be used for storing
161f462951eSMike Rapoportdifferent types of data. It can be kernel internal data structures,
162f462951eSMike RapoportDMA'able buffers for device drivers use, data read from a filesystem,
163f462951eSMike Rapoportmemory allocated by user space processes etc.
164f462951eSMike Rapoport
165f462951eSMike RapoportDepending on the page usage it is treated differently by the Linux
166f462951eSMike Rapoportmemory management. The pages that can be freed at any time, either
167f462951eSMike Rapoportbecause they cache the data available elsewhere, for instance, on a
168f462951eSMike Rapoporthard disk, or because they can be swapped out, again, to the hard
169f462951eSMike Rapoportdisk, are called `reclaimable`. The most notable categories of the
170f462951eSMike Rapoportreclaimable pages are page cache and anonymous memory.
171f462951eSMike Rapoport
172f462951eSMike RapoportIn most cases, the pages holding internal kernel data and used as DMA
173f462951eSMike Rapoportbuffers cannot be repurposed, and they remain pinned until freed by
174f462951eSMike Rapoporttheir user. Such pages are called `unreclaimable`. However, in certain
175f462951eSMike Rapoportcircumstances, even pages occupied with kernel data structures can be
176f462951eSMike Rapoportreclaimed. For instance, in-memory caches of filesystem metadata can
177f462951eSMike Rapoportbe re-read from the storage device and therefore it is possible to
178f462951eSMike Rapoportdiscard them from the main memory when system is under memory
179f462951eSMike Rapoportpressure.
180f462951eSMike Rapoport
181f462951eSMike RapoportThe process of freeing the reclaimable physical memory pages and
182f462951eSMike Rapoportrepurposing them is called (surprise!) `reclaim`. Linux can reclaim
183f462951eSMike Rapoportpages either asynchronously or synchronously, depending on the state
184*cf17e50aSMike Rapoportof the system. When the system is not loaded, most of the memory is free
185*cf17e50aSMike Rapoportand allocation requests will be satisfied immediately from the free
186f462951eSMike Rapoportpages supply. As the load increases, the amount of the free pages goes
187f462951eSMike Rapoportdown and when it reaches a certain threshold (high watermark), an
188f462951eSMike Rapoportallocation request will awaken the ``kswapd`` daemon. It will
189f462951eSMike Rapoportasynchronously scan memory pages and either just free them if the data
190f462951eSMike Rapoportthey contain is available elsewhere, or evict to the backing storage
191f462951eSMike Rapoportdevice (remember those dirty pages?). As memory usage increases even
192f462951eSMike Rapoportmore and reaches another threshold - min watermark - an allocation
193*cf17e50aSMike Rapoportwill trigger `direct reclaim`. In this case allocation is stalled
194f462951eSMike Rapoportuntil enough memory pages are reclaimed to satisfy the request.
195f462951eSMike Rapoport
196f462951eSMike RapoportCompaction
197f462951eSMike Rapoport==========
198f462951eSMike Rapoport
199f462951eSMike RapoportAs the system runs, tasks allocate and free the memory and it becomes
200f462951eSMike Rapoportfragmented. Although with virtual memory it is possible to present
201f462951eSMike Rapoportscattered physical pages as virtually contiguous range, sometimes it is
202f462951eSMike Rapoportnecessary to allocate large physically contiguous memory areas. Such
203*cf17e50aSMike Rapoportneed may arise, for instance, when a device driver requires a large
204f462951eSMike Rapoportbuffer for DMA, or when THP allocates a huge page. Memory `compaction`
205f462951eSMike Rapoportaddresses the fragmentation issue. This mechanism moves occupied pages
206f462951eSMike Rapoportfrom the lower part of a memory zone to free pages in the upper part
207f462951eSMike Rapoportof the zone. When a compaction scan is finished free pages are grouped
208f462951eSMike Rapoporttogether at the beginning of the zone and allocations of large
209f462951eSMike Rapoportphysically contiguous areas become possible.
210f462951eSMike Rapoport
211*cf17e50aSMike RapoportLike reclaim, the compaction may happen asynchronously in the ``kcompactd``
212*cf17e50aSMike Rapoportdaemon or synchronously as a result of a memory allocation request.
213f462951eSMike Rapoport
214f462951eSMike RapoportOOM killer
215f462951eSMike Rapoport==========
216f462951eSMike Rapoport
217*cf17e50aSMike RapoportIt is possible that on a loaded machine memory will be exhausted and the
218*cf17e50aSMike Rapoportkernel will be unable to reclaim enough memory to continue to operate. In
219*cf17e50aSMike Rapoportorder to save the rest of the system, it invokes the `OOM killer`.
220*cf17e50aSMike Rapoport
221*cf17e50aSMike RapoportThe `OOM killer` selects a task to sacrifice for the sake of the overall
222*cf17e50aSMike Rapoportsystem health. The selected task is killed in a hope that after it exits
223*cf17e50aSMike Rapoportenough memory will be freed to continue normal operation.
224