1f462951eSMike Rapoport.. _mm_concepts: 2f462951eSMike Rapoport 3f462951eSMike Rapoport================= 4f462951eSMike RapoportConcepts overview 5f462951eSMike Rapoport================= 6f462951eSMike Rapoport 7*cf17e50aSMike RapoportThe memory management in Linux is a complex system that evolved over the 8*cf17e50aSMike Rapoportyears and included more and more functionality to support a variety of 9f462951eSMike Rapoportsystems from MMU-less microcontrollers to supercomputers. The memory 10*cf17e50aSMike Rapoportmanagement for systems without an MMU is called ``nommu`` and it 11f462951eSMike Rapoportdefinitely deserves a dedicated document, which hopefully will be 12f462951eSMike Rapoporteventually written. Yet, although some of the concepts are the same, 13*cf17e50aSMike Rapoporthere we assume that an MMU is available and a CPU can translate a virtual 14f462951eSMike Rapoportaddress to a physical address. 15f462951eSMike Rapoport 16f462951eSMike Rapoport.. contents:: :local: 17f462951eSMike Rapoport 18f462951eSMike RapoportVirtual Memory Primer 19f462951eSMike Rapoport===================== 20f462951eSMike Rapoport 21f462951eSMike RapoportThe physical memory in a computer system is a limited resource and 22f462951eSMike Rapoporteven for systems that support memory hotplug there is a hard limit on 23f462951eSMike Rapoportthe amount of memory that can be installed. The physical memory is not 24*cf17e50aSMike Rapoportnecessarily contiguous; it might be accessible as a set of distinct 25f462951eSMike Rapoportaddress ranges. Besides, different CPU architectures, and even 26*cf17e50aSMike Rapoportdifferent implementations of the same architecture have different views 27*cf17e50aSMike Rapoportof how these address ranges are defined. 28f462951eSMike Rapoport 29f462951eSMike RapoportAll this makes dealing directly with physical memory quite complex and 30f462951eSMike Rapoportto avoid this complexity a concept of virtual memory was developed. 31f462951eSMike Rapoport 32f462951eSMike RapoportThe virtual memory abstracts the details of physical memory from the 33f462951eSMike Rapoportapplication software, allows to keep only needed information in the 34f462951eSMike Rapoportphysical memory (demand paging) and provides a mechanism for the 35f462951eSMike Rapoportprotection and controlled sharing of data between processes. 36f462951eSMike Rapoport 37f462951eSMike RapoportWith virtual memory, each and every memory access uses a virtual 38f462951eSMike Rapoportaddress. When the CPU decodes the an instruction that reads (or 39f462951eSMike Rapoportwrites) from (or to) the system memory, it translates the `virtual` 40f462951eSMike Rapoportaddress encoded in that instruction to a `physical` address that the 41f462951eSMike Rapoportmemory controller can understand. 42f462951eSMike Rapoport 43f462951eSMike RapoportThe physical system memory is divided into page frames, or pages. The 44f462951eSMike Rapoportsize of each page is architecture specific. Some architectures allow 45f462951eSMike Rapoportselection of the page size from several supported values; this 46f462951eSMike Rapoportselection is performed at the kernel build time by setting an 47f462951eSMike Rapoportappropriate kernel configuration option. 48f462951eSMike Rapoport 49f462951eSMike RapoportEach physical memory page can be mapped as one or more virtual 50f462951eSMike Rapoportpages. These mappings are described by page tables that allow 51*cf17e50aSMike Rapoporttranslation from a virtual address used by programs to the physical 52*cf17e50aSMike Rapoportmemory address. The page tables are organized hierarchically. 53f462951eSMike Rapoport 54f462951eSMike RapoportThe tables at the lowest level of the hierarchy contain physical 55f462951eSMike Rapoportaddresses of actual pages used by the software. The tables at higher 56f462951eSMike Rapoportlevels contain physical addresses of the pages belonging to the lower 57f462951eSMike Rapoportlevels. The pointer to the top level page table resides in a 58f462951eSMike Rapoportregister. When the CPU performs the address translation, it uses this 59f462951eSMike Rapoportregister to access the top level page table. The high bits of the 60f462951eSMike Rapoportvirtual address are used to index an entry in the top level page 61f462951eSMike Rapoporttable. That entry is then used to access the next level in the 62f462951eSMike Rapoporthierarchy with the next bits of the virtual address as the index to 63f462951eSMike Rapoportthat level page table. The lowest bits in the virtual address define 64f462951eSMike Rapoportthe offset inside the actual page. 65f462951eSMike Rapoport 66f462951eSMike RapoportHuge Pages 67f462951eSMike Rapoport========== 68f462951eSMike Rapoport 69f462951eSMike RapoportThe address translation requires several memory accesses and memory 70f462951eSMike Rapoportaccesses are slow relatively to CPU speed. To avoid spending precious 71f462951eSMike Rapoportprocessor cycles on the address translation, CPUs maintain a cache of 72f462951eSMike Rapoportsuch translations called Translation Lookaside Buffer (or 73f462951eSMike RapoportTLB). Usually TLB is pretty scarce resource and applications with 74f462951eSMike Rapoportlarge memory working set will experience performance hit because of 75f462951eSMike RapoportTLB misses. 76f462951eSMike Rapoport 77f462951eSMike RapoportMany modern CPU architectures allow mapping of the memory pages 78f462951eSMike Rapoportdirectly by the higher levels in the page table. For instance, on x86, 79f462951eSMike Rapoportit is possible to map 2M and even 1G pages using entries in the second 80f462951eSMike Rapoportand the third level page tables. In Linux such pages are called 81f462951eSMike Rapoport`huge`. Usage of huge pages significantly reduces pressure on TLB, 82f462951eSMike Rapoportimproves TLB hit-rate and thus improves overall system performance. 83f462951eSMike Rapoport 84f462951eSMike RapoportThere are two mechanisms in Linux that enable mapping of the physical 85f462951eSMike Rapoportmemory with the huge pages. The first one is `HugeTLB filesystem`, or 86f462951eSMike Rapoporthugetlbfs. It is a pseudo filesystem that uses RAM as its backing 87f462951eSMike Rapoportstore. For the files created in this filesystem the data resides in 88f462951eSMike Rapoportthe memory and mapped using huge pages. The hugetlbfs is described at 89f462951eSMike Rapoport:ref:`Documentation/admin-guide/mm/hugetlbpage.rst <hugetlbpage>`. 90f462951eSMike Rapoport 91f462951eSMike RapoportAnother, more recent, mechanism that enables use of the huge pages is 92f462951eSMike Rapoportcalled `Transparent HugePages`, or THP. Unlike the hugetlbfs that 93f462951eSMike Rapoportrequires users and/or system administrators to configure what parts of 94f462951eSMike Rapoportthe system memory should and can be mapped by the huge pages, THP 95f462951eSMike Rapoportmanages such mappings transparently to the user and hence the 96f462951eSMike Rapoportname. See 97f462951eSMike Rapoport:ref:`Documentation/admin-guide/mm/transhuge.rst <admin_guide_transhuge>` 98f462951eSMike Rapoportfor more details about THP. 99f462951eSMike Rapoport 100f462951eSMike RapoportZones 101f462951eSMike Rapoport===== 102f462951eSMike Rapoport 103f462951eSMike RapoportOften hardware poses restrictions on how different physical memory 104f462951eSMike Rapoportranges can be accessed. In some cases, devices cannot perform DMA to 105f462951eSMike Rapoportall the addressable memory. In other cases, the size of the physical 106f462951eSMike Rapoportmemory exceeds the maximal addressable size of virtual memory and 107f462951eSMike Rapoportspecial actions are required to access portions of the memory. Linux 108f462951eSMike Rapoportgroups memory pages into `zones` according to their possible 109f462951eSMike Rapoportusage. For example, ZONE_DMA will contain memory that can be used by 110f462951eSMike Rapoportdevices for DMA, ZONE_HIGHMEM will contain memory that is not 111f462951eSMike Rapoportpermanently mapped into kernel's address space and ZONE_NORMAL will 112f462951eSMike Rapoportcontain normally addressed pages. 113f462951eSMike Rapoport 114f462951eSMike RapoportThe actual layout of the memory zones is hardware dependent as not all 115f462951eSMike Rapoportarchitectures define all zones, and requirements for DMA are different 116f462951eSMike Rapoportfor different platforms. 117f462951eSMike Rapoport 118f462951eSMike RapoportNodes 119f462951eSMike Rapoport===== 120f462951eSMike Rapoport 121f462951eSMike RapoportMany multi-processor machines are NUMA - Non-Uniform Memory Access - 122f462951eSMike Rapoportsystems. In such systems the memory is arranged into banks that have 123f462951eSMike Rapoportdifferent access latency depending on the "distance" from the 124*cf17e50aSMike Rapoportprocessor. Each bank is referred to as a `node` and for each node Linux 125*cf17e50aSMike Rapoportconstructs an independent memory management subsystem. A node has its 126f462951eSMike Rapoportown set of zones, lists of free and used pages and various statistics 127f462951eSMike Rapoportcounters. You can find more details about NUMA in 128f462951eSMike Rapoport:ref:`Documentation/vm/numa.rst <numa>` and in 129f462951eSMike Rapoport:ref:`Documentation/admin-guide/mm/numa_memory_policy.rst <numa_memory_policy>`. 130f462951eSMike Rapoport 131f462951eSMike RapoportPage cache 132f462951eSMike Rapoport========== 133f462951eSMike Rapoport 134f462951eSMike RapoportThe physical memory is volatile and the common case for getting data 135f462951eSMike Rapoportinto the memory is to read it from files. Whenever a file is read, the 136f462951eSMike Rapoportdata is put into the `page cache` to avoid expensive disk access on 137f462951eSMike Rapoportthe subsequent reads. Similarly, when one writes to a file, the data 138f462951eSMike Rapoportis placed in the page cache and eventually gets into the backing 139f462951eSMike Rapoportstorage device. The written pages are marked as `dirty` and when Linux 140f462951eSMike Rapoportdecides to reuse them for other purposes, it makes sure to synchronize 141f462951eSMike Rapoportthe file contents on the device with the updated data. 142f462951eSMike Rapoport 143f462951eSMike RapoportAnonymous Memory 144f462951eSMike Rapoport================ 145f462951eSMike Rapoport 146f462951eSMike RapoportThe `anonymous memory` or `anonymous mappings` represent memory that 147f462951eSMike Rapoportis not backed by a filesystem. Such mappings are implicitly created 148f462951eSMike Rapoportfor program's stack and heap or by explicit calls to mmap(2) system 149f462951eSMike Rapoportcall. Usually, the anonymous mappings only define virtual memory areas 150f462951eSMike Rapoportthat the program is allowed to access. The read accesses will result 151f462951eSMike Rapoportin creation of a page table entry that references a special physical 152*cf17e50aSMike Rapoportpage filled with zeroes. When the program performs a write, a regular 153f462951eSMike Rapoportphysical page will be allocated to hold the written data. The page 154*cf17e50aSMike Rapoportwill be marked dirty and if the kernel decides to repurpose it, 155f462951eSMike Rapoportthe dirty page will be swapped out. 156f462951eSMike Rapoport 157f462951eSMike RapoportReclaim 158f462951eSMike Rapoport======= 159f462951eSMike Rapoport 160f462951eSMike RapoportThroughout the system lifetime, a physical page can be used for storing 161f462951eSMike Rapoportdifferent types of data. It can be kernel internal data structures, 162f462951eSMike RapoportDMA'able buffers for device drivers use, data read from a filesystem, 163f462951eSMike Rapoportmemory allocated by user space processes etc. 164f462951eSMike Rapoport 165f462951eSMike RapoportDepending on the page usage it is treated differently by the Linux 166f462951eSMike Rapoportmemory management. The pages that can be freed at any time, either 167f462951eSMike Rapoportbecause they cache the data available elsewhere, for instance, on a 168f462951eSMike Rapoporthard disk, or because they can be swapped out, again, to the hard 169f462951eSMike Rapoportdisk, are called `reclaimable`. The most notable categories of the 170f462951eSMike Rapoportreclaimable pages are page cache and anonymous memory. 171f462951eSMike Rapoport 172f462951eSMike RapoportIn most cases, the pages holding internal kernel data and used as DMA 173f462951eSMike Rapoportbuffers cannot be repurposed, and they remain pinned until freed by 174f462951eSMike Rapoporttheir user. Such pages are called `unreclaimable`. However, in certain 175f462951eSMike Rapoportcircumstances, even pages occupied with kernel data structures can be 176f462951eSMike Rapoportreclaimed. For instance, in-memory caches of filesystem metadata can 177f462951eSMike Rapoportbe re-read from the storage device and therefore it is possible to 178f462951eSMike Rapoportdiscard them from the main memory when system is under memory 179f462951eSMike Rapoportpressure. 180f462951eSMike Rapoport 181f462951eSMike RapoportThe process of freeing the reclaimable physical memory pages and 182f462951eSMike Rapoportrepurposing them is called (surprise!) `reclaim`. Linux can reclaim 183f462951eSMike Rapoportpages either asynchronously or synchronously, depending on the state 184*cf17e50aSMike Rapoportof the system. When the system is not loaded, most of the memory is free 185*cf17e50aSMike Rapoportand allocation requests will be satisfied immediately from the free 186f462951eSMike Rapoportpages supply. As the load increases, the amount of the free pages goes 187f462951eSMike Rapoportdown and when it reaches a certain threshold (high watermark), an 188f462951eSMike Rapoportallocation request will awaken the ``kswapd`` daemon. It will 189f462951eSMike Rapoportasynchronously scan memory pages and either just free them if the data 190f462951eSMike Rapoportthey contain is available elsewhere, or evict to the backing storage 191f462951eSMike Rapoportdevice (remember those dirty pages?). As memory usage increases even 192f462951eSMike Rapoportmore and reaches another threshold - min watermark - an allocation 193*cf17e50aSMike Rapoportwill trigger `direct reclaim`. In this case allocation is stalled 194f462951eSMike Rapoportuntil enough memory pages are reclaimed to satisfy the request. 195f462951eSMike Rapoport 196f462951eSMike RapoportCompaction 197f462951eSMike Rapoport========== 198f462951eSMike Rapoport 199f462951eSMike RapoportAs the system runs, tasks allocate and free the memory and it becomes 200f462951eSMike Rapoportfragmented. Although with virtual memory it is possible to present 201f462951eSMike Rapoportscattered physical pages as virtually contiguous range, sometimes it is 202f462951eSMike Rapoportnecessary to allocate large physically contiguous memory areas. Such 203*cf17e50aSMike Rapoportneed may arise, for instance, when a device driver requires a large 204f462951eSMike Rapoportbuffer for DMA, or when THP allocates a huge page. Memory `compaction` 205f462951eSMike Rapoportaddresses the fragmentation issue. This mechanism moves occupied pages 206f462951eSMike Rapoportfrom the lower part of a memory zone to free pages in the upper part 207f462951eSMike Rapoportof the zone. When a compaction scan is finished free pages are grouped 208f462951eSMike Rapoporttogether at the beginning of the zone and allocations of large 209f462951eSMike Rapoportphysically contiguous areas become possible. 210f462951eSMike Rapoport 211*cf17e50aSMike RapoportLike reclaim, the compaction may happen asynchronously in the ``kcompactd`` 212*cf17e50aSMike Rapoportdaemon or synchronously as a result of a memory allocation request. 213f462951eSMike Rapoport 214f462951eSMike RapoportOOM killer 215f462951eSMike Rapoport========== 216f462951eSMike Rapoport 217*cf17e50aSMike RapoportIt is possible that on a loaded machine memory will be exhausted and the 218*cf17e50aSMike Rapoportkernel will be unable to reclaim enough memory to continue to operate. In 219*cf17e50aSMike Rapoportorder to save the rest of the system, it invokes the `OOM killer`. 220*cf17e50aSMike Rapoport 221*cf17e50aSMike RapoportThe `OOM killer` selects a task to sacrifice for the sake of the overall 222*cf17e50aSMike Rapoportsystem health. The selected task is killed in a hope that after it exits 223*cf17e50aSMike Rapoportenough memory will be freed to continue normal operation. 224