Documentation/mm/page_tables.rst

1 .. SPDX-License-Identifier: GPL-2.0
7 Paged virtual memory was invented along with virtual memory as a concept in
9 virtual memory. The feature migrated to newer computers and became a de facto
10 feature of all Unix-like systems as time went by. In 1985 the feature was
14 as seen on the external memory bus.
18 map this to the restrictions of the hardware.
20 The physical address corresponding to the virtual address is often referenced
22 is the physical address of the page (as seen on the external memory bus)
25 Physical memory address 0 will be *pfn 0* and the highest pfn will be
26 the last page of physical memory the external address bus of the CPU can
32 at 0x00004000, 0x00008000 ... 0xffffc000 and pfn goes from 0 to 0x3fffff.
34 As you can see, with 4KB pages the page base address uses bits 12-31 of the
38 Over time a deeper hierarchy has been developed in response to increasing memory
41 the fact that Torvald's first computer had 4MB of physical memory. Entries in
42 this single table were referred to as *PTE*:s - page table entries.
45 become hierarchical and that in turn is done to save page table memory and
49 of entries, breaking down the whole memory into single pages. Such a page table
50 would be very sparse, because large portions of the virtual memory usually
52 address space does not waste valuable page table memory, because it will suffice
53 to mark large areas as unmapped at a higher level in the page table hierarchy.
56 to a physical memory range, which allows mapping a contiguous range of several
57 megabytes or even gigabytes in a single high-level page table entry, taking
58 shortcuts in mapping virtual memory to physical memory: there is no need to
63   +-----+
65   +-----+
67      |   +-----+
68      +-->| P4D |
69          +-----+
71             |   +-----+
72             +-->| PUD |
73                 +-----+
75                    |   +-----+
76                    +-->| PMD |
77                        +-----+
79                           |   +-----+
80                           +-->| PTE |
81                               +-----+
87 - **pte**, `pte_t`, `pteval_t` = **Page Table Entry** - mentioned earlier.
89   mapping a single page of virtual memory to a single page of physical memory.
92   A typical example is that the `pteval_t` is a 32- or 64-bit value with the
94   architecture-specific bits such as memory protection.
97   this did refer to a single page table entry in the single top level page
98   table, it was retrofitted to be an array of mapping elements when two-level
102 - **pmd**, `pmd_t`, `pmdval_t` = **Page Middle Directory**, the hierarchy right
103   above the *pte*, with `PTRS_PER_PMD` references to the *pte*:s.
105 - **pud**, `pud_t`, `pudval_t` = **Page Upper Directory** was introduced after
106   the other levels to handle 4-level page tables. It is potentially unused,
109 - **p4d**, `p4d_t`, `p4dval_t` = **Page Level 4 Directory** was introduced to
110   handle 5-level page tables after the *pud* was introduced. Now it was clear
111   that we needed to replace *pgd*, *pmd*, *pud* etc with a figure indicating the
116 - **pgd**, `pgd_t`, `pgdval_t` = **Page Global Directory** - the Linux kernel
117   main page table handling the PGD for the kernel memory is still found in
119   memory context and thus its own *pgd*, found in `struct mm_struct` which
120   in turn is referenced to in each `struct task_struct`. So tasks have memory
122   `struct pgt_t *pgd` pointer to the corresponding page global directory.
124 To repeat: each level in the page table hierarchy is a *array of pointers*, so
125 the **pgd** contains `PTRS_PER_PGD` pointers to the next level below, **p4d**
126 contains `PTRS_PER_P4D` pointers to **pud** items and so on. The number of
127 pointers on each level is architecture-defined.::
130   --> +-----+           PTE
131       | ptr |-------> +-----+
132       | ptr |-        | ptr |-------> PAGE
137       +-----+     +----> +-----+
138                          | ptr |-------> PAGE
148 compile-time augmented to just skip a level when accessing the next lower
151 Page table handling code that wishes to be architecture-neutral, such as the
152 virtual memory manager, will need to be written so that it traverses all of the
154 architecture-specific code, so as to be robust to future changes.
160 The `Memory Management Unit (MMU)` is a hardware component that handles virtual
161 to physical address translations. It may use relatively small caches in hardware
162 called `Translation Lookaside Buffers (TLBs)` and `Page Walk Caches` to speed up
165 When CPU accesses a memory location, it provides a virtual address to the MMU,
168 MMU uses the page walks to determine the physical address and create the map.
170 The dirty bit for a page is set (i.e., turned on) when the page is written to.
171 Each page of memory has associated permission and dirty bits. The latter
172 indicate that the page has been modified since it was loaded into memory.
174 If nothing prevents it, eventually the physical memory can be accessed and the
178 happen because the CPU is trying to access memory that the current task is not
179 permitted to, or because the data is not present into physical memory.
182 exceptions that signal the CPU to pause the current execution and run a special
183 function to handle the mentioned exceptions.
187 "Copy-on-Write". Page faults may also happen when frames have been swapped out
188 to persistent storage (swap partition or file) and evicted from their physical
191 These techniques improve memory efficiency, reduce latency, and minimize space
193 and "Copy-on-Write" because these subjects are out of scope as they belong to
197 undesirable since it's performed as a means to reduce memory under heavy
200 Swapping can't work for memory mapped by kernel logical addresses. These are a
202 physical memory. Given any logical address, its physical address is determined
203 with simple arithmetic on an offset. Accesses to logical addresses are fast
207 If the kernel fails to make room for the data that must be present in the
208 physical frames, the kernel invokes the out-of-memory (OOM) killer to make room
213 crafted addresses that the CPU is instructed to access. A thread of a process
214 could use instructions to address (non-shared) memory which does not belong to
215 its own address space, or could try to execute an instruction that want to write
216 to a read-only location.
218 If the above-mentioned conditions happen in user-space, the kernel sends a
219 `Segmentation Fault` (SIGSEGV) signal to the current thread. That signal usually
220 causes the termination of the thread and of the process it belongs to.
222 This document is going to simplify and show an high altitude view of how the
224 check if memory is present and, if not, requests to load data from persistent
227 The first steps are architecture dependent. Most architectures jump to
231 Whatever the routes, all architectures end up to the invocation of
233 `__handle_mm_fault()` to carry out the actual work of allocating the page
236 The unfortunate case of not being able to call `__handle_mm_fault()` means
237 that the virtual address is pointing to areas of physical memory which are not
238 permitted to be accessed (at least from the current context). This
239 condition resolves to the kernel sending the above-mentioned SIGSEGV signal
240 to the process and leads to the consequences already explained.
242 `__handle_mm_fault()` carries out its work by calling several functions to
247 "*" is for pgd, p4d, pud, pmd, pte; instead the functions to allocate the
249 above-mentioned convention to name them after the corresponding types of tables
256 directly map them, with no need to use lower level page entries (PTE). Huge
257 pages contain large contiguous physical regions that usually span from 2MB to
261 reduced page table overhead, memory allocation efficiency, and performance
263 trade-offs, like wasted memory and allocation challenges.
272 Linux to handle page faults in a way that is tailored to the specific
276 To conclude this high altitude view of how Linux handles page faults, let's
280 Several code path make use of the latter two functions because they need to
281 disable traps into the page faults handler, mostly to prevent deadlocks.