1.. SPDX-License-Identifier: GPL-2.0 2 3================= 4Process Addresses 5================= 6 7.. toctree:: 8 :maxdepth: 3 9 10 11Userland memory ranges are tracked by the kernel via Virtual Memory Areas or 12'VMA's of type :c:struct:`!struct vm_area_struct`. 13 14Each VMA describes a virtually contiguous memory range with identical 15attributes, each described by a :c:struct:`!struct vm_area_struct` 16object. Userland access outside of VMAs is invalid except in the case where an 17adjacent stack VMA could be extended to contain the accessed address. 18 19All VMAs are contained within one and only one virtual address space, described 20by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is, 21threads) which share the virtual address space. We refer to this as the 22:c:struct:`!mm`. 23 24Each mm object contains a maple tree data structure which describes all VMAs 25within the virtual address space. 26 27.. note:: An exception to this is the 'gate' VMA which is provided by 28 architectures which use :c:struct:`!vsyscall` and is a global static 29 object which does not belong to any specific mm. 30 31------- 32Locking 33------- 34 35The kernel is designed to be highly scalable against concurrent read operations 36on VMA **metadata** so a complicated set of locks are required to ensure memory 37corruption does not occur. 38 39.. note:: Locking VMAs for their metadata does not have any impact on the memory 40 they describe nor the page tables that map them. 41 42Terminology 43----------- 44 45* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock` 46 which locks at a process address space granularity which can be acquired via 47 :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants. 48* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves 49 as a read/write semaphore in practice. A VMA read lock is obtained via 50 :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a 51 write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked 52 automatically when the mmap write lock is released). To take a VMA write lock 53 you **must** have already acquired an :c:func:`!mmap_write_lock`. 54* **rmap locks** - When trying to access VMAs through the reverse mapping via a 55 :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object 56 (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via 57 :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for 58 anonymous memory and :c:func:`!i_mmap_[try]lock_read` or 59 :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these 60 locks as the reverse mapping locks, or 'rmap locks' for brevity. 61 62We discuss page table locks separately in the dedicated section below. 63 64The first thing **any** of these locks achieve is to **stabilise** the VMA 65within the MM tree. That is, guaranteeing that the VMA object will not be 66deleted from under you nor modified (except for some specific fields 67described below). 68 69Stabilising a VMA also keeps the address space described by it around. 70 71Lock usage 72---------- 73 74If you want to **read** VMA metadata fields or just keep the VMA stable, you 75must do one of the following: 76 77* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a 78 suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when 79 you're done with the VMA, *or* 80* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to 81 acquire the lock atomically so might fail, in which case fall-back logic is 82 required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`, 83 *or* 84* Acquire an rmap lock before traversing the locked interval tree (whether 85 anonymous or file-backed) to obtain the required VMA. 86 87If you want to **write** VMA metadata fields, then things vary depending on the 88field (we explore each VMA field in detail below). For the majority you must: 89 90* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a 91 suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when 92 you're done with the VMA, *and* 93* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to 94 modify, which will be released automatically when :c:func:`!mmap_write_unlock` is 95 called. 96* If you want to be able to write to **any** field, you must also hide the VMA 97 from the reverse mapping by obtaining an **rmap write lock**. 98 99VMA locks are special in that you must obtain an mmap **write** lock **first** 100in order to obtain a VMA **write** lock. A VMA **read** lock however can be 101obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then 102release an RCU lock to lookup the VMA for you). 103 104This constrains the impact of writers on readers, as a writer can interact with 105one VMA while a reader interacts with another simultaneously. 106 107.. note:: The primary users of VMA read locks are page fault handlers, which 108 means that without a VMA write lock, page faults will run concurrent with 109 whatever you are doing. 110 111Examining all valid lock states: 112 113.. table:: 114 115 ========= ======== ========= ======= ===== =========== ========== 116 mmap lock VMA lock rmap lock Stable? Read? Write most? Write all? 117 ========= ======== ========= ======= ===== =========== ========== 118 \- \- \- N N N N 119 \- R \- Y Y N N 120 \- \- R/W Y Y N N 121 R/W \-/R \-/R/W Y Y N N 122 W W \-/R Y Y Y N 123 W W W Y Y Y Y 124 ========= ======== ========= ======= ===== =========== ========== 125 126.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock, 127 attempting to do the reverse is invalid as it can result in deadlock - if 128 another task already holds an mmap write lock and attempts to acquire a VMA 129 write lock that will deadlock on the VMA read lock. 130 131All of these locks behave as read/write semaphores in practice, so you can 132obtain either a read or a write lock for each of these. 133 134.. note:: Generally speaking, a read/write semaphore is a class of lock which 135 permits concurrent readers. However a write lock can only be obtained 136 once all readers have left the critical region (and pending readers 137 made to wait). 138 139 This renders read locks on a read/write semaphore concurrent with other 140 readers and write locks exclusive against all others holding the semaphore. 141 142VMA fields 143^^^^^^^^^^ 144 145We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it 146easier to explore their locking characteristics: 147 148.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these 149 are in effect an internal implementation detail. 150 151.. table:: Virtual layout fields 152 153 ===================== ======================================== =========== 154 Field Description Write lock 155 ===================== ======================================== =========== 156 :c:member:`!vm_start` Inclusive start virtual address of range mmap write, 157 VMA describes. VMA write, 158 rmap write. 159 :c:member:`!vm_end` Exclusive end virtual address of range mmap write, 160 VMA describes. VMA write, 161 rmap write. 162 :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write, 163 the original page offset within the VMA write, 164 virtual address space (prior to any rmap write. 165 :c:func:`!mremap`), or PFN if a PFN map 166 and the architecture does not support 167 :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`. 168 ===================== ======================================== =========== 169 170These fields describes the size, start and end of the VMA, and as such cannot be 171modified without first being hidden from the reverse mapping since these fields 172are used to locate VMAs within the reverse mapping interval trees. 173 174.. table:: Core fields 175 176 ============================ ======================================== ========================= 177 Field Description Write lock 178 ============================ ======================================== ========================= 179 :c:member:`!vm_mm` Containing mm_struct. None - written once on 180 initial map. 181 :c:member:`!vm_page_prot` Architecture-specific page table mmap write, VMA write. 182 protection bits determined from VMA 183 flags. 184 :c:member:`!vm_flags` Read-only access to VMA flags describing N/A 185 attributes of the VMA, in union with 186 private writable 187 :c:member:`!__vm_flags`. 188 :c:member:`!__vm_flags` Private, writable access to VMA flags mmap write, VMA write. 189 field, updated by 190 :c:func:`!vm_flags_*` functions. 191 :c:member:`!vm_file` If the VMA is file-backed, points to a None - written once on 192 struct file object describing the initial map. 193 underlying file, if anonymous then 194 :c:macro:`!NULL`. 195 :c:member:`!vm_ops` If the VMA is file-backed, then either None - Written once on 196 the driver or file-system provides a initial map by 197 :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`. 198 object describing callbacks to be 199 invoked on VMA lifetime events. 200 :c:member:`!vm_private_data` A :c:member:`!void *` field for Handled by driver. 201 driver-specific metadata. 202 ============================ ======================================== ========================= 203 204These are the core fields which describe the MM the VMA belongs to and its attributes. 205 206.. table:: Config-specific fields 207 208 ================================= ===================== ======================================== =============== 209 Field Configuration option Description Write lock 210 ================================= ===================== ======================================== =============== 211 :c:member:`!anon_name` CONFIG_ANON_VMA_NAME A field for storing a mmap write, 212 :c:struct:`!struct anon_vma_name` VMA write. 213 object providing a name for anonymous 214 mappings, or :c:macro:`!NULL` if none 215 is set or the VMA is file-backed. The 216 underlying object is reference counted 217 and can be shared across multiple VMAs 218 for scalability. 219 :c:member:`!swap_readahead_info` CONFIG_SWAP Metadata used by the swap mechanism mmap read, 220 to perform readahead. This field is swap-specific 221 accessed atomically. lock. 222 :c:member:`!vm_policy` CONFIG_NUMA :c:type:`!mempolicy` object which mmap write, 223 describes the NUMA behaviour of the VMA write. 224 VMA. The underlying object is reference 225 counted. 226 :c:member:`!numab_state` CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which mmap read, 227 describes the current state of numab-specific 228 NUMA balancing in relation to this VMA. lock. 229 Updated under mmap read lock by 230 :c:func:`!task_numa_work`. 231 :c:member:`!vm_userfaultfd_ctx` CONFIG_USERFAULTFD Userfaultfd context wrapper object of mmap write, 232 type :c:type:`!vm_userfaultfd_ctx`, VMA write. 233 either of zero size if userfaultfd is 234 disabled, or containing a pointer 235 to an underlying 236 :c:type:`!userfaultfd_ctx` object which 237 describes userfaultfd metadata. 238 ================================= ===================== ======================================== =============== 239 240These fields are present or not depending on whether the relevant kernel 241configuration option is set. 242 243.. table:: Reverse mapping fields 244 245 =================================== ========================================= ============================ 246 Field Description Write lock 247 =================================== ========================================= ============================ 248 :c:member:`!shared.rb` A red/black tree node used, if the mmap write, VMA write, 249 mapping is file-backed, to place the VMA i_mmap write. 250 in the 251 :c:member:`!struct address_space->i_mmap` 252 red/black interval tree. 253 :c:member:`!shared.rb_subtree_last` Metadata used for management of the mmap write, VMA write, 254 interval tree if the VMA is file-backed. i_mmap write. 255 :c:member:`!anon_vma_chain` List of pointers to both forked/CoW’d mmap read, anon_vma write. 256 :c:type:`!anon_vma` objects and 257 :c:member:`!vma->anon_vma` if it is 258 non-:c:macro:`!NULL`. 259 :c:member:`!anon_vma` :c:type:`!anon_vma` object used by When :c:macro:`NULL` and 260 anonymous folios mapped exclusively to setting non-:c:macro:`NULL`: 261 this VMA. Initially set by mmap read, page_table_lock. 262 :c:func:`!anon_vma_prepare` serialised 263 by the :c:macro:`!page_table_lock`. This When non-:c:macro:`NULL` and 264 is set as soon as any page is faulted in. setting :c:macro:`NULL`: 265 mmap write, VMA write, 266 anon_vma write. 267 =================================== ========================================= ============================ 268 269These fields are used to both place the VMA within the reverse mapping, and for 270anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects 271and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should 272reside. 273 274.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set 275 then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap` 276 trees at the same time, so all of these fields might be utilised at 277 once. 278 279Page tables 280----------- 281 282We won't speak exhaustively on the subject but broadly speaking, page tables map 283virtual addresses to physical ones through a series of page tables, each of 284which contain entries with physical addresses for the next page table level 285(along with flags), and at the leaf level the physical addresses of the 286underlying physical data pages or a special entry such as a swap entry, 287migration entry or other special marker. Offsets into these pages are provided 288by the virtual address itself. 289 290In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge 291pages might eliminate one or two of these levels, but when this is the case we 292typically refer to the leaf level as the PTE level regardless. 293 294.. note:: In instances where the architecture supports fewer page tables than 295 five the kernel cleverly 'folds' page table levels, that is stubbing 296 out functions related to the skipped levels. This allows us to 297 conceptually act as if there were always five levels, even if the 298 compiler might, in practice, eliminate any code relating to missing 299 ones. 300 301There are four key operations typically performed on page tables: 302 3031. **Traversing** page tables - Simply reading page tables in order to traverse 304 them. This only requires that the VMA is kept stable, so a lock which 305 establishes this suffices for traversal (there are also lockless variants 306 which eliminate even this requirement, such as :c:func:`!gup_fast`). There is 307 also a special case of page table traversal for non-VMA regions which we 308 consider separately below. 3092. **Installing** page table mappings - Whether creating a new mapping or 310 modifying an existing one in such a way as to change its identity. This 311 requires that the VMA is kept stable via an mmap or VMA lock (explicitly not 312 rmap locks). 3133. **Zapping/unmapping** page table entries - This is what the kernel calls 314 clearing page table mappings at the leaf level only, whilst leaving all page 315 tables in place. This is a very common operation in the kernel performed on 316 file truncation, the :c:macro:`!MADV_DONTNEED` operation via 317 :c:func:`!madvise`, and others. This is performed by a number of functions 318 including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`. 319 The VMA need only be kept stable for this operation. 3204. **Freeing** page tables - When finally the kernel removes page tables from a 321 userland process (typically via :c:func:`!free_pgtables`) extreme care must 322 be taken to ensure this is done safely, as this logic finally frees all page 323 tables in the specified range, ignoring existing leaf entries (it assumes the 324 caller has both zapped the range and prevented any further faults or 325 modifications within it). 326 327.. note:: Modifying mappings for reclaim or migration is performed under rmap 328 lock as it, like zapping, does not fundamentally modify the identity 329 of what is being mapped. 330 331**Traversing** and **zapping** ranges can be performed holding any one of the 332locks described in the terminology section above - that is the mmap lock, the 333VMA lock or either of the reverse mapping locks. 334 335That is - as long as you keep the relevant VMA **stable** - you are good to go 336ahead and perform these operations on page tables (though internally, kernel 337operations that perform writes also acquire internal page table locks to 338serialise - see the page table implementation detail section for more details). 339 340.. note:: We free empty PTE tables on zap under the RCU lock - this does not 341 change the aforementioned locking requirements around zapping. 342 343When **installing** page table entries, the mmap or VMA lock must be held to 344keep the VMA stable. We explore why this is in the page table locking details 345section below. 346 347**Freeing** page tables is an entirely internal memory management operation and 348has special requirements (see the page freeing section below for more details). 349 350.. warning:: When **freeing** page tables, it must not be possible for VMAs 351 containing the ranges those page tables map to be accessible via 352 the reverse mapping. 353 354 The :c:func:`!free_pgtables` function removes the relevant VMAs 355 from the reverse mappings, but no other VMAs can be permitted to be 356 accessible and span the specified range. 357 358Traversing non-VMA page tables 359------------------------------ 360 361We've focused above on traversal of page tables belonging to VMAs. It is also 362possible to traverse page tables which are not represented by VMAs. 363 364Kernel page table mappings themselves are generally managed but whatever part of 365the kernel established them and the aforementioned locking rules do not apply - 366for instance vmalloc has its own set of locks which are utilised for 367establishing and tearing down page its page tables. 368 369However, for convenience we provide the :c:func:`!walk_kernel_page_table_range` 370function which is synchronised via the mmap lock on the :c:macro:`!init_mm` 371kernel instantiation of the :c:struct:`!struct mm_struct` metadata object. 372 373If an operation requires exclusive access, a write lock is used, but if not, a 374read lock suffices - we assert only that at least a read lock has been acquired. 375 376Since, aside from vmalloc and memory hot plug, kernel page tables are not torn 377down all that often - this usually suffices, however any caller of this 378functionality must ensure that any additionally required locks are acquired in 379advance. 380 381We also permit a truly unusual case is the traversal of non-VMA ranges in 382**userland** ranges, as provided for by :c:func:`!walk_page_range_debug`. 383 384This has only one user - the general page table dumping logic (implemented in 385:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes 386even if they are highly unusual (possibly architecture-specific) and are not 387backed by a VMA. 388 389We must take great care in this case, as the :c:func:`!munmap` implementation 390detaches VMAs under an mmap write lock before tearing down page tables under a 391downgraded mmap read lock. 392 393This means such an operation could race with this, and thus an mmap **write** 394lock is required. 395 396Lock ordering 397------------- 398 399As we have multiple locks across the kernel which may or may not be taken at the 400same time as explicit mm or VMA locks, we have to be wary of lock inversion, and 401the **order** in which locks are acquired and released becomes very important. 402 403.. note:: Lock inversion occurs when two threads need to acquire multiple locks, 404 but in doing so inadvertently cause a mutual deadlock. 405 406 For example, consider thread 1 which holds lock A and tries to acquire lock B, 407 while thread 2 holds lock B and tries to acquire lock A. 408 409 Both threads are now deadlocked on each other. However, had they attempted to 410 acquire locks in the same order, one would have waited for the other to 411 complete its work and no deadlock would have occurred. 412 413The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required 414ordering of locks within memory management code: 415 416.. code-block:: 417 418 inode->i_rwsem (while writing or truncating, not reading or faulting) 419 mm->mmap_lock 420 mapping->invalidate_lock (in filemap_fault) 421 folio_lock 422 hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) 423 vma_start_write 424 mapping->i_mmap_rwsem 425 anon_vma->rwsem 426 mm->page_table_lock or pte_lock 427 swap_lock (in swap_duplicate, swap_info_get) 428 mmlist_lock (in mmput, drain_mmlist and others) 429 mapping->private_lock (in block_dirty_folio) 430 i_pages lock (widely used) 431 lruvec->lru_lock (in folio_lruvec_lock_irq) 432 inode->i_lock (in set_page_dirty's __mark_inode_dirty) 433 bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) 434 sb_lock (within inode_lock in fs/fs-writeback.c) 435 i_pages lock (widely used, in set_page_dirty, 436 in arch-dependent flush_dcache_mmap_lock, 437 within bdi.wb->list_lock in __sync_single_inode) 438 439There is also a file-system specific lock ordering comment located at the top of 440:c:macro:`!mm/filemap.c`: 441 442.. code-block:: 443 444 ->i_mmap_rwsem (truncate_pagecache) 445 ->private_lock (__free_pte->block_dirty_folio) 446 ->swap_lock (exclusive_swap_page, others) 447 ->i_pages lock 448 449 ->i_rwsem 450 ->invalidate_lock (acquired by fs in truncate path) 451 ->i_mmap_rwsem (truncate->unmap_mapping_range) 452 453 ->mmap_lock 454 ->i_mmap_rwsem 455 ->page_table_lock or pte_lock (various, mainly in memory.c) 456 ->i_pages lock (arch-dependent flush_dcache_mmap_lock) 457 458 ->mmap_lock 459 ->invalidate_lock (filemap_fault) 460 ->lock_page (filemap_fault, access_process_vm) 461 462 ->i_rwsem (generic_perform_write) 463 ->mmap_lock (fault_in_readable->do_page_fault) 464 465 bdi->wb.list_lock 466 sb_lock (fs/fs-writeback.c) 467 ->i_pages lock (__sync_single_inode) 468 469 ->i_mmap_rwsem 470 ->anon_vma.lock (vma_merge) 471 472 ->anon_vma.lock 473 ->page_table_lock or pte_lock (anon_vma_prepare and various) 474 475 ->page_table_lock or pte_lock 476 ->swap_lock (try_to_unmap_one) 477 ->private_lock (try_to_unmap_one) 478 ->i_pages lock (try_to_unmap_one) 479 ->lruvec->lru_lock (follow_page_mask->mark_page_accessed) 480 ->lruvec->lru_lock (check_pte_range->folio_isolate_lru) 481 ->private_lock (folio_remove_rmap_pte->set_page_dirty) 482 ->i_pages lock (folio_remove_rmap_pte->set_page_dirty) 483 bdi.wb->list_lock (folio_remove_rmap_pte->set_page_dirty) 484 ->inode->i_lock (folio_remove_rmap_pte->set_page_dirty) 485 bdi.wb->list_lock (zap_pte_range->set_page_dirty) 486 ->inode->i_lock (zap_pte_range->set_page_dirty) 487 ->private_lock (zap_pte_range->block_dirty_folio) 488 489Please check the current state of these comments which may have changed since 490the time of writing of this document. 491 492------------------------------ 493Locking Implementation Details 494------------------------------ 495 496.. warning:: Locking rules for PTE-level page tables are very different from 497 locking rules for page tables at other levels. 498 499Page table locking details 500-------------------------- 501 502.. note:: This section explores page table locking requirements for page tables 503 encompassed by a VMA. See the above section on non-VMA page table 504 traversal for details on how we handle that case. 505 506In addition to the locks described in the terminology section above, we have 507additional locks dedicated to page tables: 508 509* **Higher level page table locks** - Higher level page tables, that is PGD, P4D 510 and PUD each make use of the process address space granularity 511 :c:member:`!mm->page_table_lock` lock when modified. 512 513* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks 514 either kept within the folios describing the page tables or allocated 515 separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is 516 set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are 517 mapped into higher memory (if a 32-bit system) and carefully locked via 518 :c:func:`!pte_offset_map_lock`. 519 520These locks represent the minimum required to interact with each page table 521level, but there are further requirements. 522 523Importantly, note that on a **traversal** of page tables, sometimes no such 524locks are taken. However, at the PTE level, at least concurrent page table 525deletion must be prevented (using RCU) and the page table must be mapped into 526high memory, see below. 527 528Whether care is taken on reading the page table entries depends on the 529architecture, see the section on atomicity below. 530 531Locking rules 532^^^^^^^^^^^^^ 533 534We establish basic locking rules when interacting with page tables: 535 536* When changing a page table entry the page table lock for that page table 537 **must** be held, except if you can safely assume nobody can access the page 538 tables concurrently (such as on invocation of :c:func:`!free_pgtables`). 539* Reads from and writes to page table entries must be *appropriately* 540 atomic. See the section on atomicity below for details. 541* Populating previously empty entries requires that the mmap or VMA locks are 542 held (read or write), doing so with only rmap locks would be dangerous (see 543 the warning below). 544* As mentioned previously, zapping can be performed while simply keeping the VMA 545 stable, that is holding any one of the mmap, VMA or rmap locks. 546 547.. warning:: Populating previously empty entries is dangerous as, when unmapping 548 VMAs, :c:func:`!vms_clear_ptes` has a window of time between 549 zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via 550 :c:func:`!free_pgtables`), where the VMA is still visible in the 551 rmap tree. :c:func:`!free_pgtables` assumes that the zap has 552 already been performed and removes PTEs unconditionally (along with 553 all other page tables in the freed range), so installing new PTE 554 entries could leak memory and also cause other unexpected and 555 dangerous behaviour. 556 557There are additional rules applicable when moving page tables, which we discuss 558in the section on this topic below. 559 560PTE-level page tables are different from page tables at other levels, and there 561are extra requirements for accessing them: 562 563* On 32-bit architectures, they may be in high memory (meaning they need to be 564 mapped into kernel memory to be accessible). 565* When empty, they can be unlinked and RCU-freed while holding an mmap lock or 566 rmap lock for reading in combination with the PTE and PMD page table locks. 567 In particular, this happens in :c:func:`!retract_page_tables` when handling 568 :c:macro:`!MADV_COLLAPSE`. 569 So accessing PTE-level page tables requires at least holding an RCU read lock; 570 but that only suffices for readers that can tolerate racing with concurrent 571 page table updates such that an empty PTE is observed (in a page table that 572 has actually already been detached and marked for RCU freeing) while another 573 new page table has been installed in the same location and filled with 574 entries. Writers normally need to take the PTE lock and revalidate that the 575 PMD entry still refers to the same PTE-level page table. 576 If the writer does not care whether it is the same PTE-level page table, it 577 can take the PMD lock and revalidate that the contents of pmd entry still meet 578 the requirements. In particular, this also happens in :c:func:`!retract_page_tables` 579 when handling :c:macro:`!MADV_COLLAPSE`. 580 581To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or 582:c:func:`!pte_offset_map` can be used depending on stability requirements. 583These map the page table into kernel memory if required, take the RCU lock, and 584depending on variant, may also look up or acquire the PTE lock. 585See the comment on :c:func:`!__pte_offset_map_lock`. 586 587Atomicity 588^^^^^^^^^ 589 590Regardless of page table locks, the MMU hardware concurrently updates accessed 591and dirty bits (perhaps more, depending on architecture). Additionally, page 592table traversal operations in parallel (though holding the VMA stable) and 593functionality like GUP-fast locklessly traverses (that is reads) page tables, 594without even keeping the VMA stable at all. 595 596When performing a page table traversal and keeping the VMA stable, whether a 597read must be performed once and only once or not depends on the architecture 598(for instance x86-64 does not require any special precautions). 599 600If a write is being performed, or if a read informs whether a write takes place 601(on an installation of a page table entry say, for instance in 602:c:func:`!__pud_install`), special care must always be taken. In these cases we 603can never assume that page table locks give us entirely exclusive access, and 604must retrieve page table entries once and only once. 605 606If we are reading page table entries, then we need only ensure that the compiler 607does not rearrange our loads. This is achieved via :c:func:`!pXXp_get` 608functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`, 609:c:func:`!pmdp_get`, and :c:func:`!ptep_get`. 610 611Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads 612the page table entry only once. 613 614However, if we wish to manipulate an existing page table entry and care about 615the previously stored data, we must go further and use an hardware atomic 616operation as, for example, in :c:func:`!ptep_get_and_clear`. 617 618Equally, operations that do not rely on the VMA being held stable, such as 619GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like 620:c:func:`!gup_fast_pte_range`), must very carefully interact with page table 621entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for 622higher level page table levels. 623 624Writes to page table entries must also be appropriately atomic, as established 625by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`, 626:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`. 627 628Equally functions which clear page table entries must be appropriately atomic, 629as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`, 630:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and 631:c:func:`!pte_clear`. 632 633Page table installation 634^^^^^^^^^^^^^^^^^^^^^^^ 635 636Page table installation is performed with the VMA held stable explicitly by an 637mmap or VMA lock in read or write mode (see the warning in the locking rules 638section for details as to why). 639 640When allocating a P4D, PUD or PMD and setting the relevant entry in the above 641PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is 642acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and 643:c:func:`!__pmd_alloc` respectively. 644 645.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and 646 :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately 647 references the :c:member:`!mm->page_table_lock`. 648 649Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if 650:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD 651physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by 652:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately 653:c:func:`!__pte_alloc`. 654 655Finally, modifying the contents of the PTE requires special treatment, as the 656PTE page table lock must be acquired whenever we want stable and exclusive 657access to entries contained within a PTE, especially when we wish to modify 658them. 659 660This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to 661ensure that the PTE hasn't changed from under us, ultimately invoking 662:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within 663the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock 664must be released via :c:func:`!pte_unmap_unlock`. 665 666.. note:: There are some variants on this, such as 667 :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but 668 for brevity we do not explore this. See the comment for 669 :c:func:`!__pte_offset_map_lock` for more details. 670 671When modifying data in ranges we typically only wish to allocate higher page 672tables as necessary, using these locks to avoid races or overwriting anything, 673and set/clear data at the PTE level as required (for instance when page faulting 674or zapping). 675 676A typical pattern taken when traversing page table entries to install a new 677mapping is to optimistically determine whether the page table entry in the table 678above is empty, if so, only then acquiring the page table lock and checking 679again to see if it was allocated underneath us. 680 681This allows for a traversal with page table locks only being taken when 682required. An example of this is :c:func:`!__pud_alloc`. 683 684At the leaf page table, that is the PTE, we can't entirely rely on this pattern 685as we have separate PMD and PTE locks and a THP collapse for instance might have 686eliminated the PMD entry as well as the PTE from under us. 687 688This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry 689for the PTE, carefully checking it is as expected, before acquiring the 690PTE-specific lock, and then *again* checking that the PMD entry is as expected. 691 692If a THP collapse (or similar) were to occur then the lock on both pages would 693be acquired, so we can ensure this is prevented while the PTE lock is held. 694 695Installing entries this way ensures mutual exclusion on write. 696 697Page table freeing 698^^^^^^^^^^^^^^^^^^ 699 700Tearing down page tables themselves is something that requires significant 701care. There must be no way that page tables designated for removal can be 702traversed or referenced by concurrent tasks. 703 704It is insufficient to simply hold an mmap write lock and VMA lock (which will 705prevent racing faults, and rmap operations), as a file-backed mapping can be 706truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone. 707 708As a result, no VMA which can be accessed via the reverse mapping (either 709through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct 710address_space->i_mmap` interval trees) can have its page tables torn down. 711 712The operation is typically performed via :c:func:`!free_pgtables`, which assumes 713either the mmap write lock has been taken (as specified by its 714:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable. 715 716It carefully removes the VMA from all reverse mappings, however it's important 717that no new ones overlap these or any route remain to permit access to addresses 718within the range whose page tables are being torn down. 719 720Additionally, it assumes that a zap has already been performed and steps have 721been taken to ensure that no further page table entries can be installed between 722the zap and the invocation of :c:func:`!free_pgtables`. 723 724Since it is assumed that all such steps have been taken, page table entries are 725cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`, 726:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions. 727 728.. note:: It is possible for leaf page tables to be torn down independent of 729 the page tables above it as is done by 730 :c:func:`!retract_page_tables`, which is performed under the i_mmap 731 read lock, PMD, and PTE page table locks, without this level of care. 732 733Page table moving 734^^^^^^^^^^^^^^^^^ 735 736Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD 737page tables). Most notable of these is :c:func:`!mremap`, which is capable of 738moving higher level page tables. 739 740In these instances, it is required that **all** locks are taken, that is 741the mmap lock, the VMA lock and the relevant rmap locks. 742 743You can observe this in the :c:func:`!mremap` implementation in the functions 744:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap 745side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`. 746 747VMA lock internals 748------------------ 749 750Overview 751^^^^^^^^ 752 753VMA read locking is entirely optimistic - if the lock is contended or a competing 754write has started, then we do not obtain a read lock. 755 756A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first 757calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU 758critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`, 759before releasing the RCU lock via :c:func:`!rcu_read_unlock`. 760 761In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked` 762and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not 763fail due to lock contention but the caller should still check their return values 764in case they fail for other reasons. 765 766VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their 767duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via 768:c:func:`!vma_end_read`. 769 770VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a 771VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always 772acquired. An mmap write lock **must** be held for the duration of the VMA write 773lock, releasing or downgrading the mmap write lock also releases the VMA write 774lock so there is no :c:func:`!vma_end_write` function. 775 776Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily 777modified so that readers can detect the presense of a writer. The reference counter is 778restored once the vma sequence number used for serialisation is updated. 779 780This ensures the semantics we require - VMA write locks provide exclusive write 781access to the VMA. 782 783Implementation details 784^^^^^^^^^^^^^^^^^^^^^^ 785 786The VMA lock mechanism is designed to be a lightweight means of avoiding the use 787of the heavily contended mmap lock. It is implemented using a combination of a 788reference counter and sequence numbers belonging to the containing 789:c:struct:`!struct mm_struct` and the VMA. 790 791Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic 792operation, i.e. it tries to acquire a read lock but returns false if it is 793unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is 794called to release the VMA read lock. 795 796Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has 797been called first, establishing that we are in an RCU critical section upon VMA 798read lock acquisition. Once acquired, the RCU lock can be released as it is only 799required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which 800is the interface a user should use. 801 802Writing requires the mmap to be write-locked and the VMA lock to be acquired via 803:c:func:`!vma_start_write`, however the write lock is released by the termination or 804downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required. 805 806All this is achieved by the use of per-mm and per-VMA sequence counts, which are 807used in order to reduce complexity, especially for operations which write-lock 808multiple VMAs at once. 809 810If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA 811sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If 812they differ, then it is not. 813 814Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or 815:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which 816also increments :c:member:`!mm->mm_lock_seq` via 817:c:func:`!mm_lock_seqcount_end`. 818 819This way, we ensure that, regardless of the VMA's sequence number, a write lock 820is never incorrectly indicated and that when we release an mmap write lock we 821efficiently release **all** VMA write locks contained within the mmap at the 822same time. 823 824Since the mmap write lock is exclusive against others who hold it, the automatic 825release of any VMA locks on its release makes sense, as you would never want to 826keep VMAs locked across entirely separate write operations. It also maintains 827correct lock ordering. 828 829Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt` 830reference counter and check that the sequence count of the VMA does not match 831that of the mm. 832 833If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped. 834If it does not, we keep the reference counter raised, excluding writers, but 835permitting other readers, who can also obtain this lock under RCU. 836 837Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu` 838are also RCU safe, so the whole read lock operation is guaranteed to function 839correctly. 840 841On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be 842modified by readers and wait for all readers to drop their reference count. 843Once there are no readers, the VMA's sequence number is set to match that of 844the mm. During this entire operation mmap write lock is held. 845 846This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep 847until these are finished and mutual exclusion is achieved. 848 849After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt` 850indicating a writer is cleared. From this point on, VMA's sequence number will 851indicate VMA's write-locked state until mmap write lock is dropped or downgraded. 852 853This clever combination of a reference counter and sequence count allows for 854fast RCU-based per-VMA lock acquisition (especially on page fault, though 855utilised elsewhere) with minimal complexity around lock ordering. 856 857mmap write lock downgrading 858--------------------------- 859 860When an mmap write lock is held one has exclusive access to resources within the 861mmap (with the usual caveats about requiring VMA write locks to avoid races with 862tasks holding VMA read locks). 863 864It is then possible to **downgrade** from a write lock to a read lock via 865:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`, 866implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but 867importantly does not relinquish the mmap lock while downgrading, therefore 868keeping the locked virtual address space stable. 869 870An interesting consequence of this is that downgraded locks are exclusive 871against any other task possessing a downgraded lock (since a racing task would 872have to acquire a write lock first to downgrade it, and the downgraded lock 873prevents a new write lock from being obtained until the original lock is 874released). 875 876For clarity, we map read (R)/downgraded write (D)/write (W) locks against one 877another showing which locks exclude the others: 878 879.. list-table:: Lock exclusivity 880 :widths: 5 5 5 5 881 :header-rows: 1 882 :stub-columns: 1 883 884 * - 885 - R 886 - D 887 - W 888 * - R 889 - N 890 - N 891 - Y 892 * - D 893 - N 894 - Y 895 - Y 896 * - W 897 - Y 898 - Y 899 - Y 900 901Here a Y indicates the locks in the matching row/column are mutually exclusive, 902and N indicates that they are not. 903 904Stack expansion 905--------------- 906 907Stack expansion throws up additional complexities in that we cannot permit there 908to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to 909prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`. 910