xref: /linux/Documentation/mm/process_addrs.rst (revision beace86e61e465dba204a268ab3f3377153a4973)
1.. SPDX-License-Identifier: GPL-2.0
2
3=================
4Process Addresses
5=================
6
7.. toctree::
8   :maxdepth: 3
9
10
11Userland memory ranges are tracked by the kernel via Virtual Memory Areas or
12'VMA's of type :c:struct:`!struct vm_area_struct`.
13
14Each VMA describes a virtually contiguous memory range with identical
15attributes, each described by a :c:struct:`!struct vm_area_struct`
16object. Userland access outside of VMAs is invalid except in the case where an
17adjacent stack VMA could be extended to contain the accessed address.
18
19All VMAs are contained within one and only one virtual address space, described
20by a :c:struct:`!struct mm_struct` object which is referenced by all tasks (that is,
21threads) which share the virtual address space. We refer to this as the
22:c:struct:`!mm`.
23
24Each mm object contains a maple tree data structure which describes all VMAs
25within the virtual address space.
26
27.. note:: An exception to this is the 'gate' VMA which is provided by
28          architectures which use :c:struct:`!vsyscall` and is a global static
29          object which does not belong to any specific mm.
30
31-------
32Locking
33-------
34
35The kernel is designed to be highly scalable against concurrent read operations
36on VMA **metadata** so a complicated set of locks are required to ensure memory
37corruption does not occur.
38
39.. note:: Locking VMAs for their metadata does not have any impact on the memory
40          they describe nor the page tables that map them.
41
42Terminology
43-----------
44
45* **mmap locks** - Each MM has a read/write semaphore :c:member:`!mmap_lock`
46  which locks at a process address space granularity which can be acquired via
47  :c:func:`!mmap_read_lock`, :c:func:`!mmap_write_lock` and variants.
48* **VMA locks** - The VMA lock is at VMA granularity (of course) which behaves
49  as a read/write semaphore in practice. A VMA read lock is obtained via
50  :c:func:`!lock_vma_under_rcu` (and unlocked via :c:func:`!vma_end_read`) and a
51  write lock via :c:func:`!vma_start_write` (all VMA write locks are unlocked
52  automatically when the mmap write lock is released). To take a VMA write lock
53  you **must** have already acquired an :c:func:`!mmap_write_lock`.
54* **rmap locks** - When trying to access VMAs through the reverse mapping via a
55  :c:struct:`!struct address_space` or :c:struct:`!struct anon_vma` object
56  (reachable from a folio via :c:member:`!folio->mapping`). VMAs must be stabilised via
57  :c:func:`!anon_vma_[try]lock_read` or :c:func:`!anon_vma_[try]lock_write` for
58  anonymous memory and :c:func:`!i_mmap_[try]lock_read` or
59  :c:func:`!i_mmap_[try]lock_write` for file-backed memory. We refer to these
60  locks as the reverse mapping locks, or 'rmap locks' for brevity.
61
62We discuss page table locks separately in the dedicated section below.
63
64The first thing **any** of these locks achieve is to **stabilise** the VMA
65within the MM tree. That is, guaranteeing that the VMA object will not be
66deleted from under you nor modified (except for some specific fields
67described below).
68
69Stabilising a VMA also keeps the address space described by it around.
70
71Lock usage
72----------
73
74If you want to **read** VMA metadata fields or just keep the VMA stable, you
75must do one of the following:
76
77* Obtain an mmap read lock at the MM granularity via :c:func:`!mmap_read_lock` (or a
78  suitable variant), unlocking it with a matching :c:func:`!mmap_read_unlock` when
79  you're done with the VMA, *or*
80* Try to obtain a VMA read lock via :c:func:`!lock_vma_under_rcu`. This tries to
81  acquire the lock atomically so might fail, in which case fall-back logic is
82  required to instead obtain an mmap read lock if this returns :c:macro:`!NULL`,
83  *or*
84* Acquire an rmap lock before traversing the locked interval tree (whether
85  anonymous or file-backed) to obtain the required VMA.
86
87If you want to **write** VMA metadata fields, then things vary depending on the
88field (we explore each VMA field in detail below). For the majority you must:
89
90* Obtain an mmap write lock at the MM granularity via :c:func:`!mmap_write_lock` (or a
91  suitable variant), unlocking it with a matching :c:func:`!mmap_write_unlock` when
92  you're done with the VMA, *and*
93* Obtain a VMA write lock via :c:func:`!vma_start_write` for each VMA you wish to
94  modify, which will be released automatically when :c:func:`!mmap_write_unlock` is
95  called.
96* If you want to be able to write to **any** field, you must also hide the VMA
97  from the reverse mapping by obtaining an **rmap write lock**.
98
99VMA locks are special in that you must obtain an mmap **write** lock **first**
100in order to obtain a VMA **write** lock. A VMA **read** lock however can be
101obtained without any other lock (:c:func:`!lock_vma_under_rcu` will acquire then
102release an RCU lock to lookup the VMA for you).
103
104This constrains the impact of writers on readers, as a writer can interact with
105one VMA while a reader interacts with another simultaneously.
106
107.. note:: The primary users of VMA read locks are page fault handlers, which
108          means that without a VMA write lock, page faults will run concurrent with
109          whatever you are doing.
110
111Examining all valid lock states:
112
113.. table::
114
115   ========= ======== ========= ======= ===== =========== ==========
116   mmap lock VMA lock rmap lock Stable? Read? Write most? Write all?
117   ========= ======== ========= ======= ===== =========== ==========
118   \-        \-       \-        N       N     N           N
119   \-        R        \-        Y       Y     N           N
120   \-        \-       R/W       Y       Y     N           N
121   R/W       \-/R     \-/R/W    Y       Y     N           N
122   W         W        \-/R      Y       Y     Y           N
123   W         W        W         Y       Y     Y           Y
124   ========= ======== ========= ======= ===== =========== ==========
125
126.. warning:: While it's possible to obtain a VMA lock while holding an mmap read lock,
127             attempting to do the reverse is invalid as it can result in deadlock - if
128             another task already holds an mmap write lock and attempts to acquire a VMA
129             write lock that will deadlock on the VMA read lock.
130
131All of these locks behave as read/write semaphores in practice, so you can
132obtain either a read or a write lock for each of these.
133
134.. note:: Generally speaking, a read/write semaphore is a class of lock which
135          permits concurrent readers. However a write lock can only be obtained
136          once all readers have left the critical region (and pending readers
137          made to wait).
138
139          This renders read locks on a read/write semaphore concurrent with other
140          readers and write locks exclusive against all others holding the semaphore.
141
142VMA fields
143^^^^^^^^^^
144
145We can subdivide :c:struct:`!struct vm_area_struct` fields by their purpose, which makes it
146easier to explore their locking characteristics:
147
148.. note:: We exclude VMA lock-specific fields here to avoid confusion, as these
149          are in effect an internal implementation detail.
150
151.. table:: Virtual layout fields
152
153   ===================== ======================================== ===========
154   Field                 Description                              Write lock
155   ===================== ======================================== ===========
156   :c:member:`!vm_start` Inclusive start virtual address of range mmap write,
157                         VMA describes.                           VMA write,
158                                                                  rmap write.
159   :c:member:`!vm_end`   Exclusive end virtual address of range   mmap write,
160                         VMA describes.                           VMA write,
161                                                                  rmap write.
162   :c:member:`!vm_pgoff` Describes the page offset into the file, mmap write,
163                         the original page offset within the      VMA write,
164                         virtual address space (prior to any      rmap write.
165                         :c:func:`!mremap`), or PFN if a PFN map
166                         and the architecture does not support
167                         :c:macro:`!CONFIG_ARCH_HAS_PTE_SPECIAL`.
168   ===================== ======================================== ===========
169
170These fields describes the size, start and end of the VMA, and as such cannot be
171modified without first being hidden from the reverse mapping since these fields
172are used to locate VMAs within the reverse mapping interval trees.
173
174.. table:: Core fields
175
176   ============================ ======================================== =========================
177   Field                        Description                              Write lock
178   ============================ ======================================== =========================
179   :c:member:`!vm_mm`           Containing mm_struct.                    None - written once on
180                                                                         initial map.
181   :c:member:`!vm_page_prot`    Architecture-specific page table         mmap write, VMA write.
182                                protection bits determined from VMA
183                                flags.
184   :c:member:`!vm_flags`        Read-only access to VMA flags describing N/A
185                                attributes of the VMA, in union with
186                                private writable
187                                :c:member:`!__vm_flags`.
188   :c:member:`!__vm_flags`      Private, writable access to VMA flags    mmap write, VMA write.
189                                field, updated by
190                                :c:func:`!vm_flags_*` functions.
191   :c:member:`!vm_file`         If the VMA is file-backed, points to a   None - written once on
192                                struct file object describing the        initial map.
193                                underlying file, if anonymous then
194                                :c:macro:`!NULL`.
195   :c:member:`!vm_ops`          If the VMA is file-backed, then either   None - Written once on
196                                the driver or file-system provides a     initial map by
197                                :c:struct:`!struct vm_operations_struct` :c:func:`!f_ops->mmap()`.
198                                object describing callbacks to be
199                                invoked on VMA lifetime events.
200   :c:member:`!vm_private_data` A :c:member:`!void *` field for          Handled by driver.
201                                driver-specific metadata.
202   ============================ ======================================== =========================
203
204These are the core fields which describe the MM the VMA belongs to and its attributes.
205
206.. table:: Config-specific fields
207
208   ================================= ===================== ======================================== ===============
209   Field                             Configuration option  Description                              Write lock
210   ================================= ===================== ======================================== ===============
211   :c:member:`!anon_name`            CONFIG_ANON_VMA_NAME  A field for storing a                    mmap write,
212                                                           :c:struct:`!struct anon_vma_name`        VMA write.
213                                                           object providing a name for anonymous
214                                                           mappings, or :c:macro:`!NULL` if none
215                                                           is set or the VMA is file-backed. The
216							   underlying object is reference counted
217							   and can be shared across multiple VMAs
218							   for scalability.
219   :c:member:`!swap_readahead_info`  CONFIG_SWAP           Metadata used by the swap mechanism      mmap read,
220                                                           to perform readahead. This field is      swap-specific
221                                                           accessed atomically.                     lock.
222   :c:member:`!vm_policy`            CONFIG_NUMA           :c:type:`!mempolicy` object which        mmap write,
223                                                           describes the NUMA behaviour of the      VMA write.
224                                                           VMA. The underlying object is reference
225							   counted.
226   :c:member:`!numab_state`          CONFIG_NUMA_BALANCING :c:type:`!vma_numab_state` object which  mmap read,
227                                                           describes the current state of           numab-specific
228                                                           NUMA balancing in relation to this VMA.  lock.
229                                                           Updated under mmap read lock by
230                                                           :c:func:`!task_numa_work`.
231   :c:member:`!vm_userfaultfd_ctx`   CONFIG_USERFAULTFD    Userfaultfd context wrapper object of    mmap write,
232                                                           type :c:type:`!vm_userfaultfd_ctx`,      VMA write.
233                                                           either of zero size if userfaultfd is
234                                                           disabled, or containing a pointer
235                                                           to an underlying
236                                                           :c:type:`!userfaultfd_ctx` object which
237                                                           describes userfaultfd metadata.
238   ================================= ===================== ======================================== ===============
239
240These fields are present or not depending on whether the relevant kernel
241configuration option is set.
242
243.. table:: Reverse mapping fields
244
245   =================================== ========================================= ============================
246   Field                               Description                               Write lock
247   =================================== ========================================= ============================
248   :c:member:`!shared.rb`              A red/black tree node used, if the        mmap write, VMA write,
249                                       mapping is file-backed, to place the VMA  i_mmap write.
250                                       in the
251                                       :c:member:`!struct address_space->i_mmap`
252                                       red/black interval tree.
253   :c:member:`!shared.rb_subtree_last` Metadata used for management of the       mmap write, VMA write,
254                                       interval tree if the VMA is file-backed.  i_mmap write.
255   :c:member:`!anon_vma_chain`         List of pointers to both forked/CoW’d     mmap read, anon_vma write.
256                                       :c:type:`!anon_vma` objects and
257                                       :c:member:`!vma->anon_vma` if it is
258                                       non-:c:macro:`!NULL`.
259   :c:member:`!anon_vma`               :c:type:`!anon_vma` object used by        When :c:macro:`NULL` and
260                                       anonymous folios mapped exclusively to    setting non-:c:macro:`NULL`:
261                                       this VMA. Initially set by                mmap read, page_table_lock.
262                                       :c:func:`!anon_vma_prepare` serialised
263                                       by the :c:macro:`!page_table_lock`. This  When non-:c:macro:`NULL` and
264                                       is set as soon as any page is faulted in. setting :c:macro:`NULL`:
265                                                                                 mmap write, VMA write,
266                                                                                 anon_vma write.
267   =================================== ========================================= ============================
268
269These fields are used to both place the VMA within the reverse mapping, and for
270anonymous mappings, to be able to access both related :c:struct:`!struct anon_vma` objects
271and the :c:struct:`!struct anon_vma` in which folios mapped exclusively to this VMA should
272reside.
273
274.. note:: If a file-backed mapping is mapped with :c:macro:`!MAP_PRIVATE` set
275          then it can be in both the :c:type:`!anon_vma` and :c:type:`!i_mmap`
276          trees at the same time, so all of these fields might be utilised at
277          once.
278
279Page tables
280-----------
281
282We won't speak exhaustively on the subject but broadly speaking, page tables map
283virtual addresses to physical ones through a series of page tables, each of
284which contain entries with physical addresses for the next page table level
285(along with flags), and at the leaf level the physical addresses of the
286underlying physical data pages or a special entry such as a swap entry,
287migration entry or other special marker. Offsets into these pages are provided
288by the virtual address itself.
289
290In Linux these are divided into five levels - PGD, P4D, PUD, PMD and PTE. Huge
291pages might eliminate one or two of these levels, but when this is the case we
292typically refer to the leaf level as the PTE level regardless.
293
294.. note:: In instances where the architecture supports fewer page tables than
295	  five the kernel cleverly 'folds' page table levels, that is stubbing
296	  out functions related to the skipped levels. This allows us to
297	  conceptually act as if there were always five levels, even if the
298	  compiler might, in practice, eliminate any code relating to missing
299	  ones.
300
301There are four key operations typically performed on page tables:
302
3031. **Traversing** page tables - Simply reading page tables in order to traverse
304   them. This only requires that the VMA is kept stable, so a lock which
305   establishes this suffices for traversal (there are also lockless variants
306   which eliminate even this requirement, such as :c:func:`!gup_fast`). There is
307   also a special case of page table traversal for non-VMA regions which we
308   consider separately below.
3092. **Installing** page table mappings - Whether creating a new mapping or
310   modifying an existing one in such a way as to change its identity. This
311   requires that the VMA is kept stable via an mmap or VMA lock (explicitly not
312   rmap locks).
3133. **Zapping/unmapping** page table entries - This is what the kernel calls
314   clearing page table mappings at the leaf level only, whilst leaving all page
315   tables in place. This is a very common operation in the kernel performed on
316   file truncation, the :c:macro:`!MADV_DONTNEED` operation via
317   :c:func:`!madvise`, and others. This is performed by a number of functions
318   including :c:func:`!unmap_mapping_range` and :c:func:`!unmap_mapping_pages`.
319   The VMA need only be kept stable for this operation.
3204. **Freeing** page tables - When finally the kernel removes page tables from a
321   userland process (typically via :c:func:`!free_pgtables`) extreme care must
322   be taken to ensure this is done safely, as this logic finally frees all page
323   tables in the specified range, ignoring existing leaf entries (it assumes the
324   caller has both zapped the range and prevented any further faults or
325   modifications within it).
326
327.. note:: Modifying mappings for reclaim or migration is performed under rmap
328          lock as it, like zapping, does not fundamentally modify the identity
329          of what is being mapped.
330
331**Traversing** and **zapping** ranges can be performed holding any one of the
332locks described in the terminology section above - that is the mmap lock, the
333VMA lock or either of the reverse mapping locks.
334
335That is - as long as you keep the relevant VMA **stable** - you are good to go
336ahead and perform these operations on page tables (though internally, kernel
337operations that perform writes also acquire internal page table locks to
338serialise - see the page table implementation detail section for more details).
339
340.. note:: We free empty PTE tables on zap under the RCU lock - this does not
341          change the aforementioned locking requirements around zapping.
342
343When **installing** page table entries, the mmap or VMA lock must be held to
344keep the VMA stable. We explore why this is in the page table locking details
345section below.
346
347**Freeing** page tables is an entirely internal memory management operation and
348has special requirements (see the page freeing section below for more details).
349
350.. warning:: When **freeing** page tables, it must not be possible for VMAs
351             containing the ranges those page tables map to be accessible via
352             the reverse mapping.
353
354             The :c:func:`!free_pgtables` function removes the relevant VMAs
355             from the reverse mappings, but no other VMAs can be permitted to be
356             accessible and span the specified range.
357
358Traversing non-VMA page tables
359------------------------------
360
361We've focused above on traversal of page tables belonging to VMAs. It is also
362possible to traverse page tables which are not represented by VMAs.
363
364Kernel page table mappings themselves are generally managed but whatever part of
365the kernel established them and the aforementioned locking rules do not apply -
366for instance vmalloc has its own set of locks which are utilised for
367establishing and tearing down page its page tables.
368
369However, for convenience we provide the :c:func:`!walk_kernel_page_table_range`
370function which is synchronised via the mmap lock on the :c:macro:`!init_mm`
371kernel instantiation of the :c:struct:`!struct mm_struct` metadata object.
372
373If an operation requires exclusive access, a write lock is used, but if not, a
374read lock suffices - we assert only that at least a read lock has been acquired.
375
376Since, aside from vmalloc and memory hot plug, kernel page tables are not torn
377down all that often - this usually suffices, however any caller of this
378functionality must ensure that any additionally required locks are acquired in
379advance.
380
381We also permit a truly unusual case is the traversal of non-VMA ranges in
382**userland** ranges, as provided for by :c:func:`!walk_page_range_debug`.
383
384This has only one user - the general page table dumping logic (implemented in
385:c:macro:`!mm/ptdump.c`) - which seeks to expose all mappings for debug purposes
386even if they are highly unusual (possibly architecture-specific) and are not
387backed by a VMA.
388
389We must take great care in this case, as the :c:func:`!munmap` implementation
390detaches VMAs under an mmap write lock before tearing down page tables under a
391downgraded mmap read lock.
392
393This means such an operation could race with this, and thus an mmap **write**
394lock is required.
395
396Lock ordering
397-------------
398
399As we have multiple locks across the kernel which may or may not be taken at the
400same time as explicit mm or VMA locks, we have to be wary of lock inversion, and
401the **order** in which locks are acquired and released becomes very important.
402
403.. note:: Lock inversion occurs when two threads need to acquire multiple locks,
404   but in doing so inadvertently cause a mutual deadlock.
405
406   For example, consider thread 1 which holds lock A and tries to acquire lock B,
407   while thread 2 holds lock B and tries to acquire lock A.
408
409   Both threads are now deadlocked on each other. However, had they attempted to
410   acquire locks in the same order, one would have waited for the other to
411   complete its work and no deadlock would have occurred.
412
413The opening comment in :c:macro:`!mm/rmap.c` describes in detail the required
414ordering of locks within memory management code:
415
416.. code-block::
417
418  inode->i_rwsem        (while writing or truncating, not reading or faulting)
419    mm->mmap_lock
420      mapping->invalidate_lock (in filemap_fault)
421        folio_lock
422          hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below)
423            vma_start_write
424              mapping->i_mmap_rwsem
425                anon_vma->rwsem
426                  mm->page_table_lock or pte_lock
427                    swap_lock (in swap_duplicate, swap_info_get)
428                      mmlist_lock (in mmput, drain_mmlist and others)
429                      mapping->private_lock (in block_dirty_folio)
430                          i_pages lock (widely used)
431                            lruvec->lru_lock (in folio_lruvec_lock_irq)
432                      inode->i_lock (in set_page_dirty's __mark_inode_dirty)
433                      bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty)
434                        sb_lock (within inode_lock in fs/fs-writeback.c)
435                        i_pages lock (widely used, in set_page_dirty,
436                                  in arch-dependent flush_dcache_mmap_lock,
437                                  within bdi.wb->list_lock in __sync_single_inode)
438
439There is also a file-system specific lock ordering comment located at the top of
440:c:macro:`!mm/filemap.c`:
441
442.. code-block::
443
444  ->i_mmap_rwsem                        (truncate_pagecache)
445    ->private_lock                      (__free_pte->block_dirty_folio)
446      ->swap_lock                       (exclusive_swap_page, others)
447        ->i_pages lock
448
449  ->i_rwsem
450    ->invalidate_lock                   (acquired by fs in truncate path)
451      ->i_mmap_rwsem                    (truncate->unmap_mapping_range)
452
453  ->mmap_lock
454    ->i_mmap_rwsem
455      ->page_table_lock or pte_lock     (various, mainly in memory.c)
456        ->i_pages lock                  (arch-dependent flush_dcache_mmap_lock)
457
458  ->mmap_lock
459    ->invalidate_lock                   (filemap_fault)
460      ->lock_page                       (filemap_fault, access_process_vm)
461
462  ->i_rwsem                             (generic_perform_write)
463    ->mmap_lock                         (fault_in_readable->do_page_fault)
464
465  bdi->wb.list_lock
466    sb_lock                             (fs/fs-writeback.c)
467    ->i_pages lock                      (__sync_single_inode)
468
469  ->i_mmap_rwsem
470    ->anon_vma.lock                     (vma_merge)
471
472  ->anon_vma.lock
473    ->page_table_lock or pte_lock       (anon_vma_prepare and various)
474
475  ->page_table_lock or pte_lock
476    ->swap_lock                         (try_to_unmap_one)
477    ->private_lock                      (try_to_unmap_one)
478    ->i_pages lock                      (try_to_unmap_one)
479    ->lruvec->lru_lock                  (follow_page_mask->mark_page_accessed)
480    ->lruvec->lru_lock                  (check_pte_range->folio_isolate_lru)
481    ->private_lock                      (folio_remove_rmap_pte->set_page_dirty)
482    ->i_pages lock                      (folio_remove_rmap_pte->set_page_dirty)
483    bdi.wb->list_lock                   (folio_remove_rmap_pte->set_page_dirty)
484    ->inode->i_lock                     (folio_remove_rmap_pte->set_page_dirty)
485    bdi.wb->list_lock                   (zap_pte_range->set_page_dirty)
486    ->inode->i_lock                     (zap_pte_range->set_page_dirty)
487    ->private_lock                      (zap_pte_range->block_dirty_folio)
488
489Please check the current state of these comments which may have changed since
490the time of writing of this document.
491
492------------------------------
493Locking Implementation Details
494------------------------------
495
496.. warning:: Locking rules for PTE-level page tables are very different from
497             locking rules for page tables at other levels.
498
499Page table locking details
500--------------------------
501
502.. note:: This section explores page table locking requirements for page tables
503          encompassed by a VMA. See the above section on non-VMA page table
504          traversal for details on how we handle that case.
505
506In addition to the locks described in the terminology section above, we have
507additional locks dedicated to page tables:
508
509* **Higher level page table locks** - Higher level page tables, that is PGD, P4D
510  and PUD each make use of the process address space granularity
511  :c:member:`!mm->page_table_lock` lock when modified.
512
513* **Fine-grained page table locks** - PMDs and PTEs each have fine-grained locks
514  either kept within the folios describing the page tables or allocated
515  separated and pointed at by the folios if :c:macro:`!ALLOC_SPLIT_PTLOCKS` is
516  set. The PMD spin lock is obtained via :c:func:`!pmd_lock`, however PTEs are
517  mapped into higher memory (if a 32-bit system) and carefully locked via
518  :c:func:`!pte_offset_map_lock`.
519
520These locks represent the minimum required to interact with each page table
521level, but there are further requirements.
522
523Importantly, note that on a **traversal** of page tables, sometimes no such
524locks are taken. However, at the PTE level, at least concurrent page table
525deletion must be prevented (using RCU) and the page table must be mapped into
526high memory, see below.
527
528Whether care is taken on reading the page table entries depends on the
529architecture, see the section on atomicity below.
530
531Locking rules
532^^^^^^^^^^^^^
533
534We establish basic locking rules when interacting with page tables:
535
536* When changing a page table entry the page table lock for that page table
537  **must** be held, except if you can safely assume nobody can access the page
538  tables concurrently (such as on invocation of :c:func:`!free_pgtables`).
539* Reads from and writes to page table entries must be *appropriately*
540  atomic. See the section on atomicity below for details.
541* Populating previously empty entries requires that the mmap or VMA locks are
542  held (read or write), doing so with only rmap locks would be dangerous (see
543  the warning below).
544* As mentioned previously, zapping can be performed while simply keeping the VMA
545  stable, that is holding any one of the mmap, VMA or rmap locks.
546
547.. warning:: Populating previously empty entries is dangerous as, when unmapping
548             VMAs, :c:func:`!vms_clear_ptes` has a window of time between
549             zapping (via :c:func:`!unmap_vmas`) and freeing page tables (via
550             :c:func:`!free_pgtables`), where the VMA is still visible in the
551             rmap tree. :c:func:`!free_pgtables` assumes that the zap has
552             already been performed and removes PTEs unconditionally (along with
553             all other page tables in the freed range), so installing new PTE
554             entries could leak memory and also cause other unexpected and
555             dangerous behaviour.
556
557There are additional rules applicable when moving page tables, which we discuss
558in the section on this topic below.
559
560PTE-level page tables are different from page tables at other levels, and there
561are extra requirements for accessing them:
562
563* On 32-bit architectures, they may be in high memory (meaning they need to be
564  mapped into kernel memory to be accessible).
565* When empty, they can be unlinked and RCU-freed while holding an mmap lock or
566  rmap lock for reading in combination with the PTE and PMD page table locks.
567  In particular, this happens in :c:func:`!retract_page_tables` when handling
568  :c:macro:`!MADV_COLLAPSE`.
569  So accessing PTE-level page tables requires at least holding an RCU read lock;
570  but that only suffices for readers that can tolerate racing with concurrent
571  page table updates such that an empty PTE is observed (in a page table that
572  has actually already been detached and marked for RCU freeing) while another
573  new page table has been installed in the same location and filled with
574  entries. Writers normally need to take the PTE lock and revalidate that the
575  PMD entry still refers to the same PTE-level page table.
576  If the writer does not care whether it is the same PTE-level page table, it
577  can take the PMD lock and revalidate that the contents of pmd entry still meet
578  the requirements. In particular, this also happens in :c:func:`!retract_page_tables`
579  when handling :c:macro:`!MADV_COLLAPSE`.
580
581To access PTE-level page tables, a helper like :c:func:`!pte_offset_map_lock` or
582:c:func:`!pte_offset_map` can be used depending on stability requirements.
583These map the page table into kernel memory if required, take the RCU lock, and
584depending on variant, may also look up or acquire the PTE lock.
585See the comment on :c:func:`!__pte_offset_map_lock`.
586
587Atomicity
588^^^^^^^^^
589
590Regardless of page table locks, the MMU hardware concurrently updates accessed
591and dirty bits (perhaps more, depending on architecture). Additionally, page
592table traversal operations in parallel (though holding the VMA stable) and
593functionality like GUP-fast locklessly traverses (that is reads) page tables,
594without even keeping the VMA stable at all.
595
596When performing a page table traversal and keeping the VMA stable, whether a
597read must be performed once and only once or not depends on the architecture
598(for instance x86-64 does not require any special precautions).
599
600If a write is being performed, or if a read informs whether a write takes place
601(on an installation of a page table entry say, for instance in
602:c:func:`!__pud_install`), special care must always be taken. In these cases we
603can never assume that page table locks give us entirely exclusive access, and
604must retrieve page table entries once and only once.
605
606If we are reading page table entries, then we need only ensure that the compiler
607does not rearrange our loads. This is achieved via :c:func:`!pXXp_get`
608functions - :c:func:`!pgdp_get`, :c:func:`!p4dp_get`, :c:func:`!pudp_get`,
609:c:func:`!pmdp_get`, and :c:func:`!ptep_get`.
610
611Each of these uses :c:func:`!READ_ONCE` to guarantee that the compiler reads
612the page table entry only once.
613
614However, if we wish to manipulate an existing page table entry and care about
615the previously stored data, we must go further and use an hardware atomic
616operation as, for example, in :c:func:`!ptep_get_and_clear`.
617
618Equally, operations that do not rely on the VMA being held stable, such as
619GUP-fast (see :c:func:`!gup_fast` and its various page table level handlers like
620:c:func:`!gup_fast_pte_range`), must very carefully interact with page table
621entries, using functions such as :c:func:`!ptep_get_lockless` and equivalent for
622higher level page table levels.
623
624Writes to page table entries must also be appropriately atomic, as established
625by :c:func:`!set_pXX` functions - :c:func:`!set_pgd`, :c:func:`!set_p4d`,
626:c:func:`!set_pud`, :c:func:`!set_pmd`, and :c:func:`!set_pte`.
627
628Equally functions which clear page table entries must be appropriately atomic,
629as in :c:func:`!pXX_clear` functions - :c:func:`!pgd_clear`,
630:c:func:`!p4d_clear`, :c:func:`!pud_clear`, :c:func:`!pmd_clear`, and
631:c:func:`!pte_clear`.
632
633Page table installation
634^^^^^^^^^^^^^^^^^^^^^^^
635
636Page table installation is performed with the VMA held stable explicitly by an
637mmap or VMA lock in read or write mode (see the warning in the locking rules
638section for details as to why).
639
640When allocating a P4D, PUD or PMD and setting the relevant entry in the above
641PGD, P4D or PUD, the :c:member:`!mm->page_table_lock` must be held. This is
642acquired in :c:func:`!__p4d_alloc`, :c:func:`!__pud_alloc` and
643:c:func:`!__pmd_alloc` respectively.
644
645.. note:: :c:func:`!__pmd_alloc` actually invokes :c:func:`!pud_lock` and
646   :c:func:`!pud_lockptr` in turn, however at the time of writing it ultimately
647   references the :c:member:`!mm->page_table_lock`.
648
649Allocating a PTE will either use the :c:member:`!mm->page_table_lock` or, if
650:c:macro:`!USE_SPLIT_PMD_PTLOCKS` is defined, a lock embedded in the PMD
651physical page metadata in the form of a :c:struct:`!struct ptdesc`, acquired by
652:c:func:`!pmd_ptdesc` called from :c:func:`!pmd_lock` and ultimately
653:c:func:`!__pte_alloc`.
654
655Finally, modifying the contents of the PTE requires special treatment, as the
656PTE page table lock must be acquired whenever we want stable and exclusive
657access to entries contained within a PTE, especially when we wish to modify
658them.
659
660This is performed via :c:func:`!pte_offset_map_lock` which carefully checks to
661ensure that the PTE hasn't changed from under us, ultimately invoking
662:c:func:`!pte_lockptr` to obtain a spin lock at PTE granularity contained within
663the :c:struct:`!struct ptdesc` associated with the physical PTE page. The lock
664must be released via :c:func:`!pte_unmap_unlock`.
665
666.. note:: There are some variants on this, such as
667   :c:func:`!pte_offset_map_rw_nolock` when we know we hold the PTE stable but
668   for brevity we do not explore this.  See the comment for
669   :c:func:`!__pte_offset_map_lock` for more details.
670
671When modifying data in ranges we typically only wish to allocate higher page
672tables as necessary, using these locks to avoid races or overwriting anything,
673and set/clear data at the PTE level as required (for instance when page faulting
674or zapping).
675
676A typical pattern taken when traversing page table entries to install a new
677mapping is to optimistically determine whether the page table entry in the table
678above is empty, if so, only then acquiring the page table lock and checking
679again to see if it was allocated underneath us.
680
681This allows for a traversal with page table locks only being taken when
682required. An example of this is :c:func:`!__pud_alloc`.
683
684At the leaf page table, that is the PTE, we can't entirely rely on this pattern
685as we have separate PMD and PTE locks and a THP collapse for instance might have
686eliminated the PMD entry as well as the PTE from under us.
687
688This is why :c:func:`!__pte_offset_map_lock` locklessly retrieves the PMD entry
689for the PTE, carefully checking it is as expected, before acquiring the
690PTE-specific lock, and then *again* checking that the PMD entry is as expected.
691
692If a THP collapse (or similar) were to occur then the lock on both pages would
693be acquired, so we can ensure this is prevented while the PTE lock is held.
694
695Installing entries this way ensures mutual exclusion on write.
696
697Page table freeing
698^^^^^^^^^^^^^^^^^^
699
700Tearing down page tables themselves is something that requires significant
701care. There must be no way that page tables designated for removal can be
702traversed or referenced by concurrent tasks.
703
704It is insufficient to simply hold an mmap write lock and VMA lock (which will
705prevent racing faults, and rmap operations), as a file-backed mapping can be
706truncated under the :c:struct:`!struct address_space->i_mmap_rwsem` alone.
707
708As a result, no VMA which can be accessed via the reverse mapping (either
709through the :c:struct:`!struct anon_vma->rb_root` or the :c:member:`!struct
710address_space->i_mmap` interval trees) can have its page tables torn down.
711
712The operation is typically performed via :c:func:`!free_pgtables`, which assumes
713either the mmap write lock has been taken (as specified by its
714:c:member:`!mm_wr_locked` parameter), or that the VMA is already unreachable.
715
716It carefully removes the VMA from all reverse mappings, however it's important
717that no new ones overlap these or any route remain to permit access to addresses
718within the range whose page tables are being torn down.
719
720Additionally, it assumes that a zap has already been performed and steps have
721been taken to ensure that no further page table entries can be installed between
722the zap and the invocation of :c:func:`!free_pgtables`.
723
724Since it is assumed that all such steps have been taken, page table entries are
725cleared without page table locks (in the :c:func:`!pgd_clear`, :c:func:`!p4d_clear`,
726:c:func:`!pud_clear`, and :c:func:`!pmd_clear` functions.
727
728.. note:: It is possible for leaf page tables to be torn down independent of
729          the page tables above it as is done by
730          :c:func:`!retract_page_tables`, which is performed under the i_mmap
731          read lock, PMD, and PTE page table locks, without this level of care.
732
733Page table moving
734^^^^^^^^^^^^^^^^^
735
736Some functions manipulate page table levels above PMD (that is PUD, P4D and PGD
737page tables). Most notable of these is :c:func:`!mremap`, which is capable of
738moving higher level page tables.
739
740In these instances, it is required that **all** locks are taken, that is
741the mmap lock, the VMA lock and the relevant rmap locks.
742
743You can observe this in the :c:func:`!mremap` implementation in the functions
744:c:func:`!take_rmap_locks` and :c:func:`!drop_rmap_locks` which perform the rmap
745side of lock acquisition, invoked ultimately by :c:func:`!move_page_tables`.
746
747VMA lock internals
748------------------
749
750Overview
751^^^^^^^^
752
753VMA read locking is entirely optimistic - if the lock is contended or a competing
754write has started, then we do not obtain a read lock.
755
756A VMA **read** lock is obtained by :c:func:`!lock_vma_under_rcu`, which first
757calls :c:func:`!rcu_read_lock` to ensure that the VMA is looked up in an RCU
758critical section, then attempts to VMA lock it via :c:func:`!vma_start_read`,
759before releasing the RCU lock via :c:func:`!rcu_read_unlock`.
760
761In cases when the user already holds mmap read lock, :c:func:`!vma_start_read_locked`
762and :c:func:`!vma_start_read_locked_nested` can be used. These functions do not
763fail due to lock contention but the caller should still check their return values
764in case they fail for other reasons.
765
766VMA read locks increment :c:member:`!vma.vm_refcnt` reference counter for their
767duration and the caller of :c:func:`!lock_vma_under_rcu` must drop it via
768:c:func:`!vma_end_read`.
769
770VMA **write** locks are acquired via :c:func:`!vma_start_write` in instances where a
771VMA is about to be modified, unlike :c:func:`!vma_start_read` the lock is always
772acquired. An mmap write lock **must** be held for the duration of the VMA write
773lock, releasing or downgrading the mmap write lock also releases the VMA write
774lock so there is no :c:func:`!vma_end_write` function.
775
776Note that when write-locking a VMA lock, the :c:member:`!vma.vm_refcnt` is temporarily
777modified so that readers can detect the presense of a writer. The reference counter is
778restored once the vma sequence number used for serialisation is updated.
779
780This ensures the semantics we require - VMA write locks provide exclusive write
781access to the VMA.
782
783Implementation details
784^^^^^^^^^^^^^^^^^^^^^^
785
786The VMA lock mechanism is designed to be a lightweight means of avoiding the use
787of the heavily contended mmap lock. It is implemented using a combination of a
788reference counter and sequence numbers belonging to the containing
789:c:struct:`!struct mm_struct` and the VMA.
790
791Read locks are acquired via :c:func:`!vma_start_read`, which is an optimistic
792operation, i.e. it tries to acquire a read lock but returns false if it is
793unable to do so. At the end of the read operation, :c:func:`!vma_end_read` is
794called to release the VMA read lock.
795
796Invoking :c:func:`!vma_start_read` requires that :c:func:`!rcu_read_lock` has
797been called first, establishing that we are in an RCU critical section upon VMA
798read lock acquisition. Once acquired, the RCU lock can be released as it is only
799required for lookup. This is abstracted by :c:func:`!lock_vma_under_rcu` which
800is the interface a user should use.
801
802Writing requires the mmap to be write-locked and the VMA lock to be acquired via
803:c:func:`!vma_start_write`, however the write lock is released by the termination or
804downgrade of the mmap write lock so no :c:func:`!vma_end_write` is required.
805
806All this is achieved by the use of per-mm and per-VMA sequence counts, which are
807used in order to reduce complexity, especially for operations which write-lock
808multiple VMAs at once.
809
810If the mm sequence count, :c:member:`!mm->mm_lock_seq` is equal to the VMA
811sequence count :c:member:`!vma->vm_lock_seq` then the VMA is write-locked. If
812they differ, then it is not.
813
814Each time the mmap write lock is released in :c:func:`!mmap_write_unlock` or
815:c:func:`!mmap_write_downgrade`, :c:func:`!vma_end_write_all` is invoked which
816also increments :c:member:`!mm->mm_lock_seq` via
817:c:func:`!mm_lock_seqcount_end`.
818
819This way, we ensure that, regardless of the VMA's sequence number, a write lock
820is never incorrectly indicated and that when we release an mmap write lock we
821efficiently release **all** VMA write locks contained within the mmap at the
822same time.
823
824Since the mmap write lock is exclusive against others who hold it, the automatic
825release of any VMA locks on its release makes sense, as you would never want to
826keep VMAs locked across entirely separate write operations. It also maintains
827correct lock ordering.
828
829Each time a VMA read lock is acquired, we increment :c:member:`!vma.vm_refcnt`
830reference counter and check that the sequence count of the VMA does not match
831that of the mm.
832
833If it does, the read lock fails and :c:member:`!vma.vm_refcnt` is dropped.
834If it does not, we keep the reference counter raised, excluding writers, but
835permitting other readers, who can also obtain this lock under RCU.
836
837Importantly, maple tree operations performed in :c:func:`!lock_vma_under_rcu`
838are also RCU safe, so the whole read lock operation is guaranteed to function
839correctly.
840
841On the write side, we set a bit in :c:member:`!vma.vm_refcnt` which can't be
842modified by readers and wait for all readers to drop their reference count.
843Once there are no readers, the VMA's sequence number is set to match that of
844the mm. During this entire operation mmap write lock is held.
845
846This way, if any read locks are in effect, :c:func:`!vma_start_write` will sleep
847until these are finished and mutual exclusion is achieved.
848
849After setting the VMA's sequence number, the bit in :c:member:`!vma.vm_refcnt`
850indicating a writer is cleared. From this point on, VMA's sequence number will
851indicate VMA's write-locked state until mmap write lock is dropped or downgraded.
852
853This clever combination of a reference counter and sequence count allows for
854fast RCU-based per-VMA lock acquisition (especially on page fault, though
855utilised elsewhere) with minimal complexity around lock ordering.
856
857mmap write lock downgrading
858---------------------------
859
860When an mmap write lock is held one has exclusive access to resources within the
861mmap (with the usual caveats about requiring VMA write locks to avoid races with
862tasks holding VMA read locks).
863
864It is then possible to **downgrade** from a write lock to a read lock via
865:c:func:`!mmap_write_downgrade` which, similar to :c:func:`!mmap_write_unlock`,
866implicitly terminates all VMA write locks via :c:func:`!vma_end_write_all`, but
867importantly does not relinquish the mmap lock while downgrading, therefore
868keeping the locked virtual address space stable.
869
870An interesting consequence of this is that downgraded locks are exclusive
871against any other task possessing a downgraded lock (since a racing task would
872have to acquire a write lock first to downgrade it, and the downgraded lock
873prevents a new write lock from being obtained until the original lock is
874released).
875
876For clarity, we map read (R)/downgraded write (D)/write (W) locks against one
877another showing which locks exclude the others:
878
879.. list-table:: Lock exclusivity
880   :widths: 5 5 5 5
881   :header-rows: 1
882   :stub-columns: 1
883
884   * -
885     - R
886     - D
887     - W
888   * - R
889     - N
890     - N
891     - Y
892   * - D
893     - N
894     - Y
895     - Y
896   * - W
897     - Y
898     - Y
899     - Y
900
901Here a Y indicates the locks in the matching row/column are mutually exclusive,
902and N indicates that they are not.
903
904Stack expansion
905---------------
906
907Stack expansion throws up additional complexities in that we cannot permit there
908to be racing page faults, as a result we invoke :c:func:`!vma_start_write` to
909prevent this in :c:func:`!expand_downwards` or :c:func:`!expand_upwards`.
910