xref: /linux/Documentation/mm/mmu_notifier.rst (revision 9a87ffc99ec8eb8d35eed7c4f816d75f5cc9662e)
10f10851eSJérôme GlisseWhen do you need to notify inside page table lock ?
216f9f7f9SMike Rapoport===================================================
30f10851eSJérôme Glisse
40f10851eSJérôme GlisseWhen clearing a pte/pmd we are given a choice to notify the event through
516f9f7f9SMike Rapoport(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under
60f10851eSJérôme Glissethe page table lock. But that notification is not necessary in all cases.
70f10851eSJérôme Glisse
80f10851eSJérôme GlisseFor secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use
90f10851eSJérôme Glissething like ATS/PASID to get the IOMMU to walk the CPU page table to access a
100f10851eSJérôme Glisseprocess virtual address space). There is only 2 cases when you need to notify
110f10851eSJérôme Glissethose secondary TLB while holding page table lock when clearing a pte/pmd:
120f10851eSJérôme Glisse
130f10851eSJérôme Glisse  A) page backing address is free before mmu_notifier_invalidate_range_end()
140f10851eSJérôme Glisse  B) a page table entry is updated to point to a new page (COW, write fault
150f10851eSJérôme Glisse     on zero page, __replace_page(), ...)
160f10851eSJérôme Glisse
170f10851eSJérôme GlisseCase A is obvious you do not want to take the risk for the device to write to
180f10851eSJérôme Glissea page that might now be used by some completely different task.
190f10851eSJérôme Glisse
200f10851eSJérôme GlisseCase B is more subtle. For correctness it requires the following sequence to
210f10851eSJérôme Glissehappen:
2216f9f7f9SMike Rapoport
230f10851eSJérôme Glisse  - take page table lock
240f10851eSJérôme Glisse  - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify())
250f10851eSJérôme Glisse  - set page table entry to point to new page
260f10851eSJérôme Glisse
270f10851eSJérôme GlisseIf clearing the page table entry is not followed by a notify before setting
280f10851eSJérôme Glissethe new pte/pmd value then you can break memory model like C11 or C++11 for
290f10851eSJérôme Glissethe device.
300f10851eSJérôme Glisse
310f10851eSJérôme GlisseConsider the following scenario (device use a feature similar to ATS/PASID):
320f10851eSJérôme Glisse
3316f9f7f9SMike RapoportTwo address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume
340f10851eSJérôme Glissethey are write protected for COW (other case of B apply too).
350f10851eSJérôme Glisse
3616f9f7f9SMike Rapoport::
3716f9f7f9SMike Rapoport
380f10851eSJérôme Glisse [Time N] --------------------------------------------------------------------
390f10851eSJérôme Glisse CPU-thread-0  {try to write to addrA}
400f10851eSJérôme Glisse CPU-thread-1  {try to write to addrB}
410f10851eSJérôme Glisse CPU-thread-2  {}
420f10851eSJérôme Glisse CPU-thread-3  {}
430f10851eSJérôme Glisse DEV-thread-0  {read addrA and populate device TLB}
440f10851eSJérôme Glisse DEV-thread-2  {read addrB and populate device TLB}
450f10851eSJérôme Glisse [Time N+1] ------------------------------------------------------------------
460f10851eSJérôme Glisse CPU-thread-0  {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
470f10851eSJérôme Glisse CPU-thread-1  {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
480f10851eSJérôme Glisse CPU-thread-2  {}
490f10851eSJérôme Glisse CPU-thread-3  {}
500f10851eSJérôme Glisse DEV-thread-0  {}
510f10851eSJérôme Glisse DEV-thread-2  {}
520f10851eSJérôme Glisse [Time N+2] ------------------------------------------------------------------
530f10851eSJérôme Glisse CPU-thread-0  {COW_step1: {update page table to point to new page for addrA}}
540f10851eSJérôme Glisse CPU-thread-1  {COW_step1: {update page table to point to new page for addrB}}
550f10851eSJérôme Glisse CPU-thread-2  {}
560f10851eSJérôme Glisse CPU-thread-3  {}
570f10851eSJérôme Glisse DEV-thread-0  {}
580f10851eSJérôme Glisse DEV-thread-2  {}
590f10851eSJérôme Glisse [Time N+3] ------------------------------------------------------------------
600f10851eSJérôme Glisse CPU-thread-0  {preempted}
610f10851eSJérôme Glisse CPU-thread-1  {preempted}
620f10851eSJérôme Glisse CPU-thread-2  {write to addrA which is a write to new page}
630f10851eSJérôme Glisse CPU-thread-3  {}
640f10851eSJérôme Glisse DEV-thread-0  {}
650f10851eSJérôme Glisse DEV-thread-2  {}
660f10851eSJérôme Glisse [Time N+3] ------------------------------------------------------------------
670f10851eSJérôme Glisse CPU-thread-0  {preempted}
680f10851eSJérôme Glisse CPU-thread-1  {preempted}
690f10851eSJérôme Glisse CPU-thread-2  {}
700f10851eSJérôme Glisse CPU-thread-3  {write to addrB which is a write to new page}
710f10851eSJérôme Glisse DEV-thread-0  {}
720f10851eSJérôme Glisse DEV-thread-2  {}
730f10851eSJérôme Glisse [Time N+4] ------------------------------------------------------------------
740f10851eSJérôme Glisse CPU-thread-0  {preempted}
750f10851eSJérôme Glisse CPU-thread-1  {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
760f10851eSJérôme Glisse CPU-thread-2  {}
770f10851eSJérôme Glisse CPU-thread-3  {}
780f10851eSJérôme Glisse DEV-thread-0  {}
790f10851eSJérôme Glisse DEV-thread-2  {}
800f10851eSJérôme Glisse [Time N+5] ------------------------------------------------------------------
810f10851eSJérôme Glisse CPU-thread-0  {preempted}
820f10851eSJérôme Glisse CPU-thread-1  {}
830f10851eSJérôme Glisse CPU-thread-2  {}
840f10851eSJérôme Glisse CPU-thread-3  {}
850f10851eSJérôme Glisse DEV-thread-0  {read addrA from old page}
860f10851eSJérôme Glisse DEV-thread-2  {read addrB from new page}
870f10851eSJérôme Glisse
880f10851eSJérôme GlisseSo here because at time N+2 the clear page table entry was not pair with a
890f10851eSJérôme Glissenotification to invalidate the secondary TLB, the device see the new value for
90*94ebdd28SColin Ian KingaddrB before seeing the new value for addrA. This break total memory ordering
910f10851eSJérôme Glissefor the device.
920f10851eSJérôme Glisse
930f10851eSJérôme GlisseWhen changing a pte to write protect or to point to a new write protected page
940f10851eSJérôme Glissewith same content (KSM) it is fine to delay the mmu_notifier_invalidate_range
950f10851eSJérôme Glissecall to mmu_notifier_invalidate_range_end() outside the page table lock. This
960f10851eSJérôme Glisseis true even if the thread doing the page table update is preempted right after
970f10851eSJérôme Glissereleasing page table lock but before call mmu_notifier_invalidate_range_end().
98