10f10851eSJérôme GlisseWhen do you need to notify inside page table lock ? 216f9f7f9SMike Rapoport=================================================== 30f10851eSJérôme Glisse 40f10851eSJérôme GlisseWhen clearing a pte/pmd we are given a choice to notify the event through 516f9f7f9SMike Rapoport(notify version of \*_clear_flush call mmu_notifier_invalidate_range) under 60f10851eSJérôme Glissethe page table lock. But that notification is not necessary in all cases. 70f10851eSJérôme Glisse 80f10851eSJérôme GlisseFor secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when device use 90f10851eSJérôme Glissething like ATS/PASID to get the IOMMU to walk the CPU page table to access a 100f10851eSJérôme Glisseprocess virtual address space). There is only 2 cases when you need to notify 110f10851eSJérôme Glissethose secondary TLB while holding page table lock when clearing a pte/pmd: 120f10851eSJérôme Glisse 130f10851eSJérôme Glisse A) page backing address is free before mmu_notifier_invalidate_range_end() 140f10851eSJérôme Glisse B) a page table entry is updated to point to a new page (COW, write fault 150f10851eSJérôme Glisse on zero page, __replace_page(), ...) 160f10851eSJérôme Glisse 170f10851eSJérôme GlisseCase A is obvious you do not want to take the risk for the device to write to 180f10851eSJérôme Glissea page that might now be used by some completely different task. 190f10851eSJérôme Glisse 200f10851eSJérôme GlisseCase B is more subtle. For correctness it requires the following sequence to 210f10851eSJérôme Glissehappen: 2216f9f7f9SMike Rapoport 230f10851eSJérôme Glisse - take page table lock 240f10851eSJérôme Glisse - clear page table entry and notify ([pmd/pte]p_huge_clear_flush_notify()) 250f10851eSJérôme Glisse - set page table entry to point to new page 260f10851eSJérôme Glisse 270f10851eSJérôme GlisseIf clearing the page table entry is not followed by a notify before setting 280f10851eSJérôme Glissethe new pte/pmd value then you can break memory model like C11 or C++11 for 290f10851eSJérôme Glissethe device. 300f10851eSJérôme Glisse 310f10851eSJérôme GlisseConsider the following scenario (device use a feature similar to ATS/PASID): 320f10851eSJérôme Glisse 3316f9f7f9SMike RapoportTwo address addrA and addrB such that \|addrA - addrB\| >= PAGE_SIZE we assume 340f10851eSJérôme Glissethey are write protected for COW (other case of B apply too). 350f10851eSJérôme Glisse 3616f9f7f9SMike Rapoport:: 3716f9f7f9SMike Rapoport 380f10851eSJérôme Glisse [Time N] -------------------------------------------------------------------- 390f10851eSJérôme Glisse CPU-thread-0 {try to write to addrA} 400f10851eSJérôme Glisse CPU-thread-1 {try to write to addrB} 410f10851eSJérôme Glisse CPU-thread-2 {} 420f10851eSJérôme Glisse CPU-thread-3 {} 430f10851eSJérôme Glisse DEV-thread-0 {read addrA and populate device TLB} 440f10851eSJérôme Glisse DEV-thread-2 {read addrB and populate device TLB} 450f10851eSJérôme Glisse [Time N+1] ------------------------------------------------------------------ 460f10851eSJérôme Glisse CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}} 470f10851eSJérôme Glisse CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}} 480f10851eSJérôme Glisse CPU-thread-2 {} 490f10851eSJérôme Glisse CPU-thread-3 {} 500f10851eSJérôme Glisse DEV-thread-0 {} 510f10851eSJérôme Glisse DEV-thread-2 {} 520f10851eSJérôme Glisse [Time N+2] ------------------------------------------------------------------ 530f10851eSJérôme Glisse CPU-thread-0 {COW_step1: {update page table to point to new page for addrA}} 540f10851eSJérôme Glisse CPU-thread-1 {COW_step1: {update page table to point to new page for addrB}} 550f10851eSJérôme Glisse CPU-thread-2 {} 560f10851eSJérôme Glisse CPU-thread-3 {} 570f10851eSJérôme Glisse DEV-thread-0 {} 580f10851eSJérôme Glisse DEV-thread-2 {} 590f10851eSJérôme Glisse [Time N+3] ------------------------------------------------------------------ 600f10851eSJérôme Glisse CPU-thread-0 {preempted} 610f10851eSJérôme Glisse CPU-thread-1 {preempted} 620f10851eSJérôme Glisse CPU-thread-2 {write to addrA which is a write to new page} 630f10851eSJérôme Glisse CPU-thread-3 {} 640f10851eSJérôme Glisse DEV-thread-0 {} 650f10851eSJérôme Glisse DEV-thread-2 {} 660f10851eSJérôme Glisse [Time N+3] ------------------------------------------------------------------ 670f10851eSJérôme Glisse CPU-thread-0 {preempted} 680f10851eSJérôme Glisse CPU-thread-1 {preempted} 690f10851eSJérôme Glisse CPU-thread-2 {} 700f10851eSJérôme Glisse CPU-thread-3 {write to addrB which is a write to new page} 710f10851eSJérôme Glisse DEV-thread-0 {} 720f10851eSJérôme Glisse DEV-thread-2 {} 730f10851eSJérôme Glisse [Time N+4] ------------------------------------------------------------------ 740f10851eSJérôme Glisse CPU-thread-0 {preempted} 750f10851eSJérôme Glisse CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}} 760f10851eSJérôme Glisse CPU-thread-2 {} 770f10851eSJérôme Glisse CPU-thread-3 {} 780f10851eSJérôme Glisse DEV-thread-0 {} 790f10851eSJérôme Glisse DEV-thread-2 {} 800f10851eSJérôme Glisse [Time N+5] ------------------------------------------------------------------ 810f10851eSJérôme Glisse CPU-thread-0 {preempted} 820f10851eSJérôme Glisse CPU-thread-1 {} 830f10851eSJérôme Glisse CPU-thread-2 {} 840f10851eSJérôme Glisse CPU-thread-3 {} 850f10851eSJérôme Glisse DEV-thread-0 {read addrA from old page} 860f10851eSJérôme Glisse DEV-thread-2 {read addrB from new page} 870f10851eSJérôme Glisse 880f10851eSJérôme GlisseSo here because at time N+2 the clear page table entry was not pair with a 890f10851eSJérôme Glissenotification to invalidate the secondary TLB, the device see the new value for 90*94ebdd28SColin Ian KingaddrB before seeing the new value for addrA. This break total memory ordering 910f10851eSJérôme Glissefor the device. 920f10851eSJérôme Glisse 930f10851eSJérôme GlisseWhen changing a pte to write protect or to point to a new write protected page 940f10851eSJérôme Glissewith same content (KSM) it is fine to delay the mmu_notifier_invalidate_range 950f10851eSJérôme Glissecall to mmu_notifier_invalidate_range_end() outside the page table lock. This 960f10851eSJérôme Glisseis true even if the thread doing the page table update is preempted right after 970f10851eSJérôme Glissereleasing page table lock but before call mmu_notifier_invalidate_range_end(). 98