1.. SPDX-License-Identifier: GPL-2.0 2 3================= 4KVM Lock Overview 5================= 6 71. Acquisition Orders 8--------------------- 9 10The acquisition orders for mutexes are as follows: 11 12- cpus_read_lock() is taken outside kvm_lock 13 14- kvm_usage_lock is taken outside cpus_read_lock() 15 16- kvm->lock is taken outside vcpu->mutex 17 18- kvm->lock is taken outside kvm->slots_lock and kvm->irq_lock 19 20- vcpu->mutex is taken outside kvm->slots_lock and kvm->slots_arch_lock 21 22- kvm->slots_lock is taken outside kvm->irq_lock, though acquiring 23 them together is quite rare. 24 25- kvm->mn_active_invalidate_count ensures that pairs of 26 invalidate_range_start() and invalidate_range_end() callbacks 27 use the same memslots array. kvm->slots_lock and kvm->slots_arch_lock 28 are taken on the waiting side when modifying memslots, so MMU notifiers 29 must not take either kvm->slots_lock or kvm->slots_arch_lock. 30 31cpus_read_lock() vs kvm_lock: 32 33- Taking cpus_read_lock() outside of kvm_lock is problematic, despite that 34 being the official ordering, as it is quite easy to unknowingly trigger 35 cpus_read_lock() while holding kvm_lock. Use caution when walking vm_list, 36 e.g. avoid complex operations when possible. 37 38For SRCU: 39 40- ``synchronize_srcu(&kvm->srcu)`` is called inside critical sections 41 for kvm->lock, vcpu->mutex and kvm->slots_lock. These locks _cannot_ 42 be taken inside a kvm->srcu read-side critical section; that is, the 43 following is broken:: 44 45 srcu_read_lock(&kvm->srcu); 46 mutex_lock(&kvm->slots_lock); 47 48- kvm->slots_arch_lock instead is released before the call to 49 ``synchronize_srcu()``. It _can_ therefore be taken inside a 50 kvm->srcu read-side critical section, for example while processing 51 a vmexit. 52 53On x86: 54 55- vcpu->mutex is taken outside kvm->arch.hyperv.hv_lock and kvm->arch.xen.xen_lock 56 57- kvm->arch.mmu_lock is an rwlock; critical sections for 58 kvm->arch.tdp_mmu_pages_lock and kvm->arch.mmu_unsync_pages_lock must 59 also take kvm->arch.mmu_lock 60 61Everything else is a leaf: no other lock is taken inside the critical 62sections. 63 642. Exception 65------------ 66 67Fast page fault: 68 69Fast page fault is the fast path which fixes the guest page fault out of 70the mmu-lock on x86. Currently, the page fault can be fast in one of the 71following two cases: 72 731. Access Tracking: The SPTE is not present, but it is marked for access 74 tracking. That means we need to restore the saved R/X bits. This is 75 described in more detail later below. 76 772. Write-Protection: The SPTE is present and the fault is caused by 78 write-protect. That means we just need to change the W bit of the spte. 79 80What we use to avoid all the races is the Host-writable bit and MMU-writable bit 81on the spte: 82 83- Host-writable means the gfn is writable in the host kernel page tables and in 84 its KVM memslot. 85- MMU-writable means the gfn is writable in the guest's mmu and it is not 86 write-protected by shadow page write-protection. 87 88On fast page fault path, we will use cmpxchg to atomically set the spte W 89bit if spte.HOST_WRITEABLE = 1 and spte.WRITE_PROTECT = 1, to restore the saved 90R/X bits if for an access-traced spte, or both. This is safe because whenever 91changing these bits can be detected by cmpxchg. 92 93But we need carefully check these cases: 94 951) The mapping from gfn to pfn 96 97The mapping from gfn to pfn may be changed since we can only ensure the pfn 98is not changed during cmpxchg. This is a ABA problem, for example, below case 99will happen: 100 101+------------------------------------------------------------------------+ 102| At the beginning:: | 103| | 104| gpte = gfn1 | 105| gfn1 is mapped to pfn1 on host | 106| spte is the shadow page table entry corresponding with gpte and | 107| spte = pfn1 | 108+------------------------------------------------------------------------+ 109| On fast page fault path: | 110+------------------------------------+-----------------------------------+ 111| CPU 0: | CPU 1: | 112+------------------------------------+-----------------------------------+ 113| :: | | 114| | | 115| old_spte = *spte; | | 116+------------------------------------+-----------------------------------+ 117| | pfn1 is swapped out:: | 118| | | 119| | spte = 0; | 120| | | 121| | pfn1 is re-alloced for gfn2. | 122| | | 123| | gpte is changed to point to | 124| | gfn2 by the guest:: | 125| | | 126| | spte = pfn1; | 127+------------------------------------+-----------------------------------+ 128| :: | 129| | 130| if (cmpxchg(spte, old_spte, old_spte+W) | 131| mark_page_dirty(vcpu->kvm, gfn1) | 132| OOPS!!! | 133+------------------------------------------------------------------------+ 134 135We dirty-log for gfn1, that means gfn2 is lost in dirty-bitmap. 136 137For direct sp, we can easily avoid it since the spte of direct sp is fixed 138to gfn. For indirect sp, we disabled fast page fault for simplicity. 139 140A solution for indirect sp could be to pin the gfn before the cmpxchg. After 141the pinning: 142 143- We have held the refcount of pfn; that means the pfn can not be freed and 144 be reused for another gfn. 145- The pfn is writable and therefore it cannot be shared between different gfns 146 by KSM. 147 148Then, we can ensure the dirty bitmaps is correctly set for a gfn. 149 1502) Dirty bit tracking 151 152In the original code, the spte can be fast updated (non-atomically) if the 153spte is read-only and the Accessed bit has already been set since the 154Accessed bit and Dirty bit can not be lost. 155 156But it is not true after fast page fault since the spte can be marked 157writable between reading spte and updating spte. Like below case: 158 159+-------------------------------------------------------------------------+ 160| At the beginning:: | 161| | 162| spte.W = 0 | 163| spte.Accessed = 1 | 164+-------------------------------------+-----------------------------------+ 165| CPU 0: | CPU 1: | 166+-------------------------------------+-----------------------------------+ 167| In mmu_spte_update():: | | 168| | | 169| old_spte = *spte; | | 170| | | 171| | | 172| /* 'if' condition is satisfied. */ | | 173| if (old_spte.Accessed == 1 && | | 174| old_spte.W == 0) | | 175| spte = new_spte; | | 176+-------------------------------------+-----------------------------------+ 177| | on fast page fault path:: | 178| | | 179| | spte.W = 1 | 180| | | 181| | memory write on the spte:: | 182| | | 183| | spte.Dirty = 1 | 184+-------------------------------------+-----------------------------------+ 185| :: | | 186| | | 187| else | | 188| old_spte = xchg(spte, new_spte);| | 189| if (old_spte.Accessed && | | 190| !new_spte.Accessed) | | 191| flush = true; | | 192| if (old_spte.Dirty && | | 193| !new_spte.Dirty) | | 194| flush = true; | | 195| OOPS!!! | | 196+-------------------------------------+-----------------------------------+ 197 198The Dirty bit is lost in this case. 199 200In order to avoid this kind of issue, we always treat the spte as "volatile" 201if it can be updated out of mmu-lock [see spte_needs_atomic_update()]; it means 202the spte is always atomically updated in this case. 203 2043) flush tlbs due to spte updated 205 206If the spte is updated from writable to read-only, we should flush all TLBs, 207otherwise rmap_write_protect will find a read-only spte, even though the 208writable spte might be cached on a CPU's TLB. 209 210As mentioned before, the spte can be updated to writable out of mmu-lock on 211fast page fault path. In order to easily audit the path, we see if TLBs needing 212to be flushed caused this reason in mmu_spte_update() since this is a common 213function to update spte (present -> present). 214 215Since the spte is "volatile" if it can be updated out of mmu-lock, we always 216atomically update the spte and the race caused by fast page fault can be avoided. 217See the comments in spte_needs_atomic_update() and mmu_spte_update(). 218 219Lockless Access Tracking: 220 221This is used for Intel CPUs that are using EPT but do not support the EPT A/D 222bits. In this case, PTEs are tagged as A/D disabled (using ignored bits), and 223when the KVM MMU notifier is called to track accesses to a page (via 224kvm_mmu_notifier_clear_flush_young), it marks the PTE not-present in hardware 225by clearing the RWX bits in the PTE and storing the original R & X bits in more 226unused/ignored bits. When the VM tries to access the page later on, a fault is 227generated and the fast page fault mechanism described above is used to 228atomically restore the PTE to a Present state. The W bit is not saved when the 229PTE is marked for access tracking and during restoration to the Present state, 230the W bit is set depending on whether or not it was a write access. If it 231wasn't, then the W bit will remain clear until a write access happens, at which 232time it will be set using the Dirty tracking mechanism described above. 233 2343. Reference 235------------ 236 237``kvm_lock`` 238^^^^^^^^^^^^ 239 240:Type: mutex 241:Arch: any 242:Protects: - vm_list 243 244``kvm_usage_lock`` 245^^^^^^^^^^^^^^^^^^ 246 247:Type: mutex 248:Arch: any 249:Protects: - kvm_usage_count 250 - hardware virtualization enable/disable 251:Comment: Exists to allow taking cpus_read_lock() while kvm_usage_count is 252 protected, which simplifies the virtualization enabling logic. 253 254``kvm->mn_invalidate_lock`` 255^^^^^^^^^^^^^^^^^^^^^^^^^^^ 256 257:Type: spinlock_t 258:Arch: any 259:Protects: mn_active_invalidate_count, mn_memslots_update_rcuwait 260 261``kvm_arch::tsc_write_lock`` 262^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 263 264:Type: raw_spinlock_t 265:Arch: x86 266:Protects: - kvm_arch::{last_tsc_write,last_tsc_nsec,last_tsc_offset} 267 - tsc offset in vmcb 268:Comment: 'raw' because updating the tsc offsets must not be preempted. 269 270``kvm->mmu_lock`` 271^^^^^^^^^^^^^^^^^ 272:Type: spinlock_t or rwlock_t 273:Arch: any 274:Protects: -shadow page/shadow tlb entry 275:Comment: it is a spinlock since it is used in mmu notifier. 276 277``kvm->srcu`` 278^^^^^^^^^^^^^ 279:Type: srcu lock 280:Arch: any 281:Protects: - kvm->memslots 282 - kvm->buses 283:Comment: The srcu read lock must be held while accessing memslots (e.g. 284 when using gfn_to_* functions) and while accessing in-kernel 285 MMIO/PIO address->device structure mapping (kvm->buses). 286 The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu 287 if it is needed by multiple functions. 288 289``kvm->slots_arch_lock`` 290^^^^^^^^^^^^^^^^^^^^^^^^ 291:Type: mutex 292:Arch: any (only needed on x86 though) 293:Protects: any arch-specific fields of memslots that have to be modified 294 in a ``kvm->srcu`` read-side critical section. 295:Comment: must be held before reading the pointer to the current memslots, 296 until after all changes to the memslots are complete 297 298``wakeup_vcpus_on_cpu_lock`` 299^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 300:Type: spinlock_t 301:Arch: x86 302:Protects: wakeup_vcpus_on_cpu 303:Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. 304 When VT-d posted-interrupts are supported and the VM has assigned 305 devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu 306 protected by blocked_vcpu_on_cpu_lock. When VT-d hardware issues 307 wakeup notification event since external interrupts from the 308 assigned devices happens, we will find the vCPU on the list to 309 wakeup. 310 311``vendor_module_lock`` 312^^^^^^^^^^^^^^^^^^^^^^ 313:Type: mutex 314:Arch: x86 315:Protects: loading a vendor module (kvm_amd or kvm_intel) 316:Comment: Exists because using kvm_lock leads to deadlock. kvm_lock is taken 317 in notifiers, e.g. __kvmclock_cpufreq_notifier(), that may be invoked while 318 cpu_hotplug_lock is held, e.g. from cpufreq_boost_trigger_state(), and many 319 operations need to take cpu_hotplug_lock when loading a vendor module, e.g. 320 updating static calls. 321