1.. 2 Copyright (c) 2015-2020 Linaro Ltd. 3 4 This work is licensed under the terms of the GNU GPL, version 2 or 5 later. See the COPYING file in the top-level directory. 6 7.. _mttcg: 8 9================== 10Multi-threaded TCG 11================== 12 13This document outlines the design for multi-threaded TCG (a.k.a MTTCG) 14system-mode emulation. user-mode emulation has always mirrored the 15thread structure of the translated executable although some of the 16changes done for MTTCG system emulation have improved the stability of 17linux-user emulation. 18 19The original system-mode TCG implementation was single threaded and 20dealt with multiple CPUs with simple round-robin scheduling. This 21simplified a lot of things but became increasingly limited as systems 22being emulated gained additional cores and per-core performance gains 23for host systems started to level off. 24 25vCPU Scheduling 26=============== 27 28We introduce a new running mode where each vCPU will run on its own 29user-space thread. This is enabled by default for all FE/BE 30combinations where the host memory model is able to accommodate the 31guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the 32guest has had the required work done to support this safely 33(TARGET_SUPPORTS_MTTCG). 34 35System emulation will fall back to the original round robin approach 36if: 37 38* forced by --accel tcg,thread=single 39* enabling --icount mode 40* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST) 41 42In the general case of running translated code there should be no 43inter-vCPU dependencies and all vCPUs should be able to run at full 44speed. Synchronisation will only be required while accessing internal 45shared data structures or when the emulated architecture requires a 46coherent representation of the emulated machine state. 47 48Shared Data Structures 49====================== 50 51Main Run Loop 52------------- 53 54Even when there is no code being generated there are a number of 55structures associated with the hot-path through the main run-loop. 56These are associated with looking up the next translation block to 57execute. These include: 58 59 tb_jmp_cache (per-vCPU, cache of recent jumps) 60 tb_ctx.htable (global hash table, phys address->tb lookup) 61 62As TB linking only occurs when blocks are in the same page this code 63is critical to performance as looking up the next TB to execute is the 64most common reason to exit the generated code. 65 66DESIGN REQUIREMENT: Make access to lookup structures safe with 67multiple reader/writer threads. Minimise any lock contention to do it. 68 69The hot-path avoids using locks where possible. The tb_jmp_cache is 70updated with atomic accesses to ensure consistent results. The fall 71back QHT based hash table is also designed for lockless lookups. Locks 72are only taken when code generation is required or TranslationBlocks 73have their block-to-block jumps patched. 74 75Global TCG State 76---------------- 77 78User-mode emulation 79~~~~~~~~~~~~~~~~~~~ 80 81We need to protect the entire code generation cycle including any post 82generation patching of the translated code. This also implies a shared 83translation buffer which contains code running on all cores. Any 84execution path that comes to the main run loop will need to hold a 85mutex for code generation. This also includes times when we need flush 86code or entries from any shared lookups/caches. Structures held on a 87per-vCPU basis won't need locking unless other vCPUs will need to 88modify them. 89 90DESIGN REQUIREMENT: Add locking around all code generation and TB 91patching. 92 93(Current solution) 94 95Code generation is serialised with mmap_lock(). 96 97!User-mode emulation 98~~~~~~~~~~~~~~~~~~~~ 99 100Each vCPU has its own TCG context and associated TCG region, thereby 101requiring no locking during translation. 102 103Translation Blocks 104------------------ 105 106Currently the whole system shares a single code generation buffer 107which when full will force a flush of all translations and start from 108scratch again. Some operations also force a full flush of translations 109including: 110 111 - debugging operations (breakpoint insertion/removal) 112 - some CPU helper functions 113 - linux-user spawning its first thread 114 - operations related to TCG Plugins 115 116This is done with the async_safe_run_on_cpu() mechanism to ensure all 117vCPUs are quiescent when changes are being made to shared global 118structures. 119 120More granular translation invalidation events are typically due 121to a change of the state of a physical page: 122 123 - code modification (self modify code, patching code) 124 - page changes (new page mapping in linux-user mode) 125 126While setting the invalid flag in a TranslationBlock will stop it 127being used when looked up in the hot-path there are a number of other 128book-keeping structures that need to be safely cleared. 129 130Any TranslationBlocks which have been patched to jump directly to the 131now invalid blocks need the jump patches reversing so they will return 132to the C code. 133 134There are a number of look-up caches that need to be properly updated 135including the: 136 137 - jump lookup cache 138 - the physical-to-tb lookup hash table 139 - the global page table 140 141The global page table (l1_map) which provides a multi-level look-up 142for PageDesc structures which contain pointers to the start of a 143linked list of all Translation Blocks in that page (see page_next). 144 145Both the jump patching and the page cache involve linked lists that 146the invalidated TranslationBlock needs to be removed from. 147 148DESIGN REQUIREMENT: Safely handle invalidation of TBs 149 - safely patch/revert direct jumps 150 - remove central PageDesc lookup entries 151 - ensure lookup caches/hashes are safely updated 152 153(Current solution) 154 155The direct jump themselves are updated atomically by the TCG 156tb_set_jmp_target() code. Modification to the linked lists that allow 157searching for linked pages are done under the protection of tb->jmp_lock, 158where tb is the destination block of a jump. Each origin block keeps a 159pointer to its destinations so that the appropriate lock can be acquired before 160iterating over a jump list. 161 162The global page table is a lockless radix tree; cmpxchg is used 163to atomically insert new elements. 164 165The lookup caches are updated atomically and the lookup hash uses QHT 166which is designed for concurrent safe lookup. 167 168Parallel code generation is supported. QHT is used at insertion time 169as the synchronization point across threads, thereby ensuring that we only 170keep track of a single TranslationBlock for each guest code block. 171 172Memory maps and TLBs 173-------------------- 174 175The memory handling code is fairly critical to the speed of memory 176access in the emulated system. The SoftMMU code is designed so the 177hot-path can be handled entirely within translated code. This is 178handled with a per-vCPU TLB structure which once populated will allow 179a series of accesses to the page to occur without exiting the 180translated code. It is possible to set flags in the TLB address which 181will ensure the slow-path is taken for each access. This can be done 182to support: 183 184 - Memory regions (dividing up access to PIO, MMIO and RAM) 185 - Dirty page tracking (for code gen, SMC detection, migration and display) 186 - Virtual TLB (for translating guest address->real address) 187 188When the TLB tables are updated by a vCPU thread other than their own 189we need to ensure it is done in a safe way so no inconsistent state is 190seen by the vCPU thread. 191 192Some operations require updating a number of vCPUs TLBs at the same 193time in a synchronised manner. 194 195DESIGN REQUIREMENTS: 196 197 - TLB Flush All/Page 198 - can be across-vCPUs 199 - cross vCPU TLB flush may need other vCPU brought to halt 200 - change may need to be visible to the calling vCPU immediately 201 - TLB Flag Update 202 - usually cross-vCPU 203 - want change to be visible as soon as possible 204 - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs) 205 - This is a per-vCPU table - by definition can't race 206 - updated by its own thread when the slow-path is forced 207 208(Current solution) 209 210A new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force 211synchronisation by setting the source vCPUs work as "safe work" and 212exiting the cpu run loop. This ensures that by the time execution 213restarts all flush operations have completed. 214 215TLB flag updates are all done atomically and are also protected by the 216corresponding page lock. 217 218(Known limitation) 219 220Not really a limitation but the wait mechanism is overly strict for 221some architectures which only need flushes completed by a barrier 222instruction. This could be a future optimisation. 223 224Emulated hardware state 225----------------------- 226 227Currently thanks to KVM work any access to IO memory is automatically protected 228by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected 229to do its own locking. 230 231However IO memory isn't the only way emulated hardware state can be 232modified. Some architectures have model specific registers that 233trigger hardware emulation features. Generally any translation helper 234that needs to update more than a single vCPUs of state should take the 235BQL. 236 237As the BQL, or global iothread mutex is shared across the system we 238push the use of the lock as far down into the TCG code as possible to 239minimise contention. 240 241(Current solution) 242 243MMIO access automatically serialises hardware emulation by way of the 244BQL. Currently Arm targets serialise all ARM_CP_IO register accesses 245and also defer the reset/startup of vCPUs to the vCPU context by way 246of async_run_on_cpu(). 247 248Updates to interrupt state are also protected by the BQL as they can 249often be cross vCPU. 250 251Memory Consistency 252================== 253 254Between emulated guests and host systems there are a range of memory 255consistency models. Even emulating weakly ordered systems on strongly 256ordered hosts needs to ensure things like store-after-load re-ordering 257can be prevented when the guest wants to. 258 259Memory Barriers 260--------------- 261 262Barriers (sometimes known as fences) provide a mechanism for software 263to enforce a particular ordering of memory operations from the point 264of view of external observers (e.g. another processor core). They can 265apply to any memory operations as well as just loads or stores. 266 267The Linux kernel has an excellent `write-up 268<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_ 269on the various forms of memory barrier and the guarantees they can 270provide. 271 272Barriers are often wrapped around synchronisation primitives to 273provide explicit memory ordering semantics. However they can be used 274by themselves to provide safe lockless access by ensuring for example 275a change to a signal flag will only be visible once the changes to 276payload are. 277 278DESIGN REQUIREMENT: Add a new tcg_memory_barrier op 279 280This would enforce a strong load/store ordering so all loads/stores 281complete at the memory barrier. On single-core non-SMP strongly 282ordered backends this could become a NOP. 283 284Aside from explicit standalone memory barrier instructions there are 285also implicit memory ordering semantics which comes with each guest 286memory access instruction. For example all x86 load/stores come with 287fairly strong guarantees of sequential consistency whereas Arm has 288special variants of load/store instructions that imply acquire/release 289semantics. 290 291In the case of a strongly ordered guest architecture being emulated on 292a weakly ordered host the scope for a heavy performance impact is 293quite high. 294 295DESIGN REQUIREMENTS: Be efficient with use of memory barriers 296 - host systems with stronger implied guarantees can skip some barriers 297 - merge consecutive barriers to the strongest one 298 299(Current solution) 300 301The system currently has a tcg_gen_mb() which will add memory barrier 302operations if code generation is being done in a parallel context. The 303tcg_optimize() function attempts to merge barriers up to their 304strongest form before any load/store operations. The solution was 305originally developed and tested for linux-user based systems. All 306backends have been converted to emit fences when required. So far the 307following front-ends have been updated to emit fences when required: 308 309 - target-i386 310 - target-arm 311 - target-aarch64 312 - target-alpha 313 - target-mips 314 315Memory Control and Maintenance 316------------------------------ 317 318This includes a class of instructions for controlling system cache 319behaviour. While QEMU doesn't model cache behaviour these instructions 320are often seen when code modification has taken place to ensure the 321changes take effect. 322 323Synchronisation Primitives 324-------------------------- 325 326There are two broad types of synchronisation primitives found in 327modern ISAs: atomic instructions and exclusive regions. 328 329The first type offer a simple atomic instruction which will guarantee 330some sort of test and conditional store will be truly atomic w.r.t. 331other cores sharing access to the memory. The classic example is the 332x86 cmpxchg instruction. 333 334The second type offer a pair of load/store instructions which offer a 335guarantee that a region of memory has not been touched between the 336load and store instructions. An example of this is Arm's ldrex/strex 337pair where the strex instruction will return a flag indicating a 338successful store only if no other CPU has accessed the memory region 339since the ldrex. 340 341Traditionally TCG has generated a series of operations that work 342because they are within the context of a single translation block so 343will have completed before another CPU is scheduled. However with 344the ability to have multiple threads running to emulate multiple CPUs 345we will need to explicitly expose these semantics. 346 347DESIGN REQUIREMENTS: 348 - Support classic atomic instructions 349 - Support load/store exclusive (or load link/store conditional) pairs 350 - Generic enough infrastructure to support all guest architectures 351CURRENT OPEN QUESTIONS: 352 - How problematic is the ABA problem in general? 353 354(Current solution) 355 356The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which 357can be used directly or combined to emulate other instructions like 358Arm's ldrex/strex instructions. While they are susceptible to the ABA 359problem so far common guests have not implemented patterns where 360this may be a problem - typically presenting a locking ABI which 361assumes cmpxchg like semantics. 362 363The code also includes a fall-back for cases where multi-threaded TCG 364ops can't work (e.g. guest atomic width > host atomic width). In this 365case an EXCP_ATOMIC exit occurs and the instruction is emulated with 366an exclusive lock which ensures all emulation is serialised. 367 368While the atomic helpers look good enough for now there may be a need 369to look at solutions that can more closely model the guest 370architectures semantics. 371