xref: /qemu/docs/devel/multi-thread-tcg.rst (revision 6b8f40c61bbfef1abe77eeb9c716ec642927c12c)
1..
2  Copyright (c) 2015-2020 Linaro Ltd.
3
4  This work is licensed under the terms of the GNU GPL, version 2 or
5  later. See the COPYING file in the top-level directory.
6
7.. _mttcg:
8
9==================
10Multi-threaded TCG
11==================
12
13This document outlines the design for multi-threaded TCG (a.k.a MTTCG)
14system-mode emulation. user-mode emulation has always mirrored the
15thread structure of the translated executable although some of the
16changes done for MTTCG system emulation have improved the stability of
17linux-user emulation.
18
19The original system-mode TCG implementation was single threaded and
20dealt with multiple CPUs with simple round-robin scheduling. This
21simplified a lot of things but became increasingly limited as systems
22being emulated gained additional cores and per-core performance gains
23for host systems started to level off.
24
25vCPU Scheduling
26===============
27
28We introduce a new running mode where each vCPU will run on its own
29user-space thread. This is enabled by default for all FE/BE
30combinations where the host memory model is able to accommodate the
31guest (TCG_GUEST_DEFAULT_MO & ~TCG_TARGET_DEFAULT_MO is zero) and the
32guest has had the required work done to support this safely
33(TARGET_SUPPORTS_MTTCG).
34
35System emulation will fall back to the original round robin approach
36if:
37
38* forced by --accel tcg,thread=single
39* enabling --icount mode
40* 64 bit guests on 32 bit hosts (TCG_OVERSIZED_GUEST)
41
42In the general case of running translated code there should be no
43inter-vCPU dependencies and all vCPUs should be able to run at full
44speed. Synchronisation will only be required while accessing internal
45shared data structures or when the emulated architecture requires a
46coherent representation of the emulated machine state.
47
48Shared Data Structures
49======================
50
51Main Run Loop
52-------------
53
54Even when there is no code being generated there are a number of
55structures associated with the hot-path through the main run-loop.
56These are associated with looking up the next translation block to
57execute. These include:
58
59    tb_jmp_cache (per-vCPU, cache of recent jumps)
60    tb_ctx.htable (global hash table, phys address->tb lookup)
61
62As TB linking only occurs when blocks are in the same page this code
63is critical to performance as looking up the next TB to execute is the
64most common reason to exit the generated code.
65
66DESIGN REQUIREMENT: Make access to lookup structures safe with
67multiple reader/writer threads. Minimise any lock contention to do it.
68
69The hot-path avoids using locks where possible. The tb_jmp_cache is
70updated with atomic accesses to ensure consistent results. The fall
71back QHT based hash table is also designed for lockless lookups. Locks
72are only taken when code generation is required or TranslationBlocks
73have their block-to-block jumps patched.
74
75Global TCG State
76----------------
77
78User-mode emulation
79~~~~~~~~~~~~~~~~~~~
80
81We need to protect the entire code generation cycle including any post
82generation patching of the translated code. This also implies a shared
83translation buffer which contains code running on all cores. Any
84execution path that comes to the main run loop will need to hold a
85mutex for code generation. This also includes times when we need flush
86code or entries from any shared lookups/caches. Structures held on a
87per-vCPU basis won't need locking unless other vCPUs will need to
88modify them.
89
90DESIGN REQUIREMENT: Add locking around all code generation and TB
91patching.
92
93(Current solution)
94
95Code generation is serialised with mmap_lock().
96
97!User-mode emulation
98~~~~~~~~~~~~~~~~~~~~
99
100Each vCPU has its own TCG context and associated TCG region, thereby
101requiring no locking during translation.
102
103Translation Blocks
104------------------
105
106Currently the whole system shares a single code generation buffer
107which when full will force a flush of all translations and start from
108scratch again. Some operations also force a full flush of translations
109including:
110
111  - debugging operations (breakpoint insertion/removal)
112  - some CPU helper functions
113  - linux-user spawning its first thread
114  - operations related to TCG Plugins
115
116This is done with the async_safe_run_on_cpu() mechanism to ensure all
117vCPUs are quiescent when changes are being made to shared global
118structures.
119
120More granular translation invalidation events are typically due
121to a change of the state of a physical page:
122
123  - code modification (self modify code, patching code)
124  - page changes (new page mapping in linux-user mode)
125
126While setting the invalid flag in a TranslationBlock will stop it
127being used when looked up in the hot-path there are a number of other
128book-keeping structures that need to be safely cleared.
129
130Any TranslationBlocks which have been patched to jump directly to the
131now invalid blocks need the jump patches reversing so they will return
132to the C code.
133
134There are a number of look-up caches that need to be properly updated
135including the:
136
137  - jump lookup cache
138  - the physical-to-tb lookup hash table
139  - the global page table
140
141The global page table (l1_map) which provides a multi-level look-up
142for PageDesc structures which contain pointers to the start of a
143linked list of all Translation Blocks in that page (see page_next).
144
145Both the jump patching and the page cache involve linked lists that
146the invalidated TranslationBlock needs to be removed from.
147
148DESIGN REQUIREMENT: Safely handle invalidation of TBs
149                      - safely patch/revert direct jumps
150                      - remove central PageDesc lookup entries
151                      - ensure lookup caches/hashes are safely updated
152
153(Current solution)
154
155The direct jump themselves are updated atomically by the TCG
156tb_set_jmp_target() code. Modification to the linked lists that allow
157searching for linked pages are done under the protection of tb->jmp_lock,
158where tb is the destination block of a jump. Each origin block keeps a
159pointer to its destinations so that the appropriate lock can be acquired before
160iterating over a jump list.
161
162The global page table is a lockless radix tree; cmpxchg is used
163to atomically insert new elements.
164
165The lookup caches are updated atomically and the lookup hash uses QHT
166which is designed for concurrent safe lookup.
167
168Parallel code generation is supported. QHT is used at insertion time
169as the synchronization point across threads, thereby ensuring that we only
170keep track of a single TranslationBlock for each guest code block.
171
172Memory maps and TLBs
173--------------------
174
175The memory handling code is fairly critical to the speed of memory
176access in the emulated system. The SoftMMU code is designed so the
177hot-path can be handled entirely within translated code. This is
178handled with a per-vCPU TLB structure which once populated will allow
179a series of accesses to the page to occur without exiting the
180translated code. It is possible to set flags in the TLB address which
181will ensure the slow-path is taken for each access. This can be done
182to support:
183
184  - Memory regions (dividing up access to PIO, MMIO and RAM)
185  - Dirty page tracking (for code gen, SMC detection, migration and display)
186  - Virtual TLB (for translating guest address->real address)
187
188When the TLB tables are updated by a vCPU thread other than their own
189we need to ensure it is done in a safe way so no inconsistent state is
190seen by the vCPU thread.
191
192Some operations require updating a number of vCPUs TLBs at the same
193time in a synchronised manner.
194
195DESIGN REQUIREMENTS:
196
197  - TLB Flush All/Page
198    - can be across-vCPUs
199    - cross vCPU TLB flush may need other vCPU brought to halt
200    - change may need to be visible to the calling vCPU immediately
201  - TLB Flag Update
202    - usually cross-vCPU
203    - want change to be visible as soon as possible
204  - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
205    - This is a per-vCPU table - by definition can't race
206    - updated by its own thread when the slow-path is forced
207
208(Current solution)
209
210A new set of tlb flush operations (tlb_flush_*_all_cpus_synced) force
211synchronisation by setting the source vCPUs work as "safe work" and
212exiting the cpu run loop. This ensures that by the time execution
213restarts all flush operations have completed.
214
215TLB flag updates are all done atomically and are also protected by the
216corresponding page lock.
217
218(Known limitation)
219
220Not really a limitation but the wait mechanism is overly strict for
221some architectures which only need flushes completed by a barrier
222instruction. This could be a future optimisation.
223
224Emulated hardware state
225-----------------------
226
227Currently thanks to KVM work any access to IO memory is automatically protected
228by the BQL (Big QEMU Lock). Any IO region that doesn't use the BQL is expected
229to do its own locking.
230
231However IO memory isn't the only way emulated hardware state can be
232modified. Some architectures have model specific registers that
233trigger hardware emulation features. Generally any translation helper
234that needs to update more than a single vCPUs of state should take the
235BQL.
236
237As the BQL, or global iothread mutex is shared across the system we
238push the use of the lock as far down into the TCG code as possible to
239minimise contention.
240
241(Current solution)
242
243MMIO access automatically serialises hardware emulation by way of the
244BQL. Currently Arm targets serialise all ARM_CP_IO register accesses
245and also defer the reset/startup of vCPUs to the vCPU context by way
246of async_run_on_cpu().
247
248Updates to interrupt state are also protected by the BQL as they can
249often be cross vCPU.
250
251Memory Consistency
252==================
253
254Between emulated guests and host systems there are a range of memory
255consistency models. Even emulating weakly ordered systems on strongly
256ordered hosts needs to ensure things like store-after-load re-ordering
257can be prevented when the guest wants to.
258
259Memory Barriers
260---------------
261
262Barriers (sometimes known as fences) provide a mechanism for software
263to enforce a particular ordering of memory operations from the point
264of view of external observers (e.g. another processor core). They can
265apply to any memory operations as well as just loads or stores.
266
267The Linux kernel has an excellent `write-up
268<https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt>`_
269on the various forms of memory barrier and the guarantees they can
270provide.
271
272Barriers are often wrapped around synchronisation primitives to
273provide explicit memory ordering semantics. However they can be used
274by themselves to provide safe lockless access by ensuring for example
275a change to a signal flag will only be visible once the changes to
276payload are.
277
278DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
279
280This would enforce a strong load/store ordering so all loads/stores
281complete at the memory barrier. On single-core non-SMP strongly
282ordered backends this could become a NOP.
283
284Aside from explicit standalone memory barrier instructions there are
285also implicit memory ordering semantics which comes with each guest
286memory access instruction. For example all x86 load/stores come with
287fairly strong guarantees of sequential consistency whereas Arm has
288special variants of load/store instructions that imply acquire/release
289semantics.
290
291In the case of a strongly ordered guest architecture being emulated on
292a weakly ordered host the scope for a heavy performance impact is
293quite high.
294
295DESIGN REQUIREMENTS: Be efficient with use of memory barriers
296       - host systems with stronger implied guarantees can skip some barriers
297       - merge consecutive barriers to the strongest one
298
299(Current solution)
300
301The system currently has a tcg_gen_mb() which will add memory barrier
302operations if code generation is being done in a parallel context. The
303tcg_optimize() function attempts to merge barriers up to their
304strongest form before any load/store operations. The solution was
305originally developed and tested for linux-user based systems. All
306backends have been converted to emit fences when required. So far the
307following front-ends have been updated to emit fences when required:
308
309    - target-i386
310    - target-arm
311    - target-aarch64
312    - target-alpha
313    - target-mips
314
315Memory Control and Maintenance
316------------------------------
317
318This includes a class of instructions for controlling system cache
319behaviour. While QEMU doesn't model cache behaviour these instructions
320are often seen when code modification has taken place to ensure the
321changes take effect.
322
323Synchronisation Primitives
324--------------------------
325
326There are two broad types of synchronisation primitives found in
327modern ISAs: atomic instructions and exclusive regions.
328
329The first type offer a simple atomic instruction which will guarantee
330some sort of test and conditional store will be truly atomic w.r.t.
331other cores sharing access to the memory. The classic example is the
332x86 cmpxchg instruction.
333
334The second type offer a pair of load/store instructions which offer a
335guarantee that a region of memory has not been touched between the
336load and store instructions. An example of this is Arm's ldrex/strex
337pair where the strex instruction will return a flag indicating a
338successful store only if no other CPU has accessed the memory region
339since the ldrex.
340
341Traditionally TCG has generated a series of operations that work
342because they are within the context of a single translation block so
343will have completed before another CPU is scheduled. However with
344the ability to have multiple threads running to emulate multiple CPUs
345we will need to explicitly expose these semantics.
346
347DESIGN REQUIREMENTS:
348  - Support classic atomic instructions
349  - Support load/store exclusive (or load link/store conditional) pairs
350  - Generic enough infrastructure to support all guest architectures
351CURRENT OPEN QUESTIONS:
352  - How problematic is the ABA problem in general?
353
354(Current solution)
355
356The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
357can be used directly or combined to emulate other instructions like
358Arm's ldrex/strex instructions. While they are susceptible to the ABA
359problem so far common guests have not implemented patterns where
360this may be a problem - typically presenting a locking ABI which
361assumes cmpxchg like semantics.
362
363The code also includes a fall-back for cases where multi-threaded TCG
364ops can't work (e.g. guest atomic width > host atomic width). In this
365case an EXCP_ATOMIC exit occurs and the instruction is emulated with
366an exclusive lock which ensures all emulation is serialised.
367
368While the atomic helpers look good enough for now there may be a need
369to look at solutions that can more closely model the guest
370architectures semantics.
371