1========================= 2Atomic operations in QEMU 3========================= 4 5CPUs perform independent memory operations effectively in random order. 6but this can be a problem for CPU-CPU interaction (including interactions 7between QEMU and the guest). Multi-threaded programs use various tools 8to instruct the compiler and the CPU to restrict the order to something 9that is consistent with the expectations of the programmer. 10 11The most basic tool is locking. Mutexes, condition variables and 12semaphores are used in QEMU, and should be the default approach to 13synchronization. Anything else is considerably harder, but it's 14also justified more often than one would like. The two tools that 15are provided by ``qemu/atomic.h`` are memory barriers and atomic operations. 16 17Macros defined by ``qemu/atomic.h`` fall in three camps: 18 19- compiler barriers: ``barrier()``; 20 21- weak atomic access and manual memory barriers: ``atomic_read()``, 22 ``atomic_set()``, ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, ``smp_mb_acquire()``, 23 ``smp_mb_release()``, ``smp_read_barrier_depends()``; 24 25- sequentially consistent atomic access: everything else. 26 27 28Compiler memory barrier 29======================= 30 31``barrier()`` prevents the compiler from moving the memory accesses either 32side of it to the other side. The compiler barrier has no direct effect 33on the CPU, which may then reorder things however it wishes. 34 35``barrier()`` is mostly used within ``qemu/atomic.h`` itself. On some 36architectures, CPU guarantees are strong enough that blocking compiler 37optimizations already ensures the correct order of execution. In this 38case, ``qemu/atomic.h`` will reduce stronger memory barriers to simple 39compiler barriers. 40 41Still, ``barrier()`` can be useful when writing code that can be interrupted 42by signal handlers. 43 44 45Sequentially consistent atomic access 46===================================== 47 48Most of the operations in the ``qemu/atomic.h`` header ensure *sequential 49consistency*, where "the result of any execution is the same as if the 50operations of all the processors were executed in some sequential order, 51and the operations of each individual processor appear in this sequence 52in the order specified by its program". 53 54``qemu/atomic.h`` provides the following set of atomic read-modify-write 55operations:: 56 57 void atomic_inc(ptr) 58 void atomic_dec(ptr) 59 void atomic_add(ptr, val) 60 void atomic_sub(ptr, val) 61 void atomic_and(ptr, val) 62 void atomic_or(ptr, val) 63 64 typeof(*ptr) atomic_fetch_inc(ptr) 65 typeof(*ptr) atomic_fetch_dec(ptr) 66 typeof(*ptr) atomic_fetch_add(ptr, val) 67 typeof(*ptr) atomic_fetch_sub(ptr, val) 68 typeof(*ptr) atomic_fetch_and(ptr, val) 69 typeof(*ptr) atomic_fetch_or(ptr, val) 70 typeof(*ptr) atomic_fetch_xor(ptr, val) 71 typeof(*ptr) atomic_fetch_inc_nonzero(ptr) 72 typeof(*ptr) atomic_xchg(ptr, val) 73 typeof(*ptr) atomic_cmpxchg(ptr, old, new) 74 75all of which return the old value of ``*ptr``. These operations are 76polymorphic; they operate on any type that is as wide as a pointer. 77 78Similar operations return the new value of ``*ptr``:: 79 80 typeof(*ptr) atomic_inc_fetch(ptr) 81 typeof(*ptr) atomic_dec_fetch(ptr) 82 typeof(*ptr) atomic_add_fetch(ptr, val) 83 typeof(*ptr) atomic_sub_fetch(ptr, val) 84 typeof(*ptr) atomic_and_fetch(ptr, val) 85 typeof(*ptr) atomic_or_fetch(ptr, val) 86 typeof(*ptr) atomic_xor_fetch(ptr, val) 87 88Sequentially consistent loads and stores can be done using:: 89 90 atomic_fetch_add(ptr, 0) for loads 91 atomic_xchg(ptr, val) for stores 92 93However, they are quite expensive on some platforms, notably POWER and 94Arm. Therefore, qemu/atomic.h provides two primitives with slightly 95weaker constraints:: 96 97 typeof(*ptr) atomic_mb_read(ptr) 98 void atomic_mb_set(ptr, val) 99 100The semantics of these primitives map to Java volatile variables, 101and are strongly related to memory barriers as used in the Linux 102kernel (see below). 103 104As long as you use atomic_mb_read and atomic_mb_set, accesses cannot 105be reordered with each other, and it is also not possible to reorder 106"normal" accesses around them. 107 108However, and this is the important difference between 109atomic_mb_read/atomic_mb_set and sequential consistency, it is important 110for both threads to access the same volatile variable. It is not the 111case that everything visible to thread A when it writes volatile field f 112becomes visible to thread B after it reads volatile field g. The store 113and load have to "match" (i.e., be performed on the same volatile 114field) to achieve the right semantics. 115 116 117These operations operate on any type that is as wide as an int or smaller. 118 119 120Weak atomic access and manual memory barriers 121============================================= 122 123Compared to sequentially consistent atomic access, programming with 124weaker consistency models can be considerably more complicated. 125In general, if the algorithm you are writing includes both writes 126and reads on the same side, it is generally simpler to use sequentially 127consistent primitives. 128 129When using this model, variables are accessed with: 130 131- ``atomic_read()`` and ``atomic_set()``; these prevent the compiler from 132 optimizing accesses out of existence and creating unsolicited 133 accesses, but do not otherwise impose any ordering on loads and 134 stores: both the compiler and the processor are free to reorder 135 them. 136 137- ``atomic_load_acquire()``, which guarantees the LOAD to appear to 138 happen, with respect to the other components of the system, 139 before all the LOAD or STORE operations specified afterwards. 140 Operations coming before ``atomic_load_acquire()`` can still be 141 reordered after it. 142 143- ``atomic_store_release()``, which guarantees the STORE to appear to 144 happen, with respect to the other components of the system, 145 after all the LOAD or STORE operations specified afterwards. 146 Operations coming after ``atomic_store_release()`` can still be 147 reordered after it. 148 149Restrictions to the ordering of accesses can also be specified 150using the memory barrier macros: ``smp_rmb()``, ``smp_wmb()``, ``smp_mb()``, 151``smp_mb_acquire()``, ``smp_mb_release()``, ``smp_read_barrier_depends()``. 152 153Memory barriers control the order of references to shared memory. 154They come in six kinds: 155 156- ``smp_rmb()`` guarantees that all the LOAD operations specified before 157 the barrier will appear to happen before all the LOAD operations 158 specified after the barrier with respect to the other components of 159 the system. 160 161 In other words, ``smp_rmb()`` puts a partial ordering on loads, but is not 162 required to have any effect on stores. 163 164- ``smp_wmb()`` guarantees that all the STORE operations specified before 165 the barrier will appear to happen before all the STORE operations 166 specified after the barrier with respect to the other components of 167 the system. 168 169 In other words, ``smp_wmb()`` puts a partial ordering on stores, but is not 170 required to have any effect on loads. 171 172- ``smp_mb_acquire()`` guarantees that all the LOAD operations specified before 173 the barrier will appear to happen before all the LOAD or STORE operations 174 specified after the barrier with respect to the other components of 175 the system. 176 177- ``smp_mb_release()`` guarantees that all the STORE operations specified *after* 178 the barrier will appear to happen after all the LOAD or STORE operations 179 specified *before* the barrier with respect to the other components of 180 the system. 181 182- ``smp_mb()`` guarantees that all the LOAD and STORE operations specified 183 before the barrier will appear to happen before all the LOAD and 184 STORE operations specified after the barrier with respect to the other 185 components of the system. 186 187 ``smp_mb()`` puts a partial ordering on both loads and stores. It is 188 stronger than both a read and a write memory barrier; it implies both 189 ``smp_mb_acquire()`` and ``smp_mb_release()``, but it also prevents STOREs 190 coming before the barrier from overtaking LOADs coming after the 191 barrier and vice versa. 192 193- ``smp_read_barrier_depends()`` is a weaker kind of read barrier. On 194 most processors, whenever two loads are performed such that the 195 second depends on the result of the first (e.g., the first load 196 retrieves the address to which the second load will be directed), 197 the processor will guarantee that the first LOAD will appear to happen 198 before the second with respect to the other components of the system. 199 However, this is not always true---for example, it was not true on 200 Alpha processors. Whenever this kind of access happens to shared 201 memory (that is not protected by a lock), a read barrier is needed, 202 and ``smp_read_barrier_depends()`` can be used instead of ``smp_rmb()``. 203 204 Note that the first load really has to have a _data_ dependency and not 205 a control dependency. If the address for the second load is dependent 206 on the first load, but the dependency is through a conditional rather 207 than actually loading the address itself, then it's a _control_ 208 dependency and a full read barrier or better is required. 209 210 211This is the set of barriers that is required *between* two ``atomic_read()`` 212and ``atomic_set()`` operations to achieve sequential consistency: 213 214 +----------------+-------------------------------------------------------+ 215 | | 2nd operation | 216 | +------------------+-----------------+------------------+ 217 | 1st operation | (after last) | atomic_read | atomic_set | 218 +----------------+------------------+-----------------+------------------+ 219 | (before first) | .. | none | smp_mb_release() | 220 +----------------+------------------+-----------------+------------------+ 221 | atomic_read | smp_mb_acquire() | smp_rmb() [1]_ | [2]_ | 222 +----------------+------------------+-----------------+------------------+ 223 | atomic_set | none | smp_mb() [3]_ | smp_wmb() | 224 +----------------+------------------+-----------------+------------------+ 225 226 .. [1] Or smp_read_barrier_depends(). 227 228 .. [2] This requires a load-store barrier. This is achieved by 229 either smp_mb_acquire() or smp_mb_release(). 230 231 .. [3] This requires a store-load barrier. On most machines, the only 232 way to achieve this is a full barrier. 233 234 235You can see that the two possible definitions of ``atomic_mb_read()`` 236and ``atomic_mb_set()`` are the following: 237 238 1) | atomic_mb_read(p) = atomic_read(p); smp_mb_acquire() 239 | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb() 240 241 2) | atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire() 242 | atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); 243 244Usually the former is used, because ``smp_mb()`` is expensive and a program 245normally has more reads than writes. Therefore it makes more sense to 246make ``atomic_mb_set()`` the more expensive operation. 247 248There are two common cases in which atomic_mb_read and atomic_mb_set 249generate too many memory barriers, and thus it can be useful to manually 250place barriers, or use atomic_load_acquire/atomic_store_release instead: 251 252- when a data structure has one thread that is always a writer 253 and one thread that is always a reader, manual placement of 254 memory barriers makes the write side faster. Furthermore, 255 correctness is easy to check for in this case using the "pairing" 256 trick that is explained below: 257 258 +----------------------------------------------------------------------+ 259 | thread 1 | 260 +-----------------------------------+----------------------------------+ 261 | before | after | 262 +===================================+==================================+ 263 | :: | :: | 264 | | | 265 | (other writes) | | 266 | atomic_mb_set(&a, x) | atomic_store_release(&a, x) | 267 | atomic_mb_set(&b, y) | atomic_store_release(&b, y) | 268 +-----------------------------------+----------------------------------+ 269 270 +----------------------------------------------------------------------+ 271 | thread 2 | 272 +-----------------------------------+----------------------------------+ 273 | before | after | 274 +===================================+==================================+ 275 | :: | :: | 276 | | | 277 | y = atomic_mb_read(&b) | y = atomic_load_acquire(&b) | 278 | x = atomic_mb_read(&a) | x = atomic_load_acquire(&a) | 279 | (other reads) | | 280 +-----------------------------------+----------------------------------+ 281 282 Note that the barrier between the stores in thread 1, and between 283 the loads in thread 2, has been optimized here to a write or a 284 read memory barrier respectively. On some architectures, notably 285 ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as 286 smp_mb, but smp_rmb and/or smp_wmb are more efficient. 287 288- sometimes, a thread is accessing many variables that are otherwise 289 unrelated to each other (for example because, apart from the current 290 thread, exactly one other thread will read or write each of these 291 variables). In this case, it is possible to "hoist" the implicit 292 barriers provided by ``atomic_mb_read()`` and ``atomic_mb_set()`` outside 293 a loop. For example, the above definition ``atomic_mb_read()`` gives 294 the following transformation: 295 296 +-----------------------------------+----------------------------------+ 297 | before | after | 298 +===================================+==================================+ 299 | :: | :: | 300 | | | 301 | n = 0; | n = 0; | 302 | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 303 | n += atomic_mb_read(&a[i]); | n += atomic_read(&a[i]); | 304 | | smp_mb_acquire(); | 305 +-----------------------------------+----------------------------------+ 306 307 Similarly, atomic_mb_set() can be transformed as follows: 308 309 +-----------------------------------+----------------------------------+ 310 | before | after | 311 +===================================+==================================+ 312 | :: | :: | 313 | | | 314 | | smp_mb_release(); | 315 | for (i = 0; i < 10; i++) | for (i = 0; i < 10; i++) | 316 | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | 317 | | smp_mb(); | 318 +-----------------------------------+----------------------------------+ 319 320 321 The other thread can still use ``atomic_mb_read()``/``atomic_mb_set()``. 322 323The two tricks can be combined. In this case, splitting a loop in 324two lets you hoist the barriers out of the loops _and_ eliminate the 325expensive ``smp_mb()``: 326 327 +-----------------------------------+----------------------------------+ 328 | before | after | 329 +===================================+==================================+ 330 | :: | :: | 331 | | | 332 | | smp_mb_release(); | 333 | for (i = 0; i < 10; i++) { | for (i = 0; i < 10; i++) | 334 | atomic_mb_set(&a[i], false); | atomic_set(&a[i], false); | 335 | atomic_mb_set(&b[i], false); | smb_wmb(); | 336 | } | for (i = 0; i < 10; i++) | 337 | | atomic_set(&a[i], false); | 338 | | smp_mb(); | 339 +-----------------------------------+----------------------------------+ 340 341 342Memory barrier pairing 343---------------------- 344 345A useful rule of thumb is that memory barriers should always, or almost 346always, be paired with another barrier. In the case of QEMU, however, 347note that the other barrier may actually be in a driver that runs in 348the guest! 349 350For the purposes of pairing, ``smp_read_barrier_depends()`` and ``smp_rmb()`` 351both count as read barriers. A read barrier shall pair with a write 352barrier or a full barrier; a write barrier shall pair with a read 353barrier or a full barrier. A full barrier can pair with anything. 354For example: 355 356 +--------------------+------------------------------+ 357 | thread 1 | thread 2 | 358 +====================+==============================+ 359 | :: | :: | 360 | | | 361 | a = 1; | | 362 | smp_wmb(); | | 363 | b = 2; | x = b; | 364 | | smp_rmb(); | 365 | | y = a; | 366 +--------------------+------------------------------+ 367 368Note that the "writing" thread is accessing the variables in the 369opposite order as the "reading" thread. This is expected: stores 370before the write barrier will normally match the loads after the 371read barrier, and vice versa. The same is true for more than 2 372access and for data dependency barriers: 373 374 +----------------------+------------------------------+ 375 | thread 1 | thread 2 | 376 +======================+==============================+ 377 | :: | :: | 378 | | | 379 | b[2] = 1; | | 380 | smp_wmb(); | | 381 | x->i = 2; | | 382 | smp_wmb(); | | 383 | a = x; | x = a; | 384 | | smp_read_barrier_depends(); | 385 | | y = x->i; | 386 | | smp_read_barrier_depends(); | 387 | | z = b[y]; | 388 +----------------------+------------------------------+ 389 390``smp_wmb()`` also pairs with ``atomic_mb_read()`` and ``smp_mb_acquire()``. 391and ``smp_rmb()`` also pairs with ``atomic_mb_set()`` and ``smp_mb_release()``. 392 393 394Comparison with Linux kernel memory barriers 395============================================ 396 397Here is a list of differences between Linux kernel atomic operations 398and memory barriers, and the equivalents in QEMU: 399 400- atomic operations in Linux are always on a 32-bit int type and 401 use a boxed ``atomic_t`` type; atomic operations in QEMU are polymorphic 402 and use normal C types. 403 404- Originally, ``atomic_read`` and ``atomic_set`` in Linux gave no guarantee 405 at all. Linux 4.1 updated them to implement volatile 406 semantics via ``ACCESS_ONCE`` (or the more recent ``READ``/``WRITE_ONCE``). 407 408 QEMU's ``atomic_read`` and ``atomic_set`` implement C11 atomic relaxed 409 semantics if the compiler supports it, and volatile semantics otherwise. 410 Both semantics prevent the compiler from doing certain transformations; 411 the difference is that atomic accesses are guaranteed to be atomic, 412 while volatile accesses aren't. Thus, in the volatile case we just cross 413 our fingers hoping that the compiler will generate atomic accesses, 414 since we assume the variables passed are machine-word sized and 415 properly aligned. 416 417 No barriers are implied by ``atomic_read`` and ``atomic_set`` in either Linux 418 or QEMU. 419 420- atomic read-modify-write operations in Linux are of three kinds: 421 422 ===================== ========================================= 423 ``atomic_OP`` returns void 424 ``atomic_OP_return`` returns new value of the variable 425 ``atomic_fetch_OP`` returns the old value of the variable 426 ``atomic_cmpxchg`` returns the old value of the variable 427 ===================== ========================================= 428 429 In QEMU, the second kind does not exist. Currently Linux has 430 atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub. 431 432- different atomic read-modify-write operations in Linux imply 433 a different set of memory barriers; in QEMU, all of them enforce 434 sequential consistency, which means they imply full memory barriers 435 before and after the operation. 436 437- Linux does not have an equivalent of ``atomic_mb_set()``. In particular, 438 note that ``smp_store_mb()`` is a little weaker than ``atomic_mb_set()``. 439 ``atomic_mb_read()`` compiles to the same instructions as Linux's 440 ``smp_load_acquire()``, but this should be treated as an implementation 441 detail. 442 443Sources 444======= 445 446- ``Documentation/memory-barriers.txt`` from the Linux kernel 447