linux-5.10/Documentation/memory-barriers.txt

19 documentation at tools/memory-model/.  Nevertheless, even this memory
37 Note also that it is possible that a barrier may be a no-op for an
48      - Device operations.
49      - Guarantees.
53      - Varieties of memory barrier.
54      - What may not be assumed about memory barriers?
55      - Data dependency barriers (historical).
56      - Control dependencies.
57      - SMP barrier pairing.
58      - Examples of memory barrier sequences.
59      - Read memory barriers vs load speculation.
60      - Multicopy atomicity.
64      - Compiler barrier.
65      - CPU memory barriers.
69      - Lock acquisition functions.
70      - Interrupt disabling functions.
71      - Sleep and wake-up functions.
72      - Miscellaneous functions.
74  (*) Inter-CPU acquiring barrier effects.
76      - Acquires vs memory accesses.
80      - Interprocessor interaction.
81      - Atomic operations.
82      - Accessing devices.
83      - Interrupts.
91      - Cache coherency.
92      - Cache coherency vs DMA.
93      - Cache coherency vs MMIO.
97      - And then there's the Alpha.
98      - Virtual Machine Guests.
102      - Circular buffers.
116 		+-------+   :   +--------+   :   +-------+
119 		| CPU 1 |<----->| Memory |<----->| CPU 2 |
122 		+-------+   :   +--------+   :   +-------+
127 		    |       :   +--------+   :       |
130 		    +---------->| Device |<----------+
133 		            :   +--------+   :
159 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
160 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
161 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
162 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
163 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
164 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
165 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
203 -----------------
225 ----------
239      emits a memory-barrier instruction, so that a DEC Alpha CPU will
310 And there are anti-guarantees:
313      generate code to modify these using non-atomic read-modify-write
320      non-atomic read-modify-write sequences can cause an update to one
327      "char", two-byte alignment for "short", four-byte alignment for
328      "int", and either four-byte or eight-byte alignment for "long",
329      on 32-bit and 64-bit systems, respectively.  Note that these
331      using older pre-C11 compilers (for example, gcc 4.6).  The portion
337 		of adjacent bit-fields all having nonzero width
343 		NOTE 2: A bit-field and an adjacent non-bit-field member
345 		to two bit-fields, if one is declared inside a nested
347 		are separated by a zero-length bit-field declaration,
348 		or if they are separated by a non-bit-field member
350 		bit-fields in the same structure if all members declared
351 		between them are also bit-fields, no matter what the
352 		sizes of those intervening bit-fields happen to be.
360 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
376 ---------------------------
468      This acts as a one-way permeable barrier.  It guarantees that all memory
483      This also acts as a one-way permeable barrier.  It guarantees that all
494      -not- guaranteed to act as a full memory barrier.  However, after an
505 RELEASE variants in addition to fully-ordered and relaxed (no barrier
522 ----------------------------------------------
529      access queue that accesses of the appropriate type may not cross.
541  (*) There is no guarantee that some intervening piece of off-the-CPU
548 	    Documentation/driver-api/pci/pci.rst
549 	    Documentation/core-api/dma-api-howto.rst
550 	    Documentation/core-api/dma-api.rst
554 -------------------------------------
558 to this section are those working on DEC Alpha architecture-specific code
561 data-dependency barriers.
610 even-numbered cache lines and the other bank processes odd-numbered cache
611 lines.  The pointer P might be stored in an odd-numbered cache line, and the
612 variable B might be stored in an even-numbered cache line.  Then, if the
613 even-numbered bank of the reading CPU's cache is extremely busy while the
614 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
618 A data-dependency barrier is not required to order dependent writes
635 Therefore, no data-dependency barrier is required to order the read into
637 even without a data-dependency barrier:
642 of dependency ordering is to -prevent- writes to the data structure, along
663 --------------------
669 A load-load control dependency requires a full read memory barrier, not
680 dependency, but rather a control dependency that the CPU may short-circuit
691 However, stores are not speculated.  This means that ordering -is- provided
692 for load-store control dependencies, as in the following example:
707 variable 'a' is always non-zero, it would be well within its rights
737 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
740 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
760 In contrast, without explicit memory barriers, two-legged-if control
817 You must also be careful not to rely too much on boolean short-circuit
832 out-guess your code.  More generally, although READ_ONCE() does force
836 In addition, control dependencies apply only to the then-clause and
837 else-clause of the if-statement in question.  In particular, it does
838 not necessarily apply to code following the if-statement:
852 conditional-move instructions, as in this fanciful pseudo-assembly
865 In short, control dependencies apply only to the stores in the then-clause
866 and else-clause of the if-statement in question (including functions
867 invoked by those two clauses), not to code following that if-statement.
878       However, they do -not- guarantee any other sort of ordering:
887       to carry out the stores.  Please note that it is -not- sufficient
893   (*) Control dependencies require at least one run-time conditional
905   (*) Control dependencies apply only to the then-clause and else-clause
906       of the if-statement containing the control dependency, including
908       do -not- apply to code following the if-statement containing the
913   (*) Control dependencies do -not- provide multicopy atomicity.  If you
921 -------------------
923 When dealing with CPU-CPU interactions, certain types of memory barrier should
976 	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
980 	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
984 ------------------------------------
1003 	+-------+       :      :
1004 	|       |       +------+
1005 	|       |------>| C=3  |     }     /\
1006 	|       |  :    +------+     }-----  \  -----> Events perceptible to
1008 	|       |  :    +------+     }
1010 	|       |       +------+     }
1011 	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1012 	|       |       +------+     }        requires all stores prior to the
1014 	|       |  :    +------+     }        further stores may take place
1015 	|       |------>| D=4  |     }
1016 	|       |       +------+
1017 	+-------+       :      :
1024 Secondly, data dependency barriers act as partial orderings on data-dependent
1040 	+-------+       :      :                :       :
1041 	|       |       +------+                +-------+  | Sequence of update
1042 	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1043 	|       |  :    +------+     \          +-------+  | CPU 2
1044 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1045 	|       |       +------+       |        +-------+
1047 	|       |       +------+       |        :       :
1048 	|       |  :    | C=&B |---    |        :       :       +-------+
1049 	|       |  :    +------+   \   |        +-------+       |       |
1050 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1051 	|       |       +------+       |        +-------+       |       |
1052 	+-------+       :      :       |        :       :       |       |
1055 	                               |        +-------+       |       |
1056 	    Apparently incorrect --->  |        | B->7  |------>|       |
1057 	    perception of B (!)        |        +-------+       |       |
1059 	                               |        +-------+       |       |
1060 	    The load of X holds --->    \       | X->9  |------>|       |
1061 	    up the maintenance           \      +-------+       |       |
1062 	    of coherence of B             ----->| B->2  |       +-------+
1063 	                                        +-------+
1086 	+-------+       :      :                :       :
1087 	|       |       +------+                +-------+
1088 	|       |------>| B=2  |-----       --->| Y->8  |
1089 	|       |  :    +------+     \          +-------+
1090 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1091 	|       |       +------+       |        +-------+
1093 	|       |       +------+       |        :       :
1094 	|       |  :    | C=&B |---    |        :       :       +-------+
1095 	|       |  :    +------+   \   |        +-------+       |       |
1096 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1097 	|       |       +------+       |        +-------+       |       |
1098 	+-------+       :      :       |        :       :       |       |
1101 	                               |        +-------+       |       |
1102 	                               |        | X->9  |------>|       |
1103 	                               |        +-------+       |       |
1104 	  Makes sure all effects --->   \   ddddddddddddddddd   |       |
1105 	  prior to the store of C        \      +-------+       |       |
1106 	  are perceptible to              ----->| B->2  |------>|       |
1107 	  subsequent loads                      +-------+       |       |
1108 	                                        :       :       +-------+
1126 	+-------+       :      :                :       :
1127 	|       |       +------+                +-------+
1128 	|       |------>| A=1  |------      --->| A->0  |
1129 	|       |       +------+      \         +-------+
1130 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1131 	|       |       +------+        |       +-------+
1132 	|       |------>| B=2  |---     |       :       :
1133 	|       |       +------+   \    |       :       :       +-------+
1134 	+-------+       :      :    \   |       +-------+       |       |
1135 	                             ---------->| B->2  |------>|       |
1136 	                                |       +-------+       | CPU 2 |
1137 	                                |       | A->0  |------>|       |
1138 	                                |       +-------+       |       |
1139 	                                |       :       :       +-------+
1141 	                                  \     +-------+
1142 	                                   ---->| A->1  |
1143 	                                        +-------+
1163 	+-------+       :      :                :       :
1164 	|       |       +------+                +-------+
1165 	|       |------>| A=1  |------      --->| A->0  |
1166 	|       |       +------+      \         +-------+
1167 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1168 	|       |       +------+        |       +-------+
1169 	|       |------>| B=2  |---     |       :       :
1170 	|       |       +------+   \    |       :       :       +-------+
1171 	+-------+       :      :    \   |       +-------+       |       |
1172 	                             ---------->| B->2  |------>|       |
1173 	                                |       +-------+       | CPU 2 |
1176 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1177 	  barrier causes all effects      \     +-------+       |       |
1178 	  prior to the storage of B        ---->| A->1  |------>|       |
1179 	  to be perceptible to CPU 2            +-------+       |       |
1180 	                                        :       :       +-------+
1200 	+-------+       :      :                :       :
1201 	|       |       +------+                +-------+
1202 	|       |------>| A=1  |------      --->| A->0  |
1203 	|       |       +------+      \         +-------+
1204 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1205 	|       |       +------+        |       +-------+
1206 	|       |------>| B=2  |---     |       :       :
1207 	|       |       +------+   \    |       :       :       +-------+
1208 	+-------+       :      :    \   |       +-------+       |       |
1209 	                             ---------->| B->2  |------>|       |
1210 	                                |       +-------+       | CPU 2 |
1213 	                                |       +-------+       |       |
1214 	                                |       | A->0  |------>| 1st   |
1215 	                                |       +-------+       |       |
1216 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1217 	  barrier causes all effects      \     +-------+       |       |
1218 	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1219 	  to be perceptible to CPU 2            +-------+       |       |
1220 	                                        :       :       +-------+
1226 	+-------+       :      :                :       :
1227 	|       |       +------+                +-------+
1228 	|       |------>| A=1  |------      --->| A->0  |
1229 	|       |       +------+      \         +-------+
1230 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1231 	|       |       +------+        |       +-------+
1232 	|       |------>| B=2  |---     |       :       :
1233 	|       |       +------+   \    |       :       :       +-------+
1234 	+-------+       :      :    \   |       +-------+       |       |
1235 	                             ---------->| B->2  |------>|       |
1236 	                                |       +-------+       | CPU 2 |
1239 	                                  \     +-------+       |       |
1240 	                                   ---->| A->1  |------>| 1st   |
1241 	                                        +-------+       |       |
1243 	                                        +-------+       |       |
1244 	                                        | A->1  |------>| 2nd   |
1245 	                                        +-------+       |       |
1246 	                                        :       :       +-------+
1255 ----------------------------------------
1259 other loads, and so do the load in advance - even though they haven't actually
1264 It may turn out that the CPU didn't actually need the value - perhaps because a
1265 branch circumvented the load - in which case it can discard the value or just
1279 	                                        :       :       +-------+
1280 	                                        +-------+       |       |
1281 	                                    --->| B->2  |------>|       |
1282 	                                        +-------+       | CPU 2 |
1284 	                                        +-------+       |       |
1285 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1286 	division speculates on the              +-------+   ~   |       |
1290 	Once the divisions are complete -->     :       :   ~-->|       |
1292 	LOAD with immediate effect              :       :       +-------+
1310 	                                        :       :       +-------+
1311 	                                        +-------+       |       |
1312 	                                    --->| B->2  |------>|       |
1313 	                                        +-------+       | CPU 2 |
1315 	                                        +-------+       |       |
1316 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1317 	division speculates on the              +-------+   ~   |       |
1324 	                                        :       :   ~-->|       |
1326 	                                        :       :       +-------+
1332 	                                        :       :       +-------+
1333 	                                        +-------+       |       |
1334 	                                    --->| B->2  |------>|       |
1335 	                                        +-------+       | CPU 2 |
1337 	                                        +-------+       |       |
1338 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1339 	division speculates on the              +-------+   ~   |       |
1345 	                                        +-------+       |       |
1346 	The speculation is discarded --->   --->| A->1  |------>|       |
1347 	and an updated value is                 +-------+       |       |
1348 	retrieved                               :       :       +-------+
1352 --------------------
1361 time to all -other- CPUs.  The remainder of this document discusses this
1385 multicopy-atomic systems, CPU B's load must return either the same value
1395 able to compensate for non-multicopy atomicity.  For example, suppose
1406 This substitution allows non-multicopy atomicity to run rampant: in
1412 example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
1417 General barriers can compensate not only for non-multicopy atomicity,
1418 but can also generate additional ordering that can ensure that -all-
1419 CPUs will perceive the same order of -all- operations.  In contrast, a
1420 chain of release-acquire pairs do not provide this additional ordering,
1461 Furthermore, because of the release-acquire relationship between cpu0()
1467 However, the ordering provided by a release-acquire chain is local
1478 writes in order, CPUs not involved in the release-acquire chain might
1480 the weak memory-barrier instructions used to implement smp_load_acquire()
1483 store to u as happening -after- cpu1()'s load from v, even though
1489 -not- ensure that any particular value will be read.  Therefore, the
1514 ----------------
1521 This is a general barrier -- there are no read-read or write-write
1531      interrupt-handler code and the code that was interrupted.
1537 optimizations that, while perfectly safe in single-threaded code, can
1566      for single-threaded code, is almost certainly not what the developer
1587      single-threaded code, but can be fatal in concurrent code:
1605      single-threaded code, so you need to tell the compiler about cases
1619      This transformation is a win for single-threaded code because it
1638      the code into near-nonexistence.  (It will still load from the
1666      between process-level code and an interrupt handler:
1682      win for single-threaded code:
1743      In single-threaded code, this is not only safe, but also saves
1745      could cause some other CPU to see a spurious value of 42 -- even
1746      if variable 'a' was never zero -- when loading variable 'b'.
1755      damaging, but they can result in cache-line bouncing and thus in
1760      with a single memory-reference instruction, prevents "load tearing"
1763      16-bit store instructions with 7-bit immediate fields, the compiler
1764      might be tempted to use two 16-bit store-immediate instructions to
1765      implement the following 32-bit store:
1772      This optimization can therefore be a win in single-threaded code.
1782 	struct __attribute__((__packed__)) foo {
1787 	struct foo foo1, foo2;
1796      implement these three assignment statements as a pair of 32-bit
1797      loads followed by a pair of 32-bit stores.  This would result in
1817 -------------------
1843 systems because it is assumed that a CPU will appear to be self-consistent,
1854 windows.  These barriers are required even on non-SMP systems as they affect
1885 	obj->dead = 1;
1887 	atomic_dec(&obj->ref_count);
1907 	if (desc->status != DEVICE_OWN) {
1912 		read_data = desc->data;
1913 		desc->data = write_data;
1919 		desc->status = DEVICE_OWN;
1935      relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for
1944      For example, after a non-temporal write to pmem region, we use pmem_wmb()
1966 --------------------------
2013 one-way barriers is that the effects of instructions outside of a critical
2034 RELEASE may -not- be assumed to be a full memory barrier.
2059 	-could- occur.
2074 	a sleep-unlock race, but the locking primitive needs to resolve
2079 anything at all - especially with respect to I/O accesses - unless combined
2082 See also the section on "Inter-CPU acquiring barrier effects".
2112 -----------------------------
2120 SLEEP AND WAKE-UP FUNCTIONS
2121 ---------------------------
2146 	    STORE current->state
2189 	    STORE current->state	  ...
2191 	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
2192 					    STORE task->state
2237 order multiple stores before the wake-up with respect to loads of those stored
2273 -----------------------
2281 INTER-CPU ACQUIRING BARRIER EFFECTS
2290 ---------------------------
2323 be a problem as a single-threaded linear piece of code will still appear to
2337 --------------------------
2377 	LOAD waiter->list.next;
2378 	LOAD waiter->task;
2379 	STORE waiter->task;
2398 					Queue waiter
2401 	LOAD waiter->task;
2402 	STORE waiter->task;
2407 					call foo()
2408 					foo() clobbers *waiter
2410 	LOAD waiter->list.next;
2411 	--- OOPS ---
2418 	LOAD waiter->list.next;
2419 	LOAD waiter->task;
2421 	STORE waiter->task;
2431 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2438 -----------------
2449 -----------------
2458 efficient to reorder, combine or merge accesses - something that would cause
2462 routines - such as inb() or writel() - which know how to make such accesses
2468 See Documentation/driver-api/device-io.rst for more information.
2472 ----------
2478 This may be alleviated - at least in part - by disabling local interrupts (a
2480 the interrupt-disabled section in the driver.  While the driver's interrupt
2487 under interrupt-disablement and then the driver's interrupt handler is invoked:
2506 accesses performed in an interrupt - and vice versa - unless implicit or
2516 likely, then interrupt-disabling locks should be used to guarantee ordering.
2524 specific. Therefore, drivers which are inherently non-portable may rely on
2576 	The ordering properties of __iomem pointers obtained with non-default
2586 	bullets 2-5 above) but they are still guaranteed to be ordered with
2594 	register-based, memory-mapped FIFOs residing on peripherals that are not
2600 	The inX() and outX() accessors are intended to access legacy port-mapped
2611 	Device drivers may expect outX() to emit a non-posted write transaction
2629 little-endian and will therefore perform byte-swapping operations on big-endian
2637 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2641 of arch-specific code.
2644 stream in any order it feels like - or even in parallel - provided that if an
2650  [*] Some instructions have more than one effect - such as changing the
2651      condition codes, changing registers or changing memory - and different
2677 	    <--- CPU --->         :       <----------- Memory ----------->
2679 	+--------+    +--------+  :   +--------+    +-----------+
2680 	|        |    |        |  :   |        |    |           |    +--------+
2682 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2683 	|        |    | Queue  |  :   |        |    |           |--->| Memory |
2685 	+--------+    +--------+  :   +--------+    |           |    |        |
2686 	                          :                 | Cache     |    +--------+
2688 	                          :                 | Mechanism |    +--------+
2689 	+--------+    +--------+  :   +--------+    |           |    |	      |
2691 	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2692 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2693 	|        |    | Queue  |  :   |        |    |           |    |        |
2694 	|        |    |        |  :   |        |    |           |    +--------+
2695 	+--------+    +--------+  :   +--------+    +-----------+
2707 generate load and store operations which then go into the queue of memory
2708 accesses to be performed.  The core may place these in the queue in any order
2726 ----------------------
2743 See Documentation/core-api/cachetlb.rst for more information on cache management.
2747 -----------------------
2803  (*) the CPU's data cache may affect the ordering, and while cache-coherency
2804      mechanisms may alleviate this - once the store has actually hit the cache
2805      - there's no guarantee that the coherency management will be propagated in
2816 However, it is guaranteed that a CPU will be self-consistent: it will see its
2843 are -not- optional in the above example, as there are architectures
2878 --------------------------
2882 two semantically-related cache lines updated at separate times.  This is where
2893 ----------------------
2898 barriers for this use-case would be possible but is often suboptimal.
2900 To handle this case optimally, low-level virt_mb() etc macros are available.
2902 identical code for SMP and non-SMP systems.  For example, virtual machine guests
2916 ----------------
2921 	Documentation/core-api/circular-buffers.rst
2938 	Chapter 7.1: Memory-Access Ordering
2941 ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
2944 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2959 	Chapter 15: Sparc-V9 Memory Models
2975 Solaris Internals, Core Kernel Architecture, p63-68: