linux-5.10/Documentation/memory-barriers.txt

19 documentation at tools/memory-model/.  Nevertheless, even this memory
37 Note also that it is possible that a barrier may be a no-op for an
48      - Device operations.
49      - Guarantees.
53      - Varieties of memory barrier.
54      - What may not be assumed about memory barriers?
55      - Data dependency barriers (historical).
56      - Control dependencies.
57      - SMP barrier pairing.
58      - Examples of memory barrier sequences.
59      - Read memory barriers vs load speculation.
60      - Multicopy atomicity.
64      - Compiler barrier.
65      - CPU memory barriers.
69      - Lock acquisition functions.
70      - Interrupt disabling functions.
71      - Sleep and wake-up functions.
72      - Miscellaneous functions.
74  (*) Inter-CPU acquiring barrier effects.
76      - Acquires vs memory accesses.
80      - Interprocessor interaction.
81      - Atomic operations.
82      - Accessing devices.
83      - Interrupts.
91      - Cache coherency.
92      - Cache coherency vs DMA.
93      - Cache coherency vs MMIO.
97      - And then there's the Alpha.
98      - Virtual Machine Guests.
102      - Circular buffers.
116 		+-------+   :   +--------+   :   +-------+
119 		| CPU 1 |<----->| Memory |<----->| CPU 2 |
122 		+-------+   :   +--------+   :   +-------+
127 		    |       :   +--------+   :       |
130 		    +---------->| Device |<----------+
133 		            :   +--------+   :
159 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
160 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
161 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
162 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
163 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
164 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
165 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
203 -----------------
225 ----------
239      emits a memory-barrier instruction, so that a DEC Alpha CPU will
310 And there are anti-guarantees:
313      generate code to modify these using non-atomic read-modify-write
318      in a given bitfield must be protected by one lock.  If two fields
319      in a given bitfield are protected by different locks, the compiler's
320      non-atomic read-modify-write sequences can cause an update to one
327      "char", two-byte alignment for "short", four-byte alignment for
328      "int", and either four-byte or eight-byte alignment for "long",
329      on 32-bit and 64-bit systems, respectively.  Note that these
331      using older pre-C11 compilers (for example, gcc 4.6).  The portion
337 		of adjacent bit-fields all having nonzero width
343 		NOTE 2: A bit-field and an adjacent non-bit-field member
345 		to two bit-fields, if one is declared inside a nested
347 		are separated by a zero-length bit-field declaration,
348 		or if they are separated by a non-bit-field member
350 		bit-fields in the same structure if all members declared
351 		between them are also bit-fields, no matter what the
352 		sizes of those intervening bit-fields happen to be.
360 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
376 ---------------------------
468      This acts as a one-way permeable barrier.  It guarantees that all memory
483      This also acts as a one-way permeable barrier.  It guarantees that all
494      -not- guaranteed to act as a full memory barrier.  However, after an
505 RELEASE variants in addition to fully-ordered and relaxed (no barrier
522 ----------------------------------------------
541  (*) There is no guarantee that some intervening piece of off-the-CPU
548 	    Documentation/driver-api/pci/pci.rst
549 	    Documentation/core-api/dma-api-howto.rst
550 	    Documentation/core-api/dma-api.rst
554 -------------------------------------
558 to this section are those working on DEC Alpha architecture-specific code
561 data-dependency barriers.
610 even-numbered cache lines and the other bank processes odd-numbered cache
611 lines.  The pointer P might be stored in an odd-numbered cache line, and the
612 variable B might be stored in an even-numbered cache line.  Then, if the
613 even-numbered bank of the reading CPU's cache is extremely busy while the
614 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
618 A data-dependency barrier is not required to order dependent writes
635 Therefore, no data-dependency barrier is required to order the read into
637 even without a data-dependency barrier:
642 of dependency ordering is to -prevent- writes to the data structure, along
663 --------------------
669 A load-load control dependency requires a full read memory barrier, not
680 dependency, but rather a control dependency that the CPU may short-circuit
691 However, stores are not speculated.  This means that ordering -is- provided
692 for load-store control dependencies, as in the following example:
707 variable 'a' is always non-zero, it would be well within its rights
737 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
740 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
760 In contrast, without explicit memory barriers, two-legged-if control
817 You must also be careful not to rely too much on boolean short-circuit
832 out-guess your code.  More generally, although READ_ONCE() does force
836 In addition, control dependencies apply only to the then-clause and
837 else-clause of the if-statement in question.  In particular, it does
838 not necessarily apply to code following the if-statement:
852 conditional-move instructions, as in this fanciful pseudo-assembly
865 In short, control dependencies apply only to the stores in the then-clause
866 and else-clause of the if-statement in question (including functions
867 invoked by those two clauses), not to code following that if-statement.
878       However, they do -not- guarantee any other sort of ordering:
887       to carry out the stores.  Please note that it is -not- sufficient
893   (*) Control dependencies require at least one run-time conditional
905   (*) Control dependencies apply only to the then-clause and else-clause
906       of the if-statement containing the control dependency, including
908       do -not- apply to code following the if-statement containing the
913   (*) Control dependencies do -not- provide multicopy atomicity.  If you
921 -------------------
923 When dealing with CPU-CPU interactions, certain types of memory barrier should
976 	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
980 	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
984 ------------------------------------
1003 	+-------+       :      :
1004 	|       |       +------+
1005 	|       |------>| C=3  |     }     /\
1006 	|       |  :    +------+     }-----  \  -----> Events perceptible to
1008 	|       |  :    +------+     }
1010 	|       |       +------+     }
1011 	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1012 	|       |       +------+     }        requires all stores prior to the
1014 	|       |  :    +------+     }        further stores may take place
1015 	|       |------>| D=4  |     }
1016 	|       |       +------+
1017 	+-------+       :      :
1024 Secondly, data dependency barriers act as partial orderings on data-dependent
1040 	+-------+       :      :                :       :
1041 	|       |       +------+                +-------+  | Sequence of update
1042 	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1043 	|       |  :    +------+     \          +-------+  | CPU 2
1044 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1045 	|       |       +------+       |        +-------+
1047 	|       |       +------+       |        :       :
1048 	|       |  :    | C=&B |---    |        :       :       +-------+
1049 	|       |  :    +------+   \   |        +-------+       |       |
1050 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1051 	|       |       +------+       |        +-------+       |       |
1052 	+-------+       :      :       |        :       :       |       |
1055 	                               |        +-------+       |       |
1056 	    Apparently incorrect --->  |        | B->7  |------>|       |
1057 	    perception of B (!)        |        +-------+       |       |
1059 	                               |        +-------+       |       |
1060 	    The load of X holds --->    \       | X->9  |------>|       |
1061 	    up the maintenance           \      +-------+       |       |
1062 	    of coherence of B             ----->| B->2  |       +-------+
1063 	                                        +-------+
1086 	+-------+       :      :                :       :
1087 	|       |       +------+                +-------+
1088 	|       |------>| B=2  |-----       --->| Y->8  |
1089 	|       |  :    +------+     \          +-------+
1090 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1091 	|       |       +------+       |        +-------+
1093 	|       |       +------+       |        :       :
1094 	|       |  :    | C=&B |---    |        :       :       +-------+
1095 	|       |  :    +------+   \   |        +-------+       |       |
1096 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1097 	|       |       +------+       |        +-------+       |       |
1098 	+-------+       :      :       |        :       :       |       |
1101 	                               |        +-------+       |       |
1102 	                               |        | X->9  |------>|       |
1103 	                               |        +-------+       |       |
1104 	  Makes sure all effects --->   \   ddddddddddddddddd   |       |
1105 	  prior to the store of C        \      +-------+       |       |
1106 	  are perceptible to              ----->| B->2  |------>|       |
1107 	  subsequent loads                      +-------+       |       |
1108 	                                        :       :       +-------+
1126 	+-------+       :      :                :       :
1127 	|       |       +------+                +-------+
1128 	|       |------>| A=1  |------      --->| A->0  |
1129 	|       |       +------+      \         +-------+
1130 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1131 	|       |       +------+        |       +-------+
1132 	|       |------>| B=2  |---     |       :       :
1133 	|       |       +------+   \    |       :       :       +-------+
1134 	+-------+       :      :    \   |       +-------+       |       |
1135 	                             ---------->| B->2  |------>|       |
1136 	                                |       +-------+       | CPU 2 |
1137 	                                |       | A->0  |------>|       |
1138 	                                |       +-------+       |       |
1139 	                                |       :       :       +-------+
1141 	                                  \     +-------+
1142 	                                   ---->| A->1  |
1143 	                                        +-------+
1163 	+-------+       :      :                :       :
1164 	|       |       +------+                +-------+
1165 	|       |------>| A=1  |------      --->| A->0  |
1166 	|       |       +------+      \         +-------+
1167 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1168 	|       |       +------+        |       +-------+
1169 	|       |------>| B=2  |---     |       :       :
1170 	|       |       +------+   \    |       :       :       +-------+
1171 	+-------+       :      :    \   |       +-------+       |       |
1172 	                             ---------->| B->2  |------>|       |
1173 	                                |       +-------+       | CPU 2 |
1176 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1177 	  barrier causes all effects      \     +-------+       |       |
1178 	  prior to the storage of B        ---->| A->1  |------>|       |
1179 	  to be perceptible to CPU 2            +-------+       |       |
1180 	                                        :       :       +-------+
1200 	+-------+       :      :                :       :
1201 	|       |       +------+                +-------+
1202 	|       |------>| A=1  |------      --->| A->0  |
1203 	|       |       +------+      \         +-------+
1204 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1205 	|       |       +------+        |       +-------+
1206 	|       |------>| B=2  |---     |       :       :
1207 	|       |       +------+   \    |       :       :       +-------+
1208 	+-------+       :      :    \   |       +-------+       |       |
1209 	                             ---------->| B->2  |------>|       |
1210 	                                |       +-------+       | CPU 2 |
1213 	                                |       +-------+       |       |
1214 	                                |       | A->0  |------>| 1st   |
1215 	                                |       +-------+       |       |
1216 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1217 	  barrier causes all effects      \     +-------+       |       |
1218 	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1219 	  to be perceptible to CPU 2            +-------+       |       |
1220 	                                        :       :       +-------+
1226 	+-------+       :      :                :       :
1227 	|       |       +------+                +-------+
1228 	|       |------>| A=1  |------      --->| A->0  |
1229 	|       |       +------+      \         +-------+
1230 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1231 	|       |       +------+        |       +-------+
1232 	|       |------>| B=2  |---     |       :       :
1233 	|       |       +------+   \    |       :       :       +-------+
1234 	+-------+       :      :    \   |       +-------+       |       |
1235 	                             ---------->| B->2  |------>|       |
1236 	                                |       +-------+       | CPU 2 |
1239 	                                  \     +-------+       |       |
1240 	                                   ---->| A->1  |------>| 1st   |
1241 	                                        +-------+       |       |
1243 	                                        +-------+       |       |
1244 	                                        | A->1  |------>| 2nd   |
1245 	                                        +-------+       |       |
1246 	                                        :       :       +-------+
1255 ----------------------------------------
1259 other loads, and so do the load in advance - even though they haven't actually
1264 It may turn out that the CPU didn't actually need the value - perhaps because a
1265 branch circumvented the load - in which case it can discard the value or just
1279 	                                        :       :       +-------+
1280 	                                        +-------+       |       |
1281 	                                    --->| B->2  |------>|       |
1282 	                                        +-------+       | CPU 2 |
1284 	                                        +-------+       |       |
1285 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1286 	division speculates on the              +-------+   ~   |       |
1290 	Once the divisions are complete -->     :       :   ~-->|       |
1292 	LOAD with immediate effect              :       :       +-------+
1310 	                                        :       :       +-------+
1311 	                                        +-------+       |       |
1312 	                                    --->| B->2  |------>|       |
1313 	                                        +-------+       | CPU 2 |
1315 	                                        +-------+       |       |
1316 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1317 	division speculates on the              +-------+   ~   |       |
1324 	                                        :       :   ~-->|       |
1326 	                                        :       :       +-------+
1332 	                                        :       :       +-------+
1333 	                                        +-------+       |       |
1334 	                                    --->| B->2  |------>|       |
1335 	                                        +-------+       | CPU 2 |
1337 	                                        +-------+       |       |
1338 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1339 	division speculates on the              +-------+   ~   |       |
1345 	                                        +-------+       |       |
1346 	The speculation is discarded --->   --->| A->1  |------>|       |
1347 	and an updated value is                 +-------+       |       |
1348 	retrieved                               :       :       +-------+
1352 --------------------
1361 time to all -other- CPUs.  The remainder of this document discusses this
1380 Because CPU 3's load from X in some sense comes after CPU 2's load, it
1385 multicopy-atomic systems, CPU B's load must return either the same value
1395 able to compensate for non-multicopy atomicity.  For example, suppose
1406 This substitution allows non-multicopy atomicity to run rampant: in
1412 example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
1417 General barriers can compensate not only for non-multicopy atomicity,
1418 but can also generate additional ordering that can ensure that -all-
1419 CPUs will perceive the same order of -all- operations.  In contrast, a
1420 chain of release-acquire pairs do not provide this additional ordering,
1461 Furthermore, because of the release-acquire relationship between cpu0()
1467 However, the ordering provided by a release-acquire chain is local
1478 writes in order, CPUs not involved in the release-acquire chain might
1480 the weak memory-barrier instructions used to implement smp_load_acquire()
1483 store to u as happening -after- cpu1()'s load from v, even though
1489 -not- ensure that any particular value will be read.  Therefore, the
1514 ----------------
1521 This is a general barrier -- there are no read-read or write-write
1531      interrupt-handler code and the code that was interrupted.
1537 optimizations that, while perfectly safe in single-threaded code, can
1565      into the following code, which, although in some sense legitimate
1566      for single-threaded code, is almost certainly not what the developer
1587      single-threaded code, but can be fatal in concurrent code:
1605      single-threaded code, so you need to tell the compiler about cases
1619      This transformation is a win for single-threaded code because it
1638      the code into near-nonexistence.  (It will still load from the
1666      between process-level code and an interrupt handler:
1682      win for single-threaded code:
1743      In single-threaded code, this is not only safe, but also saves
1745      could cause some other CPU to see a spurious value of 42 -- even
1746      if variable 'a' was never zero -- when loading variable 'b'.
1755      damaging, but they can result in cache-line bouncing and thus in
1760      with a single memory-reference instruction, prevents "load tearing"
1763      16-bit store instructions with 7-bit immediate fields, the compiler
1764      might be tempted to use two 16-bit store-immediate instructions to
1765      implement the following 32-bit store:
1772      This optimization can therefore be a win in single-threaded code.
1796      implement these three assignment statements as a pair of 32-bit
1797      loads followed by a pair of 32-bit stores.  This would result in
1817 -------------------
1843 systems because it is assumed that a CPU will appear to be self-consistent,
1854 windows.  These barriers are required even on non-SMP systems as they affect
1885 	obj->dead = 1;
1887 	atomic_dec(&obj->ref_count);
1907 	if (desc->status != DEVICE_OWN) {
1912 		read_data = desc->data;
1913 		desc->data = write_data;
1919 		desc->status = DEVICE_OWN;
1935      relaxed I/O accessors and the Documentation/core-api/dma-api.rst file for
1944      For example, after a non-temporal write to pmem region, we use pmem_wmb()
1966 --------------------------
2013 one-way barriers is that the effects of instructions outside of a critical
2034 RELEASE may -not- be assumed to be a full memory barrier.
2059 	-could- occur.
2074 	a sleep-unlock race, but the locking primitive needs to resolve
2079 anything at all - especially with respect to I/O accesses - unless combined
2082 See also the section on "Inter-CPU acquiring barrier effects".
2112 -----------------------------
2120 SLEEP AND WAKE-UP FUNCTIONS
2121 ---------------------------
2146 	    STORE current->state
2189 	    STORE current->state	  ...
2191 	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
2192 					    STORE task->state
2237 order multiple stores before the wake-up with respect to loads of those stored
2273 -----------------------
2281 INTER-CPU ACQUIRING BARRIER EFFECTS
2290 ---------------------------
2323 be a problem as a single-threaded linear piece of code will still appear to
2337 --------------------------
2377 	LOAD waiter->list.next;
2378 	LOAD waiter->task;
2379 	STORE waiter->task;
2401 	LOAD waiter->task;
2402 	STORE waiter->task;
2410 	LOAD waiter->list.next;
2411 	--- OOPS ---
2418 	LOAD waiter->list.next;
2419 	LOAD waiter->task;
2421 	STORE waiter->task;
2431 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2438 -----------------
2449 -----------------
2458 efficient to reorder, combine or merge accesses - something that would cause
2462 routines - such as inb() or writel() - which know how to make such accesses
2468 See Documentation/driver-api/device-io.rst for more information.
2472 ----------
2478 This may be alleviated - at least in part - by disabling local interrupts (a
2480 the interrupt-disabled section in the driver.  While the driver's interrupt
2487 under interrupt-disablement and then the driver's interrupt handler is invoked:
2506 accesses performed in an interrupt - and vice versa - unless implicit or
2516 likely, then interrupt-disabling locks should be used to guarantee ordering.
2524 specific. Therefore, drivers which are inherently non-portable may rely on
2576 	The ordering properties of __iomem pointers obtained with non-default
2586 	bullets 2-5 above) but they are still guaranteed to be ordered with
2594 	register-based, memory-mapped FIFOs residing on peripherals that are not
2600 	The inX() and outX() accessors are intended to access legacy port-mapped
2611 	Device drivers may expect outX() to emit a non-posted write transaction
2629 little-endian and will therefore perform byte-swapping operations on big-endian
2637 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2641 of arch-specific code.
2644 stream in any order it feels like - or even in parallel - provided that if an
2650  [*] Some instructions have more than one effect - such as changing the
2651      condition codes, changing registers or changing memory - and different
2677 	    <--- CPU --->         :       <----------- Memory ----------->
2679 	+--------+    +--------+  :   +--------+    +-----------+
2680 	|        |    |        |  :   |        |    |           |    +--------+
2682 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2683 	|        |    | Queue  |  :   |        |    |           |--->| Memory |
2685 	+--------+    +--------+  :   +--------+    |           |    |        |
2686 	                          :                 | Cache     |    +--------+
2688 	                          :                 | Mechanism |    +--------+
2689 	+--------+    +--------+  :   +--------+    |           |    |	      |
2691 	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2692 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2694 	|        |    |        |  :   |        |    |           |    +--------+
2695 	+--------+    +--------+  :   +--------+    +-----------+
2726 ----------------------
2743 See Documentation/core-api/cachetlb.rst for more information on cache management.
2747 -----------------------
2803  (*) the CPU's data cache may affect the ordering, and while cache-coherency
2804      mechanisms may alleviate this - once the store has actually hit the cache
2805      - there's no guarantee that the coherency management will be propagated in
2816 However, it is guaranteed that a CPU will be self-consistent: it will see its
2843 are -not- optional in the above example, as there are architectures
2878 --------------------------
2882 two semantically-related cache lines updated at separate times.  This is where
2893 ----------------------
2898 barriers for this use-case would be possible but is often suboptimal.
2900 To handle this case optimally, low-level virt_mb() etc macros are available.
2902 identical code for SMP and non-SMP systems.  For example, virtual machine guests
2916 ----------------
2921 	Documentation/core-api/circular-buffers.rst
2938 	Chapter 7.1: Memory-Access Ordering
2941 ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
2944 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2959 	Chapter 15: Sparc-V9 Memory Models
2975 Solaris Internals, Core Kernel Architecture, p63-68: