linux-6.8/Documentation/memory-barriers.txt

19 documentation at tools/memory-model/.  Nevertheless, even this memory
37 Note also that it is possible that a barrier may be a no-op for an
48      - Device operations.
49      - Guarantees.
53      - Varieties of memory barrier.
54      - What may not be assumed about memory barriers?
55      - Address-dependency barriers (historical).
56      - Control dependencies.
57      - SMP barrier pairing.
58      - Examples of memory barrier sequences.
59      - Read memory barriers vs load speculation.
60      - Multicopy atomicity.
64      - Compiler barrier.
65      - CPU memory barriers.
69      - Lock acquisition functions.
70      - Interrupt disabling functions.
71      - Sleep and wake-up functions.
72      - Miscellaneous functions.
74  (*) Inter-CPU acquiring barrier effects.
76      - Acquires vs memory accesses.
80      - Interprocessor interaction.
81      - Atomic operations.
82      - Accessing devices.
83      - Interrupts.
91      - Cache coherency.
92      - Cache coherency vs DMA.
93      - Cache coherency vs MMIO.
97      - And then there's the Alpha.
98      - Virtual Machine Guests.
102      - Circular buffers.
116 		+-------+   :   +--------+   :   +-------+
119 		| CPU 1 |<----->| Memory |<----->| CPU 2 |
122 		+-------+   :   +--------+   :   +-------+
127 		    |       :   +--------+   :       |
130 		    +---------->| Device |<----------+
133 		            :   +--------+   :
159 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
160 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
161 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
162 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
163 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
164 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
165 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
203 -----------------
225 ----------
239      emits a memory-barrier instruction, so that a DEC Alpha CPU will
310 And there are anti-guarantees:
313      generate code to modify these using non-atomic read-modify-write
318      in a given bitfield must be protected by one lock.  If two fields
319      in a given bitfield are protected by different locks, the compiler's
320      non-atomic read-modify-write sequences can cause an update to one
327      "char", two-byte alignment for "short", four-byte alignment for
328      "int", and either four-byte or eight-byte alignment for "long",
329      on 32-bit and 64-bit systems, respectively.  Note that these
331      using older pre-C11 compilers (for example, gcc 4.6).  The portion
337 		of adjacent bit-fields all having nonzero width
343 		NOTE 2: A bit-field and an adjacent non-bit-field member
345 		to two bit-fields, if one is declared inside a nested
347 		are separated by a zero-length bit-field declaration,
348 		or if they are separated by a non-bit-field member
350 		bit-fields in the same structure if all members declared
351 		between them are also bit-fields, no matter what the
352 		sizes of those intervening bit-fields happen to be.
360 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
376 ---------------------------
395      address-dependency barriers; see the "SMP barrier pairing" subsection.
398  (2) Address-dependency barriers (historical).
399      [!] This section is marked as HISTORICAL: it covers the long-obsolete
401      implicit in all marked accesses.  For more up-to-date information,
405      An address-dependency barrier is a weaker form of read barrier.  In the
408      the second load will be directed), an address-dependency barrier would
412      An address-dependency barrier is a partial ordering on interdependent
418      considered can then perceive.  An address-dependency barrier issued by
423      the address-dependency barrier.
435      [!] Note that address-dependency barriers should normally be paired with
438      [!] Kernel release v5.9 removed kernel APIs for explicit address-
441      address-dependency barriers.
445      A read barrier is an address-dependency barrier plus a guarantee that all
453      Read memory barriers imply address-dependency barriers, and so can
477      This acts as a one-way permeable barrier.  It guarantees that all memory
492      This also acts as a one-way permeable barrier.  It guarantees that all
503      -not- guaranteed to act as a full memory barrier.  However, after an
514 RELEASE variants in addition to fully-ordered and relaxed (no barrier
531 ----------------------------------------------
550  (*) There is no guarantee that some intervening piece of off-the-CPU
557 	    Documentation/driver-api/pci/pci.rst
558 	    Documentation/core-api/dma-api-howto.rst
559 	    Documentation/core-api/dma-api.rst
562 ADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
563 ----------------------------------------
564 [!] This section is marked as HISTORICAL: it covers the long-obsolete
566 in all marked accesses.  For more up-to-date information, including
572 to this section are those working on DEC Alpha architecture-specific code
575 address-dependency barriers.
577 [!] While address dependencies are observed in both load-to-load and
578 load-to-store relations, address-dependency barriers are not necessary
579 for load-to-store situations.
581 The requirement of address-dependency barriers is a little subtle, and
594 [!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
595 doesn't imply an address-dependency barrier.
612 To deal with this, READ_ONCE() provides an implicit address-dependency barrier
622 			      <implicit address-dependency barrier>
631 even-numbered cache lines and the other bank processes odd-numbered cache
632 lines.  The pointer P might be stored in an odd-numbered cache line, and the
633 variable B might be stored in an even-numbered cache line.  Then, if the
634 even-numbered bank of the reading CPU's cache is extremely busy while the
635 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
639 An address-dependency barrier is not required to order dependent writes
656 Therefore, no address-dependency barrier is required to order the read into
658 even without an implicit address-dependency barrier of modern READ_ONCE():
663 of dependency ordering is to -prevent- writes to the data structure, along
674 The address-dependency barrier is very important to the RCU system,
684 --------------------
690 A load-load control dependency requires a full read memory barrier, not
691 simply an (implicit) address-dependency barrier to make it work correctly.
695 	<implicit address-dependency barrier>
702 dependency, but rather a control dependency that the CPU may short-circuit
713 However, stores are not speculated.  This means that ordering -is- provided
714 for load-store control dependencies, as in the following example:
729 variable 'a' is always non-zero, it would be well within its rights
759 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
762 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
782 In contrast, without explicit memory barriers, two-legged-if control
839 You must also be careful not to rely too much on boolean short-circuit
854 out-guess your code.  More generally, although READ_ONCE() does force
858 In addition, control dependencies apply only to the then-clause and
859 else-clause of the if-statement in question.  In particular, it does
860 not necessarily apply to code following the if-statement:
874 conditional-move instructions, as in this fanciful pseudo-assembly
887 In short, control dependencies apply only to the stores in the then-clause
888 and else-clause of the if-statement in question (including functions
889 invoked by those two clauses), not to code following that if-statement.
900       However, they do -not- guarantee any other sort of ordering:
909       to carry out the stores.  Please note that it is -not- sufficient
915   (*) Control dependencies require at least one run-time conditional
927   (*) Control dependencies apply only to the then-clause and else-clause
928       of the if-statement containing the control dependency, including
930       do -not- apply to code following the if-statement containing the
935   (*) Control dependencies do -not- provide multicopy atomicity.  If you
943 -------------------
945 When dealing with CPU-CPU interactions, certain types of memory barrier should
952 with an address-dependency barrier, a control dependency, an acquire barrier,
954 read barrier, control dependency, or an address-dependency barrier pairs
973 			      <implicit address-dependency barrier>
993 match the loads after the read barrier or the address-dependency barrier, and
998 	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
1002 	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
1006 ------------------------------------
1025 	+-------+       :      :
1026 	|       |       +------+
1027 	|       |------>| C=3  |     }     /\
1028 	|       |  :    +------+     }-----  \  -----> Events perceptible to
1030 	|       |  :    +------+     }
1032 	|       |       +------+     }
1033 	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1034 	|       |       +------+     }        requires all stores prior to the
1036 	|       |  :    +------+     }        further stores may take place
1037 	|       |------>| D=4  |     }
1038 	|       |       +------+
1039 	+-------+       :      :
1046 Secondly, address-dependency barriers act as partial orderings on address-
1062 	+-------+       :      :                :       :
1063 	|       |       +------+                +-------+  | Sequence of update
1064 	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1065 	|       |  :    +------+     \          +-------+  | CPU 2
1066 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1067 	|       |       +------+       |        +-------+
1069 	|       |       +------+       |        :       :
1070 	|       |  :    | C=&B |---    |        :       :       +-------+
1071 	|       |  :    +------+   \   |        +-------+       |       |
1072 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1073 	|       |       +------+       |        +-------+       |       |
1074 	+-------+       :      :       |        :       :       |       |
1077 	                               |        +-------+       |       |
1078 	    Apparently incorrect --->  |        | B->7  |------>|       |
1079 	    perception of B (!)        |        +-------+       |       |
1081 	                               |        +-------+       |       |
1082 	    The load of X holds --->    \       | X->9  |------>|       |
1083 	    up the maintenance           \      +-------+       |       |
1084 	    of coherence of B             ----->| B->2  |       +-------+
1085 	                                        +-------+
1092 If, however, an address-dependency barrier were to be placed between the load
1103 				<address-dependency barrier>
1108 	+-------+       :      :                :       :
1109 	|       |       +------+                +-------+
1110 	|       |------>| B=2  |-----       --->| Y->8  |
1111 	|       |  :    +------+     \          +-------+
1112 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1113 	|       |       +------+       |        +-------+
1115 	|       |       +------+       |        :       :
1116 	|       |  :    | C=&B |---    |        :       :       +-------+
1117 	|       |  :    +------+   \   |        +-------+       |       |
1118 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1119 	|       |       +------+       |        +-------+       |       |
1120 	+-------+       :      :       |        :       :       |       |
1123 	                               |        +-------+       |       |
1124 	                               |        | X->9  |------>|       |
1125 	                               |        +-------+       |       |
1126 	  Makes sure all effects --->   \   aaaaaaaaaaaaaaaaa   |       |
1127 	  prior to the store of C        \      +-------+       |       |
1128 	  are perceptible to              ----->| B->2  |------>|       |
1129 	  subsequent loads                      +-------+       |       |
1130 	                                        :       :       +-------+
1148 	+-------+       :      :                :       :
1149 	|       |       +------+                +-------+
1150 	|       |------>| A=1  |------      --->| A->0  |
1151 	|       |       +------+      \         +-------+
1152 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1153 	|       |       +------+        |       +-------+
1154 	|       |------>| B=2  |---     |       :       :
1155 	|       |       +------+   \    |       :       :       +-------+
1156 	+-------+       :      :    \   |       +-------+       |       |
1157 	                             ---------->| B->2  |------>|       |
1158 	                                |       +-------+       | CPU 2 |
1159 	                                |       | A->0  |------>|       |
1160 	                                |       +-------+       |       |
1161 	                                |       :       :       +-------+
1163 	                                  \     +-------+
1164 	                                   ---->| A->1  |
1165 	                                        +-------+
1185 	+-------+       :      :                :       :
1186 	|       |       +------+                +-------+
1187 	|       |------>| A=1  |------      --->| A->0  |
1188 	|       |       +------+      \         +-------+
1189 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1190 	|       |       +------+        |       +-------+
1191 	|       |------>| B=2  |---     |       :       :
1192 	|       |       +------+   \    |       :       :       +-------+
1193 	+-------+       :      :    \   |       +-------+       |       |
1194 	                             ---------->| B->2  |------>|       |
1195 	                                |       +-------+       | CPU 2 |
1198 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1199 	  barrier causes all effects      \     +-------+       |       |
1200 	  prior to the storage of B        ---->| A->1  |------>|       |
1201 	  to be perceptible to CPU 2            +-------+       |       |
1202 	                                        :       :       +-------+
1222 	+-------+       :      :                :       :
1223 	|       |       +------+                +-------+
1224 	|       |------>| A=1  |------      --->| A->0  |
1225 	|       |       +------+      \         +-------+
1226 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1227 	|       |       +------+        |       +-------+
1228 	|       |------>| B=2  |---     |       :       :
1229 	|       |       +------+   \    |       :       :       +-------+
1230 	+-------+       :      :    \   |       +-------+       |       |
1231 	                             ---------->| B->2  |------>|       |
1232 	                                |       +-------+       | CPU 2 |
1235 	                                |       +-------+       |       |
1236 	                                |       | A->0  |------>| 1st   |
1237 	                                |       +-------+       |       |
1238 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1239 	  barrier causes all effects      \     +-------+       |       |
1240 	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1241 	  to be perceptible to CPU 2            +-------+       |       |
1242 	                                        :       :       +-------+
1248 	+-------+       :      :                :       :
1249 	|       |       +------+                +-------+
1250 	|       |------>| A=1  |------      --->| A->0  |
1251 	|       |       +------+      \         +-------+
1252 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1253 	|       |       +------+        |       +-------+
1254 	|       |------>| B=2  |---     |       :       :
1255 	|       |       +------+   \    |       :       :       +-------+
1256 	+-------+       :      :    \   |       +-------+       |       |
1257 	                             ---------->| B->2  |------>|       |
1258 	                                |       +-------+       | CPU 2 |
1261 	                                  \     +-------+       |       |
1262 	                                   ---->| A->1  |------>| 1st   |
1263 	                                        +-------+       |       |
1265 	                                        +-------+       |       |
1266 	                                        | A->1  |------>| 2nd   |
1267 	                                        +-------+       |       |
1268 	                                        :       :       +-------+
1277 ----------------------------------------
1281 other loads, and so do the load in advance - even though they haven't actually
1286 It may turn out that the CPU didn't actually need the value - perhaps because a
1287 branch circumvented the load - in which case it can discard the value or just
1301 	                                        :       :       +-------+
1302 	                                        +-------+       |       |
1303 	                                    --->| B->2  |------>|       |
1304 	                                        +-------+       | CPU 2 |
1306 	                                        +-------+       |       |
1307 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1308 	division speculates on the              +-------+   ~   |       |
1312 	Once the divisions are complete -->     :       :   ~-->|       |
1314 	LOAD with immediate effect              :       :       +-------+
1317 Placing a read barrier or an address-dependency barrier just before the second
1332 	                                        :       :       +-------+
1333 	                                        +-------+       |       |
1334 	                                    --->| B->2  |------>|       |
1335 	                                        +-------+       | CPU 2 |
1337 	                                        +-------+       |       |
1338 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1339 	division speculates on the              +-------+   ~   |       |
1346 	                                        :       :   ~-->|       |
1348 	                                        :       :       +-------+
1354 	                                        :       :       +-------+
1355 	                                        +-------+       |       |
1356 	                                    --->| B->2  |------>|       |
1357 	                                        +-------+       | CPU 2 |
1359 	                                        +-------+       |       |
1360 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1361 	division speculates on the              +-------+   ~   |       |
1367 	                                        +-------+       |       |
1368 	The speculation is discarded --->   --->| A->1  |------>|       |
1369 	and an updated value is                 +-------+       |       |
1370 	retrieved                               :       :       +-------+
1374 --------------------
1383 time to all -other- CPUs.  The remainder of this document discusses this
1402 Because CPU 3's load from X in some sense comes after CPU 2's load, it
1407 multicopy-atomic systems, CPU B's load must return either the same value
1417 able to compensate for non-multicopy atomicity.  For example, suppose
1428 This substitution allows non-multicopy atomicity to run rampant: in
1434 example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
1439 General barriers can compensate not only for non-multicopy atomicity,
1440 but can also generate additional ordering that can ensure that -all-
1441 CPUs will perceive the same order of -all- operations.  In contrast, a
1442 chain of release-acquire pairs do not provide this additional ordering,
1483 Furthermore, because of the release-acquire relationship between cpu0()
1489 However, the ordering provided by a release-acquire chain is local
1500 writes in order, CPUs not involved in the release-acquire chain might
1502 the weak memory-barrier instructions used to implement smp_load_acquire()
1505 store to u as happening -after- cpu1()'s load from v, even though
1511 -not- ensure that any particular value will be read.  Therefore, the
1536 ----------------
1543 This is a general barrier -- there are no read-read or write-write
1553      interrupt-handler code and the code that was interrupted.
1559 optimizations that, while perfectly safe in single-threaded code, can
1587      into the following code, which, although in some sense legitimate
1588      for single-threaded code, is almost certainly not what the developer
1609      single-threaded code, but can be fatal in concurrent code:
1627      single-threaded code, so you need to tell the compiler about cases
1641      This transformation is a win for single-threaded code because it
1660      the code into near-nonexistence.  (It will still load from the
1688      between process-level code and an interrupt handler:
1704      win for single-threaded code:
1765      In single-threaded code, this is not only safe, but also saves
1767      could cause some other CPU to see a spurious value of 42 -- even
1768      if variable 'a' was never zero -- when loading variable 'b'.
1777      damaging, but they can result in cache-line bouncing and thus in
1782      with a single memory-reference instruction, prevents "load tearing"
1785      16-bit store instructions with 7-bit immediate fields, the compiler
1786      might be tempted to use two 16-bit store-immediate instructions to
1787      implement the following 32-bit store:
1794      This optimization can therefore be a win in single-threaded code.
1818      implement these three assignment statements as a pair of 32-bit
1819      loads followed by a pair of 32-bit stores.  This would result in
1839 -------------------
1851 All memory barriers except the address-dependency barriers imply a compiler
1865 systems because it is assumed that a CPU will appear to be self-consistent,
1876 windows.  These barriers are required even on non-SMP systems as they affect
1907 	obj->dead = 1;
1909 	atomic_dec(&obj->ref_count);
1923      DMA capable device. See Documentation/core-api/dma-api.rst file for more
1931 	if (desc->status != DEVICE_OWN) {
1936 		read_data = desc->data;
1937 		desc->data = write_data;
1943 		desc->status = DEVICE_OWN;
1967      For example, after a non-temporal write to pmem region, we use pmem_wmb()
1978      For memory accesses with write-combining attributes (e.g. those returned
1981      write-combining memory accesses before this macro with those after it when
1997 --------------------------
2044 one-way barriers is that the effects of instructions outside of a critical
2065 RELEASE may -not- be assumed to be a full memory barrier.
2090 	-could- occur.
2105 	a sleep-unlock race, but the locking primitive needs to resolve
2110 anything at all - especially with respect to I/O accesses - unless combined
2113 See also the section on "Inter-CPU acquiring barrier effects".
2143 -----------------------------
2151 SLEEP AND WAKE-UP FUNCTIONS
2152 ---------------------------
2177 	    STORE current->state
2220 	    STORE current->state	  ...
2222 	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
2223 					    STORE task->state
2268 order multiple stores before the wake-up with respect to loads of those stored
2304 -----------------------
2312 INTER-CPU ACQUIRING BARRIER EFFECTS
2321 ---------------------------
2354 be a problem as a single-threaded linear piece of code will still appear to
2368 --------------------------
2408 	LOAD waiter->list.next;
2409 	LOAD waiter->task;
2410 	STORE waiter->task;
2432 	LOAD waiter->task;
2433 	STORE waiter->task;
2441 	LOAD waiter->list.next;
2442 	--- OOPS ---
2449 	LOAD waiter->list.next;
2450 	LOAD waiter->task;
2452 	STORE waiter->task;
2462 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2469 -----------------
2480 -----------------
2489 efficient to reorder, combine or merge accesses - something that would cause
2493 routines - such as inb() or writel() - which know how to make such accesses
2499 See Documentation/driver-api/device-io.rst for more information.
2503 ----------
2509 This may be alleviated - at least in part - by disabling local interrupts (a
2511 the interrupt-disabled section in the driver.  While the driver's interrupt
2518 under interrupt-disablement and then the driver's interrupt handler is invoked:
2537 accesses performed in an interrupt - and vice versa - unless implicit or
2547 likely, then interrupt-disabling locks should be used to guarantee ordering.
2555 specific. Therefore, drivers which are inherently non-portable may rely on
2607 	The ordering properties of __iomem pointers obtained with non-default
2617 	bullets 2-5 above) but they are still guaranteed to be ordered with
2625 	register-based, memory-mapped FIFOs residing on peripherals that are not
2631 	The inX() and outX() accessors are intended to access legacy port-mapped
2642 	Device drivers may expect outX() to emit a non-posted write transaction
2660 little-endian and will therefore perform byte-swapping operations on big-endian
2668 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2672 of arch-specific code.
2675 stream in any order it feels like - or even in parallel - provided that if an
2681  [*] Some instructions have more than one effect - such as changing the
2682      condition codes, changing registers or changing memory - and different
2708 	    <--- CPU --->         :       <----------- Memory ----------->
2710 	+--------+    +--------+  :   +--------+    +-----------+
2711 	|        |    |        |  :   |        |    |           |    +--------+
2713 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2714 	|        |    | Queue  |  :   |        |    |           |--->| Memory |
2716 	+--------+    +--------+  :   +--------+    |           |    |        |
2717 	                          :                 | Cache     |    +--------+
2719 	                          :                 | Mechanism |    +--------+
2720 	+--------+    +--------+  :   +--------+    |           |    |	      |
2722 	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2723 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2725 	|        |    |        |  :   |        |    |           |    +--------+
2726 	+--------+    +--------+  :   +--------+    +-----------+
2757 ----------------------
2774 See Documentation/core-api/cachetlb.rst for more information on cache
2779 -----------------------
2835  (*) the CPU's data cache may affect the ordering, and while cache-coherency
2836      mechanisms may alleviate this - once the store has actually hit the cache
2837      - there's no guarantee that the coherency management will be propagated in
2848 However, it is guaranteed that a CPU will be self-consistent: it will see its
2875 are -not- optional in the above example, as there are architectures
2910 --------------------------
2914 two semantically-related cache lines updated at separate times.  This is where
2915 the address-dependency barrier really becomes necessary as this synchronises
2925 ----------------------
2930 barriers for this use-case would be possible but is often suboptimal.
2932 To handle this case optimally, low-level virt_mb() etc macros are available.
2934 identical code for SMP and non-SMP systems.  For example, virtual machine guests
2948 ----------------
2953 	Documentation/core-api/circular-buffers.rst
2970 	Chapter 7.1: Memory-Access Ordering
2973 ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
2976 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2991 	Chapter 15: Sparc-V9 Memory Models
3007 Solaris Internals, Core Kernel Architecture, p63-68: