linux-6.8/Documentation/memory-barriers.txt

19 documentation at tools/memory-model/.  Nevertheless, even this memory
37 Note also that it is possible that a barrier may be a no-op for an
48      - Device operations.
49      - Guarantees.
53      - Varieties of memory barrier.
54      - What may not be assumed about memory barriers?
55      - Address-dependency barriers (historical).
56      - Control dependencies.
57      - SMP barrier pairing.
58      - Examples of memory barrier sequences.
59      - Read memory barriers vs load speculation.
60      - Multicopy atomicity.
64      - Compiler barrier.
65      - CPU memory barriers.
69      - Lock acquisition functions.
70      - Interrupt disabling functions.
71      - Sleep and wake-up functions.
72      - Miscellaneous functions.
74  (*) Inter-CPU acquiring barrier effects.
76      - Acquires vs memory accesses.
80      - Interprocessor interaction.
81      - Atomic operations.
82      - Accessing devices.
83      - Interrupts.
89  (*) The effects of the cpu cache.
91      - Cache coherency.
92      - Cache coherency vs DMA.
93      - Cache coherency vs MMIO.
97      - And then there's the Alpha.
98      - Virtual Machine Guests.
102      - Circular buffers.
116 		+-------+   :   +--------+   :   +-------+
119 		| CPU 1 |<----->| Memory |<----->| CPU 2 |
122 		+-------+   :   +--------+   :   +-------+
127 		    |       :   +--------+   :       |
130 		    +---------->| Device |<----------+
133 		            :   +--------+   :
136 Each CPU executes a program that generates memory access operations.  In the
137 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
144 CPU are perceived by the rest of the system as the operations cross the
145 interface between the CPU and rest of the system (the dotted lines).
150 	CPU 1		CPU 2
159 	STORE A=3,	STORE B=4,	y=LOAD A->3,	x=LOAD B->4
160 	STORE A=3,	STORE B=4,	x=LOAD B->4,	y=LOAD A->3
161 	STORE A=3,	y=LOAD A->3,	STORE B=4,	x=LOAD B->4
162 	STORE A=3,	y=LOAD A->3,	x=LOAD B->2,	STORE B=4
163 	STORE A=3,	x=LOAD B->2,	STORE B=4,	y=LOAD A->3
164 	STORE A=3,	x=LOAD B->2,	y=LOAD A->3,	STORE B=4
165 	STORE B=4,	STORE A=3,	y=LOAD A->3,	x=LOAD B->4
177 Furthermore, the stores committed by a CPU to the memory system may not be
178 perceived by the loads made by another CPU in the same order as the stores were
184 	CPU 1		CPU 2
191 on the address retrieved from P by CPU 2.  At the end of the sequence, any of
198 Note that CPU 2 will never try and load C into D because the CPU will load P
203 -----------------
209 port register (D).  To read internal register 5, the following code might then
221 the address _after_ attempting to read the register.
225 ----------
227 There are some minimal guarantees that may be expected of a CPU:
229  (*) On any given CPU, dependent memory accesses will be issued in order, with
234      the CPU will issue the following memory operations:
239      emits a memory-barrier instruction, so that a DEC Alpha CPU will
247  (*) Overlapping loads and stores within a particular CPU will appear to be
248      ordered within that CPU.  This means that for:
252      the CPU will only issue the following sequence of memory operations:
260      the CPU will only issue:
310 And there are anti-guarantees:
313      generate code to modify these using non-atomic read-modify-write
320      non-atomic read-modify-write sequences can cause an update to one
327      "char", two-byte alignment for "short", four-byte alignment for
328      "int", and either four-byte or eight-byte alignment for "long",
329      on 32-bit and 64-bit systems, respectively.  Note that these
331      using older pre-C11 compilers (for example, gcc 4.6).  The portion
337 		of adjacent bit-fields all having nonzero width
343 		NOTE 2: A bit-field and an adjacent non-bit-field member
345 		to two bit-fields, if one is declared inside a nested
347 		are separated by a zero-length bit-field declaration,
348 		or if they are separated by a non-bit-field member
350 		bit-fields in the same structure if all members declared
351 		between them are also bit-fields, no matter what the
352 		sizes of those intervening bit-fields happen to be.
360 in random order, but this can be a problem for CPU-CPU interaction and for I/O.
362 CPU to restrict the order.
376 ---------------------------
390      A CPU can be viewed as committing a sequence of store operations to the
394      [!] Note that write barriers should normally be paired with read or
395      address-dependency barriers; see the "SMP barrier pairing" subsection.
398  (2) Address-dependency barriers (historical).
399      [!] This section is marked as HISTORICAL: it covers the long-obsolete
401      implicit in all marked accesses.  For more up-to-date information,
405      An address-dependency barrier is a weaker form of read barrier.  In the
408      the second load will be directed), an address-dependency barrier would
412      An address-dependency barrier is a partial ordering on interdependent
417      committing sequences of stores to the memory system that the CPU being
418      considered can then perceive.  An address-dependency barrier issued by
419      the CPU under consideration guarantees that for any load preceding it,
420      if that load touches one of a sequence of stores from another CPU, then
423      the address-dependency barrier.
432      a full read barrier or better is required.  See the "Control dependencies"
435      [!] Note that address-dependency barriers should normally be paired with
438      [!] Kernel release v5.9 removed kernel APIs for explicit address-
441      address-dependency barriers.
443  (3) Read (or load) memory barriers.
445      A read barrier is an address-dependency barrier plus a guarantee that all
450      A read barrier is a partial ordering on loads only; it is not required to
453      Read memory barriers imply address-dependency barriers, and so can
456      [!] Note that read barriers should normally be paired with write barriers;
469      General memory barriers imply both read and write memory barriers, and so
477      This acts as a one-way permeable barrier.  It guarantees that all memory
492      This also acts as a one-way permeable barrier.  It guarantees that all
503      -not- guaranteed to act as a full memory barrier.  However, after an
514 RELEASE variants in addition to fully-ordered and relaxed (no barrier
520 between two CPUs or between a CPU and a device.  If it can be guaranteed that
531 ----------------------------------------------
537      instruction; the barrier can be considered to draw a line in that CPU's
540  (*) There is no guarantee that issuing a memory barrier on one CPU will have
541      any direct effect on another CPU or any other hardware in the system.  The
542      indirect effect will be the order in which the second CPU sees the effects
543      of the first CPU's accesses occur, but see the next point:
545  (*) There is no guarantee that a CPU will see the correct order of effects
546      from a second CPU's accesses, even _if_ the second CPU uses a memory
547      barrier, unless the first CPU _also_ uses a matching memory barrier (see
550  (*) There is no guarantee that some intervening piece of off-the-CPU
551      hardware[*] will not reorder the memory accesses.  CPU cache coherency
555 	[*] For information on bus mastering DMA and coherency please read:
557 	    Documentation/driver-api/pci/pci.rst
558 	    Documentation/core-api/dma-api-howto.rst
559 	    Documentation/core-api/dma-api.rst
562 ADDRESS-DEPENDENCY BARRIERS (HISTORICAL)
563 ----------------------------------------
564 [!] This section is marked as HISTORICAL: it covers the long-obsolete
566 in all marked accesses.  For more up-to-date information, including
572 to this section are those working on DEC Alpha architecture-specific code
575 address-dependency barriers.
577 [!] While address dependencies are observed in both load-to-load and
578 load-to-store relations, address-dependency barriers are not necessary
579 for load-to-store situations.
581 The requirement of address-dependency barriers is a little subtle, and
585 	CPU 1		      CPU 2
594 [!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
595 doesn't imply an address-dependency barrier.
603 But!  CPU 2's perception of P may be updated _before_ its perception of B, thus
612 To deal with this, READ_ONCE() provides an implicit address-dependency barrier
615 	CPU 1		      CPU 2
622 			      <implicit address-dependency barrier>
631 even-numbered cache lines and the other bank processes odd-numbered cache
632 lines.  The pointer P might be stored in an odd-numbered cache line, and the
633 variable B might be stored in an even-numbered cache line.  Then, if the
634 even-numbered bank of the reading CPU's cache is extremely busy while the
635 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
639 An address-dependency barrier is not required to order dependent writes
643 But please carefully read the "CONTROL DEPENDENCIES" section and the
647 	CPU 1		      CPU 2
656 Therefore, no address-dependency barrier is required to order the read into
658 even without an implicit address-dependency barrier of modern READ_ONCE():
663 of dependency ordering is to -prevent- writes to the data structure, along
670 the CPU containing it.  See the section on "Multicopy atomicity" for
674 The address-dependency barrier is very important to the RCU system,
684 --------------------
690 A load-load control dependency requires a full read memory barrier, not
691 simply an (implicit) address-dependency barrier to make it work correctly.
695 	<implicit address-dependency barrier>
702 dependency, but rather a control dependency that the CPU may short-circuit
709 		<read barrier>
713 However, stores are not speculated.  This means that ordering -is- provided
714 for load-store control dependencies, as in the following example:
729 variable 'a' is always non-zero, it would be well within its rights
734 	b = 1;  /* BUG: Compiler and CPU can both reorder!!! */
759 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
762 		/* WRITE_ONCE(b, 1); -- moved up, BUG!!! */
767 'b', which means that the CPU is within its rights to reorder them:
782 In contrast, without explicit memory barriers, two-legged-if control
818 Given this transformation, the CPU is not required to respect the ordering
839 You must also be careful not to rely too much on boolean short-circuit
854 out-guess your code.  More generally, although READ_ONCE() does force
858 In addition, control dependencies apply only to the then-clause and
859 else-clause of the if-statement in question.  In particular, it does
860 not necessarily apply to code following the if-statement:
868 	WRITE_ONCE(c, 1);  /* BUG: No ordering against the read from 'a'. */
874 conditional-move instructions, as in this fanciful pseudo-assembly
884 A weakly ordered CPU would have no dependency of any sort between the load
887 In short, control dependencies apply only to the stores in the then-clause
888 and else-clause of the if-statement in question (including functions
889 invoked by those two clauses), not to code following that if-statement.
893 to the CPU containing it.  See the section on "Multicopy atomicity"
900       However, they do -not- guarantee any other sort of ordering:
909       to carry out the stores.  Please note that it is -not- sufficient
915   (*) Control dependencies require at least one run-time conditional
927   (*) Control dependencies apply only to the then-clause and else-clause
928       of the if-statement containing the control dependency, including
930       do -not- apply to code following the if-statement containing the
935   (*) Control dependencies do -not- provide multicopy atomicity.  If you
943 -------------------
945 When dealing with CPU-CPU interactions, certain types of memory barrier should
952 with an address-dependency barrier, a control dependency, an acquire barrier,
953 a release barrier, a read barrier, or a general barrier.  Similarly a
954 read barrier, control dependency, or an address-dependency barrier pairs
958 	CPU 1		      CPU 2
963 			      <read barrier>
968 	CPU 1		      CPU 2
973 			      <implicit address-dependency barrier>
978 	CPU 1		      CPU 2
989 Basically, the read barrier always has to be there, even though it can be of
993 match the loads after the read barrier or the address-dependency barrier, and
996 	CPU 1                               CPU 2
998 	WRITE_ONCE(a, 1);    }----   --->{  v = READ_ONCE(c);
1000 	<write barrier>            \        <read barrier>
1002 	WRITE_ONCE(d, 4);    }----   --->{  y = READ_ONCE(b);
1006 ------------------------------------
1011 	CPU 1
1025 	+-------+       :      :
1026 	|       |       +------+
1027 	|       |------>| C=3  |     }     /\
1028 	|       |  :    +------+     }-----  \  -----> Events perceptible to
1030 	|       |  :    +------+     }
1031 	| CPU 1 |  :    | B=2  |     }
1032 	|       |       +------+     }
1033 	|       |   wwwwwwwwwwwwwwww }   <--- At this point the write barrier
1034 	|       |       +------+     }        requires all stores prior to the
1036 	|       |  :    +------+     }        further stores may take place
1037 	|       |------>| D=4  |     }
1038 	|       |       +------+
1039 	+-------+       :      :
1042 	                   | memory system by CPU 1
1046 Secondly, address-dependency barriers act as partial orderings on address-
1049 	CPU 1			CPU 2
1059 Without intervention, CPU 2 may perceive the events on CPU 1 in some
1060 effectively random order, despite the write barrier issued by CPU 1:
1062 	+-------+       :      :                :       :
1063 	|       |       +------+                +-------+  | Sequence of update
1064 	|       |------>| B=2  |-----       --->| Y->8  |  | of perception on
1065 	|       |  :    +------+     \          +-------+  | CPU 2
1066 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |  V
1067 	|       |       +------+       |        +-------+
1069 	|       |       +------+       |        :       :
1070 	|       |  :    | C=&B |---    |        :       :       +-------+
1071 	|       |  :    +------+   \   |        +-------+       |       |
1072 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1073 	|       |       +------+       |        +-------+       |       |
1074 	+-------+       :      :       |        :       :       |       |
1076 	                               |        :       :       | CPU 2 |
1077 	                               |        +-------+       |       |
1078 	    Apparently incorrect --->  |        | B->7  |------>|       |
1079 	    perception of B (!)        |        +-------+       |       |
1081 	                               |        +-------+       |       |
1082 	    The load of X holds --->    \       | X->9  |------>|       |
1083 	    up the maintenance           \      +-------+       |       |
1084 	    of coherence of B             ----->| B->2  |       +-------+
1085 	                                        +-------+
1089 In the above example, CPU 2 perceives that B is 7, despite the load of *C
1092 If, however, an address-dependency barrier were to be placed between the load
1093 of C and the load of *C (ie: B) on CPU 2:
1095 	CPU 1			CPU 2
1103 				<address-dependency barrier>
1108 	+-------+       :      :                :       :
1109 	|       |       +------+                +-------+
1110 	|       |------>| B=2  |-----       --->| Y->8  |
1111 	|       |  :    +------+     \          +-------+
1112 	| CPU 1 |  :    | A=1  |      \     --->| C->&Y |
1113 	|       |       +------+       |        +-------+
1115 	|       |       +------+       |        :       :
1116 	|       |  :    | C=&B |---    |        :       :       +-------+
1117 	|       |  :    +------+   \   |        +-------+       |       |
1118 	|       |------>| D=4  |    ----------->| C->&B |------>|       |
1119 	|       |       +------+       |        +-------+       |       |
1120 	+-------+       :      :       |        :       :       |       |
1122 	                               |        :       :       | CPU 2 |
1123 	                               |        +-------+       |       |
1124 	                               |        | X->9  |------>|       |
1125 	                               |        +-------+       |       |
1126 	  Makes sure all effects --->   \   aaaaaaaaaaaaaaaaa   |       |
1127 	  prior to the store of C        \      +-------+       |       |
1128 	  are perceptible to              ----->| B->2  |------>|       |
1129 	  subsequent loads                      +-------+       |       |
1130 	                                        :       :       +-------+
1133 And thirdly, a read barrier acts as a partial order on loads.  Consider the
1136 	CPU 1			CPU 2
1145 Without intervention, CPU 2 may then choose to perceive the events on CPU 1 in
1146 some effectively random order, despite the write barrier issued by CPU 1:
1148 	+-------+       :      :                :       :
1149 	|       |       +------+                +-------+
1150 	|       |------>| A=1  |------      --->| A->0  |
1151 	|       |       +------+      \         +-------+
1152 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1153 	|       |       +------+        |       +-------+
1154 	|       |------>| B=2  |---     |       :       :
1155 	|       |       +------+   \    |       :       :       +-------+
1156 	+-------+       :      :    \   |       +-------+       |       |
1157 	                             ---------->| B->2  |------>|       |
1158 	                                |       +-------+       | CPU 2 |
1159 	                                |       | A->0  |------>|       |
1160 	                                |       +-------+       |       |
1161 	                                |       :       :       +-------+
1163 	                                  \     +-------+
1164 	                                   ---->| A->1  |
1165 	                                        +-------+
1169 If, however, a read barrier were to be placed between the load of B and the
1170 load of A on CPU 2:
1172 	CPU 1			CPU 2
1179 				<read barrier>
1182 then the partial ordering imposed by CPU 1 will be perceived correctly by CPU
1185 	+-------+       :      :                :       :
1186 	|       |       +------+                +-------+
1187 	|       |------>| A=1  |------      --->| A->0  |
1188 	|       |       +------+      \         +-------+
1189 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1190 	|       |       +------+        |       +-------+
1191 	|       |------>| B=2  |---     |       :       :
1192 	|       |       +------+   \    |       :       :       +-------+
1193 	+-------+       :      :    \   |       +-------+       |       |
1194 	                             ---------->| B->2  |------>|       |
1195 	                                |       +-------+       | CPU 2 |
1198 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1199 	  barrier causes all effects      \     +-------+       |       |
1200 	  prior to the storage of B        ---->| A->1  |------>|       |
1201 	  to be perceptible to CPU 2            +-------+       |       |
1202 	                                        :       :       +-------+
1206 contained a load of A either side of the read barrier:
1208 	CPU 1			CPU 2
1216 				<read barrier>
1222 	+-------+       :      :                :       :
1223 	|       |       +------+                +-------+
1224 	|       |------>| A=1  |------      --->| A->0  |
1225 	|       |       +------+      \         +-------+
1226 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1227 	|       |       +------+        |       +-------+
1228 	|       |------>| B=2  |---     |       :       :
1229 	|       |       +------+   \    |       :       :       +-------+
1230 	+-------+       :      :    \   |       +-------+       |       |
1231 	                             ---------->| B->2  |------>|       |
1232 	                                |       +-------+       | CPU 2 |
1235 	                                |       +-------+       |       |
1236 	                                |       | A->0  |------>| 1st   |
1237 	                                |       +-------+       |       |
1238 	  At this point the read ---->   \  rrrrrrrrrrrrrrrrr   |       |
1239 	  barrier causes all effects      \     +-------+       |       |
1240 	  prior to the storage of B        ---->| A->1  |------>| 2nd   |
1241 	  to be perceptible to CPU 2            +-------+       |       |
1242 	                                        :       :       +-------+
1245 But it may be that the update to A from CPU 1 becomes perceptible to CPU 2
1246 before the read barrier completes anyway:
1248 	+-------+       :      :                :       :
1249 	|       |       +------+                +-------+
1250 	|       |------>| A=1  |------      --->| A->0  |
1251 	|       |       +------+      \         +-------+
1252 	| CPU 1 |   wwwwwwwwwwwwwwww   \    --->| B->9  |
1253 	|       |       +------+        |       +-------+
1254 	|       |------>| B=2  |---     |       :       :
1255 	|       |       +------+   \    |       :       :       +-------+
1256 	+-------+       :      :    \   |       +-------+       |       |
1257 	                             ---------->| B->2  |------>|       |
1258 	                                |       +-------+       | CPU 2 |
1261 	                                  \     +-------+       |       |
1262 	                                   ---->| A->1  |------>| 1st   |
1263 	                                        +-------+       |       |
1265 	                                        +-------+       |       |
1266 	                                        | A->1  |------>| 2nd   |
1267 	                                        +-------+       |       |
1268 	                                        :       :       +-------+
1276 READ MEMORY BARRIERS VS LOAD SPECULATION
1277 ----------------------------------------
1281 other loads, and so do the load in advance - even though they haven't actually
1283 actual load instruction to potentially complete immediately because the CPU
1286 It may turn out that the CPU didn't actually need the value - perhaps because a
1287 branch circumvented the load - in which case it can discard the value or just
1292 	CPU 1			CPU 2
1301 	                                        :       :       +-------+
1302 	                                        +-------+       |       |
1303 	                                    --->| B->2  |------>|       |
1304 	                                        +-------+       | CPU 2 |
1306 	                                        +-------+       |       |
1307 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1308 	division speculates on the              +-------+   ~   |       |
1312 	Once the divisions are complete -->     :       :   ~-->|       |
1313 	the CPU can then perform the            :       :       |       |
1314 	LOAD with immediate effect              :       :       +-------+
1317 Placing a read barrier or an address-dependency barrier just before the second
1320 	CPU 1			CPU 2
1325 				<read barrier>
1332 	                                        :       :       +-------+
1333 	                                        +-------+       |       |
1334 	                                    --->| B->2  |------>|       |
1335 	                                        +-------+       | CPU 2 |
1337 	                                        +-------+       |       |
1338 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1339 	division speculates on the              +-------+   ~   |       |
1346 	                                        :       :   ~-->|       |
1348 	                                        :       :       +-------+
1351 but if there was an update or an invalidation from another CPU pending, then
1354 	                                        :       :       +-------+
1355 	                                        +-------+       |       |
1356 	                                    --->| B->2  |------>|       |
1357 	                                        +-------+       | CPU 2 |
1359 	                                        +-------+       |       |
1360 	The CPU being busy doing a --->     --->| A->0  |~~~~   |       |
1361 	division speculates on the              +-------+   ~   |       |
1367 	                                        +-------+       |       |
1368 	The speculation is discarded --->   --->| A->1  |------>|       |
1369 	and an updated value is                 +-------+       |       |
1370 	retrieved                               :       :       +-------+
1374 --------------------
1383 time to all -other- CPUs.  The remainder of this document discusses this
1388 	CPU 1			CPU 2			CPU 3
1392 				<general barrier>	<read barrier>
1395 Suppose that CPU 2's load from X returns 1, which it then stores to Y,
1396 and CPU 3's load from Y returns 1.  This indicates that CPU 1's store
1397 to X precedes CPU 2's load from X and that CPU 2's store to Y precedes
1398 CPU 3's load from Y.  In addition, the memory barriers guarantee that
1399 CPU 2 executes its load before its store, and CPU 3 loads from Y before
1400 it loads from X.  The question is then "Can CPU 3's load from X return 0?"
1402 Because CPU 3's load from X in some sense comes after CPU 2's load, it
1403 is natural to expect that CPU 3's load from X must therefore return 1.
1405 on CPU B follows a load from the same variable executing on CPU A (and
1406 CPU A did not originally store the value which it read), then on
1407 multicopy-atomic systems, CPU B's load must return either the same value
1408 that CPU A's load did or some later value.  However, the Linux kernel
1412 for any lack of multicopy atomicity.  In the example, if CPU 2's load
1413 from X returns 1 and CPU 3's load from Y returns 1, then CPU 3's load
1416 However, dependencies, read barriers, and write barriers are not always
1417 able to compensate for non-multicopy atomicity.  For example, suppose
1418 that CPU 2's general barrier is removed from the above example, leaving
1421 	CPU 1			CPU 2			CPU 3
1425 				<data dependency>	<read barrier>
1428 This substitution allows non-multicopy atomicity to run rampant: in
1429 this example, it is perfectly legal for CPU 2's load from X to return 1,
1430 CPU 3's load from Y to return 1, and its load from X to return 0.
1432 The key point is that although CPU 2's data dependency orders its load
1433 and store, it does not guarantee to order CPU 1's store.  Thus, if this
1434 example runs on a non-multicopy-atomic system where CPUs 1 and 2 share a
1435 store buffer or a level of cache, CPU 2 might have early access to CPU 1's
1439 General barriers can compensate not only for non-multicopy atomicity,
1440 but can also generate additional ordering that can ensure that -all-
1441 CPUs will perceive the same order of -all- operations.  In contrast, a
1442 chain of release-acquire pairs do not provide this additional ordering,
1483 Furthermore, because of the release-acquire relationship between cpu0()
1489 However, the ordering provided by a release-acquire chain is local
1500 writes in order, CPUs not involved in the release-acquire chain might
1502 the weak memory-barrier instructions used to implement smp_load_acquire()
1505 store to u as happening -after- cpu1()'s load from v, even though
1511 -not- ensure that any particular value will be read.  Therefore, the
1532   (*) CPU memory barriers.
1536 ----------------
1543 This is a general barrier -- there are no read-read or write-write
1553      interrupt-handler code and the code that was interrupted.
1559 optimizations that, while perfectly safe in single-threaded code, can
1564      to the same variable, and in some cases, the CPU is within its
1572      Prevent both the compiler and the CPU from doing this as follows:
1588      for single-threaded code, is almost certainly not what the developer
1609      single-threaded code, but can be fatal in concurrent code:
1616      a was modified by some other CPU between the "while" statement and
1627      single-threaded code, so you need to tell the compiler about cases
1641      This transformation is a win for single-threaded code because it
1643      will carry out its proof assuming that the current CPU is the only
1660      the code into near-nonexistence.  (It will still load from the
1665      Again, the compiler assumes that the current CPU is the only one
1676      surprise if some other CPU might have stored to variable 'a' in the
1688      between process-level code and an interrupt handler:
1704      win for single-threaded code:
1749      though the CPU of course need not do so.
1765      In single-threaded code, this is not only safe, but also saves
1767      could cause some other CPU to see a spurious value of 42 -- even
1768      if variable 'a' was never zero -- when loading variable 'b'.
1777      damaging, but they can result in cache-line bouncing and thus in
1782      with a single memory-reference instruction, prevents "load tearing"
1785      16-bit store instructions with 7-bit immediate fields, the compiler
1786      might be tempted to use two 16-bit store-immediate instructions to
1787      implement the following 32-bit store:
1794      This optimization can therefore be a win in single-threaded code.
1818      implement these three assignment statements as a pair of 32-bit
1819      loads followed by a pair of 32-bit stores.  This would result in
1834 Please note that these compiler barriers have no direct effect on the CPU,
1838 CPU MEMORY BARRIERS
1839 -------------------
1841 The Linux kernel has seven basic CPU memory barriers:
1847 	READ			rmb()		smp_rmb()
1851 All memory barriers except the address-dependency barriers imply a compiler
1865 systems because it is assumed that a CPU will appear to be self-consistent,
1876 windows.  These barriers are required even on non-SMP systems as they affect
1878 compiler and the CPU from reordering them.
1907 	obj->dead = 1;
1909 	atomic_dec(&obj->ref_count);
1922      of writes or reads of shared memory accessible to both the CPU and a
1923      DMA capable device. See Documentation/core-api/dma-api.rst file for more
1928      to the device or the CPU, and a doorbell to notify it when new
1931 	if (desc->status != DEVICE_OWN) {
1932 		/* do not read data until we own descriptor */
1935 		/* read/modify data */
1936 		read_data = desc->data;
1937 		desc->data = write_data;
1943 		desc->status = DEVICE_OWN;
1952      before we read the data from the descriptor, and the dma_wmb() allows
1967      For example, after a non-temporal write to pmem region, we use pmem_wmb()
1973      For load from persistent memory, existing read memory barriers are sufficient
1974      to ensure read ordering.
1978      For memory accesses with write-combining attributes (e.g. those returned
1979      by ioremap_wc()), the CPU may wait for prior accesses to be merged with
1981      write-combining memory accesses before this macro with those after it when
1997 --------------------------
2044 one-way barriers is that the effects of instructions outside of a critical
2064 another CPU not holding that lock.  In short, a ACQUIRE followed by an
2065 RELEASE may -not- be assumed to be a full memory barrier.
2068 not imply a full memory barrier.  Therefore, the CPU's execution of the
2087 	One key point is that we are only talking about the CPU doing
2090 	-could- occur.
2092 	But suppose the CPU reordered the operations.  In this case,
2093 	the unlock precedes the lock in the assembly code.  The CPU
2096 	try to sleep, but more on that later).	The CPU will eventually
2105 	a sleep-unlock race, but the locking primitive needs to resolve
2110 anything at all - especially with respect to I/O accesses - unless combined
2113 See also the section on "Inter-CPU acquiring barrier effects".
2143 -----------------------------
2151 SLEEP AND WAKE-UP FUNCTIONS
2152 ---------------------------
2173 	CPU 1
2177 	    STORE current->state
2216 	CPU 1 (Sleeper)			CPU 2 (Waker)
2220 	    STORE current->state	  ...
2222 	LOAD event_indicated		  if ((LOAD task->state) & TASK_NORMAL)
2223 					    STORE task->state
2225 where "task" is the thread being woken up and it equals CPU 1's "current".
2232 	CPU 1				CPU 2
2268 order multiple stores before the wake-up with respect to loads of those stored
2304 -----------------------
2312 INTER-CPU ACQUIRING BARRIER EFFECTS
2321 ---------------------------
2326 	CPU 1				CPU 2
2335 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
2354 be a problem as a single-threaded linear piece of code will still appear to
2368 --------------------------
2370 When there's a system with more than one processor, more than one CPU in the
2395  (1) read the next pointer from this waiter's record to know as to where the
2398  (2) read the pointer to the waiter's task structure;
2408 	LOAD waiter->list.next;
2409 	LOAD waiter->task;
2410 	STORE waiter->task;
2420 if the task pointer is cleared _before_ the next pointer in the list is read,
2421 another CPU might start processing the waiter and might clobber the waiter's
2422 stack before the up*() function has a chance to read the next pointer.
2426 	CPU 1				CPU 2
2432 	LOAD waiter->task;
2433 	STORE waiter->task;
2441 	LOAD waiter->list.next;
2442 	--- OOPS ---
2449 	LOAD waiter->list.next;
2450 	LOAD waiter->task;
2452 	STORE waiter->task;
2462 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2464 right order without actually intervening in the CPU.  Since there's only one
2465 CPU, that CPU's dependency ordering logic will take care of everything else.
2469 -----------------
2480 -----------------
2482 Many devices can be memory mapped, and so appear to the CPU as if they're just
2486 However, having a clever CPU or a clever compiler creates a potential problem
2488 device in the requisite order if the CPU or the compiler thinks it is more
2489 efficient to reorder, combine or merge accesses - something that would cause
2493 routines - such as inb() or writel() - which know how to make such accesses
2499 See Documentation/driver-api/device-io.rst for more information.
2503 ----------
2509 This may be alleviated - at least in part - by disabling local interrupts (a
2511 the interrupt-disabled section in the driver.  While the driver's interrupt
2512 routine is executing, the driver's core may not run on the same CPU, and its
2518 under interrupt-disablement and then the driver's interrupt handler is invoked:
2537 accesses performed in an interrupt - and vice versa - unless implicit or
2547 likely, then interrupt-disabling locks should be used to guarantee ordering.
2555 specific. Therefore, drivers which are inherently non-portable may rely on
2571 	   by the same CPU thread to a particular device will arrive in program
2574 	2. A writeX() issued by a CPU thread holding a spinlock is ordered
2575 	   before a writeX() to the same peripheral from another CPU thread
2581 	3. A writeX() by a CPU thread to the peripheral will first wait for the
2583 	   propagated to, the same thread. This ensures that writes by the CPU
2585 	   visible to a DMA engine when the CPU writes to its MMIO control
2588 	4. A readX() by a CPU thread from the peripheral will complete before
2590 	   ensures that reads by the CPU from an incoming DMA buffer allocated
2595 	5. A readX() by a CPU thread from the peripheral will complete before
2597 	   This ensures that two MMIO register writes by the CPU to a peripheral
2598 	   will arrive at least 1us apart if the first write is immediately read
2607 	The ordering properties of __iomem pointers obtained with non-default
2617 	bullets 2-5 above) but they are still guaranteed to be ordered with
2618 	respect to other accesses from the same CPU thread to the same
2625 	register-based, memory-mapped FIFOs residing on peripherals that are not
2631 	The inX() and outX() accessors are intended to access legacy port-mapped
2636 	Since many CPU architectures ultimately access these peripherals via an
2642 	Device drivers may expect outX() to emit a non-posted write transaction
2660 little-endian and will therefore perform byte-swapping operations on big-endian
2668 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2672 of arch-specific code.
2674 This means that it must be considered that the CPU will execute its instruction
2675 stream in any order it feels like - or even in parallel - provided that if an
2681  [*] Some instructions have more than one effect - such as changing the
2682      condition codes, changing registers or changing memory - and different
2685 A CPU may also discard any instruction sequence that winds up having no
2696 THE EFFECTS OF THE CPU CACHE
2703 As far as the way a CPU interacts with another part of the system through the
2704 caches goes, the memory system has to include the CPU's caches, and memory
2705 barriers for the most part act at the interface between the CPU and its cache
2708 	    <--- CPU --->         :       <----------- Memory ----------->
2710 	+--------+    +--------+  :   +--------+    +-----------+
2711 	|        |    |        |  :   |        |    |           |    +--------+
2712 	|  CPU   |    | Memory |  :   | CPU    |    |           |    |        |
2713 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2714 	|        |    | Queue  |  :   |        |    |           |--->| Memory |
2716 	+--------+    +--------+  :   +--------+    |           |    |        |
2717 	                          :                 | Cache     |    +--------+
2719 	                          :                 | Mechanism |    +--------+
2720 	+--------+    +--------+  :   +--------+    |           |    |	      |
2722 	|  CPU   |    | Memory |  :   | CPU    |    |           |--->| Device |
2723 	|  Core  |--->| Access |----->| Cache  |<-->|           |    |        |
2725 	|        |    |        |  :   |        |    |           |    +--------+
2726 	+--------+    +--------+  :   +--------+    +-----------+
2731 CPU that issued it since it may have been satisfied within the CPU's own cache,
2734 cacheline over to the accessing CPU and propagate the effects upon conflict.
2736 The CPU core may execute instructions in any order it deems fit, provided the
2744 accesses cross from the CPU side of things to the memory side of things, and
2748 [!] Memory barriers are _not_ needed within a given CPU, as CPUs always see
2753 the use of any special device communication instructions the CPU may have.
2757 ----------------------
2763 the kernel must flush the overlapping bits of cache on each CPU (and maybe
2767 cache lines being written back to RAM from a CPU's cache after the device has
2768 installed its own data, or cache lines present in the CPU's cache may simply
2770 is discarded from the CPU's cache and reloaded.  To deal with this, the
2772 cache on each CPU.
2774 See Documentation/core-api/cachetlb.rst for more information on cache
2779 -----------------------
2782 a window in the CPU's memory space that has different properties assigned than
2797 A programmer might take it for granted that the CPU will perform memory
2798 operations in exactly the order specified, so that if the CPU is, for example,
2807 they would then expect that the CPU will complete the memory operation for each
2828      of the CPU buses and caches;
2835  (*) the CPU's data cache may affect the ordering, and while cache-coherency
2836      mechanisms may alleviate this - once the store has actually hit the cache
2837      - there's no guarantee that the coherency management will be propagated in
2840 So what another CPU, say, might actually observe from the above piece of code
2848 However, it is guaranteed that a CPU will be self-consistent: it will see its
2867 The code above may cause the CPU to generate the full sequence of memory
2875 are -not- optional in the above example, as there are architectures
2876 where a given CPU might reorder successive loads to the same location.
2883 the CPU even sees them.
2906 and the LOAD operation never appear outside of the CPU.
2910 --------------------------
2912 The DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
2913 some versions of the Alpha CPU have a split data cache, permitting them to have
2914 two semantically-related cache lines updated at separate times.  This is where
2915 the address-dependency barrier really becomes necessary as this synchronises
2925 ----------------------
2930 barriers for this use-case would be possible but is often suboptimal.
2932 To handle this case optimally, low-level virt_mb() etc macros are available.
2934 identical code for SMP and non-SMP systems.  For example, virtual machine guests
2948 ----------------
2953 	Documentation/core-api/circular-buffers.rst
2967 	Chapter 5.6: Read/Write Ordering
2970 	Chapter 7.1: Memory-Access Ordering
2973 ARM Architecture Reference Manual (ARMv8, for ARMv8-A architecture profile)
2976 IA-32 Intel Architecture Software Developer's Manual, Volume 3:
2991 	Chapter 15: Sparc-V9 Memory Models
3007 Solaris Internals, Core Kernel Architecture, p63-68: