linux-6.8/Documentation/memory-barriers.txt

14 This document is not a specification; it is intentionally (for the sake of
15 brevity) and unintentionally (due to being human) incomplete. This document is
23 To repeat, this document is not a specification of what Linux expects from
26 The purpose of this document is twofold:
35 that, that architecture is incorrect.
37 Note also that it is possible that a barrier may be a no-op for an
137 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
190 There is an obvious address dependency here, as the value loaded into D depends
206 locations, but the order in which the control registers are accessed is very
271      WRITE_ONCE().  Without them, the compiler is within its rights to
331      using older pre-C11 compilers (for example, gcc 4.6).  The portion
332      of the standard containing this guarantee is Section 3.14, which
345 		to two bit-fields, if one is declared inside a nested
346 		structure declaration and the other is not, or if the two
349 		declaration. It is not safe to concurrently update two
361 What is required is some way of intervening to instruct the compiler and the
367 Such enforcement is important because the CPUs and other devices in a system
387      A write barrier is a partial ordering on stores only; it is not required
399      [!] This section is marked as HISTORICAL: it covers the long-obsolete
405      An address-dependency barrier is a weaker form of read barrier.  In the
409      be required to make sure that the target of the second load is updated
410      after the address obtained by the first load is accessed.
412      An address-dependency barrier is a partial ordering on interdependent
413      loads only; it is not required to have any effect on stores, independent
429      not a control dependency.  If the address for the second load is dependent
430      on the first load, but the dependency is through a conditional rather than
432      a full read barrier or better is required.  See the "Control dependencies"
445      A read barrier is an address-dependency barrier plus a guarantee that all
450      A read barrier is a partial ordering on loads only; it is not required to
467      A general memory barrier is a partial ordering over both loads and stores.
502      for other sorts of memory barrier.  In addition, a RELEASE+ACQUIRE pair is
535  (*) There is no guarantee that any of the memory accesses specified before a
540  (*) There is no guarantee that issuing a memory barrier on one CPU will have
545  (*) There is no guarantee that a CPU will see the correct order of effects
550  (*) There is no guarantee that some intervening piece of off-the-CPU
564 [!] This section is marked as HISTORICAL: it covers the long-obsolete
574 those who are interested in the history, here is the story of
581 The requirement of address-dependency barriers is a little subtle, and
594 [!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
634 even-numbered bank of the reading CPU's cache is extremely busy while the
635 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
639 An address-dependency barrier is not required to order dependent writes
656 Therefore, no address-dependency barrier is required to order the read into
657 Q with the store into *Q.  In other words, this outcome is prohibited,
663 of dependency ordering is to -prevent- writes to the data structure, along
669 Note well that the ordering provided by an address dependency is local to
674 The address-dependency barrier is very important to the RCU system,
687 not understand them.  The purpose of this section is to help you prevent
701 This will not have the desired effect because there is no actual address
705 what's actually required is:
713 However, stores are not speculated.  This means that ordering -is- provided
728 Worse yet, if the compiler is able to prove (say) that the value of
729 variable 'a' is always non-zero, it would be well within its rights
738 It is tempting to try to enforce ordering on identical stores on both
766 Now there is no conditional between the load from 'a' and the store to
767 'b', which means that the CPU is within its rights to reorder them:
768 The conditional is absolutely required, and must be present in the
783 ordering is guaranteed only when the stores differ, for example:
794 The initial READ_ONCE() is still required to prevent the compiler from
810 If MAX is defined to be 1, then the compiler knows that (q % MAX) is
811 equal to zero, in which case the compiler is within its rights to
818 Given this transformation, the CPU is not required to respect the ordering
819 between the load from variable 'a' and the store to variable 'b'.  It is
821 is gone, and the barrier won't bring it back.  Therefore, if you are
822 relying on this ordering, you should make sure that MAX is greater than
846 Because the first condition cannot fault and the second condition is
870 It is tempting to argue that there in fact is ordering because the
892 Note well that the ordering provided by a control dependency is local
909       to carry out the stores.  Please note that it is -not- sufficient
917       conditional must involve the prior load.  If the compiler is able
938   (*) Compilers do not understand control dependencies.  It is therefore
946 always be paired.  A lack of appropriate pairing is almost certainly an error.
1020 This sequence of events is committed to the memory coherence system in an order
1089 In the above example, CPU 2 perceives that B is 7, despite the load of *C
1271 The guarantee is that the second load will always come up with A == 1 if the
1279 Many CPUs speculate with loads: that is they see that they will need to load an
1368 	The speculation is discarded --->   --->| A->1  |------>|       |
1369 	and an updated value is                 +-------+       |       |
1376 Multicopy atomicity is a deeply intuitive notion about ordering that is
1400 it loads from X.  The question is then "Can CPU 3's load from X return 0?"
1403 is natural to expect that CPU 3's load from X must therefore return 1.
1418 that CPU 2's general barrier is removed from the above example, leaving
1429 this example, it is perfectly legal for CPU 2's load from X to return 1,
1432 The key point is that although CPU 2's data dependency orders its load
1479 is prohibited:
1485 outcome is prohibited:
1489 However, the ordering provided by a release-acquire chain is local
1491 at least aside from stores.  Therefore, the following outcome is possible:
1495 As an aside, the following outcome is also possible:
1509 However, please keep in mind that smp_load_acquire() is not magic.
1512 following outcome is possible:
1517 consistent system where nothing is ever reordered.
1543 This is a general barrier -- there are no read-read or write-write
1552      One example use for this property is to ease communication between
1563  (*) The compiler is within its rights to reorder loads and stores
1564      to the same variable, and in some cases, the CPU is within its
1580  (*) The compiler is within its rights to merge successive loads from
1588      for single-threaded code, is almost certainly not what the developer
1600  (*) The compiler is within its rights to reload a variable, for example,
1608      This could result in the following code, which is perfectly safe in
1626      is why compilers reload variables.  Doing so is perfectly safe for
1628      where it is not safe.
1630  (*) The compiler is within its rights to omit a load entirely if it knows
1632      the value of variable 'a' is always zero, it can optimize this code:
1641      This transformation is a win for single-threaded code because it
1642      gets rid of a load and a branch.  The problem is that the compiler
1643      will carry out its proof assuming that the current CPU is the only
1644      one updating variable 'a'.  If variable 'a' is shared, then the
1651      But please note that the compiler is also closely watching what you
1653      do the following and MAX is a preprocessor macro with the value 1:
1663  (*) Similarly, the compiler is within its rights to omit a store entirely
1665      Again, the compiler assumes that the current CPU is the only one
1674      The compiler sees that the value of variable 'a' is already zero, so
1686  (*) The compiler is within its rights to reorder memory accesses unless
1702      There is nothing to prevent the compiler from transforming
1751  (*) The compiler is within its rights to invent stores to a variable,
1765      In single-threaded code, this is not only safe, but also saves
1783      and "store tearing," in which a single large access is replaced by
1792      which is not surprising given that it would likely take more
1827 All that aside, it is never necessary to use READ_ONCE() and
1829 because 'jiffies' is marked volatile, it is never necessary to
1830 say READ_ONCE(jiffies).  The reason for this is that READ_ONCE() and
1832 its argument is already marked volatile.
1856 the value of b before loading a[b]), however there is no guarantee in
1858 (eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1)
1859 tmp = a[b]; ).  There is also the problem of a compiler reloading b after
1862 macro is a good place to start looking.
1865 systems because it is assumed that a CPU will appear to be self-consistent,
1871 is sufficient.
1898      barrier may be required is when atomic ops are used for reference
1911      This makes sure that the death mark on the object is perceived to be set
1912      *before* the reference counter is decremented.
1953      us to guarantee the data is written to the descriptor before the device
1963      This is for use with persistent memory to ensure that stores for which
1970      data transfer caused by subsequent instructions is initiated. This is
1991 This specification is a _minimum_ guarantee; any particular architecture may
2044 one-way barriers is that the effects of instructions outside of a critical
2048 because it is possible for an access preceding the ACQUIRE to happen after the
2087 	One key point is that we are only talking about the CPU doing
2095 	If there is a deadlock, this lock operation will simply spin (or
2101 	But what if the lock is a sleeplock?  In that case, the code will
2127 The following sequence of events is acceptable:
2170 A general memory barrier is interpolated automatically by set_current_state()
2187 The whole sequence above is available in various canned forms, all of which
2210 A general memory barrier is executed by wake_up() if it wakes something up.
2213 is accessed, in particular, it sits between the STORE to indicate the event
2225 where "task" is the thread being woken up and it equals CPU 1's "current".
2227 To repeat, a general memory barrier is guaranteed to be executed by wake_up()
2228 if something is actually awakened, but otherwise there is no such guarantee.
2242 occurs before the task state is accessed.  In particular, if the wake_up() in
2335 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
2353 Under normal operation, memory operation reordering is generally not going to
2372 synchronisation problems, and the usual way of dealing with them is to use
2378 Consider, for example, the R/W semaphore slow path.  Here a waiting process is
2396      next waiter record is;
2419 before proceeding.  Since the record is on the waiter's stack, this means that
2420 if the task pointer is cleared _before_ the next pointer in the list is read,
2447 The way to deal with this is to insert a general SMP memory barrier:
2460 instruction itself is complete.
2462 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2488 device in the requisite order if the CPU or the compiler thinks it is more
2489 efficient to reorder, combine or merge accesses - something that would cause
2512 routine is executing, the driver's core may not run on the same CPU, and its
2513 interrupt is not permitted to happen again until the current interrupt has been
2518 under interrupt-disablement and then the driver's interrupt handler is invoked:
2546 running on separate CPUs that communicate with each other.  If such a case is
2554 Interfacing with peripherals via I/O accesses is deeply architecture and device
2574 	2. A writeX() issued by a CPU thread holding a spinlock is ordered
2598 	   will arrive at least 1us apart if the first write is immediately read
2599 	   back with readX() and udelay(1) is called prior to the second
2634 	accessed is passed as an argument.
2644 	returning. This is not guaranteed by all architectures and is therefore
2659 writesX()), all of the above assume that the underlying peripheral is
2668 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2679 causality is maintained.
2691 stream in any way it sees fit, again provided the appearance of causality is
2699 The way cached memory operations are perceived across the system is affected to
2740 it wishes, and continue execution until it is forced to wait for an instruction
2743 What memory barriers are concerned with is controlling the order in which
2770 is discarded from the CPU's cache and reloaded.  To deal with this, the
2785 Amongst these properties is usually the fact that such accesses bypass the
2798 operations in exactly the order specified, so that if the CPU is, for example,
2814 Reality is, of course, much messier.  With many CPUs and compilers, the above
2841 is:
2845 	(Where "LOAD {*C,*D}" is a combined load)
2848 However, it is guaranteed that a CPU will be self-consistent: it will see its
2877 On such architectures, READ_ONCE() and WRITE_ONCE() do whatever is
2895 assumed that the effect of the storage of V to *A is lost.  Similarly:
2912 The DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
2914 two semantically-related cache lines updated at separate times.  This is where
2928 the guest itself is compiled without SMP support.  This is an artifact of
2930 barriers for this use-case would be possible but is often suboptimal.
2933 These have the same effect as smp_mb() etc when SMP is enabled, but generate