linux-6.15/Documentation/memory-barriers.txt

14 This document is not a specification; it is intentionally (for the sake of
15 brevity) and unintentionally (due to being human) incomplete. This document is
23 To repeat, this document is not a specification of what Linux expects from
26 The purpose of this document is twofold:
35 that, that architecture is incorrect.
37 Note also that it is possible that a barrier may be a no-op for an
136 abstract CPU, memory operation ordering is very relaxed, and a CPU may actually
189 There is an obvious address dependency here, as the value loaded into D depends
205 locations, but the order in which the control registers are accessed is very
270      WRITE_ONCE().  Without them, the compiler is within its rights to
330      using older pre-C11 compilers (for example, gcc 4.6).  The portion
331      of the standard containing this guarantee is Section 3.14, which
344 		to two bit-fields, if one is declared inside a nested
345 		structure declaration and the other is not, or if the two
348 		declaration. It is not safe to concurrently update two
360 What is required is some way of intervening to instruct the compiler and the
366 Such enforcement is important because the CPUs and other devices in a system
386      A write barrier is a partial ordering on stores only; it is not required
398      [!] This section is marked as HISTORICAL: it covers the long-obsolete
404      An address-dependency barrier is a weaker form of read barrier.  In the
408      be required to make sure that the target of the second load is updated
409      after the address obtained by the first load is accessed.
411      An address-dependency barrier is a partial ordering on interdependent
412      loads only; it is not required to have any effect on stores, independent
428      not a control dependency.  If the address for the second load is dependent
429      on the first load, but the dependency is through a conditional rather than
431      a full read barrier or better is required.  See the "Control dependencies"
444      A read barrier is an address-dependency barrier plus a guarantee that all
449      A read barrier is a partial ordering on loads only; it is not required to
466      A general memory barrier is a partial ordering over both loads and stores.
501      for other sorts of memory barrier.  In addition, a RELEASE+ACQUIRE pair is
534  (*) There is no guarantee that any of the memory accesses specified before a
539  (*) There is no guarantee that issuing a memory barrier on one CPU will have
544  (*) There is no guarantee that a CPU will see the correct order of effects
549  (*) There is no guarantee that some intervening piece of off-the-CPU
563 [!] This section is marked as HISTORICAL: it covers the long-obsolete
573 those who are interested in the history, here is the story of
580 The requirement of address-dependency barriers is a little subtle, and
593 [!] READ_ONCE_OLD() corresponds to READ_ONCE() of pre-4.15 kernel, which
633 even-numbered bank of the reading CPU's cache is extremely busy while the
634 odd-numbered bank is idle, one can see the new value of the pointer P (&B),
638 An address-dependency barrier is not required to order dependent writes
655 Therefore, no address-dependency barrier is required to order the read into
656 Q with the store into *Q.  In other words, this outcome is prohibited,
662 of dependency ordering is to -prevent- writes to the data structure, along
668 Note well that the ordering provided by an address dependency is local to
673 The address-dependency barrier is very important to the RCU system,
684 not understand them.  The purpose of this section is to help you prevent
698 This will not have the desired effect because there is no actual address
702 what's actually required is:
710 However, stores are not speculated.  This means that ordering -is- provided
725 Worse yet, if the compiler is able to prove (say) that the value of
726 variable 'a' is always non-zero, it would be well within its rights
735 It is tempting to try to enforce ordering on identical stores on both
763 Now there is no conditional between the load from 'a' and the store to
764 'b', which means that the CPU is within its rights to reorder them:
765 The conditional is absolutely required, and must be present in the
780 ordering is guaranteed only when the stores differ, for example:
791 The initial READ_ONCE() is still required to prevent the compiler from
807 If MAX is defined to be 1, then the compiler knows that (q % MAX) is
808 equal to zero, in which case the compiler is within its rights to
815 Given this transformation, the CPU is not required to respect the ordering
816 between the load from variable 'a' and the store to variable 'b'.  It is
818 is gone, and the barrier won't bring it back.  Therefore, if you are
819 relying on this ordering, you should make sure that MAX is greater than
843 Because the first condition cannot fault and the second condition is
867 It is tempting to argue that there in fact is ordering because the
889 Note well that the ordering provided by a control dependency is local
906       to carry out the stores.  Please note that it is -not- sufficient
914       conditional must involve the prior load.  If the compiler is able
935   (*) Compilers do not understand control dependencies.  It is therefore
943 always be paired.  A lack of appropriate pairing is almost certainly an error.
1017 This sequence of events is committed to the memory coherence system in an order
1086 In the above example, CPU 2 perceives that B is 7, despite the load of *C
1268 The guarantee is that the second load will always come up with A == 1 if the
1276 Many CPUs speculate with loads: that is they see that they will need to load an
1365 	The speculation is discarded --->   --->| A->1  |------>|       |
1366 	and an updated value is                 +-------+       |       |
1373 Multicopy atomicity is a deeply intuitive notion about ordering that is
1397 it loads from X.  The question is then "Can CPU 3's load from X return 0?"
1400 is natural to expect that CPU 3's load from X must therefore return 1.
1415 that CPU 2's general barrier is removed from the above example, leaving
1426 this example, it is perfectly legal for CPU 2's load from X to return 1,
1429 The key point is that although CPU 2's data dependency orders its load
1476 is prohibited:
1482 outcome is prohibited:
1486 However, the ordering provided by a release-acquire chain is local
1488 at least aside from stores.  Therefore, the following outcome is possible:
1492 As an aside, the following outcome is also possible:
1506 However, please keep in mind that smp_load_acquire() is not magic.
1509 following outcome is possible:
1514 consistent system where nothing is ever reordered.
1540 This is a general barrier -- there are no read-read or write-write
1549      One example use for this property is to ease communication between
1560  (*) The compiler is within its rights to reorder loads and stores
1561      to the same variable, and in some cases, the CPU is within its
1577  (*) The compiler is within its rights to merge successive loads from
1585      for single-threaded code, is almost certainly not what the developer
1597  (*) The compiler is within its rights to reload a variable, for example,
1605      This could result in the following code, which is perfectly safe in
1623      is why compilers reload variables.  Doing so is perfectly safe for
1625      where it is not safe.
1627  (*) The compiler is within its rights to omit a load entirely if it knows
1629      the value of variable 'a' is always zero, it can optimize this code:
1638      This transformation is a win for single-threaded code because it
1639      gets rid of a load and a branch.  The problem is that the compiler
1640      will carry out its proof assuming that the current CPU is the only
1641      one updating variable 'a'.  If variable 'a' is shared, then the
1648      But please note that the compiler is also closely watching what you
1650      do the following and MAX is a preprocessor macro with the value 1:
1660  (*) Similarly, the compiler is within its rights to omit a store entirely
1662      Again, the compiler assumes that the current CPU is the only one
1671      The compiler sees that the value of variable 'a' is already zero, so
1683  (*) The compiler is within its rights to reorder memory accesses unless
1699      There is nothing to prevent the compiler from transforming
1748  (*) The compiler is within its rights to invent stores to a variable,
1762      In single-threaded code, this is not only safe, but also saves
1780      and "store tearing," in which a single large access is replaced by
1789      which is not surprising given that it would likely take more
1824 All that aside, it is never necessary to use READ_ONCE() and
1826 because 'jiffies' is marked volatile, it is never necessary to
1827 say READ_ONCE(jiffies).  The reason for this is that READ_ONCE() and
1829 its argument is already marked volatile.
1853 the value of b before loading a[b]), however there is no guarantee in
1855 (eg. is equal to 1) and load a[b] before b (eg. tmp = a[1]; if (b != 1)
1856 tmp = a[b]; ).  There is also the problem of a compiler reloading b after
1859 macro is a good place to start looking.
1862 systems because it is assumed that a CPU will appear to be self-consistent,
1868 is sufficient.
1895      barrier may be required is when atomic ops are used for reference
1908      This makes sure that the death mark on the object is perceived to be set
1909      *before* the reference counter is decremented.
1950      us to guarantee the data is written to the descriptor before the device
1960      This is for use with persistent memory to ensure that stores for which
1967      data transfer caused by subsequent instructions is initiated. This is
1988 This specification is a _minimum_ guarantee; any particular architecture may
2041 one-way barriers is that the effects of instructions outside of a critical
2045 because it is possible for an access preceding the ACQUIRE to happen after the
2084 	One key point is that we are only talking about the CPU doing
2092 	If there is a deadlock, this lock operation will simply spin (or
2098 	But what if the lock is a sleeplock?  In that case, the code will
2124 The following sequence of events is acceptable:
2167 A general memory barrier is interpolated automatically by set_current_state()
2184 The whole sequence above is available in various canned forms, all of which
2207 A general memory barrier is executed by wake_up() if it wakes something up.
2210 is accessed, in particular, it sits between the STORE to indicate the event
2222 where "task" is the thread being woken up and it equals CPU 1's "current".
2224 To repeat, a general memory barrier is guaranteed to be executed by wake_up()
2225 if something is actually awakened, but otherwise there is no such guarantee.
2239 occurs before the task state is accessed.  In particular, if the wake_up() in
2332 Then there is no guarantee as to what order CPU 3 will see the accesses to *A
2350 Under normal operation, memory operation reordering is generally not going to
2369 synchronisation problems, and the usual way of dealing with them is to use
2375 Consider, for example, the R/W semaphore slow path.  Here a waiting process is
2393      next waiter record is;
2416 before proceeding.  Since the record is on the waiter's stack, this means that
2417 if the task pointer is cleared _before_ the next pointer in the list is read,
2444 The way to deal with this is to insert a general SMP memory barrier:
2457 instruction itself is complete.
2459 On a UP system - where this wouldn't be a problem - the smp_mb() is just a
2485 device in the requisite order if the CPU or the compiler thinks it is more
2486 efficient to reorder, combine or merge accesses - something that would cause
2509 routine is executing, the driver's core may not run on the same CPU, and its
2510 interrupt is not permitted to happen again until the current interrupt has been
2515 under interrupt-disablement and then the driver's interrupt handler is invoked:
2543 running on separate CPUs that communicate with each other.  If such a case is
2551 Interfacing with peripherals via I/O accesses is deeply architecture and device
2571 	2. A writeX() issued by a CPU thread holding a spinlock is ordered
2595 	   will arrive at least 1us apart if the first write is immediately read
2596 	   back with readX() and udelay(1) is called prior to the second
2631 	accessed is passed as an argument.
2641 	returning. This is not guaranteed by all architectures and is therefore
2656 writesX()), all of the above assume that the underlying peripheral is
2665 It has to be assumed that the conceptual CPU is weakly-ordered but that it will
2676 causality is maintained.
2688 stream in any way it sees fit, again provided the appearance of causality is
2696 The way cached memory operations are perceived across the system is affected to
2737 it wishes, and continue execution until it is forced to wait for an instruction
2740 What memory barriers are concerned with is controlling the order in which
2767 is discarded from the CPU's cache and reloaded.  To deal with this, the
2782 Amongst these properties is usually the fact that such accesses bypass the
2795 operations in exactly the order specified, so that if the CPU is, for example,
2811 Reality is, of course, much messier.  With many CPUs and compilers, the above
2838 is:
2842 	(Where "LOAD {*C,*D}" is a combined load)
2845 However, it is guaranteed that a CPU will be self-consistent: it will see its
2874 On such architectures, READ_ONCE() and WRITE_ONCE() do whatever is
2892 assumed that the effect of the storage of V to *A is lost.  Similarly:
2909 The DEC Alpha CPU is one of the most relaxed CPUs there is.  Not only that,
2911 two semantically-related cache lines updated at separate times.  This is where
2925 the guest itself is compiled without SMP support.  This is an artifact of
2927 barriers for this use-case would be possible but is often suboptimal.
2930 These have the same effect as smp_mb() etc when SMP is enabled, but generate