memory-model/Documentation/explanation.txt

13   5. ORDERING AND CYCLES
15   7. THE PROGRAM ORDER RELATION: po AND po-loc
17   9. DEPENDENCY RELATIONS: data, addr, and ctrl
18   10. THE READS-FROM RELATION: rf, rfi, and rfe
19   11. CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe
20   12. THE FROM-READS RELATION: fr, fri, and fre
27   19. AND THEN THERE WAS ALPHA
30   22. RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
32   24. PLAIN ACCESSES AND DATA RACES
33   25. ODDS AND ENDS
40 The Linux-kernel memory consistency model (LKMM) is rather complex and
42 linux-kernel.bell and linux-kernel.cat files that make up the formal
43 version of the model; they are extremely terse and their meanings are
48 not go into the details of the code in the .bell and .cat files;
51 Sections 2 (BACKGROUND) through 5 (ORDERING AND CYCLES) are aimed
52 toward beginners; they explain what memory consistency models are and
62 not passed by pointers, and they don't have an "exists" clause at the
77 That is, given a piece of code and a collection of values specified
86 factors such as DMA and mixed-size accesses.)  But on multiprocessor
90 Different architectures have differing memory models, and the Linux
102 device, stores it in a buffer, and sets a flag to indicate the buffer
106 ready, and if it is, copies the data back to userspace.  The buffer
107 and the flag are memory locations shared between the two CPUs.
110 (the reason for using WRITE_ONCE() and READ_ONCE() instead of simple
132 CPU and P1() represents the read() routine running on another.  The
134 Thus, P0 stores the data in buf and then sets flag.  Meanwhile, P1
135 reads flag into the private variable r1, and if it is set, reads the
138 while and try again.)
141 shared memory locations and another CPU loads from those locations in
147 usually use sleep and wakeup mechanisms rather than polling for I/O
148 completion, and real code generally doesn't bother to copy values into
154 from flag and buf, or equivalently, what values r1 and r2 might end up
162 instance, P1 might run entirely before P0 begins, in which case r1 and
164 begins, in which case r1 and r2 will both be 1.
168 store to buf but before the store to flag.  In this case, r1 and r2
170 unconditionally then we would instead have r1 = 0 and r2 = 1.)
172 However, the most interesting possibility is where r1 = 1 and r2 = 0.
175 happens, the LKMM does predict this outcome can occur, and the example
182 The first widely cited memory model, and the simplest to understand,
214 performance optimizations.  On ARM and PowerPC architectures, for
215 instance, the MP example code really does sometimes yield r1 = 1 and
218 x86 and SPARC follow yet a different memory model: TSO (Total Store
222 each CPU stores to its own shared location and then loads from the
245 this outcome to occur, and in fact it does sometimes occur on x86 and
249 x86, Alpha, and other architectures.  However, it is different in
253 ORDERING AND CYCLES
261 as those instructions occur in the code, and there are many other
265 It's not possible to have X ordered before Y, Y ordered before Z, and
277 and a certain outcome for the loads in a piece of code can happen only
304 spin_lock().  However, no single event can be both a read and a write.
311 shared memory, do not give rise to events.  Thus, arithmetic and
318 is concerned only with the store itself -- its value and its address
327 THE PROGRAM ORDER RELATION: po AND po-loc
332 code after branches are taken into account and loops have been
343 first comes before the second in program order and they access the
349 code, and it retains their outlook to a considerable extent.  The
350 read, write, and fence events used by the model are close in spirit to
352 kernel code written in C, and the mapping from C to machine code can
364 by the C standard), and it will not introduce extraneous accesses.
366 The MP and SB examples above used READ_ONCE() and WRITE_ONCE() rather
369 buf and P0's write event to flag, and similarly for the other shared
382 The protections provided by READ_ONCE(), WRITE_ONCE(), and others are
383 not perfect; and under some circumstances it is possible for the
415 operators and function calls can be evaluated in any order.  For
426 DEPENDENCY RELATIONS: data, addr, and ctrl
431 from memory by the first.  The first event must be a read, and the
433 There are three kinds of dependencies: data, address (addr), and
436 A read and a write event are linked by a data dependency if the value
447 arbitrarily complicated computations, and a write can depend on the
450 A read event and another memory access event are linked by an address
467 Finally, a read event and another memory access event are linked by a
489 THE READS-FROM RELATION: rf, rfi, and rfe
496 further distinguish the cases where the load and the store occur on
497 the same CPU (internal reads-from, or rfi) and where they occur on
507 and some of them from another store.  Fortunately, use of READ_ONCE()
508 and WRITE_ONCE() will prevent load-tearing; it's not possible to have:
524 and end up with r1 = 0x1200 (partly from x's initial value and partly
549 from both of P0's stores.  It is possible to handle mixed-size and
551 attempt to do so.  It requires all accesses to be properly aligned and
555 CACHE COHERENCE AND THE COHERENCE ORDER RELATION: co, coi, and coe
562 ordering which all the CPUs agree on (the coherence order), and this
571 that value comes third, and so on.
584 	W' in program order and they access the same location), where W
585 	and W' are two stores, then W ->co W'.
587 	Write-read coherence: If W ->po-loc R, where W is a store and R
591 	Read-write coherence: If R ->po-loc W, where R is a load and W
595 	Read-read coherence: If R ->po-loc R', where R and R' are two
603 Wikipedia, sequential consistency per variable and cache coherence
623 program order, it must also come later in x's coherence order and
657 If r1 = 5 (reading from P0's store) and r2 = 0 (reading from the
665 and r2 = 0!  This results from parallel execution of the operations
666 encoded in Itanium's Very-Long-Instruction-Word format, and it is yet
671 possible for a store to directly or indirectly overwrite itself!  And
673 occur on the same CPU (internal coherence order, or coi) and stores
682 THE FROM-READS RELATION: fr, fri, and fre
704 The value loaded from x will be 0 (assuming cache coherence!), and it
710 As with rf, rfi, and rfe, we subdivide the fr relation into fri (when
711 the load and the store are on the same CPU) and fre (when they are on
714 Note that the fr relation is determined entirely by the rf and co
715 relations; it is not independent.  Given a read event R and a write
716 event W for the same location, we will have R ->fr W if and only if
719 	(R ->fr W) := (there exists W' with W' ->rf R and W' ->co W).
729 The system as a whole is divided into the CPUs and a memory subsystem.
731 in program order), and they communicate with the memory subsystem.
733 only internal operations.  However, loads, stores, and fences involve
749 and we say that the store's value is forwarded to R.  Otherwise, the
750 CPU asks the memory subsystem for the value to load and we say that R
756 CPUs have local caches, and propagating a store to a CPU really means
758 time to process the stores that it receives, and a store can't be used
761 First-In-First-Out order, and consequently the processing delay
766 Note that load instructions may be executed speculatively and may be
773 about the fence.  However, fences do constrain the way CPUs and the
780 	Strong fences, including smp_mb() and synchronize_rcu(), force
816 The propagation ordering enforced by release fences and strong fences
819 fence.  We describe this property by saying that release fences and
825 rcu_read_lock(), rcu_read_unlock(), and synchronize_rcu() fences have
832 The fences which affect propagation order (i.e., strong, release, and
835 defined to link memory access events E and F whenever:
837 	E and F are both stores on the same CPU and an smp_wmb() fence
840 	F is a release fence and some X comes before F in program order,
843 	A strong fence event occurs between some X and F in program
846 The operational model requires that whenever W and W' are both stores
847 and W ->cumul-fence W', then W must propagate to any given CPU
848 before W' does.  However, for different CPUs C and C', it does not
857 maintaining cache coherence and the fact that a CPU can't operate on a
874 	CPUs and to RAM in a specific order.
876 	Rcu: This requires that RCU read-side critical sections and
884 The first and second are quite common; they can be found in many
885 memory models (such as those for C11/C++11).  The "happens-before" and
887 "rcu" and "plain-coherence" axioms are specific to the LKMM.
898 each load between the store that it reads from and the following
904 and po-loc relations agree with this global ordering; in other words,
927 this case) does not get altered between the read and the write events
941 occurs in between CPU 1's load and store.  To put it another way, the
943 is between the store that CPU 1 reads from and the store that CPU 1
948 rule: Whenever R and W are the read and write events composing an
949 atomic read-modify-write and W' is the write event which R reads from,
950 there must not be any stores coming between W' and W in the coherence
953 	(R ->rmw W) implies (there is no X with R ->fr X and X ->co W),
955 where the rmw relation links the read and write events making up each
965 instruction to the po-later instruction and is thus a sub-relation of
969 situation: Fences are a source of ppo links.  Suppose X and Y are
974 	X and Y;
976 	X and Y are both stores and an smp_wmb() fence occurs between
979 	X and Y are both loads and an smp_rmb() fence occurs between
989 	X and Y are both loads, X ->addr Y (i.e., there is an address
990 	dependency from X to Y), and X is a READ_ONCE() or an atomic
1006 can always satisfy the second load speculatively before the first, and
1008 be executed after all.  And lastly, the real difficulties begin when
1018 problem when the loads come from READ_ONCE(), and therefore the LKMM
1023 store and a second, po-later load reads from that store:
1031 it knows what that value is, or that W and R' do access the same
1033 and W then the CPU can speculatively forward W to R' before executing
1044 because it could tell that the store and the second load access the
1054 (the po-loc link says that R comes before W in program order and they
1062 and the CPU executed W' before W, then the memory subsystem would put
1071 AND THEN THERE WAS ALPHA
1100 and r2 = 0 at the end, in spite of the address dependency.
1104 to ptr does.  And since P1 can't execute its second load
1125 adds this fence after every READ_ONCE() and atomic load on Alpha.  The
1139 then we would never get r1 = &x and r2 = 0.  By the time P1 executed
1141 the local cache and available for satisfying the read request.  Thus
1146 The LKMM requires that smp_rmb(), acquire fences, and strong fences
1156 And of course, none of this matters for any architecture other than
1164 execute in a certain order.  hb includes the ppo relation and two
1167 W ->rfe R implies that W and R are on different CPUs.  It also means
1170 must have executed before R, and so we have W ->hb R.
1172 The equivalent fact need not hold if W ->rfi R (i.e., W and R are on
1179 W ->coe W'.  This means that W and W' are stores to the same location,
1180 they execute on different CPUs, and W comes before W' in the coherence
1193 cache coherence.  The relation is called prop, and it links two events
1195 the first event in the coherence order and propagates to C before the
1222 order, and P1's store propagated to P0 before P0's load executed.
1242 If r1 = 0 and r2 = 9 at the end then P0's accesses must have executed
1245 load executed, and so r1 would have been 9 rather than 0.  In this
1247 because P1's store overwrote the value read by P0's first load, and
1273 stores.  If r1 = 1 and r2 = 0 at the end then there is a prop link
1277 to buf will propagate to P1 before the store to flag does, and the
1285 Alpha, although the details are more complicated and we will not go
1289 would force the two loads to be executed in program order, and it
1291 link (hence an hb link) from the first load to the second, and the
1302 followed by two cumul-fences and an rfe link, utilizing the fact that
1330 If x = 2, r0 = 1, and r2 = 1 after this code runs then there is a prop
1335 before P2's load and store execute, P2's smp_store_release()
1336 guarantees that the stores to x and y both propagate to P0 before the
1337 store to z does (the second cumul-fence), and P0's load executes after the
1354 features of strong fences.  It links two events E and F whenever some
1355 store is coherence-later than E and propagates to every CPU and to RAM
1358 optional rfe link, a strong fence, and an arbitrary number of hb
1362 of links begins with coe).  Then there are events W, X, Y, and Z such
1368 specified type, and the ? suffix indicates the link is optional (Y may
1370 propagate to Y's CPU before X does, hence before Y executes and hence
1372 know that W will propagate to every CPU and to RAM before Z executes.
1373 And because of the hb links, we know that Z will execute before F.
1375 propagate to every CPU and to RAM before F executes.
1414 value read by P0), and a strong fence between P1's store and its load.
1415 In this example, the sequences of cumul-fence and hb links are empty.
1417 because it does not start and end on the same CPU.
1420 to P0's.  This means that if both r1 and r2 were 0 there would be a
1430 RCU RELATIONS: rcu-link, rcu-gp, rcu-rscsi, rcu-order, rcu-fence, and rb
1434 rests on two concepts: grace periods and read-side critical sections.
1439 at the start and rcu_read_unlock() at the end.  Critical sections can
1446 	For any critical section C and any grace period G, at least
1449 (1)	C ends before G does, and in addition, every store that
1453 (2)	G starts before C does, and in addition, every store that
1458 before and end after a grace period.
1482 never end with r1 = 1 and r2 = 0.  The reasoning is as follows.  r1 = 1
1495 execute an smp_mb() fence after the end of the critical section and
1497 And if a critical section ends after a grace period does then the
1499 and some time before the critical section's opening rcu_read_lock()
1507 "before": If E and F are RCU fence events (i.e., rcu_read_lock(),
1510 event X, F is po-after some memory-access event Y, and we have any of
1514 obscure, and we won't give it here.  It is closely related to the pb
1515 relation, and the details don't matter unless you want to comb through
1519 The LKMM also defines the rcu-gp and rcu-rscsi relations.  They bring
1520 grace periods and read-side critical sections into the picture, in the
1523 	E ->rcu-gp F means that E and F are in fact the same event,
1524 	and that event is a synchronize_rcu() fence (i.e., a grace
1527 	E ->rcu-rscsi F means that E and F are the rcu_read_unlock()
1528 	and rcu_read_lock() fence events delimiting some read-side
1542 rcu-gp and rcu-rscsi links separated by rcu-link links, in which the
1549 rcu-gp links and one rcu-rscsi link.  (It also implies that
1550 X ->rcu-order T and Z ->rcu-order V.)  On the other hand:
1570 This formula means that G and W are the same event (a grace period),
1571 and there are events X, Y and a read-side critical section C such that:
1575 	2. X comes "before" Y in some sense (including rfe, co and fr);
1588 executing and hence before any instruction po-after F can execute.
1599 there are fence events E and F such that:
1606 "super-strong" fence: Unlike the original strong fences (smp_mb() and
1615 before F, just as E ->pb F does (and for much the same reasons).
1620 and F with E ->rcu-link F ->rcu-order E.  Or to put it a third way,
1621 the axiom requires that there are no cycles consisting of rcu-gp and
1629 violated: A critical section starts before a grace period, and some
1634 Putting symbols to these ideas, let L and U be the rcu_read_lock() and
1636 question, and let S be the synchronize_rcu() fence event for the grace
1638 are events Q and R where Q is po-after L (which marks the start of the
1640 relation, and R is po-before the grace period S.  Thus we have:
1645 critical section and witness that W propagates to the critical
1646 section's CPU by reading from W, and let Z on some arbitrary CPU be a
1654 discussion of the rcu-link relation earlier) that S and U are related
1659 Since S is a grace period we have S ->rcu-gp S, and since L and U are
1660 the start and end of the critical section C we have U ->rcu-rscsi L.
1693 P1's load at W reads from, so we have W ->fre Y.  Since S ->po W and
1698 so we have X ->rfe Z.  Together with L ->po X and Z ->po S, this
1699 yields L ->rcu-link S.  And since L and U are the start and end of a
1762 This requires P0 and P2 to execute their loads and stores out of
1763 program order, but of course they are allowed to do so.  And as you
1765 section in P0 both starts before P1's grace period does and ends
1766 before it does, and the critical section in P2 both starts after P1's
1767 grace period does and ends after it does.
1771 above, with new relations srcu-gp and srcu-rscsi added to represent
1772 SRCU grace periods and read-side critical sections.  There is a
1773 restriction on the srcu-gp and srcu-rscsi links that can appear in an
1786 the same as an int, and spin_lock(&s) is treated almost the same as:
1791 This waits until s is equal to 0 and then atomically sets it to 1,
1792 and the read part of the cmpxchg operation acts as an acquire fence.
1802 which atomically sets s to 1 if it is currently equal to 0 and returns
1810 store-release in a spin_unlock() and the load-acquire which forms the
1812 spin_trylock() -- we can call these things lock-releases and
1814 and acquires.
1819 naturally hold if the release and acquire operations were on different
1845 Here the second spin_lock() reads from the first spin_unlock(), and
1847 cannot have r1 = 1 and r2 = 0 at the end (this is an instance of the
1850 This requirement does not apply to ordinary release and acquire
1873 and thus it could load y before x, obtaining r2 = 0 and r1 = 1.
1875 Second, when a lock-acquire reads from a lock-release, and some other
1876 stores W and W' occur po-before the lock-release and po-after the
1911 before the store to y does, so we cannot have r2 = 1 and r3 = 0.
1913 These two special requirements for lock-release and lock-acquire do
1915 have come to expect and rely on them because they do hold on all
1920 PLAIN ACCESSES AND DATA RACES
1924 smp_load_acquire(&z), and so on are collectively referred to as
1957 possible final values for r2 are 6 and 0, depending on whether or not
1961 for the READ_ONCE()) and eliminate the temporary variable r1.  The
1973 And now it is obvious that this code runs the risk of dereferencing a
1977 the compiler would not have performed this optimization and there
1997 	same CPU), and
2002 1 and 2 above.  We'll go a little farther and say that two accesses
2009 a potential data race and makes no predictions about the program's
2021 if they can be connected by a sequence of hb, pb, and rb links
2025 If X is a load and X executes before a store Y, then indeed there is
2026 no danger of X and Y being concurrent.  After all, Y can't have any
2029 some time after Y executes and thus after X executes.  But if X is a
2032 could very well interfere somehow with Y, and we would have to
2033 consider X and Y to be concurrent.
2035 Therefore when X is a store, for X and Y to be non-concurrent the LKMM
2045 	these links are present, X and Z are the same event),
2047 and either:
2054 	Z is on the same CPU as Y and is connected to Y by a possibly
2056 	means Z and Y are the same event).
2074 pattern (with fences and statement labels, but without the conditional
2096 The smp_wmb() memory barrier gives a cumul-fence link from X to W, and
2099 executes.  Next, Z and Y are on the same CPU and the smp_rmb() fence
2108 cumul-fence, pb, and so on -- including vis) apply only to marked
2112 allowed to apply fancy transformations to marked accesses, and
2116 split them up, duplicate them, eliminate them, invent new ones, and
2124 race (if it could, memory models would be useless and no multithreaded
2179 This program does not contain a data race.  Although the U and V
2183 	The smp_wmb() fence in P0 is both a compiler barrier and a
2191 	X and Y are both marked accesses.  Hence an rfe link from X to
2193 	executed, i.e., X ->vis Y.  (And if there is no rfe link then
2194 	r1 will be 0, so V will not be executed and ipso facto won't
2199 	corresponding to the access V will be po-after the fence, and
2201 	after the fence does and hence after Y does.
2208 general.  Suppose R is a plain load and we want to show that R
2210 marked access X such that R and X are ordered by a suitable fence and
2212 marked access Y such that X ->xb* Y, and Y and E are ordered by a
2214 "post-bounded" by X and E is "pre-bounded" by Y.
2220 (i.e., smp_rmb()) and some affect only stores (smp_wmb()); otherwise
2221 the two types of bounds are the same.  And as a degenerate case, we
2222 say that a marked access pre-bounds and post-bounds itself (e.g., if R
2225 The need to distinguish between r- and w-bounding raises yet another
2238 thereby adding a load (and possibly replacing the store entirely).
2245 compiler has augmented a store with a load in this fashion, and the
2295 (In this example the rcu_read_lock() and rcu_read_unlock() calls don't
2301 is definitely w-post-bounded before the store to ptr, and the two
2309 and a certain amount of care is required when programming constructs
2310 like this one.  In particular, comparisons between the pointer and
2338 be on different CPUs, and fences don't link events on different CPUs.
2363 P1 before the critical section started and so would have been visible
2369 "y = 3" store, and consequently the first must propagate from P1 to P0
2371 concurrent and there is no race, even though P1's plain store to y
2375 race-candidate stores W and W', where W ->co W', the LKMM says the
2384 sequence, and if W' is plain then they also have to be linked by a
2388 sequence.  For race-candidate load R and store W, the LKMM says the
2400 	strong-fence ; xb* ; {w and/or r}-pre-bounded
2402 sequence with no post-bounding, and in every case the LKMM also allows
2409 happens-before, propagates-before, and rcu axioms (which state that
2418 satisfy a load request and its determination of where a store will
2421 	If R and W are race candidates and it is possible to link R to
2426 	If W and R are race candidates and it is possible to link W to
2431 	If W and W' are race candidates and it is possible to link W
2442 ODDS AND ENDS
2451 optional, and it doesn't require the events linked by the relation to
2454 because all the other parts (fences and rfe) are already included in
2455 hb anyway, and where the formal model adds prop into hb, it includes
2460 accesses and fences, such as those corresponding to smp_load_acquire()
2462 and fences; rather, they are read events with an annotation marking
2479 x, and then the increment is carried out by the memory hardware with
2485 treated as READ_ONCE() and rcu_assign_pointer() is treated as
2492 followed by smp_wmb() and then a marked store W', the LKMM creates a
2500 smp_mb__before_atomic(), smp_mb__after_atomic(), and
2508 	po-later atomic updates and the events following them;
2510 	smp_mb__after_atomic() orders po-earlier atomic updates and
2514 	events and the events preceding them against all po-later
2517 Interestingly, RCU and locking each introduce the possibility of