Documentation/filesystems/xfs-delayed-logging-design.rst

1 .. SPDX-License-Identifier: GPL-2.0
7 Introduction to Re-logging in XFS
12 logged are made up of the changes to in-core structures rather than on-disk
13 structures. Other objects - typically buffers - have their physical changes
24 "re-logging". Conceptually, this is quite simple - all it requires is that any
48 (increasing) LSN of each subsequent transaction - the LSN is effectively a
51 This relogging is also used to implement long-running, multiple-commit
65 the log - repeated operations to the same objects write the same changes to
74 doing aggregation of transactions in memory - batching them, if you like - to
79 buffers available and the size of each is 32kB - the size can be increased up
83 that can be made to the filesystem at any point in time - if all the log
85 the current batch completes. It is now common for a single current CPU core to
100 but only one of those copies needs to be there - the last one "D", as it
119 actually relatively easy to do - all the changes to logged items are already
155 	4. No on-disk format change (metadata or log format).
163 ---------------
180 The solution is relatively simple - it just took a long time to recognise it.
183 simply copies the memory these vectors point to into the log buffer during
203     Object    +---------------------------------------------+
204     Vector 1      +----+
205     Vector 2                    +----+
206     Vector 3                                   +----------+
210     Log Buffer    +-V1-+-V2-+----V3----+
214     Object    +---------------------------------------------+
215     Vector 1      +----+
216     Vector 2                    +----+
217     Vector 3                                   +----------+
221     Memory Buffer +-V1-+-V2-+----V3----+
222     Vector 1      +----+
223     Vector 2           +----+
224     Vector 3                +----------+
232 buffer is to support splitting vectors across log buffer boundaries correctly.
235 buffer writing (i.e. double encapsulation). This would be an on-disk format
242 self-describing object that can be passed to the log buffer write code to be
243 handled in exactly the same manner as the existing log vectors are handled.
244 Hence we avoid needing a new on-disk format to handle items that have been
249 ----------------
260 and as such are stored in the Active Item List (AIL) which is a LSN-ordered
278 it's place in the list and re-inserted at the tail. This is entirely arbitrary
279 and done to make it easy for debugging - the last items in the list are the
286 ----------------------------
293 log replay - all the changes in all the objects in a given transaction must
311 to any other transaction - it contains a transaction header, a series of
313 perspective, the checkpoint transaction is also no different - just a lot
318 items are stored as log vectors, we can use the existing log buffer writing
322 way it separates the writing of the transaction contents (the log vectors) from
324 per-checkpoint context that travels through the log write process through to
345 to store the list of log vectors that need to be written into the transaction.
346 Hence log vectors need to be able to be chained together to allow them to be
355 	Log Item <-> log vector 1	-> memory buffer
356 	   |				-> vector array
358 	Log Item <-> log vector 2	-> memory buffer
359 	   |				-> vector array
364 	Log Item <-> log vector N-1	-> memory buffer
365 	   |				-> vector array
367 	Log Item <-> log vector N	-> memory buffer
368 					-> vector array
376 	log vector 1	-> memory buffer
377 	   |		-> vector array
378 	   |		-> Log Item
380 	log vector 2	-> memory buffer
381 	   |		-> vector array
382 	   |		-> Log Item
387 	log vector N-1	-> memory buffer
388 	   |		-> vector array
389 	   |		-> Log Item
391 	log vector N	-> memory buffer
392 			-> vector array
393 			-> Log Item
407 efficient way to track vectors, even though it seems like the natural way to do
409 vectors and break the link between the log item and the log vector means that
411 the log vector chaining. If we track by the log vectors, then we only need to
415 vectors in one checkpoint transaction. I'd guess this is a "measure and
420 --------------------------------------
427 re-using a freed metadata extent for a data extent), a special, optimised log
437 As discussed in the checkpoint section, delayed logging uses per-checkpoint
442 atomic counter - we can just take the current context sequence number and add
471 else for such serialisation - it only matters when we do a log force.
484 ------------------------------------------------
499 of log vectors in the transaction).
502 inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then
504 format structure. That is, two vectors totaling roughly 150 bytes. If we modify
505 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
509 buffer format structure for each buffer - roughly 800 vectors or 1.51MB total
527 reservation of around 150KB, which is a non-trivial amount of space.
529 A static reservation needs to manipulate the log grant counters - we can take a
548 available in their reservation for this as they have already reserved the
576 ---------------------------------
592 That is, we now have a many-to-one relationship between transaction commit and
600 pin the object the first time it is inserted into the CIL - if it is already in
617 ---------------------------------------
623 there was only one CPU using it, but it does not slow down either.
627 points in the design - the three important ones are:
634 that we have a many-to-one interaction here. That is, the only restriction on
638 128MB log, which means that it is generally one per CPU in a machine.
641 relatively long period of time - the pinning of log items needs to be done
649 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
650 want every other CPU in the machine spinning on the CIL lock. Given that
658 compared to transaction commit for asynchronous transaction workloads - only
659 time will tell if using a read-write semaphore for exclusion will limit
674 an ordering loop after writing all the log vectors into the log buffers but
696 -----------------
736 Essentially, steps 1-6 operate independently from step 7, which is also
737 independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9
738 at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur
740 and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9
767 		Chain log vectors and buffers together
770 		write log vectors into log
792 logging methods are in the middle of the life cycle - they still have the same
798 As a result of this zero-impact "insertion" of delayed logging infrastructure