Documentation/filesystems/xfs-delayed-logging-design.rst

7 Introduction to Re-logging in XFS
12 logged are made up of the changes to in-core structures rather than on-disk
14 logged. The reason for these differences is to reduce the amount of log space
21 modifications to a single object to be carried in the log at any given time.
22 This allows the log to avoid needing to flush each change to disk before
23 recording a new change to the object. XFS does this via a method called
25 new change to the object is recorded with a *new copy* of all the existing
26 changes in the new transaction that is written to the log.
28 That is, if we have a sequence of changes A through to F, and the object was
29 written to disk after change D, we would see in the log the following series
38 	    <object written to disk>
45 This relogging technique also allows objects to be moved forward in the log so
51 This relogging is also used to implement long-running, multiple-commit
62 Hence it can be seen that the relogging operation is fundamental to the correct
64 people should be able to see why the XFS metadata operations writes so much to
65 the log - repeated operations to the same objects write the same changes to
66 the log over and over again. Worse is the fact that objects tend to get
71 asynchronous. That is, they don't commit to disk until either a log buffer is
73 forces the log buffers holding the transactions to disk. This means that XFS is
74 doing aggregation of transactions in memory - batching them, if you like - to
80 to 256kB by use of a mount option.
83 that can be made to the filesystem at any point in time - if all the log
85 the current batch completes. It is now common for a single current CPU core to
86 be to able to issue enough transactions to keep the log buffers full and under
87 IO permanently. Hence the XFS journalling subsystem can be considered to be IO
93 The key thing to note about the asynchronous logging combined with the
95 multiple times before they are committed to disk in the log buffers. If we
96 return to the previous relogging example, it is entirely possible that
97 transactions A through D are committed to disk in the same log buffer.
100 but only one of those copies needs to be there - the last one "D", as it
105 buffers. It is clear that reducing the number of stale objects written to the
106 log would greatly reduce the amount of metadata we write to the log, and this
111 logical to physical formatting to do the relogging because there is no
112 infrastructure to keep track of logical changes in memory prior to physically
113 formatting the changes in a transaction to the log buffer. Hence we cannot avoid
116 Delayed logging is the name we've given to keeping and tracking transactional
117 changes to objects in memory outside the log buffer infrastructure. Because of
118 the relogging concept fundamental to the XFS journalling subsystem, this is
119 actually relatively easy to do - all the changes to logged items are already
120 tracked in the current infrastructure. The big problem is how to accumulate
121 them and get them to the log in a consistent, recoverable manner.
125 One of the key changes that delayed logging makes to the operation of the
129 written to the log at any point in time, there may be a much greater amount
138 need to ensure application level data integrity is maintained.
141 warrants rigorous proofs to determine whether it is correct or not. The method
142 of accumulating changes in memory for some period before writing them to the
144 no time is spent in this document trying to convince the reader that the
150 	1. Reduce the amount of metadata written to the log by at least
152 	2. Supply sufficient statistics to validate Requirement #1.
153 	3. Supply sufficient new tracing infrastructure to be able to debug
166 existing log item dirty region tracking) is that when it comes to writing the
167 changes to the log buffers, we need to ensure that the object we are formatting
168 is not changing while we do this. This requires locking the object to prevent
169 concurrent modification. Hence flushing the logical changes to the log would
170 require us to lock every object, format them, and then unlock them again.
174 the delayed logging tracking lock to commit the transaction. However, the
176 trying to get the lock on object A to flush it to the log buffer. This appears
177 to be an unsolvable deadlock condition, and it was solving this problem that
178 was the barrier to implementing delayed logging for so long.
180 The solution is relatively simple - it just took a long time to recognise it.
181 Put simply, the current logging code formats the changes to each item into an
182 vector array that points to the changed regions in the item. The log write code
183 simply copies the memory these vectors point to into the log buffer during
186 allocated memory buffer big enough to fit the formatted vector.
188 If we then copy the vector into the memory buffer and rewrite the vector to
189 point to the memory buffer rather than the object itself, we now have a copy of
191 that does not require us to lock the item to access. This formatting and
194 without needing to lock the owning item.
196 Hence we avoid the need to lock items when we need to flush outstanding
197 asynchronous transactions to the log. The differences between the existing
226 The memory buffer and associated vector need to be passed as a single object,
227 but still need to be associated with the parent object so if the object is
232 buffer is to support splitting vectors across log buffer boundaries correctly.
236 change and as such is not desirable.  It also means we'd have to write the log
238 region state that needs to be placed into the headers during the log write.
240 Hence we need to keep the vector, but by attaching the memory buffer to it and
241 rewriting the vector addresses to point at the memory buffer we end up with a
242 self-describing object that can be passed to the log buffer write code to be
244 Hence we avoid needing a new on-disk format to handle items that have been
252 them to be used without limitations, we need to be able to track and accumulate
253 them so that they can be written to the log at some later point in time.  The
254 log item is the natural place to store this vector and buffer, and also makes sense
255 to be the object that is used to track committed objects as it will always
258 The log item is already used to track the log items that have been written to
259 the log but not yet written to disk. Such log items are considered "active"
262 completion, after which they are unpinned and can be written to disk. An object
263 that is in the AIL can be relogged, which causes the object to be pinned again
268 and relogged, so any tracking must be separate to the AIL infrastructure. As
274 Similar to the AIL, tracking of committed items is done through a new list
276 committed and have formatted memory buffers attached to them. It tracks objects
279 and done to make it easy for debugging - the last items in the list are the
290 We need to write these items in the order that they exist in the CIL, and they
291 need to be written as an atomic transaction. The need for all the objects to be
298 To fulfill this requirement, we need to write the entire CIL in a single log
302 reason for this limit is that to find the head and tail of the log, there must
308 size of a checkpoint to be slightly less than a half the log.
311 to any other transaction - it contains a transaction header, a series of
315 might need to tune the recovery transaction object hash size.
317 Because the checkpoint is just another transaction and all the changes to log
319 code to write the changes into the log. To do this efficiently, we need to
321 transaction. The current log write code enables us to do this easily with the
323 the transaction commit record, but tracking this requires us to have a
324 per-checkpoint context that travels through the log write process through to
328 checkpoint from initiation to checkpoint completion. A new context is initiated
332 context and attach that to the CIL for aggregation of new transactions.
334 This allows us to unlock the CIL immediately after transfer of all the
335 committed items and effectively allow new transactions to be issued while we
337 checkpoints to be written into the log buffers in the case of log force heavy
342 To ensure that we can be writing an item into a checkpoint transaction at
345 to store the list of log vectors that need to be written into the transaction.
346 Hence log vectors need to be able to be chained together to allow them to be
348 buffer and log vector attached to each log item needs to be attached to the
396 start, while the checkpoint flush code works over the log vector chain to
400 attached to the log buffer that the commit record was written to along with a
406 Discussion Point: I am uncertain as to whether the log item is the most
407 efficient way to track vectors, even though it seems like the natural way to do
408 it. The fact that we walk the log items (in the CIL) just to chain the log
411 the log vector chaining. If we track by the log vectors, then we only need to
424 This allows transactions to be issued asynchronously even though there may be
426 committed to the log. In the rare case that a dependent operation occurs (e.g.
428 force can be issued to force the dependent transaction to disk immediately.
430 To do this, transactions need to record the LSN of the commit record of the
438 contexts, and as such it is simple to assign a sequence number to each
440 atomically, it is simple to ensure that each new context has a monotonically
441 increasing sequence number assigned to it without the need for an external
443 one to it for the new context.
445 Then, instead of assigning a log buffer LSN to the transaction commit LSN
448 checkpoint sequence needs to be committed before they can continue. As a
449 result, the code that forces the log to a specific LSN now needs to ensure that
450 the log forces to a specific checkpoint.
452 To ensure that we can do this, we need to track all the checkpoint contexts
453 that are currently committing to the log. When we flush a checkpoint, the
454 context gets added to a "committing" list which can be searched. When a
458 using the existing log force mechanisms to execute synchronous forces.
460 It should be noted that the synchronous forces may need to be extended with
461 mitigation algorithms similar to the current log buffer code to allow
466 The main concern with log forces is to ensure that all the previous checkpoints
467 are also committed to disk before the one we need to wait for. Therefore we
468 need to check that all the prior contexts in the committing list are also
469 complete before waiting on the one we need to complete. We do this
470 synchronisation in the log force code so that we don't need to wait anywhere
473 The only remaining complexity is that a log force now also has to handle the
475 is, we need to flush the CIL and potentially wait for it to complete. This is a
476 simple addition to the existing log forcing code to check the sequence numbers
479 transactions to remain untouched (i.e. commit an asynchronous transaction, then
487 transaction. We don't know how big a checkpoint transaction is going to be
488 ahead of time, nor how many log buffers it will take to write out, nor the
489 number of split log vector regions are going to be used. We can track the
490 amount of log space required as we add items to the commit item list, but we
491 still need to reserve the space in the log for the checkpoint.
505 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
506 vector is 12 bytes, so the total to be logged is approximately 1.75MB. In
511 not particularly flexible and is difficult to select the "optimal value" for
514 Further, if we are going to use a static reservation, which bit of the entire
518 relogged. This allows for a checkpoint reservation to only have to account for
523 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be
524 large enough to handle arbitrary sized checkpoint transactions. This
525 reservation needs to be made before the checkpoint is started, and we need to
526 be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
529 A static reservation needs to manipulate the log grant counters - we can take a
530 permanent reservation on the space, but we still need to make sure we refresh
531 the write reservation (the actual space available to the transaction) after
535 The problem with this is that it can lead to deadlocks as we may need to commit
536 checkpoints to be able to free up log space (refer back to the description of
538 space available in the log if we are to use static reservations, and that is
539 very difficult and complex to arrange. It is possible to do, but there is a
543 items in the CIL and using this to dynamically calculate the amount of log
550 will always be less than or equal to the maximal amount in the reservation.
553 are added to the CIL and avoid the need for reserving and regranting log space
557 As mentioned early, transactions can't grow to more than half the size of the
558 log. Hence as part of the reservation growing, we need to also check the size
560 the maximum threshold, we need to push the CIL to the log. This is effectively
561 a "background flush" and is done on demand. This is identical to
563 checkpoint commit to complete. This background push is checked and executed by
568 force will push the CIL to disk, and if the transaction subsystem stays idle,
569 allow the idle log to be covered (effectively marked clean) in exactly the same
571 whether this log force needs to be done more frequently than the current rate
581 that items get pinned once for every transaction that is committed to the log
589 For delayed logging, however, we have an asymmetric transaction commit to
592 That is, we now have a many-to-one relationship between transaction commit and
597 To keep pin/unpin symmetry, the algorithm needs to change to a "pin on
599 pinning and unpinning becomes symmetric around a checkpoint context. We have to
604 value according to it's context.
606 Just to make matters more slightly more complex, this checkpoint level context
614 lock to guarantee that we pin the items correctly.
620 commits must scale to many concurrent commits. The current transaction commit
625 As a result, the delayed logging transaction commit code needs to be designed
630 	2. Adding items to the CIL and updating item space accounting
634 that we have a many-to-one interaction here. That is, the only restriction on
635 the number of concurrent transactions that can be trying to commit at once is
640 The amount of time a transaction commit needs to hold out a flush is a
641 relatively long period of time - the pinning of log items needs to be done
645 separately to the pinning of objects could be used to reduce the hold time of
649 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
658 compared to transaction commit for asynchronous transaction workloads - only
660 transaction commit concurrency due to cache line bouncing of the lock on the
665 concurrently, the CIL needs to be protected separately from the above
666 commit/flush exclusion. It also needs to be an exclusive lock but it is only
676 checkpoints and needs to block waiting for checkpoints to complete their commit
678 sequencing also requires the same lock, list walk, and blocking mechanism to
683 sequencing needs to wait until checkpoint contexts contain a commit LSN
685 sequencing needs to wait until previous checkpoint contexts are removed from
687 broadcast wakeups (thundering herds) has been used to implement these two
691 given separate wait lists to reduce lock contention and the number of processes
703 	4. Join item to transaction
706 			Attach log item to owner item
707 		Attach log item to transaction
715 		Attach transaction to log buffer
728 		Flush item to disk
748 	4. Join item to transaction
751 			Attach log item to owner item
752 		Attach log item to transaction
758 		Attach log vector and buffer to log item
772 		attach checkpoint context to log buffer
785 		Flush item to disk
794 committing of the log items to the log itself and the completion processing.
799 and the design of the internal structures to avoid on disk format changes, we
802 be able to swap methods automatically and transparently depending on load