Documentation/filesystems/xfs-delayed-logging-design.rst

28 That is, if we have a sequence of changes A through to F, and the object was
29 written to disk after change D, we would see in the log the following series
94 relogging technique XFS uses is that we can be relogging changed objects
95 multiple times before they are committed to disk in the log buffers. If we
101 contains all the changes from the previous changes. In other words, we have one
103 wasting space. When we are doing repeated operations on the same set of
106 log would greatly reduce the amount of metadata we write to the log, and this
113 formatting the changes in a transaction to the log buffer. Hence we cannot avoid
116 Delayed logging is the name we've given to keeping and tracking transactional
167 changes to the log buffers, we need to ensure that the object we are formatting
168 is not changing while we do this. This requires locking the object to prevent
185 using the log buffer as the destination of the formatting code, we can use an
188 If we then copy the vector into the memory buffer and rewrite the vector to
189 point to the memory buffer rather than the object itself, we now have a copy of
196 Hence we avoid the need to lock items when we need to flush outstanding
228 relogged we can replace the current memory buffer with a new memory buffer that
231 The reason for keeping the vector around after we've formatted the memory
233 If we don't keep the vector around, we do not know where the region boundaries
234 are in the item, so we'd need a new encapsulation method for regions in the log
236 change and as such is not desirable.  It also means we'd have to write the log
240 Hence we need to keep the vector, but by attaching the memory buffer to it and
241 rewriting the vector addresses to point at the memory buffer we end up with a
244 Hence we avoid needing a new on-disk format to handle items that have been
251 Now that we can record transactional changes in memory in a form that allows
252 them to be used without limitations, we need to be able to track and accumulate
269 such, we cannot reuse the AIL list pointers for tracking committed items, nor
270 can we store state in any field that is protected by the AIL lock. Hence the
288 When we have a log synchronisation event, commonly known as a "log force",
290 We need to write these items in the order that they exist in the CIL, and they
298 To fulfill this requirement, we need to write the entire CIL in a single log
307 failure and an inconsistent filesystem and hence we must enforce the maximum
314 bigger with a lot more items in it. The worst case effect of this is that we
318 items are stored as log vectors, we can use the existing log buffer writing
319 code to write the changes into the log. To do this efficiently, we need to
320 minimise the time we hold the CIL locked while writing the checkpoint
329 at the same time a checkpoint transaction is started. That is, when we remove
330 all the current items from the CIL during a checkpoint operation, we move all
331 those changes into the current checkpoint context. We then initialise a new
335 committed items and effectively allow new transactions to be issued while we
339 requires that we strictly order the commit records in the log so that
342 To ensure that we can be writing an item into a checkpoint transaction at
408 it. The fact that we walk the log items (in the CIL) just to chain the log
410 we take a cache line hit for the log item list modification, then another for
411 the log vector chaining. If we track by the log vectors, then we only need to
412 break the link between the log item and the log vector, which means we should
442 atomic counter - we can just take the current context sequence number and add
446 during the commit, we can assign the current checkpoint sequence. This allows
452 To ensure that we can do this, we need to track all the checkpoint contexts
453 that are currently committing to the log. When we flush a checkpoint, the
457 we can also wait on the log buffer that contains the commit record, thereby
467 are also committed to disk before the one we need to wait for. Therefore we
469 complete before waiting on the one we need to complete. We do this
470 synchronisation in the log force code so that we don't need to wait anywhere
471 else for such serialisation - it only matters when we do a log force.
475 is, we need to flush the CIL and potentially wait for it to complete. This is a
487 transaction. We don't know how big a checkpoint transaction is going to be
489 number of split log vector regions are going to be used. We can track the
490 amount of log space required as we add items to the commit item list, but we
504 format structure. That is, two vectors totaling roughly 150 bytes. If we modify
505 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each
507 comparison, if we are logging full directory buffers, they are typically 4KB
508 each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a
514 Further, if we are going to use a static reservation, which bit of the entire
515 reservation does it cover? We account for space used by the transaction
525 reservation needs to be made before the checkpoint is started, and we need to
526 be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a
529 A static reservation needs to manipulate the log grant counters - we can take a
530 permanent reservation on the space, but we still need to make sure we refresh
535 The problem with this is that it can lead to deadlocks as we may need to commit
537 rolling transactions for an example of this).  Hence we *must* always have
538 space available in the log if we are to use static reservations, and that is
552 Hence we can grow the checkpoint transaction reservation dynamically as items
558 log. Hence as part of the reservation growing, we need to also check the size
559 of the reservation against the maximum allowed transaction size. If we reach
560 the maximum threshold, we need to push the CIL to the log. This is effectively
566 If the transaction subsystem goes idle while we still have items in the CIL,
589 For delayed logging, however, we have an asymmetric transaction commit to
592 That is, we now have a many-to-one relationship between transaction commit and
594 log items becomes unbalanced if we retain the "pin on transaction commit, unpin
599 pinning and unpinning becomes symmetric around a checkpoint context. We have to
601 the CIL during a transaction commit, then we do not pin it again. Because there
602 can be multiple outstanding checkpoint contexts, we can still see elevated pin
608 CIL commit/flush lock. If we pin the object outside this lock, we cannot
611 current CIL or not. If we don't pin the CIL first before we check and pin the
612 object, we have a race with CIL being flushed between the check and the pin
613 (or not pinning, as the case may be). Hence we must hold the CIL flush/commit
614 lock to guarantee that we pin the items correctly.
634 that we have a many-to-one interaction here. That is, the only restriction on
642 while we are holding out a CIL flush, so at the moment that means it is held
649 really needs to be a sleeping lock - if the CIL flush takes the lock, we do not
688 serialisation queues. They use the same lock as the CIL, too. If we see too
799 and the design of the internal structures to avoid on disk format changes, we