xref: /linux/Documentation/admin-guide/device-mapper/log-writes.rst (revision 6cf2a73cb2bc422a03984b285a63632c27f8c4e4)
1*f0ba4377SMauro Carvalho Chehab=============
20e9cebe7SJosef Bacikdm-log-writes
30e9cebe7SJosef Bacik=============
40e9cebe7SJosef Bacik
50e9cebe7SJosef BacikThis target takes 2 devices, one to pass all IO to normally, and one to log all
60e9cebe7SJosef Bacikof the write operations to.  This is intended for file system developers wishing
70e9cebe7SJosef Bacikto verify the integrity of metadata or data as the file system is written to.
80e9cebe7SJosef BacikThere is a log_write_entry written for every WRITE request and the target is
90e9cebe7SJosef Bacikable to take arbitrary data from userspace to insert into the log.  The data
100e9cebe7SJosef Bacikthat is in the WRITE requests is copied into the log to make the replay happen
110e9cebe7SJosef Bacikexactly as it happened originally.
120e9cebe7SJosef Bacik
130e9cebe7SJosef BacikLog Ordering
140e9cebe7SJosef Bacik============
150e9cebe7SJosef Bacik
160e9cebe7SJosef BacikWe log things in order of completion once we are sure the write is no longer in
170e9cebe7SJosef Bacikcache.  This means that normal WRITE requests are not actually logged until the
1828a8f0d3SMike Christienext REQ_PREFLUSH request.  This is to make it easier for userspace to replay
1928a8f0d3SMike Christiethe log in a way that correlates to what is on disk and not what is in cache,
2028a8f0d3SMike Christieto make it easier to detect improper waiting/flushing.
210e9cebe7SJosef Bacik
220e9cebe7SJosef BacikThis works by attaching all WRITE requests to a list once the write completes.
2328a8f0d3SMike ChristieOnce we see a REQ_PREFLUSH request we splice this list onto the request and once
240e9cebe7SJosef Bacikthe FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
2528a8f0d3SMike Christiecompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
260e9cebe7SJosef Baciksimulate the worst case scenario with regard to power failures.  Consider the
270e9cebe7SJosef Bacikfollowing example (W means write, C means complete):
280e9cebe7SJosef Bacik
290e9cebe7SJosef Bacik	W1,W2,W3,C3,C2,Wflush,C1,Cflush
300e9cebe7SJosef Bacik
31*f0ba4377SMauro Carvalho ChehabThe log would show the following:
320e9cebe7SJosef Bacik
330e9cebe7SJosef Bacik	W3,W2,flush,W1....
340e9cebe7SJosef Bacik
350e9cebe7SJosef BacikAgain this is to simulate what is actually on disk, this allows us to detect
360e9cebe7SJosef Bacikcases where a power failure at a particular point in time would create an
370e9cebe7SJosef Bacikinconsistent file system.
380e9cebe7SJosef Bacik
390e9cebe7SJosef BacikAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as
400e9cebe7SJosef Bacikthey complete as those requests will obviously bypass the device cache.
410e9cebe7SJosef Bacik
429305455aSBart Van AsscheAny REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
430e9cebe7SJosef Bacikhave all the DISCARD requests, and then the WRITE requests and then the FLUSH
440e9cebe7SJosef Bacikrequest.  Consider the following example:
450e9cebe7SJosef Bacik
460e9cebe7SJosef Bacik	WRITE block 1, DISCARD block 1, FLUSH
470e9cebe7SJosef Bacik
48*f0ba4377SMauro Carvalho ChehabIf we logged DISCARD when it completed, the replay would look like this:
490e9cebe7SJosef Bacik
500e9cebe7SJosef Bacik	DISCARD 1, WRITE 1, FLUSH
510e9cebe7SJosef Bacik
520e9cebe7SJosef Bacikwhich isn't quite what happened and wouldn't be caught during the log replay.
530e9cebe7SJosef Bacik
540e9cebe7SJosef BacikTarget interface
550e9cebe7SJosef Bacik================
560e9cebe7SJosef Bacik
570e9cebe7SJosef Baciki) Constructor
580e9cebe7SJosef Bacik
590e9cebe7SJosef Bacik   log-writes <dev_path> <log_dev_path>
600e9cebe7SJosef Bacik
61*f0ba4377SMauro Carvalho Chehab   ============= ==============================================
62*f0ba4377SMauro Carvalho Chehab   dev_path	 Device that all of the IO will go to normally.
63*f0ba4377SMauro Carvalho Chehab   log_dev_path  Device where the log entries are written to.
64*f0ba4377SMauro Carvalho Chehab   ============= ==============================================
650e9cebe7SJosef Bacik
660e9cebe7SJosef Bacikii) Status
670e9cebe7SJosef Bacik
680e9cebe7SJosef Bacik    <#logged entries> <highest allocated sector>
690e9cebe7SJosef Bacik
70*f0ba4377SMauro Carvalho Chehab    =========================== ========================
71*f0ba4377SMauro Carvalho Chehab    #logged entries	        Number of logged entries
72*f0ba4377SMauro Carvalho Chehab    highest allocated sector    Highest allocated sector
73*f0ba4377SMauro Carvalho Chehab    =========================== ========================
740e9cebe7SJosef Bacik
750e9cebe7SJosef Bacikiii) Messages
760e9cebe7SJosef Bacik
770e9cebe7SJosef Bacik    mark <description>
780e9cebe7SJosef Bacik
790e9cebe7SJosef Bacik	You can use a dmsetup message to set an arbitrary mark in a log.
800e9cebe7SJosef Bacik	For example say you want to fsck a file system after every
810e9cebe7SJosef Bacik	write, but first you need to replay up to the mkfs to make sure
820e9cebe7SJosef Bacik	we're fsck'ing something reasonable, you would do something like
83*f0ba4377SMauro Carvalho Chehab	this::
840e9cebe7SJosef Bacik
850e9cebe7SJosef Bacik	  mkfs.btrfs -f /dev/mapper/log
860e9cebe7SJosef Bacik	  dmsetup message log 0 mark mkfs
870e9cebe7SJosef Bacik	  <run test>
880e9cebe7SJosef Bacik
890e9cebe7SJosef Bacik	This would allow you to replay the log up to the mkfs mark and
900e9cebe7SJosef Bacik	then replay from that point on doing the fsck check in the
910e9cebe7SJosef Bacik	interval that you want.
920e9cebe7SJosef Bacik
930e9cebe7SJosef Bacik	Every log has a mark at the end labeled "dm-log-writes-end".
940e9cebe7SJosef Bacik
950e9cebe7SJosef BacikUserspace component
960e9cebe7SJosef Bacik===================
970e9cebe7SJosef Bacik
980e9cebe7SJosef BacikThere is a userspace tool that will replay the log for you in various ways.
990e9cebe7SJosef BacikIt can be found here: https://github.com/josefbacik/log-writes
1000e9cebe7SJosef Bacik
1010e9cebe7SJosef BacikExample usage
1020e9cebe7SJosef Bacik=============
1030e9cebe7SJosef Bacik
1040e9cebe7SJosef BacikSay you want to test fsync on your file system.  You would do something like
105*f0ba4377SMauro Carvalho Chehabthis::
1060e9cebe7SJosef Bacik
1070e9cebe7SJosef Bacik  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
1080e9cebe7SJosef Bacik  dmsetup create log --table "$TABLE"
1090e9cebe7SJosef Bacik  mkfs.btrfs -f /dev/mapper/log
1100e9cebe7SJosef Bacik  dmsetup message log 0 mark mkfs
1110e9cebe7SJosef Bacik
1120e9cebe7SJosef Bacik  mount /dev/mapper/log /mnt/btrfs-test
1130e9cebe7SJosef Bacik  <some test that does fsync at the end>
1140e9cebe7SJosef Bacik  dmsetup message log 0 mark fsync
1150e9cebe7SJosef Bacik  md5sum /mnt/btrfs-test/foo
1160e9cebe7SJosef Bacik  umount /mnt/btrfs-test
1170e9cebe7SJosef Bacik
1180e9cebe7SJosef Bacik  dmsetup remove log
1190e9cebe7SJosef Bacik  replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
1200e9cebe7SJosef Bacik  mount /dev/sdb /mnt/btrfs-test
1210e9cebe7SJosef Bacik  md5sum /mnt/btrfs-test/foo
1220e9cebe7SJosef Bacik  <verify md5sum's are correct>
1230e9cebe7SJosef Bacik
1240e9cebe7SJosef Bacik  Another option is to do a complicated file system operation and verify the file
1250e9cebe7SJosef Bacik  system is consistent during the entire operation.  You could do this with:
1260e9cebe7SJosef Bacik
1270e9cebe7SJosef Bacik  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
1280e9cebe7SJosef Bacik  dmsetup create log --table "$TABLE"
1290e9cebe7SJosef Bacik  mkfs.btrfs -f /dev/mapper/log
1300e9cebe7SJosef Bacik  dmsetup message log 0 mark mkfs
1310e9cebe7SJosef Bacik
1320e9cebe7SJosef Bacik  mount /dev/mapper/log /mnt/btrfs-test
1330e9cebe7SJosef Bacik  <fsstress to dirty the fs>
1340e9cebe7SJosef Bacik  btrfs filesystem balance /mnt/btrfs-test
1350e9cebe7SJosef Bacik  umount /mnt/btrfs-test
1360e9cebe7SJosef Bacik  dmsetup remove log
1370e9cebe7SJosef Bacik
1380e9cebe7SJosef Bacik  replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
1390e9cebe7SJosef Bacik  btrfsck /dev/sdb
1400e9cebe7SJosef Bacik  replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
1410e9cebe7SJosef Bacik	--fsck "btrfsck /dev/sdb" --check fua
1420e9cebe7SJosef Bacik
1430e9cebe7SJosef BacikAnd that will replay the log until it sees a FUA request, run the fsck command
1440e9cebe7SJosef Bacikand if the fsck passes it will replay to the next FUA, until it is completed or
1450e9cebe7SJosef Bacikthe fsck command exists abnormally.
146