admin-guide/device-mapper/log-writes.rst

*f0ba4377SMauro Carvalho Chehab=============
0e9cebe7SJosef Bacikdm-log-writes
0e9cebe7SJosef Bacik=============
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikThis target takes 2 devices, one to pass all IO to normally, and one to log all
0e9cebe7SJosef Bacikof the write operations to.  This is intended for file system developers wishing
0e9cebe7SJosef Bacikto verify the integrity of metadata or data as the file system is written to.
0e9cebe7SJosef BacikThere is a log_write_entry written for every WRITE request and the target is
0e9cebe7SJosef Bacikable to take arbitrary data from userspace to insert into the log.  The data
0e9cebe7SJosef Bacikthat is in the WRITE requests is copied into the log to make the replay happen
0e9cebe7SJosef Bacikexactly as it happened originally.
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikLog Ordering
0e9cebe7SJosef Bacik============
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikWe log things in order of completion once we are sure the write is no longer in
0e9cebe7SJosef Bacikcache.  This means that normal WRITE requests are not actually logged until the
28a8f0d3SMike Christienext REQ_PREFLUSH request.  This is to make it easier for userspace to replay
28a8f0d3SMike Christiethe log in a way that correlates to what is on disk and not what is in cache,
28a8f0d3SMike Christieto make it easier to detect improper waiting/flushing.
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikThis works by attaching all WRITE requests to a list once the write completes.
28a8f0d3SMike ChristieOnce we see a REQ_PREFLUSH request we splice this list onto the request and once
0e9cebe7SJosef Bacikthe FLUSH request completes we log all of the WRITEs and then the FLUSH.  Only
28a8f0d3SMike Christiecompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
0e9cebe7SJosef Baciksimulate the worst case scenario with regard to power failures.  Consider the
0e9cebe7SJosef Bacikfollowing example (W means write, C means complete):
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik	W1,W2,W3,C3,C2,Wflush,C1,Cflush
0e9cebe7SJosef Bacik
*f0ba4377SMauro Carvalho ChehabThe log would show the following:
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik	W3,W2,flush,W1....
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikAgain this is to simulate what is actually on disk, this allows us to detect
0e9cebe7SJosef Bacikcases where a power failure at a particular point in time would create an
0e9cebe7SJosef Bacikinconsistent file system.
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as
0e9cebe7SJosef Bacikthey complete as those requests will obviously bypass the device cache.
0e9cebe7SJosef Bacik
9305455aSBart Van AsscheAny REQ_OP_DISCARD requests are treated like WRITE requests.  Otherwise we would
0e9cebe7SJosef Bacikhave all the DISCARD requests, and then the WRITE requests and then the FLUSH
0e9cebe7SJosef Bacikrequest.  Consider the following example:
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik	WRITE block 1, DISCARD block 1, FLUSH
0e9cebe7SJosef Bacik
*f0ba4377SMauro Carvalho ChehabIf we logged DISCARD when it completed, the replay would look like this:
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik	DISCARD 1, WRITE 1, FLUSH
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacikwhich isn't quite what happened and wouldn't be caught during the log replay.
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikTarget interface
0e9cebe7SJosef Bacik================
0e9cebe7SJosef Bacik
0e9cebe7SJosef Baciki) Constructor
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik   log-writes <dev_path> <log_dev_path>
0e9cebe7SJosef Bacik
*f0ba4377SMauro Carvalho Chehab   ============= ==============================================
*f0ba4377SMauro Carvalho Chehab   dev_path	 Device that all of the IO will go to normally.
*f0ba4377SMauro Carvalho Chehab   log_dev_path  Device where the log entries are written to.
*f0ba4377SMauro Carvalho Chehab   ============= ==============================================
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacikii) Status
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik    <#logged entries> <highest allocated sector>
0e9cebe7SJosef Bacik
*f0ba4377SMauro Carvalho Chehab    =========================== ========================
*f0ba4377SMauro Carvalho Chehab    #logged entries	        Number of logged entries
*f0ba4377SMauro Carvalho Chehab    highest allocated sector    Highest allocated sector
*f0ba4377SMauro Carvalho Chehab    =========================== ========================
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacikiii) Messages
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik    mark <description>
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik	You can use a dmsetup message to set an arbitrary mark in a log.
0e9cebe7SJosef Bacik	For example say you want to fsck a file system after every
0e9cebe7SJosef Bacik	write, but first you need to replay up to the mkfs to make sure
0e9cebe7SJosef Bacik	we're fsck'ing something reasonable, you would do something like
*f0ba4377SMauro Carvalho Chehab	this::
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik	  mkfs.btrfs -f /dev/mapper/log
0e9cebe7SJosef Bacik	  dmsetup message log 0 mark mkfs
0e9cebe7SJosef Bacik	  <run test>
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik	This would allow you to replay the log up to the mkfs mark and
0e9cebe7SJosef Bacik	then replay from that point on doing the fsck check in the
0e9cebe7SJosef Bacik	interval that you want.
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik	Every log has a mark at the end labeled "dm-log-writes-end".
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikUserspace component
0e9cebe7SJosef Bacik===================
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikThere is a userspace tool that will replay the log for you in various ways.
0e9cebe7SJosef BacikIt can be found here: https://github.com/josefbacik/log-writes
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikExample usage
0e9cebe7SJosef Bacik=============
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikSay you want to test fsync on your file system.  You would do something like
*f0ba4377SMauro Carvalho Chehabthis::
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
0e9cebe7SJosef Bacik  dmsetup create log --table "$TABLE"
0e9cebe7SJosef Bacik  mkfs.btrfs -f /dev/mapper/log
0e9cebe7SJosef Bacik  dmsetup message log 0 mark mkfs
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik  mount /dev/mapper/log /mnt/btrfs-test
0e9cebe7SJosef Bacik  <some test that does fsync at the end>
0e9cebe7SJosef Bacik  dmsetup message log 0 mark fsync
0e9cebe7SJosef Bacik  md5sum /mnt/btrfs-test/foo
0e9cebe7SJosef Bacik  umount /mnt/btrfs-test
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik  dmsetup remove log
0e9cebe7SJosef Bacik  replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
0e9cebe7SJosef Bacik  mount /dev/sdb /mnt/btrfs-test
0e9cebe7SJosef Bacik  md5sum /mnt/btrfs-test/foo
0e9cebe7SJosef Bacik  <verify md5sum's are correct>
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik  Another option is to do a complicated file system operation and verify the file
0e9cebe7SJosef Bacik  system is consistent during the entire operation.  You could do this with:
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik  TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
0e9cebe7SJosef Bacik  dmsetup create log --table "$TABLE"
0e9cebe7SJosef Bacik  mkfs.btrfs -f /dev/mapper/log
0e9cebe7SJosef Bacik  dmsetup message log 0 mark mkfs
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik  mount /dev/mapper/log /mnt/btrfs-test
0e9cebe7SJosef Bacik  <fsstress to dirty the fs>
0e9cebe7SJosef Bacik  btrfs filesystem balance /mnt/btrfs-test
0e9cebe7SJosef Bacik  umount /mnt/btrfs-test
0e9cebe7SJosef Bacik  dmsetup remove log
0e9cebe7SJosef Bacik
0e9cebe7SJosef Bacik  replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
0e9cebe7SJosef Bacik  btrfsck /dev/sdb
0e9cebe7SJosef Bacik  replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
0e9cebe7SJosef Bacik	--fsck "btrfsck /dev/sdb" --check fua
0e9cebe7SJosef Bacik
0e9cebe7SJosef BacikAnd that will replay the log until it sees a FUA request, run the fsck command
0e9cebe7SJosef Bacikand if the fsck passes it will replay to the next FUA, until it is completed or
0e9cebe7SJosef Bacikthe fsck command exists abnormally.