1*f0ba4377SMauro Carvalho Chehab============= 20e9cebe7SJosef Bacikdm-log-writes 30e9cebe7SJosef Bacik============= 40e9cebe7SJosef Bacik 50e9cebe7SJosef BacikThis target takes 2 devices, one to pass all IO to normally, and one to log all 60e9cebe7SJosef Bacikof the write operations to. This is intended for file system developers wishing 70e9cebe7SJosef Bacikto verify the integrity of metadata or data as the file system is written to. 80e9cebe7SJosef BacikThere is a log_write_entry written for every WRITE request and the target is 90e9cebe7SJosef Bacikable to take arbitrary data from userspace to insert into the log. The data 100e9cebe7SJosef Bacikthat is in the WRITE requests is copied into the log to make the replay happen 110e9cebe7SJosef Bacikexactly as it happened originally. 120e9cebe7SJosef Bacik 130e9cebe7SJosef BacikLog Ordering 140e9cebe7SJosef Bacik============ 150e9cebe7SJosef Bacik 160e9cebe7SJosef BacikWe log things in order of completion once we are sure the write is no longer in 170e9cebe7SJosef Bacikcache. This means that normal WRITE requests are not actually logged until the 1828a8f0d3SMike Christienext REQ_PREFLUSH request. This is to make it easier for userspace to replay 1928a8f0d3SMike Christiethe log in a way that correlates to what is on disk and not what is in cache, 2028a8f0d3SMike Christieto make it easier to detect improper waiting/flushing. 210e9cebe7SJosef Bacik 220e9cebe7SJosef BacikThis works by attaching all WRITE requests to a list once the write completes. 2328a8f0d3SMike ChristieOnce we see a REQ_PREFLUSH request we splice this list onto the request and once 240e9cebe7SJosef Bacikthe FLUSH request completes we log all of the WRITEs and then the FLUSH. Only 2528a8f0d3SMike Christiecompleted WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to 260e9cebe7SJosef Baciksimulate the worst case scenario with regard to power failures. Consider the 270e9cebe7SJosef Bacikfollowing example (W means write, C means complete): 280e9cebe7SJosef Bacik 290e9cebe7SJosef Bacik W1,W2,W3,C3,C2,Wflush,C1,Cflush 300e9cebe7SJosef Bacik 31*f0ba4377SMauro Carvalho ChehabThe log would show the following: 320e9cebe7SJosef Bacik 330e9cebe7SJosef Bacik W3,W2,flush,W1.... 340e9cebe7SJosef Bacik 350e9cebe7SJosef BacikAgain this is to simulate what is actually on disk, this allows us to detect 360e9cebe7SJosef Bacikcases where a power failure at a particular point in time would create an 370e9cebe7SJosef Bacikinconsistent file system. 380e9cebe7SJosef Bacik 390e9cebe7SJosef BacikAny REQ_FUA requests bypass this flushing mechanism and are logged as soon as 400e9cebe7SJosef Bacikthey complete as those requests will obviously bypass the device cache. 410e9cebe7SJosef Bacik 429305455aSBart Van AsscheAny REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would 430e9cebe7SJosef Bacikhave all the DISCARD requests, and then the WRITE requests and then the FLUSH 440e9cebe7SJosef Bacikrequest. Consider the following example: 450e9cebe7SJosef Bacik 460e9cebe7SJosef Bacik WRITE block 1, DISCARD block 1, FLUSH 470e9cebe7SJosef Bacik 48*f0ba4377SMauro Carvalho ChehabIf we logged DISCARD when it completed, the replay would look like this: 490e9cebe7SJosef Bacik 500e9cebe7SJosef Bacik DISCARD 1, WRITE 1, FLUSH 510e9cebe7SJosef Bacik 520e9cebe7SJosef Bacikwhich isn't quite what happened and wouldn't be caught during the log replay. 530e9cebe7SJosef Bacik 540e9cebe7SJosef BacikTarget interface 550e9cebe7SJosef Bacik================ 560e9cebe7SJosef Bacik 570e9cebe7SJosef Baciki) Constructor 580e9cebe7SJosef Bacik 590e9cebe7SJosef Bacik log-writes <dev_path> <log_dev_path> 600e9cebe7SJosef Bacik 61*f0ba4377SMauro Carvalho Chehab ============= ============================================== 62*f0ba4377SMauro Carvalho Chehab dev_path Device that all of the IO will go to normally. 63*f0ba4377SMauro Carvalho Chehab log_dev_path Device where the log entries are written to. 64*f0ba4377SMauro Carvalho Chehab ============= ============================================== 650e9cebe7SJosef Bacik 660e9cebe7SJosef Bacikii) Status 670e9cebe7SJosef Bacik 680e9cebe7SJosef Bacik <#logged entries> <highest allocated sector> 690e9cebe7SJosef Bacik 70*f0ba4377SMauro Carvalho Chehab =========================== ======================== 71*f0ba4377SMauro Carvalho Chehab #logged entries Number of logged entries 72*f0ba4377SMauro Carvalho Chehab highest allocated sector Highest allocated sector 73*f0ba4377SMauro Carvalho Chehab =========================== ======================== 740e9cebe7SJosef Bacik 750e9cebe7SJosef Bacikiii) Messages 760e9cebe7SJosef Bacik 770e9cebe7SJosef Bacik mark <description> 780e9cebe7SJosef Bacik 790e9cebe7SJosef Bacik You can use a dmsetup message to set an arbitrary mark in a log. 800e9cebe7SJosef Bacik For example say you want to fsck a file system after every 810e9cebe7SJosef Bacik write, but first you need to replay up to the mkfs to make sure 820e9cebe7SJosef Bacik we're fsck'ing something reasonable, you would do something like 83*f0ba4377SMauro Carvalho Chehab this:: 840e9cebe7SJosef Bacik 850e9cebe7SJosef Bacik mkfs.btrfs -f /dev/mapper/log 860e9cebe7SJosef Bacik dmsetup message log 0 mark mkfs 870e9cebe7SJosef Bacik <run test> 880e9cebe7SJosef Bacik 890e9cebe7SJosef Bacik This would allow you to replay the log up to the mkfs mark and 900e9cebe7SJosef Bacik then replay from that point on doing the fsck check in the 910e9cebe7SJosef Bacik interval that you want. 920e9cebe7SJosef Bacik 930e9cebe7SJosef Bacik Every log has a mark at the end labeled "dm-log-writes-end". 940e9cebe7SJosef Bacik 950e9cebe7SJosef BacikUserspace component 960e9cebe7SJosef Bacik=================== 970e9cebe7SJosef Bacik 980e9cebe7SJosef BacikThere is a userspace tool that will replay the log for you in various ways. 990e9cebe7SJosef BacikIt can be found here: https://github.com/josefbacik/log-writes 1000e9cebe7SJosef Bacik 1010e9cebe7SJosef BacikExample usage 1020e9cebe7SJosef Bacik============= 1030e9cebe7SJosef Bacik 1040e9cebe7SJosef BacikSay you want to test fsync on your file system. You would do something like 105*f0ba4377SMauro Carvalho Chehabthis:: 1060e9cebe7SJosef Bacik 1070e9cebe7SJosef Bacik TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" 1080e9cebe7SJosef Bacik dmsetup create log --table "$TABLE" 1090e9cebe7SJosef Bacik mkfs.btrfs -f /dev/mapper/log 1100e9cebe7SJosef Bacik dmsetup message log 0 mark mkfs 1110e9cebe7SJosef Bacik 1120e9cebe7SJosef Bacik mount /dev/mapper/log /mnt/btrfs-test 1130e9cebe7SJosef Bacik <some test that does fsync at the end> 1140e9cebe7SJosef Bacik dmsetup message log 0 mark fsync 1150e9cebe7SJosef Bacik md5sum /mnt/btrfs-test/foo 1160e9cebe7SJosef Bacik umount /mnt/btrfs-test 1170e9cebe7SJosef Bacik 1180e9cebe7SJosef Bacik dmsetup remove log 1190e9cebe7SJosef Bacik replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync 1200e9cebe7SJosef Bacik mount /dev/sdb /mnt/btrfs-test 1210e9cebe7SJosef Bacik md5sum /mnt/btrfs-test/foo 1220e9cebe7SJosef Bacik <verify md5sum's are correct> 1230e9cebe7SJosef Bacik 1240e9cebe7SJosef Bacik Another option is to do a complicated file system operation and verify the file 1250e9cebe7SJosef Bacik system is consistent during the entire operation. You could do this with: 1260e9cebe7SJosef Bacik 1270e9cebe7SJosef Bacik TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc" 1280e9cebe7SJosef Bacik dmsetup create log --table "$TABLE" 1290e9cebe7SJosef Bacik mkfs.btrfs -f /dev/mapper/log 1300e9cebe7SJosef Bacik dmsetup message log 0 mark mkfs 1310e9cebe7SJosef Bacik 1320e9cebe7SJosef Bacik mount /dev/mapper/log /mnt/btrfs-test 1330e9cebe7SJosef Bacik <fsstress to dirty the fs> 1340e9cebe7SJosef Bacik btrfs filesystem balance /mnt/btrfs-test 1350e9cebe7SJosef Bacik umount /mnt/btrfs-test 1360e9cebe7SJosef Bacik dmsetup remove log 1370e9cebe7SJosef Bacik 1380e9cebe7SJosef Bacik replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs 1390e9cebe7SJosef Bacik btrfsck /dev/sdb 1400e9cebe7SJosef Bacik replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \ 1410e9cebe7SJosef Bacik --fsck "btrfsck /dev/sdb" --check fua 1420e9cebe7SJosef Bacik 1430e9cebe7SJosef BacikAnd that will replay the log until it sees a FUA request, run the fsck command 1440e9cebe7SJosef Bacikand if the fsck passes it will replay to the next FUA, until it is completed or 1450e9cebe7SJosef Bacikthe fsck command exists abnormally. 146