From c045654a862bc1119fa4e7584fff9d2a965192ea Mon Sep 17 00:00:00 2001 From: Luke Shumaker Date: Fri, 14 Jul 2023 14:38:38 -0700 Subject: Add the btrfs-rec email This isn't quite verbatim checking in the email as I did in btrfs-progs-ng.git, I fussed with it a bit to get my blog engine to do sane things with it. --- public/btrfs-rec.md | 1234 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 1234 insertions(+) create mode 100644 public/btrfs-rec.md diff --git a/public/btrfs-rec.md b/public/btrfs-rec.md new file mode 100644 index 0000000..8519969 --- /dev/null +++ b/public/btrfs-rec.md @@ -0,0 +1,1234 @@ +Announcing: btrfs-rec: Recover (data from) a broken btrfs filesystem +==================================================================== +--- +date: "2023-07-10" +--- + +> I originally sent this email on 2023-07-10, but it has been caught +> by their bogofilter. Yes, it was +> [plaintext](https://git.lukeshu.com/btrfs-progs-ng/tree/README.md?id=18e6066c241cf3d252b6521150843ffc858d8434). +> No, I didn't use GMail. Yes, I've successfully participated in vger +> lists in the past. Yes, I've reached out to postmaster; no, I +> haven't received a reply yet (as of 2023-07-14). + +
+To: linux-btrfs@vger.kernel.org
+From: Luke T. Shumaker <lukeshu@lukeshu.com>
+Subject: btrfs-rec: Recover (data from) a broken btrfs filesystem
+Date: Mon, 10 Jul 2023 21:23:41 -0600
+Message-ID: <87jzv7uo5e.wl-lukeshu@lukeshu.com>
+
+ +Inspired by a mis-typed `dd` command, for the last year I've been +working on a tool for recovering corrupt btrfs filesystems; at first +idly here and there, but more actively in the last few months. I hope +to get it incorporated into btrfs-progs, though perhaps that is +problematic for a few reasons I'll get to. If the code can't be +incorporated into btrfs-progs, at least the ideas and algorithms +should be. + +[https://git.lukeshu.com/btrfs-progs-ng/](https://git.lukeshu.com/btrfs-progs-ng/) + +Highlights: + + - In general, it's more tolerant of corrupt filesystems than + `btrfs check --repair`, `btrfs rescue` or `btrfs restore`. + + - `btrfs-rec inspect rebuild-mappings` is a better + `btrfs rescue chunk-recover`. + + - `btrfs-rec inspect rebuild-trees` can re-attach lost branches to + broken B+ trees. + + - `btrfs-rec inspect mount` is a read-only FUSE implementation of + btrfs. This is conceptually a replacement for `btrfs restore`. + + - It's entirely written in Go. I'm not saying that's a good thing, + but it's an interesting thing. + +Hopefully some folks will find it useful, or at least neat! + + - [1. Motivation](#motivation) + - [2. Overview of use](#overview-of-use) + - [3. Prior art](#prior-art) + - [4. Internals/Design](#internalsdesign) + - [4.1. Overview of the source tree layout](#overview-of-the-source-tree-layout) + - [4.2. Base decisions: CLI structure, Go, JSON](#base-decisions-cli-structure-go-json) + - [4.3. Algorithms](#algorithms) + - [4.3.1. The `rebuild-mappings` algorithm](#the-rebuild-mappings-algorithm) + - [4.3.2. The `--rebuild` algorithm](#the---rebuild-algorithm) + - [4.3.2.1. rebuilt forrest behavior](#rebuilt-forrest-behavior-looking-up-trees) + - [4.3.2.2. rebuilt individual tree behavior](#rebuilt-individual-tree-behavior) + - [4.3.3. The `rebuild-trees` algorithm](#the-rebuild-trees-algorithm) + - [4.3.3.1. initialization](#initialization) + - [4.3.3.2. the main loop](#the-main-loop) + - [4.3.3.3. graph callbacks](#graph-callbacks) + - [5. Future work](#future-work) + - [6. Problems for merging this code into btrfs-progs](#problems-with-merging-this-code-into-btrfs) + +# 1. Motivation + +Have you ever ended up with a corrupt btrfs filesystem (through no +fault of btrfs itself, but perhaps a failing drive, or a mistaken `dd` +invocation)? Surely losing less than 100MB of data from a drive +should not render hundreds of GB of perfectly intact data unreadable! +And yet, the existing tools are unable to even attempt to read that +data: + + $ btrfs check --repair --force dump-zero.1.img + enabling repair mode + Opening filesystem to check... + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 + ERROR: cannot read chunk root + ERROR: cannot open file system + +or + + $ btrfs check --init-extent-tree --force dump-zero.1.img + Opening filesystem to check... + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 + ERROR: cannot read chunk root + ERROR: cannot open file system + +or + + $ btrfs check --init-csum-tree --force dump-zero.1.img + Creating a new CRC tree + Opening filesystem to check... + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 + ERROR: cannot read chunk root + ERROR: cannot open file system + +or + + $ btrfs rescue chunk-recover dump-zero.1.img + Scanning: DONE in dev0 + corrupt node: root=1 block=160410271744 slot=0, corrupt node: root=1 block=160410271744, nritems too large, have 39 expect range [1,0] + Couldn't read tree root + open with broken chunk error + +or + + $ btrfs rescue zero-log dump-zero.1.img + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + ERROR: cannot read chunk root + ERROR: could not open ctree + +or + + $ mkdir out + $ btrfs restore dump-zero.1.img out + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 + ERROR: cannot read chunk root + Could not open root, trying backup super + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 + ERROR: cannot read chunk root + Could not open root, trying backup super + ERROR: superblock bytenr 274877906944 is larger than device size 256060514304 + Could not open root, trying backup super + +or + + $ btrfs restore --list-roots dump-zero.1.img + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 + ERROR: cannot read chunk root + Could not open root, trying backup super + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 + bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 + ERROR: cannot read chunk root + Could not open root, trying backup super + ERROR: superblock bytenr 274877906944 is larger than device size 256060514304 + Could not open root, trying backup super + +or + + $ btrfs-find-root dump-zero.1.img + WARNING: cannot read chunk root, continue anyway + Superblock thinks the generation is 6596071 + Superblock thinks the level is 1 + +Well, have I got a tool for you! + +(FWIW, I also tried manipulating the filesystem and patching to tools +to try to get past those errors, only to get a different set of +errors. Some of these patches I am separately submitting to +btrfs-progs.) + +# 2. Overview of use + +There are two `btrfs-rec` sub-command groups: +`btrfs-rec inspect SUBCMD` and `btrfs-rec repair SUBCMD`, and you can +find out about various sub-commands with `btrfs-rec help`. These are +both told about devices or images with the `--pv` flag. + +`btrfs-rec inspect SUBCMD` commands open the filesystem read-only, and +(generally speaking) write extracted or rebuilt information to stdout. +`btrfs-rec repair SUBCMD` commands open the filesystem read+write, and +consume information from `btrfs-rec inspect SUBCMD` commands to +actually repair the filesystem (except I haven't actually implemented +any `repair` commands yet... despite the lack of `repair` commands, I +believe that `btrfs-rec` is already a useful because of the `btrfs-rec +inspect mount` command to get data out of the broken filesystem). +This split allows you to try things without being scared by WARNINGs +about not using these tools unless you're an expert or have been told +to by a developer. + +In the broken `dump-zero.1.img` example above (which has a perfectly +intact superblock, but a totally broken `CHUNK_TREE`), to "repair" it +I'd: + + 1. Start by using `btrfs-rec inspect rebuild-mappings` to rebuild the + broken chunk/dev/blockgroup trees: + + $ btrfs-rec inspect rebuild-mappings \ + --pv=dump-zero.1.img \ + > mappings-1.json + + 2. If it only mostly succeeds, but on stderr tells us about a few + regions of the image that it wasn't able to figure out the chunks + for. Using some human-level knowledge, you can write those + yourself, inserting them into the generated `mappings.json`, and + ask `rebuild-mappings` to normalize what you wrote: + + $ btrfs-rec inspect rebuild-mappings \ + --pv=dump-zero.1.img \ + --mappings=<(sed mappings-2.json + + 3. Now that it has functioning chunk/dev/blockgroup trees, we can use + `btrfs-rec inspect rebuild-trees` to rebuild other trees that rely + on those: + + $ btrfs-rec inspect rebuild-mappings \ + --pv=dump-zero.1.img \ + --mappings=mappings-2.json \ + > trees.json + + 4. Now that (hopefully) everything that was damaged has been + reconstructed, we can use `btrfs-rec inspect mount` to mount the + filesystem read-only and copy out our data: + + $ mkdir mnt + $ sudo btrfs-rec inspect mount \ + --pv=dump-zero.1.img \ + --mappings=mappings-2.json \ + --trees=trees.json \ + ./mnt + +This example is fleshed out more (and the manual edits to +`mappings.json` explained more) in `./examples/main.sh`. + +# 3. Prior art + +Comparing `btrfs-rec inspect mount` with the existing +https://github.com/adam900710/btrfs-fuse project: + + - Again, mine has better fault tolerance + - Mine is read-only + - Mine supports xattrs ("TODO" in Adam's) + - Mine supports separate inode address spaces for subvolumes; Adam's + doesn't due to limitations in FUSE, mine works around this by + lazily setting up separate mountpoints for each subvolume (though + this does mean that the process needs to run as root, which is a + bummer). + +# 4. Internals/Design + +## 4.1. Overview of the source tree layout + + - `examples/` has example scripts showing how to use `btrfs-rec`. + + - `lib/btrfs/` is the core btrfs implementation. + + - `lib/btrfscheck/` and `lib/btrfsutil/` are libraries for + "btrfs-progs" type programs, that are userland-y things that I + thought should be separate from the core implementation; something + that frustrated me about libbtrfs was having to figure out "is this + thing here in support of btrfs bits-on-disk, or in support of a + higher-level 'how btrfs-progs wants to think about things'?" + + - `cmd/btrfs-rec/` is where the command implementations live. If a + sub-command fits in a single file, it's + `cmd/btrfs-rec/inspect_SUBCMD.go`, otherwise, it's in a separate + `cmd/btrfs-rec/inspect/SUBCMD/` package. + + - `lib/textui/` is reasonably central to how the commands implement a + text/CLI user-interface. + + - `lib/binstruct/`, `lib/diskio/`, and `lib/streamio/` are + non-btrfs-specific libraries related to the problem domain. + + - `lib/containers/`, `lib/fmtutil/`, `lib/maps/`, `lib/slices/`, and + `lib/profile/` are all generic Go libraries that have nothing to do + with btrfs or the problem domain, but weren't in the Go standard + library and I didn't find/know-of exiting implementations that I + liked. Of these, all but `containers` are pretty simple utility + libraries. Also, some of these things have been added to the + standard library since I started the project. + +## 4.2. Base decisions: CLI structure, Go, JSON + +I started with trying to enhance btrfs-progs, but ended up writing a +wholy new program in Go, for several reasons: + + - writing a new thing: I was having to learn both the btrfs-progs + codebase and how btrfs-bits-on-disk work, and it got to the point + that I decided I should just focus on learning btrfs-bits-on-disk. + + - writing a new thing: It was becoming increasingly apparent to me + that it was going to be an uphill-fight of having recovery-tools + share the same code as the main-tools, as the routines used by the + main-tools rightly have validity checks, where recovery-tools want + to say "yes, I know it's invalid, can you give it to me anyway?". + + - writing it in not-C: I love me some C, but higher level languages + are good for productivity. And I was trying to write a whole lot + of code at once, I needed a productivity boost. + + - writing it in not-C: This forced me to learn btrfs-bits-on-disk + better, instead of just cribbing from btrfs-progs. That knowledge + is particularly important for having ideas on how to deal with + corrupt bits-on-disk. + + - writing it in Go: At the time I started, my day job was writing Go, + so I had Go swapped into my brain. And Go still feels close to C + but provides *a lot* of niceness and safety over C. + +It turned out that Go was perhaps not the best choice, but we'll come +back to that. + +I wanted to separate things into a pipeline. For instance: Instead of +`btrfs rescue chunk-recover` trying to do everything to rebuild a +broken chunk tree, I wanted to separate I/O from computation from +repairs. So I have `btrfs-rec inspect rebuild-mappings scan` that +reads all the info necessary to rebuild the chunk tree, then dump that +as a 2GB glob of JSON. Then I can feed that JSON to `btrfs-rec +inspect rebuild-mappings process` which actually rebuilds the mappings +in the chunk tree, and dumps them as JSON. And then other commands +can consume that `mappings.json` to use that instead of trying to read +the chunk tree from the actual FS, so that you don't have to make +potentially destructive writes to inspect an FS with a broken chunk +tree, and can inspect it more forensically. Or then use +`btrfs-rec repair SOME_SUBCMD_I_HAVENT_WRITTEN_YET` to write that +chunk tree in `mappings.json` back to the filesystem. + +(But also, the separate steps thing was useful just so I could iterate +on the algorithms of `rebuild-mappings process` separately from having +to scan the entire FS) + +So, I made the decision that `btrfs-rec inspect SUBCMD` commands +should all only open the FS read-only, and output their work to a +separate file; that writing that info back to the FS should be +separate in `btrfs-rec repair SUBCMD`. + +For connecting those parts of the pipeline, I chose JSON, for a few +reasons: + + - I wanted something reasonably human-readable, so that I could debug + it easier. + + - I wanted something reasonably human-readable, so that human + end-users could make manual edits; for example, in + `examples/main.sh` I have an example of manually editing + `mappings.json` to resolve a region that the algorithm couldn't + figure out, but with knowledge of what caused the corruption a + human can. + + - I didn't want to invent my own DSL and have to handle writing a + parser. (This part didn't pay off! See below.) + + - I wanted something that I thought would have good support in a + variety of languages, so that if Go is problematic for getting + things merged upstream it could be rewritten in C (or maybe Rust?) + piece-meal where each subcommand can be rewritten one at a time. + +It turned out that JSON was perhaps not the best choice. + +OK, so: Go and/or JSON maybe being mistakes: + + - I spent a lot of time getting the garbage collector to not just + kill performance. + + - The `btrfs-rec inspect rebuild-mappings SUBCMD` subcommands all + throw a lot of data through the JSON encoder/decoder, and I learned + that the Go stdlib `encoding/json` package has memory use that + grows O(n^2) (-ish? I didn't study the implementation, but that's + what the curve looks like just observing it) on the size of the + data being shoved through it, so I had to go take a break and go + write https://pkg.go.dev/git.lukeshu.com/go/lowmemjson which is a + mostly-drop-in-replacement that tries to be as close-as possible to + O(1) memory use. So I did end up having to write my own parser + anyway :( + +## 4.3. Algorithms + +There are 3 algorithms of note in `btrfs-rec`, that I think are worth +getting into mainline btrfs-progs even if the code of `btrfs-rec` +doesn't get in: + + 1. The `btrfs-rec inspect rebuild-mappings` algoritithm to rebuild + information from the `CHUNK_TREE`, `DEV_TREE`, and + `BLOCK_GROUP_TREE`. + + 2. The `btrfs-rec --rebuild` algorithm to cope with reading broken B+ + trees. + + 3. The `btrfs-rec inspect rebuild-trees` algorithm to re-attach lost + branches to broken B+ trees. + +### 4.3.1. The `rebuild-mappings` algorithm + +(This step-zero scan is `btrfs-rec inspect rebuild-mappings scan`, and +principally lives in `./lib/btrfsutil/scan.go` and +`./cmd/btrfs-rec/inspect/rebuidmappings/scan.go`) + + 0. Similar to `btrfs rescue chunk-recover`, scan each device for + things that look like nodes; keep track of: + - Checksums of every block on the device + - Which physical addresses contain nodes that claim to be at a + given logical addess. + - Any found Chunk items, BlockGroup items, DevExtent, and CSum + items. Keep track of the key for each of these, and for CSum + items also track the generation. + +Create a bucket of the data from Chunks, DevExtents, and BlockGroups; +since these are mostly a Chunk and a DevExtent+BlockGroup store pretty +much the same information; we can use one to reconstruct the other. +How we "merge" these and handle conflicts is in +`./lib/btrfs/btrfsvol/lvm.go:addMapping()`, I don't think this +part is particularly clever, but given that `btrfs rescue +chunk-recover` crashes if it encounters two overlapping chunks, I +suppose I should spell it out: + + - A "mapping" is represented as a group of 4 things: + + + logical address + + a list of 1 or more physical addresses (device ID and offset) + + size, and a Boolean indicator of whether the size is "locked" + + block group flags, and a Boolean presence-indicator + + - Mappings must be merged if their logical or physical regions + overlap. + + - If a mapping has a "locked" size, then when merging it may subsume + smaller mappings with unlocked sizes, but its size cannot be + changed; trying to merge a locked-size mapping with another mapping + that is not for a subset region should return an error. + + - If a mapping has block group flags present, then those flags may + not be changed; it may only be merged with another mapping that + does not have flags present, or has identical flags. + + - When returning an error because of overlapping non-mergeable + mappings, just log an error on stderr and keep going. That's an + important design thing that is different than normal filesystem + code; if there's an error, yeah, detect and notify about it, **but + don't bail out of the whole routine**. Just skip that one item or + whatever. + +Now that we know how to "add a mapping", let's do that: + +(The following main-steps are `btrfs-rec inspect rebuild-mappings +process`, and principally live in +`./cmd/btrfs-rec/inspect/rebuidmappings/process.go`) + + 1. Add all found Chunks. + + 2. Add all found DevExtents. + + 3. Add a phyical:logical mapping of length nodesize for each node + that was found. + + 4. Any mappings from steps 2 or 3 that are missing blockgroup flags + (that is: they weren't able to be merged with a mapping from step + 1), use the found BlockGroups to fill in those flags. + + 5. Now we'll merge all found CSum items into a map of the sums of the + logical address space. Sort all of the csum items by generation, + then by address. Loop over them in that order, inserting their + sums into the map. If two csum items overlap, but agree about the + sums of the overlapping region, that's fine, just take their + union. For overlaps that disagree, items with a newer generation + kick out items with an older generation. If disagreeing items + have the same generation... I don't think that can happen except + by a filesystem bug (i.e. not by a failing drive or other external + corruption), so I wasn't too concerned about it, so I just log an + error on stderr and skip the later-processed item. See + `./cmd/btrfs-rec/inspect/rebuidmappings/process_sums_logical.go`. + + Look at regions of the logical address space that meet all the 3 + criteria: + + - we have CSum items for them + - we have a BlockGroup for them + - we don't have a Chunk/DevExtent mapping them to the pysical + address space. + + Pair those CSums up with BlockGroups, and for each BlockGroup, + search the list of checksums of physical blocks to try to find a + physical region that matches the logical csums (and isn't already + mapped to a different logical region). I used a + Knuth-Morris-Pratt search, modified to handle holes in the logical + csum list as wildcards. + + Insert any found mappings into our bucket of mappings. + + 6. Do the same again, but with a fuzzy search (we can re-use the csum + map of the logical address space). My implementation of this is + comparatively time and space intensive; I just walk over the + entire unmapped physical address space, noting what % of match + each BlockGroup has if placed at that location. I keep track of + the best 2 matches for each BlockGroup. If the best match is + better than a 50% match, and the second best is less than a 50% + match, then I add the best match. In my experience, the best + match is >90% (or at whatever the maximum percent is for how much + of the BlockGroup has logical sums), and the second best is 0% or + 1%. The point of tracking both is that if there isn't a clear-cut + winner, I don't want it to commit to a potentially wrong choice. + +### 4.3.2. The `--rebuild` algorithm + +The `--rebuild` flag is implied by the `--trees=trees.json` flag, and +triggers an algorithm that allows "safely" reading from a broken B+ +tree, rather than the usual B+ tree lookup and search functions. I +probably should have tried to understand the `btrfs restore` +algorithm, maybe I reinvented the wheel... + +This algorithm requires a list of all nodes on the filesystem; we find +these using the same scan as above (`./lib/btrfsutil/scan.go`), the +same procedure as `btrfs rescue chunk-recover`. + +We walk all of those nodes, and build a reasonably lightweight +in-memory graph of all nodes (`./lib/btrfsutil/graph.go`), tracking + + - each node's + + logical address + + level + + generation + + tree + + each item's key and size + - each keypointer's + + source node + + source slot within the node + + tree of the source node + + destination node + + destination level implied by the level of the source node + + destination key + + destination generation + - logical addresses and error messages for nodes that are pointed to + by a keypointer or the superblock, but can't be read (because that + logical address isn't mapped, or it doesn't look like a node, + or...) + - an index such that for a given node we can quickly list both + keypointers both originating at that node and pointing to that + node. + +#### 4.3.2.1. rebuilt forrest behavior (looking up trees) + +(see: `./lib/btrfsutil/rebuilt_forrest.go`) + + - The `ROOT_TREE`, `CHUNK_TREE`, `TREE_LOG`, and `BLOCK_GROUP_TREE` + (the trees pointed to directy by the superblock) work as you'd + expect. + - For other trees, we (as you'd expect) look up the root item in the + rebuilt `ROOT_TREE`, and then (if rootitem.ParentUUID is non-zero) + eagerly also look up the parent tree (recursing on ourself). We + try to use the `UUID_TREE` tree to help with this, but fall back to + just doing a linear scan over the `ROOT_TREE`. If we fail to look + up the parent tree (or its parent, or a more distant ancestor), + then (depending on a flag) we either make a note of that, or error + out and fail to look up the child tree. For `--rebuild` and + `--trees=trees.json` we are permissive of this error, and just make + note of it; but we'll re-use this algorithm in the `rebuild-trees` + algorithm below, and it needs the more strict handling. + - When creating the rebuilt individual tree, we start by adding the + root node specified by the superblock/root-item. But we may also + add additional root nodes grafted on to the tree by the + `--trees=trees.json` flag or by the `rebuild-trees` algorithm + below. So a tree may have more than 1 root node. + +#### 4.3.2.2. rebuilt individual tree behavior + +(see: `./lib/btrfsutil/rebuilt_tree.go`) + +In order to read from a tree, we first have to build a few indexes. +We store these indexes in an Adaptive Replacement Cache; they are all +re-buildable based on the tree's list of roots and the above graph; if +we have a bunch of trees we don't need to keep all of this in memory +at once. Note that this is done 100% with the in-memory graph, we +don't need to read anything from the filesystem during these +procedures. + + - The first index we build is the "node index". This is an index + that for every node tells us what root(s) the tree would need to + have in order for the tree to include that node, and also what the + highest item key would be acceptable in the node if the tree + includes that root. We track both a `loMaxItem` and a `hiMaxItem`, + in case the tree is real broken and there are multiple paths from + the root to the node; as these different paths may imply different + max-item constraints. Put more concretely, the type of the index + is: + + map[ nodeID → map[ rootNodeID → {loMaxItem, hiMaxItem} ] ] + + We'll do a loop over the graph, using dynamic-programming + memoization to figure out ordering and avoid processing the same + node twice; for each node we'll + + + Check whether the owner-tree is this tree or one of this tree's + ancestors (and if it's an ancestor, that the node's generation + isn't after the point that the child tree was forked from the + parent tree). If not, we are done processing that node (record + an empty/nil set of roots for it). + + + Create an empty map of `rootID` → {`loMaxItem`, `hiMaxItem`}. + + + Look at each keypointer that that points at the node and: + + * Skip the keypointer if its expectations of the node aren't met: + if the level, generation, and min-key constraints don't match + up. If the keypointer isn't in the last slot in the source + node, we also go ahead and include checking that the + destination node's max-key is under the min-key of the + keypointer in the next slot, since that's cheap to do now. + + * Skip the keypointer if its source node's owner-tree isn't this + tree or one of this tree's ancestors (and if it's an ancestor, + that the node's generation isn't after the point that the child + tree was forked from the parent tree). + + * dynamic-programming recurse and index the keypointer's source + node. + + * for every root that would result in the keypointer's source + node being included in the tree: + + . If the keypointer is in the last slot, look at what the what + the source node's last-item constraints would be if that root + is included, and can now check the max-item of our + destination node. We check against the `hiMaxItem`; as if + there is any valid path from the root to this node, then we + want to be permissive and include it. If that check fails, + then we're done with this keypointer. Also, make node of + those `loMaxItem` and `hiMaxItem` values, we'll use them + again in just a moment. + + . Otherwise, set both `loMaxItem` and `hiMaxItem` to 1-under + the min-item of the keypointer in the next slot. + + . Insert that `loMaxItem` and `hiMaxItem` pair into the + `rootID` → {`loMaxItem`, `hiMaxItem`} map we created above. + If an entry already exists for this root (since a broken tree + might have multiple paths from the root to our node), then + set `loMaxItem` to the min of the existing entry and our + value, and `hiMaxItem` to the max. + + + If that `rootID` → {`loMaxItem`, `hiMaxItem`} map is still empty, + then consider this node to be a (potential) root, and insert + `rootID=thisNode` -> {`loMaxItem=maxKey`, `hiMaxItem=maxKey`} + (where `maxKey` is the maximum value of the key datatype). + + + Take that `rootID` → {`loMaxItem`, `hiMaxItem`} map and insert it + into the index as the entry for this node. + + - The next index we build is the "item index". This is a "sorted + map" (implemented as a red-black tree, supporting sub-range + iteration) of `key` → {`nodeID`, `slotNumber`}; a map that for each + key tells us where to find the item with that key. + + + Loop over the node index, and for each node check if both (a) it + has `level==0` (is a leaf node containing items), and (b) its set + of roots that would include it has any overlap with the tree's + set of roots. + + + Loop over each of those included leaf nodes, and loop over the + items in each node. Insert the `key` → {`nodeId`, `slot`} into + our sorted map. If there is already an entry for that key, + decide which one wins by: + + * Use the one from the node with the owner-tree that is closer to + this tree; node with owner=thisTree wins over a node with + owner=thisTree.parent, which would win over a node with + owner.thisTree.parent.parent. If that's a tie, then... + + * Use the one from the node with the higher generation. If + that's a tie, then... + + * I don't know, I have the code `panic`: + + // TODO: This is a panic because I'm not really sure what the + // best way to handle this is, and so if this happens I want the + // program to crash and force me to figure out how to handle it. + panic(fmt.Errorf("dup nodes in tree=%v: old=%v=%v ; new=%v=%v", + tree.ID, + oldNode, tree.forrest.graph.Nodes[oldNode], + newNode, tree.forrest.graph.Nodes[newNode])) + + Note that this algorithm means that for a given node we may use a + few items from that node, while having other items from that same + node be overridden by another node. + + - The final index we build is the "error index". This is an index of + what errors correspond to which range of keys, so that we can + report them, and give an idea of "there may be entries missing from + this directory" and similar. + + For each error, we'll track the min-key and max-key of the range it + applies to, the node it came from, and what the error string is. + We'll store these into an interval tree keyed on that + min-key/max-key range. + + + Create an empty set `nodesToProcess`. Now populate it: + + * Once again, we'll loop over the node index, but this time we'll + only check that there's overlap between the set of roots that + would include the node and the tree's set of roots. The nodes + that are included in this tree, insert both that node itself + and all node IDs that it has keypointers pointing to into the + `nodesToProcess` set. + + * Also insert all of the tree's roots into `nodesToProcess`; this + is in case the superblock/root-item points to an invalid node + that we couldn't read. + + + Now loop over `nodesToProcess`. For each node, create an empty + list of errors. Use the keypointers pointing to and the min + `loMaxItem` from the node index to construct a set of + expectations for the node; this should be reasonably + straight-forward, given: + + * If different keypointers have disagreeing levels, insert an + error in to the list, and don't bother with checking the node's + level. + + * If different keypointers have disagreeing generations, insert + an error in to the list, and don't bother with checking the + node's generation. + + * If different keypointers have different min-item expectations, + use the max of them. + + Then: + + * If the node is a "bad node" in the graph, insert the error + message associated with it. Otherwise, check those + expectations against the node in the graph. + + If the list of error messages is non-empty, then insert their + concatenation into the interval tree, with the range set to the + min of the min-item expectations from the keypointers through the + max of the `hiMaxItem`s from the node index. If the min min-item + expectation turns out to be higher than the max `hiMaxItem`, then + set the range to the zero-key through the max-key. + +From there, it should be trivial to implement the usual B+ tree +operations using those indexes; exact-lookup using the item index, and +range-lookups and walks using the item index together with the error +index. Efficiently searching the `CSUM_TREE` requires knowing item +sizes, so that's why we recorded the item sizes into the graph. + +### 4.3.3. The `rebuild-trees` algorithm + +The `btrfs inspect rebuild-trees` algorithm finds nodes to attach as +extra roots to trees. I think that conceptually it's the the simplest +of the 3 algorithms, but turned out to be the hardest to get right. +So... maybe more than the others reference the source code too +(`./cmd/btrfs-rec/inspect/rebuildtrees/`) because I might forget some +small but important detail. + +The core idea here is that we're just going to walk each tree, +inspecting each item in the tree, and checking for any items that are +implied by other items (e.g.: a dir entry item implies the existence +of inode item for the inode that it points at). If an implied item is +not in the tree, but is in some other node, then we look at which +potential roots we could add to the tree that would add that other +node. Then, after we've processed all of the items in the filesystem, +we go add those various roots to the various trees, keeping track of +which items are added or updated. If any of those added/updated items +have a version with a newer generation on a different node, see what +roots we could add to get that newer version. Then add those roots, +keeping track of items that are added/updated. Once we reach +steady-state with the newest version of each item has been added, loop +back and inspect all added/updated items for implied items, keeping +track of roots we could add. Repeat until a steady-state is reached. + +There are lots of little details in that process, some of which are +for correctness, and some of which are for "it should run in hours +instead of weeks." + +#### 4.3.3.1. initialization + +First up, we're going to build and in-memory graph, same as above. +But this time, while we're reading the nodes to do that, we're also +going to watch for some specific items and record a few things about +them. + +(see: `./cmd/btrfs-rec/inspect/rebuildtrees/scan.go`) + +For each {`nodeID`, `slotNumber`} pair that matches one of these item +types, we're going to record: + + - flags: + + `INODE_ITEM`s: whether it has the `INODE_NODATASUM` flag set + - names: + + `DIR_INDEX` items: the file's name + - sizes: + + `EXTENT_CSUM` items: the number of bytes that this is a sum for + (i.e. the item size over the checksum size, times the block size) + + `EXTENT_DATA` items: the number of bytes in this extent + (i.e. either the item size minus + `offsetof(btrfs_file_extent_item.disk_bytenr)` if + `FILE_EXTENT_INLINE`, or else the item's `num_bytes`). + - data backrefs: + - `EXTENT_ITEM`s and `METADATA_ITEM`s: a list of the same length as + the number of refs embedded in the item; for embeded + ExtentDataRefs, the list entry is the subvolume tree ID that the + ExtentDataRef points at, otherwise it is zero. + - `EXTENT_DATA_REF` items: a list of length 1, with the sole member + being the subvolume tree ID that the ExtentDataRef points at. + +#### 4.3.3.2. the main loop + +(see: `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go`) + +Start with that scan data (graph + info about items), and also a +rebuilt forrest from the above algorithm, but with: + + - the flag set so that it refuses to look up a tree if it can't look + up all of that tree's ancestors + + - an additional "potential-item index" that is similar to the item + index. It is generated the same way and can cache/evict the same + way; the difference is that we invert the check for if the set of + roots for a node has overlap with the tree's set of roots; we're + looking for *potential* nodes that we could add to this tree. + + - some callbacks; we'll get to what we do in these callbacks in a + bit, but for now, what the callbacks are: + + + a callback that is called for each added/updated item when we add + a root. + + + a callback that is called whenever we add a root + + + a callback that intercepts looking up a root item + + + a callback that intercepts resolving an UUID to an object ID. + + (The callbacks are in + `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go`) + +We have 5 unordered queues ("work lists"?); these are sets that when +it's time to drain them we'll sort the members and process them in +that order. + + 1. the tree queue: a list of tree IDs that we need to crawl + 2. the retry-item queue: for each tree ID, a set of items that we + should re-process if we add a root to that tree + 3. the added-item queue: a set of key/tree pairs identifying items + that have been added by adding a root to a tree + 4. the settled-item-queue: a set of key/tree pairs that have have not + just been added by adding a root, but we've also verified that + they are the newest-generation item with that key that we could + add to the tree. + 5. the augment queue: for each item that we want to add to a tree, + the list of roots that we could add to get that item. + +The roots all start out empty, except for the tree queue, which we +seed with the `ROOT_TREE`, the `CHUNK_TREE`, and the +`BLOCK_GROUP_TREE` (It is a "TODO" task that it should probably also +be seeded with the `TREE_LOG`, but as I will say below in the "future +work" section, I don't actually understand the `TREE_LOG`, so I +couldn't implement it). + +Now we're going to loop until the tree queue, added-item queue, +settled-item queue, and augment queue are all empty (all queues except +for the retry-item queue). Each loop "pass" has 3 substeps: + + 1. Crawl the trees (drain the tree queue, fill the added-item queue). + + 2. Either: + + a. if the added-item queue is non-empty: "settle" those items + (drain the added-item queue, fill the augment queue and the + settled-item queue). + + b. otherwise: process items (drain the settled-item queue, fill + the augment queue and the tree queue) + + 3. Apply augments (drain the augment queue and maybe the retry-item + queue, fill the added-item queue). + +OK, let's look at those 3 substeps in more detail: + + 1. Crawl the trees; drain the tree queue, fill the added-item queue. + + We just look up the tree in the rebuilt forrest, which will (per + the above `--rebuild` algorithm) will either fail to look up the + tree, or succeed, and add to that tree the root node from the + superblock/root-item. Because we set an item-added callback, when + adding that root it will loop over the nodes added by that root, + and call our callback for each item in one of the added nodes. + Our callback inserts each item into the added-item queue. The + forrest also calls our root-added callback, but because of the way + this algorithm works, that turns out to be a no-op at this step. + + I mentioned that we added callbacks to intercept the forrest's + looking up of root items and resolving UUIDs; we override the + forrest's "lookup root item" routine and "resolve UUID" routine to + instead of doing normal lookups on the `ROOT_TREE` and + `UUID_TREE`, use the above `WantXXX` routines that we'll define + below in the "graph callbacks" section. + + It shouldn't matter what order this queue is processed in, but I + sort tree IDs numerically. + + The crawling is fairly fast because it's just in-memory, the only + accesses to disk are looking up root items and resolving UUIDs. + + 2. Either: + + a. Settle items from the added-item queue to the settled-item queue + (and fill the augment queue). + + For each item in the queue, we look in the tree's item index to + get the {node, slot} pair for it, then we do the same in the + tree's potential-item index. If the potential-item index + contains an entry for the item's key, then we check if the + potential-item's node should "win" over the queue item's node, + deciding the "winner" using the same routine as when building + the item index. If the potential-item's node wins, then we add + the potential node's set of roots to the augment queue. If the + queue-item's node wins, then we add the item to the + settled-item queue (except, as an optimization, if the item is + of a type that cannot possibly imply the existence of another + item, then we just drop it and don't add it to the settled-item + queue). + + It shouldn't matter what order this queue is processed in, but + I sort it numerically by treeID and then by item key. + + This step is fairly fast because it's entirely in-memory, + making no accesses to disk. + + b. Process items from the settled-item queue (drain the + settled-item queue, fill the augment queue and the tree queue). + + This step accesses disk, and so the order we process the queue + in turns out to be pretty important in order to keep our disk + access patterns cache-friendly. For the most part, we just + sort each queue item by tree, then by key. But, we have + special handling for `EXTENT_ITEM`s, `METADATA_ITEM`s, and + `EXTENT_DATA_REF` items: We break `EXTENT_ITEM`s and + `METADATA_ITEM`s in to "sub-items", treating each ref embedded + in them as a separate item. For those embedded items that are + `EXTENT_DATA_REF`s, and for stand-alone `EXTENT_DATA_REF` + items, we sort them not with the `EXTENT_TREE` items, but with + the items of the tree that the extent data ref points at. + Recall that during the intitial scan step, we took note of + which tree every extent data ref points at, so we can perform + this sort without accessing disk yet. This splitting does mean + that we may visit/read an `EXTENT_ITEM` or `METADATA_ITEM` + multiple times as we process the queue, but to do otherwise is + to solve MinLA, which is NP-hard and also an optimal MinLA + solution I still think would perform worse than this; there is + a reasonably lengthy discussion of this in a comment in + `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:sortSettledItemQueue()`. + + Now we loop over that sorted queue. In the code, this loop is + deceptively simple. Read the item, then pass it to a function + that tells us what other items are implied by it. That + function is large, but simple; it's just a giant table. The + trick is how it tells us about implied items; we give it set of + callbacks that it calls to tell us these things; the real + complexity is in the callbacks. These "graph callbacks" will + be discussed in detail below, but as an illustrative example: + It may call `.WantOff()` with a tree ID, object ID, item type, + and offset to specify a precise item that it believes should + exist. + + If we encounter a `ROOT_ITEM`, add the tree described by that + item to the tree queue. + + (Both the "can this item even imply the existence of another item" + check and the "what items are implied by this item" routine are in + `./lib/btrfscheck/graph.go`) + + 3. Apply augments; drain the augment queue (and maybe the retry-item + queue), fill the added-item queuee. + + It is at this point that I call out that the augment queue isn't + implemented as a simple map/set like the others, the + `treeAugmentQueue struct` has special handling for sets of + different sizes; optimizing the space for empty and len()==1 sized + sets, and falling back to normal the usual implementation for + larger sets; this is important because those small sets are the + overwhelming majority, and otherwise there's no way the program + would be able to run on my 32GB RAM laptop. Now that I think + about it, I bet it would even be worth it to add optimized storage + for len()==2 sized sets. + + The reason is that each "want" from above is tracked in the queue + separately; if we were OK merging them, then this optimized + storage wouldn't be nescessary. But we keep them separate, so + that: + + - For all "wants", including ones with empty sets, graph callbacks + can check if a want has already been processed; avoiding + re-doing any work (see the description of the graph callbacks + below). + + - For "wants" with non-empty sets, we can see how many different + "wants" could be satisfied with a given root, in order to decide + which root to choose. + + Anyway, we loop over the trees in the augment queue. For each + tree we look at that tree's augment queue and look at all the + choices of root nodes to add (below), and decide on a list to add. + The we add each of those roots to the tree; the adding of each + root triggers several calls to our item-added callback (filling + the added-item queue), and our root-added callback. The + root-added callback moves any items from the retry-item queue for + this tree to the added-item queue. + + How do we decide between choices of root nodes to add? + `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:resolveTreeAugments()` + has a good comment explaining the criteria we'd like to optimize + for, and then code that does an OK-ish job of actually optimizing + for that: + + - It loops over the augment queue for that tree, building a list + of possible roots, for each possible root making note of 3 + things: + + a. how many "wants" that root satisfies, + + b. how far from treee the root's owner is (owner=tree is a + distance of 0, owner=tree.parent is a distance of 1, + owner=tree.parent.parent is a distance of 2, and so on), and + + c. what the generation of that root is. + + - We sort that list first by highest-count-first, then by + lowest-distance-first, then by highest-generation-first. + + - We create a "return" set and an "illegal" set. We loop over the + sorted list; for each possible root if it is in the illegal set, + we skip it, otherwise we insert it into the return set and for + each "want" that includes this root we all all roots that + satisfy that want to the illegal list. + +It is important that the rebuilt forrest have the flag set so that it +refuses to look up a tree if it can't look up all of that tree's +ancestors; otherwise the potential-items index would be garbage as we +wouldn't have a good idea of which nodes are OK to consider; but this +does have the downside that it won't even attempt to improve a tree +with a missing parent. Perhaps the algorithm should flip the flag +once the loop terminates, and then re-seed the tree queue with each +`ROOT_ITEM` from the `ROOT_TREE`? + +#### 4.3.3.3. graph callbacks + +(see: `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go`) + +The graph callbacks are what tie the above together. + +For each of these callbacks, whenever I say that it looks up something +in a tree's item index or potential-item index, that implies looking +the tree up from the forrest; if the forrest cannot look up that tree, +then the callback returns early, after either: + + - if we are in substep 1 and are processing a tree: we add the tree + that is being processed to the tree queue. (TODO: Wait, this + assumes that an augment will be applied to the `ROOT_TREE` before + the next pass... if that isn't the case, this will result in the + loop never terminating... I guess I need to add a separate + retry-tree queue?) + + - if we are in substep 2 and are processing an item: we add the item + that is being processed to the retry-item queue for the tree that + cannot be looked up + +The 6 methods in the `brfscheck.GraphCallbacks` interface are: + + 1. `FSErr()`: There's an error with the filesystem; this callback + just spits it out on stderr. I say such a trivial matter because, + again, for a recovery tool I think it's worth putting care in to + how you handle errors and where you expect them: We expect them + here, so we have to check for them to avoid reading invalid data + or whatever, but we don't actually need to do anything other than + watch our step. + + 2. `Want()`: We want an item in a given tree with a given object ID + and item type, but we don't care about what the item's offset is. + + The callback works by searching the item index to see if it can + find such an item; if so, it has nothing else to do and returns. + Otherwise, it searches the potential-item index; for each matching + item it finds it looks in the node index for the node containing + that item, and adds the roots that would add that node, and adds + those roots to a set. Once it has finished searching the + potential-item index, it adds that set to the augment queue (even + if that set is still empty). + + 3. `WantOff()`: The same, but we want a specific offset. + + 4. `WantDirIndex()`: We want a `DIR_INDEX` item for a given inode and + filename, but we don't know what the offset of that item is. + + First we scan over the item index, looking at all `DIR_INDEX` + items for that inode number. For each item, we can check the scan + data to see what the filename in that `DIR_INDEX` is, so we can + see if the item satisfies this want without accessing the disk. + If there's a match, then there is nothing else to do, so we + return. Otherwise, we do that same search over the potential-item + index; if we find any matches, then we build the set of roots to + add to the augment queue the same as in `Want`. + + 5. `WantFileExt()`: We want 1 or more `DATA_EXTENT` items in the + given tree for the given inode, and we want them to cover from 0 + to a given size bytes of that file. + + First we walk that range in the item index, to build a list of the + gaps that we need to fill ("Step 1" in + `rebuild_wantcb.go:_wantRange()`). This walk + (`rebuild_wantcb.go:_walkRange()`) requires knowing the size of + each file extent; so doing this quickly without hitting disk is + why we recorded the size of each file extent in our initialization + step. + + Then ("Step 2" in `_wantRange()`) we iterate over each of the + gaps, and for each gap do a very similar walk (again, by calling + `_walkRange()`, but this time over the potential-item index. For + each file extent we find that has is entirely within the gap, we + "want" that extent, and move the beginning of of the gap forward + to the end of that extent. This algorithm is dumb and greedy, + potentially making sub-optimal selections; and so could probably + stand to be improved; but in my real-world use, it seems to be + "good enough". + + 6. `WantCSum()`: We want 1 or more `EXTENT_CSUM` items to cover the + half-open interval [`lo_logical_addr`, `hi_logical_addr`). Well, + maybe. It also takes a subvolume ID and an inode number; and + looks up in the scan data whether that inode has the + `INODE_NODATASUM` flag set; if it does have the flag set, then it + returns early without looking for any `EXTENT_CSUM` items. If it + doesn't return early, then it performs the same want-range routine + as `WantFileExt`, but with the appropriate tree, object ID, and + item types for csums as opposed to data extents. + +For each of these callbacks, we generate a "wantKey", a tuple +representing the function and its arguments; we check the +augment-queue to see if we've already enqueued a set of roots for that +want, and if so, that callback can return early without checking the +potential-item index. + +# 5. Future work + +It's in a reasonably useful place, I think; and so now I'm going to +take a break from it for a while. But there's still lots of work to +do: + + - RAID almost certainly doesn't work. + + - Encryption is not implemented. + + - It doesn't understand (ignores) the `TREE_LOG` (because I don't + understand the `TREE_LOG`). + + - `btrfs-rec inspect mount` should add "lost+found" directories for + inodes that are included in the subvolume's tree but aren't + reachable from the tree's root inode + + - I still need to implement `btrfs-rec repair SUBCMD` subcommands to + write rebuilt-information from `btrfs-rec inspect` back to the + filesystem. + + - I need to figure out the error handling/reporting story for + `mount`. + + - It needs a lot more tests + + + I'd like to get the existing btrfs-progs fsck tests to run on + it. + + - In the process of writing this email, I realized that I probably + need to add a retry-tree queue; see the "graph callbacks" section + in the description of the `rebuild-trees` algorithm above. + + - Shere are a number of "TODO" comments or panics in the code: + + + Some of them definitely need done. + + + Some of them are `panic("TODO")` on the basis that if it's + seeing something on the filesystem that it doesn't recognize, + it's probably that I didn't get to implementing that + thing/situation, but it's possible that the thing is just + corrupt. This should only be for situations that the node + passed the checksum test, so it being corrupt would have to be + caused by a bug in btrfs rather than a failing drive or other + corruption; I wasn't too worried about btrfs bugs. + + - `btrfs-rec inspect rebuild-trees` is slow, and can probably be made + a lot faster. + + Just to give you an idea of the speeds, the run-times for the + various steps on my ThinkPad E15 for a 256GB disk image are as + follows: + + btrfs-rec inspect rebuild-mappings scan : 7m 31s + btrfs-rec inspect rebuild-mappings list-nodes : 47s + btrfs-rec inspect rebuild-mappings process : 8m 22s + btrfs-rec inspect rebuild-trees : 1h 4m 55s + btrfs-rec inspect ls-files : 29m 55s + btrfs-rec inspect ls-trees : 8m 40s + + For the most part, it's all single-threaded (with the main + exception that in several places I/O has been moved to a separate + thread from the main CPU-heavy thread), but a lot of the algorithms + could be parallelized. + + - There are a lot of "tunable" values that I haven't really spent + time tuning. These are all annotated with `textui.Tunable()`. I + sort-of intended for them to be adjustable on the CLI. + + - Perhaps the `btrfs inspect rebuild-trees` algorithm could be + adjusted to also try to rebuild trees with missing parents; see the + above discussion of the algorithm. + +# 6. Problems for merging this code into btrfs-progs + + - It's written in Go, not C. + + - It's effectively GPLv3+ (not GPLv2-only or GPLv2+) because of use + of some code under the Apache 2.0 license (2 files in the codebase + itself that are based off of Apache-licensed code, and use of + unmodified 3rd-party libraries). + + - It uses ARC (Adaptive Replacement Cache), which is patented by IBM, + and the patent doesn't expire for another 7 months. An important + property of ARC over LRU is that it is scan-resistant; the above + algorithms do a lot of scanning. On that note, now that RedHat is + owned by IBM: who in the company do we need to get to talk to + eachother so that we can get ARC into the Linux kernel before then? + +
+--
+Happy hacking,
+~ Luke Shumaker
+
-- cgit v1.2.3