Add the btrfs-rec email

This isn't quite verbatim checking in the email as I did in btrfs-progs-ng.git, I fussed with it a bit to get my blog engine to do sane things with it.
author: Luke Shumaker <lukeshu@lukeshu.com> 2023-07-14 14:38:38 -0700
committer: Luke Shumaker <lukeshu@lukeshu.com> 2023-07-14 15:24:09 -0700
commit: c045654a862bc1119fa4e7584fff9d2a965192ea (patch)
tree: 15b075deea49efdba94df64f90a29fbc5aea0ba8
parent: 8c11dbdaacd8ec0b75d9110fe326955f01c45bee (diff)
1 files changed, 1234 insertions, 0 deletions
diff --git a/public/btrfs-rec.md b/public/btrfs-rec.md
new file mode 100644
index 0000000..8519969
--- /dev/null
+++ b/public/btrfs-rec.md
@@ -0,0 +1,1234 @@
+Announcing: btrfs-rec: Recover (data from) a broken btrfs filesystem
+====================================================================
+---
+date: "2023-07-10"
+---
+
+> I originally sent this email on 2023-07-10, but it has been caught
+> by their bogofilter.  Yes, it was
+> [plaintext](https://git.lukeshu.com/btrfs-progs-ng/tree/README.md?id=18e6066c241cf3d252b6521150843ffc858d8434).
+> No, I didn't use GMail.  Yes, I've successfully participated in vger
+> lists in the past.  Yes, I've reached out to postmaster; no, I
+> haven't received a reply yet (as of 2023-07-14).
+
+<div style="font-family: monospace">
+To: linux-btrfs@vger.kernel.org<br/>
+From: Luke T. Shumaker &lt;lukeshu@lukeshu.com&gt;<br/>
+Subject: btrfs-rec: Recover (data from) a broken btrfs filesystem<br/>
+Date: Mon, 10 Jul 2023 21:23:41 -0600<br/>
+Message-ID: &lt;87jzv7uo5e.wl-lukeshu@lukeshu.com&gt;<br/>
+</div>
+
+Inspired by a mis-typed `dd` command, for the last year I've been
+working on a tool for recovering corrupt btrfs filesystems; at first
+idly here and there, but more actively in the last few months.  I hope
+to get it incorporated into btrfs-progs, though perhaps that is
+problematic for a few reasons I'll get to.  If the code can't be
+incorporated into btrfs-progs, at least the ideas and algorithms
+should be.
+
+[https://git.lukeshu.com/btrfs-progs-ng/](https://git.lukeshu.com/btrfs-progs-ng/)
+
+Highlights:
+
+ - In general, it's more tolerant of corrupt filesystems than
+   `btrfs check --repair`, `btrfs rescue` or `btrfs restore`.
+
+ - `btrfs-rec inspect rebuild-mappings` is a better
+   `btrfs rescue chunk-recover`.
+
+ - `btrfs-rec inspect rebuild-trees` can re-attach lost branches to
+   broken B+ trees.
+
+ - `btrfs-rec inspect mount` is a read-only FUSE implementation of
+   btrfs.  This is conceptually a replacement for `btrfs restore`.
+
+ - It's entirely written in Go.  I'm not saying that's a good thing,
+   but it's an interesting thing.
+
+Hopefully some folks will find it useful, or at least neat!
+
+  - [1.      Motivation](#motivation)
+  - [2.      Overview of use](#overview-of-use)
+  - [3.      Prior art](#prior-art)
+  - [4.      Internals/Design](#internalsdesign)
+  - [4.1.      Overview of the source tree layout](#overview-of-the-source-tree-layout)
+  - [4.2.      Base decisions: CLI structure, Go, JSON](#base-decisions-cli-structure-go-json)
+  - [4.3.      Algorithms](#algorithms)
+  - [4.3.1.      The `rebuild-mappings` algorithm](#the-rebuild-mappings-algorithm)
+  - [4.3.2.      The `--rebuild` algorithm](#the---rebuild-algorithm)
+  - [4.3.2.1.      rebuilt forrest behavior](#rebuilt-forrest-behavior-looking-up-trees)
+  - [4.3.2.2.      rebuilt individual tree behavior](#rebuilt-individual-tree-behavior)
+  - [4.3.3.      The `rebuild-trees` algorithm](#the-rebuild-trees-algorithm)
+  - [4.3.3.1.      initialization](#initialization)
+  - [4.3.3.2.      the main loop](#the-main-loop)
+  - [4.3.3.3.      graph callbacks](#graph-callbacks)
+  - [5.      Future work](#future-work)
+  - [6.      Problems for merging this code into btrfs-progs](#problems-with-merging-this-code-into-btrfs)
+
+# 1. Motivation
+
+Have you ever ended up with a corrupt btrfs filesystem (through no
+fault of btrfs itself, but perhaps a failing drive, or a mistaken `dd`
+invocation)?  Surely losing less than 100MB of data from a drive
+should not render hundreds of GB of perfectly intact data unreadable!
+And yet, the existing tools are unable to even attempt to read that
+data:
+
+    $ btrfs check --repair --force dump-zero.1.img
+    enabling repair mode
+    Opening filesystem to check...
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+    ERROR: cannot read chunk root
+    ERROR: cannot open file system
+
+or
+
+    $ btrfs check --init-extent-tree --force dump-zero.1.img
+    Opening filesystem to check...
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+    ERROR: cannot read chunk root
+    ERROR: cannot open file system
+
+or
+
+    $ btrfs check --init-csum-tree --force dump-zero.1.img
+    Creating a new CRC tree
+    Opening filesystem to check...
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+    ERROR: cannot read chunk root
+    ERROR: cannot open file system
+
+or
+
+    $ btrfs rescue chunk-recover dump-zero.1.img
+    Scanning: DONE in dev0
+    corrupt node: root=1 block=160410271744 slot=0, corrupt node: root=1 block=160410271744, nritems too large, have 39 expect range [1,0]
+    Couldn't read tree root
+    open with broken chunk error
+
+or
+
+    $ btrfs rescue zero-log dump-zero.1.img
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    ERROR: cannot read chunk root
+    ERROR: could not open ctree
+
+or
+
+    $ mkdir out
+    $ btrfs restore dump-zero.1.img out
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+    ERROR: cannot read chunk root
+    Could not open root, trying backup super
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+    ERROR: cannot read chunk root
+    Could not open root, trying backup super
+    ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
+    Could not open root, trying backup super
+
+or
+
+    $ btrfs restore --list-roots dump-zero.1.img
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+    ERROR: cannot read chunk root
+    Could not open root, trying backup super
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+    bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+    ERROR: cannot read chunk root
+    Could not open root, trying backup super
+    ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
+    Could not open root, trying backup super
+
+or
+
+    $ btrfs-find-root dump-zero.1.img
+    WARNING: cannot read chunk root, continue anyway
+    Superblock thinks the generation is 6596071
+    Superblock thinks the level is 1
+
+Well, have I got a tool for you!
+
+(FWIW, I also tried manipulating the filesystem and patching to tools
+to try to get past those errors, only to get a different set of
+errors.  Some of these patches I am separately submitting to
+btrfs-progs.)
+
+# 2. Overview of use
+
+There are two `btrfs-rec` sub-command groups:
+`btrfs-rec inspect SUBCMD` and `btrfs-rec repair SUBCMD`, and you can
+find out about various sub-commands with `btrfs-rec help`.  These are
+both told about devices or images with the `--pv` flag.
+
+`btrfs-rec inspect SUBCMD` commands open the filesystem read-only, and
+(generally speaking) write extracted or rebuilt information to stdout.
+`btrfs-rec repair SUBCMD` commands open the filesystem read+write, and
+consume information from `btrfs-rec inspect SUBCMD` commands to
+actually repair the filesystem (except I haven't actually implemented
+any `repair` commands yet...  despite the lack of `repair` commands, I
+believe that `btrfs-rec` is already a useful because of the `btrfs-rec
+inspect mount` command to get data out of the broken filesystem).
+This split allows you to try things without being scared by WARNINGs
+about not using these tools unless you're an expert or have been told
+to by a developer.
+
+In the broken `dump-zero.1.img` example above (which has a perfectly
+intact superblock, but a totally broken `CHUNK_TREE`), to "repair" it
+I'd:
+
+ 1. Start by using `btrfs-rec inspect rebuild-mappings` to rebuild the
+    broken chunk/dev/blockgroup trees:
+
+        $ btrfs-rec inspect rebuild-mappings \
+            --pv=dump-zero.1.img \
+            > mappings-1.json
+
+ 2. If it only mostly succeeds, but on stderr tells us about a few
+    regions of the image that it wasn't able to figure out the chunks
+    for.  Using some human-level knowledge, you can write those
+    yourself, inserting them into the generated `mappings.json`, and
+    ask `rebuild-mappings` to normalize what you wrote:
+
+        $ btrfs-rec inspect rebuild-mappings \
+            --pv=dump-zero.1.img \
+            --mappings=<(sed <mappings-1.json \
+                -e '2a{"LAddr":5242880,"PAddr":{"Dev":1,"Addr":5242880},"Size":1},' \
+                -e '2a{"LAddr":13631488,"PAddr":{"Dev":1,"Addr":13631488},"Size":1},') \
+            > mappings-2.json
+
+ 3. Now that it has functioning chunk/dev/blockgroup trees, we can use
+    `btrfs-rec inspect rebuild-trees` to rebuild other trees that rely
+    on those:
+
+        $ btrfs-rec inspect rebuild-mappings \
+            --pv=dump-zero.1.img \
+            --mappings=mappings-2.json \
+            > trees.json
+
+ 4. Now that (hopefully) everything that was damaged has been
+    reconstructed, we can use `btrfs-rec inspect mount` to mount the
+    filesystem read-only and copy out our data:
+
+        $ mkdir mnt
+        $ sudo btrfs-rec inspect mount \
+            --pv=dump-zero.1.img \
+            --mappings=mappings-2.json \
+            --trees=trees.json \
+            ./mnt
+
+This example is fleshed out more (and the manual edits to
+`mappings.json` explained more) in `./examples/main.sh`.
+
+# 3. Prior art
+
+Comparing `btrfs-rec inspect mount` with the existing
+https://github.com/adam900710/btrfs-fuse project:
+
+ - Again, mine has better fault tolerance
+ - Mine is read-only
+ - Mine supports xattrs ("TODO" in Adam's)
+ - Mine supports separate inode address spaces for subvolumes; Adam's
+   doesn't due to limitations in FUSE, mine works around this by
+   lazily setting up separate mountpoints for each subvolume (though
+   this does mean that the process needs to run as root, which is a
+   bummer).
+
+# 4. Internals/Design
+
+## 4.1. Overview of the source tree layout
+
+ - `examples/` has example scripts showing how to use `btrfs-rec`.
+
+ - `lib/btrfs/` is the core btrfs implementation.
+
+ - `lib/btrfscheck/` and `lib/btrfsutil/` are libraries for
+   "btrfs-progs" type programs, that are userland-y things that I
+   thought should be separate from the core implementation; something
+   that frustrated me about libbtrfs was having to figure out "is this
+   thing here in support of btrfs bits-on-disk, or in support of a
+   higher-level 'how btrfs-progs wants to think about things'?"
+
+ - `cmd/btrfs-rec/` is where the command implementations live.  If a
+   sub-command fits in a single file, it's
+   `cmd/btrfs-rec/inspect_SUBCMD.go`, otherwise, it's in a separate
+   `cmd/btrfs-rec/inspect/SUBCMD/` package.
+
+ - `lib/textui/` is reasonably central to how the commands implement a
+   text/CLI user-interface.
+
+ - `lib/binstruct/`, `lib/diskio/`, and `lib/streamio/` are
+   non-btrfs-specific libraries related to the problem domain.
+
+ - `lib/containers/`, `lib/fmtutil/`, `lib/maps/`, `lib/slices/`, and
+   `lib/profile/` are all generic Go libraries that have nothing to do
+   with btrfs or the problem domain, but weren't in the Go standard
+   library and I didn't find/know-of exiting implementations that I
+   liked.  Of these, all but `containers` are pretty simple utility
+   libraries.  Also, some of these things have been added to the
+   standard library since I started the project.
+
+## 4.2. Base decisions: CLI structure, Go, JSON
+
+I started with trying to enhance btrfs-progs, but ended up writing a
+wholy new program in Go, for several reasons:
+
+ - writing a new thing: I was having to learn both the btrfs-progs
+   codebase and how btrfs-bits-on-disk work, and it got to the point
+   that I decided I should just focus on learning btrfs-bits-on-disk.
+
+ - writing a new thing: It was becoming increasingly apparent to me
+   that it was going to be an uphill-fight of having recovery-tools
+   share the same code as the main-tools, as the routines used by the
+   main-tools rightly have validity checks, where recovery-tools want
+   to say "yes, I know it's invalid, can you give it to me anyway?".
+
+ - writing it in not-C: I love me some C, but higher level languages
+   are good for productivity.  And I was trying to write a whole lot
+   of code at once, I needed a productivity boost.
+
+ - writing it in not-C: This forced me to learn btrfs-bits-on-disk
+   better, instead of just cribbing from btrfs-progs.  That knowledge
+   is particularly important for having ideas on how to deal with
+   corrupt bits-on-disk.
+
+ - writing it in Go: At the time I started, my day job was writing Go,
+   so I had Go swapped into my brain.  And Go still feels close to C
+   but provides *a lot* of niceness and safety over C.
+
+It turned out that Go was perhaps not the best choice, but we'll come
+back to that.
+
+I wanted to separate things into a pipeline.  For instance: Instead of
+`btrfs rescue chunk-recover` trying to do everything to rebuild a
+broken chunk tree, I wanted to separate I/O from computation from
+repairs.  So I have `btrfs-rec inspect rebuild-mappings scan` that
+reads all the info necessary to rebuild the chunk tree, then dump that
+as a 2GB glob of JSON.  Then I can feed that JSON to `btrfs-rec
+inspect rebuild-mappings process` which actually rebuilds the mappings
+in the chunk tree, and dumps them as JSON.  And then other commands
+can consume that `mappings.json` to use that instead of trying to read
+the chunk tree from the actual FS, so that you don't have to make
+potentially destructive writes to inspect an FS with a broken chunk
+tree, and can inspect it more forensically.  Or then use
+`btrfs-rec repair SOME_SUBCMD_I_HAVENT_WRITTEN_YET` to write that
+chunk tree in `mappings.json` back to the filesystem.
+
+(But also, the separate steps thing was useful just so I could iterate
+on the algorithms of `rebuild-mappings process` separately from having
+to scan the entire FS)
+
+So, I made the decision that `btrfs-rec inspect SUBCMD` commands
+should all only open the FS read-only, and output their work to a
+separate file; that writing that info back to the FS should be
+separate in `btrfs-rec repair SUBCMD`.
+
+For connecting those parts of the pipeline, I chose JSON, for a few
+reasons:
+
+ - I wanted something reasonably human-readable, so that I could debug
+   it easier.
+
+ - I wanted something reasonably human-readable, so that human
+   end-users could make manual edits; for example, in
+   `examples/main.sh` I have an example of manually editing
+   `mappings.json` to resolve a region that the algorithm couldn't
+   figure out, but with knowledge of what caused the corruption a
+   human can.
+
+ - I didn't want to invent my own DSL and have to handle writing a
+   parser.  (This part didn't pay off!  See below.)
+
+ - I wanted something that I thought would have good support in a
+   variety of languages, so that if Go is problematic for getting
+   things merged upstream it could be rewritten in C (or maybe Rust?)
+   piece-meal where each subcommand can be rewritten one at a time.
+
+It turned out that JSON was perhaps not the best choice.
+
+OK, so: Go and/or JSON maybe being mistakes:
+
+ - I spent a lot of time getting the garbage collector to not just
+   kill performance.
+
+ - The `btrfs-rec inspect rebuild-mappings SUBCMD` subcommands all
+   throw a lot of data through the JSON encoder/decoder, and I learned
+   that the Go stdlib `encoding/json` package has memory use that
+   grows O(n^2) (-ish? I didn't study the implementation, but that's
+   what the curve looks like just observing it) on the size of the
+   data being shoved through it, so I had to go take a break and go
+   write https://pkg.go.dev/git.lukeshu.com/go/lowmemjson which is a
+   mostly-drop-in-replacement that tries to be as close-as possible to
+   O(1) memory use.  So I did end up having to write my own parser
+   anyway :(
+
+## 4.3. Algorithms
+
+There are 3 algorithms of note in `btrfs-rec`, that I think are worth
+getting into mainline btrfs-progs even if the code of `btrfs-rec`
+doesn't get in:
+
+ 1. The `btrfs-rec inspect rebuild-mappings` algoritithm to rebuild
+    information from the `CHUNK_TREE`, `DEV_TREE`, and
+    `BLOCK_GROUP_TREE`.
+
+ 2. The `btrfs-rec --rebuild` algorithm to cope with reading broken B+
+    trees.
+
+ 3. The `btrfs-rec inspect rebuild-trees` algorithm to re-attach lost
+    branches to broken B+ trees.
+
+### 4.3.1. The `rebuild-mappings` algorithm
+
+(This step-zero scan is `btrfs-rec inspect rebuild-mappings scan`, and
+principally lives in `./lib/btrfsutil/scan.go` and
+`./cmd/btrfs-rec/inspect/rebuidmappings/scan.go`)
+
+ 0. Similar to `btrfs rescue chunk-recover`, scan each device for
+    things that look like nodes; keep track of:
+     - Checksums of every block on the device
+     - Which physical addresses contain nodes that claim to be at a
+       given logical addess.
+     - Any found Chunk items, BlockGroup items, DevExtent, and CSum
+       items.  Keep track of the key for each of these, and for CSum
+       items also track the generation.
+
+Create a bucket of the data from Chunks, DevExtents, and BlockGroups;
+since these are mostly a Chunk and a DevExtent+BlockGroup store pretty
+much the same information; we can use one to reconstruct the other.
+How we "merge" these and handle conflicts is in
+`./lib/btrfs/btrfsvol/lvm.go:addMapping()`, I don't think this
+part is particularly clever, but given that `btrfs rescue
+chunk-recover` crashes if it encounters two overlapping chunks, I
+suppose I should spell it out:
+
+ - A "mapping" is represented as a group of 4 things:
+
+   + logical address
+   + a list of 1 or more physical addresses (device ID and offset)
+   + size, and a Boolean indicator of whether the size is "locked"
+   + block group flags, and a Boolean presence-indicator
+
+ - Mappings must be merged if their logical or physical regions
+   overlap.
+
+ - If a mapping has a "locked" size, then when merging it may subsume
+   smaller mappings with unlocked sizes, but its size cannot be
+   changed; trying to merge a locked-size mapping with another mapping
+   that is not for a subset region should return an error.
+
+ - If a mapping has block group flags present, then those flags may
+   not be changed; it may only be merged with another mapping that
+   does not have flags present, or has identical flags.
+
+ - When returning an error because of overlapping non-mergeable
+   mappings, just log an error on stderr and keep going.  That's an
+   important design thing that is different than normal filesystem
+   code; if there's an error, yeah, detect and notify about it, **but
+   don't bail out of the whole routine**.  Just skip that one item or
+   whatever.
+
+Now that we know how to "add a mapping", let's do that:
+
+(The following main-steps are `btrfs-rec inspect rebuild-mappings
+process`, and principally live in
+`./cmd/btrfs-rec/inspect/rebuidmappings/process.go`)
+
+ 1. Add all found Chunks.
+
+ 2. Add all found DevExtents.
+
+ 3. Add a phyical:logical mapping of length nodesize for each node
+    that was found.
+
+ 4. Any mappings from steps 2 or 3 that are missing blockgroup flags
+    (that is: they weren't able to be merged with a mapping from step
+    1), use the found BlockGroups to fill in those flags.
+
+ 5. Now we'll merge all found CSum items into a map of the sums of the
+    logical address space.  Sort all of the csum items by generation,
+    then by address.  Loop over them in that order, inserting their
+    sums into the map.  If two csum items overlap, but agree about the
+    sums of the overlapping region, that's fine, just take their
+    union.  For overlaps that disagree, items with a newer generation
+    kick out items with an older generation.  If disagreeing items
+    have the same generation... I don't think that can happen except
+    by a filesystem bug (i.e. not by a failing drive or other external
+    corruption), so I wasn't too concerned about it, so I just log an
+    error on stderr and skip the later-processed item.  See
+    `./cmd/btrfs-rec/inspect/rebuidmappings/process_sums_logical.go`.
+
+    Look at regions of the logical address space that meet all the 3
+    criteria:
+
+     - we have CSum items for them
+     - we have a BlockGroup for them
+     - we don't have a Chunk/DevExtent mapping them to the pysical
+       address space.
+
+    Pair those CSums up with BlockGroups, and for each BlockGroup,
+    search the list of checksums of physical blocks to try to find a
+    physical region that matches the logical csums (and isn't already
+    mapped to a different logical region).  I used a
+    Knuth-Morris-Pratt search, modified to handle holes in the logical
+    csum list as wildcards.
+
+    Insert any found mappings into our bucket of mappings.
+
+ 6. Do the same again, but with a fuzzy search (we can re-use the csum
+    map of the logical address space).  My implementation of this is
+    comparatively time and space intensive; I just walk over the
+    entire unmapped physical address space, noting what % of match
+    each BlockGroup has if placed at that location.  I keep track of
+    the best 2 matches for each BlockGroup.  If the best match is
+    better than a 50% match, and the second best is less than a 50%
+    match, then I add the best match.  In my experience, the best
+    match is >90% (or at whatever the maximum percent is for how much
+    of the BlockGroup has logical sums), and the second best is 0% or
+    1%.  The point of tracking both is that if there isn't a clear-cut
+    winner, I don't want it to commit to a potentially wrong choice.
+
+### 4.3.2. The `--rebuild` algorithm
+
+The `--rebuild` flag is implied by the `--trees=trees.json` flag, and
+triggers an algorithm that allows "safely" reading from a broken B+
+tree, rather than the usual B+ tree lookup and search functions.  I
+probably should have tried to understand the `btrfs restore`
+algorithm, maybe I reinvented the wheel...
+
+This algorithm requires a list of all nodes on the filesystem; we find
+these using the same scan as above (`./lib/btrfsutil/scan.go`), the
+same procedure as `btrfs rescue chunk-recover`.
+
+We walk all of those nodes, and build a reasonably lightweight
+in-memory graph of all nodes (`./lib/btrfsutil/graph.go`), tracking
+
+ - each node's
+   + logical address
+   + level
+   + generation
+   + tree
+   + each item's key and size
+ - each keypointer's
+   + source node
+   + source slot within the node
+   + tree of the source node
+   + destination node
+   + destination level implied by the level of the source node
+   + destination key
+   + destination generation
+ - logical addresses and error messages for nodes that are pointed to
+   by a keypointer or the superblock, but can't be read (because that
+   logical address isn't mapped, or it doesn't look like a node,
+   or...)
+ - an index such that for a given node we can quickly list both
+   keypointers both originating at that node and pointing to that
+   node.
+
+#### 4.3.2.1. rebuilt forrest behavior (looking up trees)
+
+(see: `./lib/btrfsutil/rebuilt_forrest.go`)
+
+ - The `ROOT_TREE`, `CHUNK_TREE`, `TREE_LOG`, and `BLOCK_GROUP_TREE`
+   (the trees pointed to directy by the superblock) work as you'd
+   expect.
+ - For other trees, we (as you'd expect) look up the root item in the
+   rebuilt `ROOT_TREE`, and then (if rootitem.ParentUUID is non-zero)
+   eagerly also look up the parent tree (recursing on ourself).  We
+   try to use the `UUID_TREE` tree to help with this, but fall back to
+   just doing a linear scan over the `ROOT_TREE`.  If we fail to look
+   up the parent tree (or its parent, or a more distant ancestor),
+   then (depending on a flag) we either make a note of that, or error
+   out and fail to look up the child tree.  For `--rebuild` and
+   `--trees=trees.json` we are permissive of this error, and just make
+   note of it; but we'll re-use this algorithm in the `rebuild-trees`
+   algorithm below, and it needs the more strict handling.
+ - When creating the rebuilt individual tree, we start by adding the
+   root node specified by the superblock/root-item.  But we may also
+   add additional root nodes grafted on to the tree by the
+   `--trees=trees.json` flag or by the `rebuild-trees` algorithm
+   below.  So a tree may have more than 1 root node.
+
+#### 4.3.2.2. rebuilt individual tree behavior
+
+(see: `./lib/btrfsutil/rebuilt_tree.go`)
+
+In order to read from a tree, we first have to build a few indexes.
+We store these indexes in an Adaptive Replacement Cache; they are all
+re-buildable based on the tree's list of roots and the above graph; if
+we have a bunch of trees we don't need to keep all of this in memory
+at once.  Note that this is done 100% with the in-memory graph, we
+don't need to read anything from the filesystem during these
+procedures.
+
+ - The first index we build is the "node index".  This is an index
+   that for every node tells us what root(s) the tree would need to
+   have in order for the tree to include that node, and also what the
+   highest item key would be acceptable in the node if the tree
+   includes that root.  We track both a `loMaxItem` and a `hiMaxItem`,
+   in case the tree is real broken and there are multiple paths from
+   the root to the node; as these different paths may imply different
+   max-item constraints.  Put more concretely, the type of the index
+   is:
+
+       map[ nodeID → map[ rootNodeID → {loMaxItem, hiMaxItem} ] ]
+
+   We'll do a loop over the graph, using dynamic-programming
+   memoization to figure out ordering and avoid processing the same
+   node twice; for each node we'll
+
+   + Check whether the owner-tree is this tree or one of this tree's
+      ancestors (and if it's an ancestor, that the node's generation
+      isn't after the point that the child tree was forked from the
+      parent tree).  If not, we are done processing that node (record
+      an empty/nil set of roots for it).
+
+   + Create an empty map of `rootID` → {`loMaxItem`, `hiMaxItem`}.
+
+   + Look at each keypointer that that points at the node and:
+
+     * Skip the keypointer if its expectations of the node aren't met:
+       if the level, generation, and min-key constraints don't match
+       up.  If the keypointer isn't in the last slot in the source
+       node, we also go ahead and include checking that the
+       destination node's max-key is under the min-key of the
+       keypointer in the next slot, since that's cheap to do now.
+
+     * Skip the keypointer if its source node's owner-tree isn't this
+       tree or one of this tree's ancestors (and if it's an ancestor,
+       that the node's generation isn't after the point that the child
+       tree was forked from the parent tree).
+
+     * dynamic-programming recurse and index the keypointer's source
+       node.
+
+     * for every root that would result in the keypointer's source
+       node being included in the tree:
+
+       . If the keypointer is in the last slot, look at what the what
+         the source node's last-item constraints would be if that root
+         is included, and can now check the max-item of our
+         destination node.  We check against the `hiMaxItem`; as if
+         there is any valid path from the root to this node, then we
+         want to be permissive and include it.  If that check fails,
+         then we're done with this keypointer.  Also, make node of
+         those `loMaxItem` and `hiMaxItem` values, we'll use them
+         again in just a moment.
+
+       . Otherwise, set both `loMaxItem` and `hiMaxItem` to 1-under
+         the min-item of the keypointer in the next slot.
+
+       . Insert that `loMaxItem` and `hiMaxItem` pair into the
+         `rootID` → {`loMaxItem`, `hiMaxItem`} map we created above.
+         If an entry already exists for this root (since a broken tree
+         might have multiple paths from the root to our node), then
+         set `loMaxItem` to the min of the existing entry and our
+         value, and `hiMaxItem` to the max.
+
+   + If that `rootID` → {`loMaxItem`, `hiMaxItem`} map is still empty,
+     then consider this node to be a (potential) root, and insert
+     `rootID=thisNode` -> {`loMaxItem=maxKey`, `hiMaxItem=maxKey`}
+     (where `maxKey` is the maximum value of the key datatype).
+
+   + Take that `rootID` → {`loMaxItem`, `hiMaxItem`} map and insert it
+     into the index as the entry for this node.
+
+ - The next index we build is the "item index".  This is a "sorted
+   map" (implemented as a red-black tree, supporting sub-range
+   iteration) of `key` → {`nodeID`, `slotNumber`}; a map that for each
+   key tells us where to find the item with that key.
+
+   + Loop over the node index, and for each node check if both (a) it
+     has `level==0` (is a leaf node containing items), and (b) its set
+     of roots that would include it has any overlap with the tree's
+     set of roots.
+
+   + Loop over each of those included leaf nodes, and loop over the
+     items in each node.  Insert the `key` → {`nodeId`, `slot`} into
+     our sorted map.  If there is already an entry for that key,
+     decide which one wins by:
+
+     * Use the one from the node with the owner-tree that is closer to
+       this tree; node with owner=thisTree wins over a node with
+       owner=thisTree.parent, which would win over a node with
+       owner.thisTree.parent.parent.  If that's a tie, then...
+
+     * Use the one from the node with the higher generation.  If
+       that's a tie, then...
+
+     * I don't know, I have the code `panic`:
+
+           // TODO: This is a panic because I'm not really sure what the
+           // best way to handle this is, and so if this happens I want the
+           // program to crash and force me to figure out how to handle it.
+           panic(fmt.Errorf("dup nodes in tree=%v: old=%v=%v ; new=%v=%v",
+               tree.ID,
+               oldNode, tree.forrest.graph.Nodes[oldNode],
+               newNode, tree.forrest.graph.Nodes[newNode]))
+
+   Note that this algorithm means that for a given node we may use a
+   few items from that node, while having other items from that same
+   node be overridden by another node.
+
+ - The final index we build is the "error index".  This is an index of
+   what errors correspond to which range of keys, so that we can
+   report them, and give an idea of "there may be entries missing from
+   this directory" and similar.
+
+   For each error, we'll track the min-key and max-key of the range it
+   applies to, the node it came from, and what the error string is.
+   We'll store these into an interval tree keyed on that
+   min-key/max-key range.
+
+   + Create an empty set `nodesToProcess`.  Now populate it:
+
+     * Once again, we'll loop over the node index, but this time we'll
+       only check that there's overlap between the set of roots that
+       would include the node and the tree's set of roots.  The nodes
+       that are included in this tree, insert both that node itself
+       and all node IDs that it has keypointers pointing to into the
+       `nodesToProcess` set.
+
+     * Also insert all of the tree's roots into `nodesToProcess`; this
+       is in case the superblock/root-item points to an invalid node
+       that we couldn't read.
+
+   + Now loop over `nodesToProcess`.  For each node, create an empty
+     list of errors.  Use the keypointers pointing to and the min
+     `loMaxItem` from the node index to construct a set of
+     expectations for the node; this should be reasonably
+     straight-forward, given:
+
+     * If different keypointers have disagreeing levels, insert an
+       error in to the list, and don't bother with checking the node's
+       level.
+
+     * If different keypointers have disagreeing generations, insert
+       an error in to the list, and don't bother with checking the
+       node's generation.
+
+     * If different keypointers have different min-item expectations,
+       use the max of them.
+
+     Then:
+
+     * If the node is a "bad node" in the graph, insert the error
+       message associated with it.  Otherwise, check those
+       expectations against the node in the graph.
+
+     If the list of error messages is non-empty, then insert their
+     concatenation into the interval tree, with the range set to the
+     min of the min-item expectations from the keypointers through the
+     max of the `hiMaxItem`s from the node index.  If the min min-item
+     expectation turns out to be higher than the max `hiMaxItem`, then
+     set the range to the zero-key through the max-key.
+
+From there, it should be trivial to implement the usual B+ tree
+operations using those indexes; exact-lookup using the item index, and
+range-lookups and walks using the item index together with the error
+index.  Efficiently searching the `CSUM_TREE` requires knowing item
+sizes, so that's why we recorded the item sizes into the graph.
+
+### 4.3.3. The `rebuild-trees` algorithm
+
+The `btrfs inspect rebuild-trees` algorithm finds nodes to attach as
+extra roots to trees.  I think that conceptually it's the the simplest
+of the 3 algorithms, but turned out to be the hardest to get right.
+So... maybe more than the others reference the source code too
+(`./cmd/btrfs-rec/inspect/rebuildtrees/`) because I might forget some
+small but important detail.
+
+The core idea here is that we're just going to walk each tree,
+inspecting each item in the tree, and checking for any items that are
+implied by other items (e.g.: a dir entry item implies the existence
+of inode item for the inode that it points at).  If an implied item is
+not in the tree, but is in some other node, then we look at which
+potential roots we could add to the tree that would add that other
+node.  Then, after we've processed all of the items in the filesystem,
+we go add those various roots to the various trees, keeping track of
+which items are added or updated.  If any of those added/updated items
+have a version with a newer generation on a different node, see what
+roots we could add to get that newer version.  Then add those roots,
+keeping track of items that are added/updated.  Once we reach
+steady-state with the newest version of each item has been added, loop
+back and inspect all added/updated items for implied items, keeping
+track of roots we could add.  Repeat until a steady-state is reached.
+
+There are lots of little details in that process, some of which are
+for correctness, and some of which are for "it should run in hours
+instead of weeks."
+
+#### 4.3.3.1. initialization
+
+First up, we're going to build and in-memory graph, same as above.
+But this time, while we're reading the nodes to do that, we're also
+going to watch for some specific items and record a few things about
+them.
+
+(see: `./cmd/btrfs-rec/inspect/rebuildtrees/scan.go`)
+
+For each {`nodeID`, `slotNumber`} pair that matches one of these item
+types, we're going to record:
+
+ - flags:
+   + `INODE_ITEM`s: whether it has the `INODE_NODATASUM` flag set
+ - names:
+   + `DIR_INDEX` items: the file's name
+ - sizes:
+   + `EXTENT_CSUM` items: the number of bytes that this is a sum for
+     (i.e. the item size over the checksum size, times the block size)
+   + `EXTENT_DATA` items: the number of bytes in this extent
+     (i.e. either the item size minus
+     `offsetof(btrfs_file_extent_item.disk_bytenr)` if
+     `FILE_EXTENT_INLINE`, or else the item's `num_bytes`).
+ - data backrefs:
+   - `EXTENT_ITEM`s and `METADATA_ITEM`s: a list of the same length as
+     the number of refs embedded in the item; for embeded
+     ExtentDataRefs, the list entry is the subvolume tree ID that the
+     ExtentDataRef points at, otherwise it is zero.
+   - `EXTENT_DATA_REF` items: a list of length 1, with the sole member
+     being the subvolume tree ID that the ExtentDataRef points at.
+
+#### 4.3.3.2. the main loop
+
+(see: `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go`)
+
+Start with that scan data (graph + info about items), and also a
+rebuilt forrest from the above algorithm, but with:
+
+ - the flag set so that it refuses to look up a tree if it can't look
+   up all of that tree's ancestors
+
+ - an additional "potential-item index" that is similar to the item
+   index.  It is generated the same way and can cache/evict the same
+   way; the difference is that we invert the check for if the set of
+   roots for a node has overlap with the tree's set of roots; we're
+   looking for *potential* nodes that we could add to this tree.
+
+ - some callbacks; we'll get to what we do in these callbacks in a
+   bit, but for now, what the callbacks are:
+
+   + a callback that is called for each added/updated item when we add
+     a root.
+
+   + a callback that is called whenever we add a root
+
+   + a callback that intercepts looking up a root item
+
+   + a callback that intercepts resolving an UUID to an object ID.
+
+  (The callbacks are in
+  `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go`)
+
+We have 5 unordered queues ("work lists"?); these are sets that when
+it's time to drain them we'll sort the members and process them in
+that order.
+
+ 1. the tree queue: a list of tree IDs that we need to crawl
+ 2. the retry-item queue: for each tree ID, a set of items that we
+    should re-process if we add a root to that tree
+ 3. the added-item queue: a set of key/tree pairs identifying items
+    that have been added by adding a root to a tree
+ 4. the settled-item-queue: a set of key/tree pairs that have have not
+    just been added by adding a root, but we've also verified that
+    they are the newest-generation item with that key that we could
+    add to the tree.
+ 5. the augment queue: for each item that we want to add to a tree,
+    the list of roots that we could add to get that item.
+
+The roots all start out empty, except for the tree queue, which we
+seed with the `ROOT_TREE`, the `CHUNK_TREE`, and the
+`BLOCK_GROUP_TREE` (It is a "TODO" task that it should probably also
+be seeded with the `TREE_LOG`, but as I will say below in the "future
+work" section, I don't actually understand the `TREE_LOG`, so I
+couldn't implement it).
+
+Now we're going to loop until the tree queue, added-item queue,
+settled-item queue, and augment queue are all empty (all queues except
+for the retry-item queue).  Each loop "pass" has 3 substeps:
+
+ 1. Crawl the trees (drain the tree queue, fill the added-item queue).
+
+ 2. Either:
+
+    a. if the added-item queue is non-empty: "settle" those items
+       (drain the added-item queue, fill the augment queue and the
+       settled-item queue).
+
+    b. otherwise: process items (drain the settled-item queue, fill
+       the augment queue and the tree queue)
+
+ 3. Apply augments (drain the augment queue and maybe the retry-item
+    queue, fill the added-item queue).
+
+OK, let's look at those 3 substeps in more detail:
+
+ 1. Crawl the trees; drain the tree queue, fill the added-item queue.
+
+    We just look up the tree in the rebuilt forrest, which will (per
+    the above `--rebuild` algorithm) will either fail to look up the
+    tree, or succeed, and add to that tree the root node from the
+    superblock/root-item.  Because we set an item-added callback, when
+    adding that root it will loop over the nodes added by that root,
+    and call our callback for each item in one of the added nodes.
+    Our callback inserts each item into the added-item queue.  The
+    forrest also calls our root-added callback, but because of the way
+    this algorithm works, that turns out to be a no-op at this step.
+
+    I mentioned that we added callbacks to intercept the forrest's
+    looking up of root items and resolving UUIDs; we override the
+    forrest's "lookup root item" routine and "resolve UUID" routine to
+    instead of doing normal lookups on the `ROOT_TREE` and
+    `UUID_TREE`, use the above `WantXXX` routines that we'll define
+    below in the "graph callbacks" section.
+
+    It shouldn't matter what order this queue is processed in, but I
+    sort tree IDs numerically.
+
+    The crawling is fairly fast because it's just in-memory, the only
+    accesses to disk are looking up root items and resolving UUIDs.
+
+ 2. Either:
+
+    a. Settle items from the added-item queue to the settled-item queue
+       (and fill the augment queue).
+
+       For each item in the queue, we look in the tree's item index to
+       get the {node, slot} pair for it, then we do the same in the
+       tree's potential-item index.  If the potential-item index
+       contains an entry for the item's key, then we check if the
+       potential-item's node should "win" over the queue item's node,
+       deciding the "winner" using the same routine as when building
+       the item index.  If the potential-item's node wins, then we add
+       the potential node's set of roots to the augment queue.  If the
+       queue-item's node wins, then we add the item to the
+       settled-item queue (except, as an optimization, if the item is
+       of a type that cannot possibly imply the existence of another
+       item, then we just drop it and don't add it to the settled-item
+       queue).
+
+       It shouldn't matter what order this queue is processed in, but
+       I sort it numerically by treeID and then by item key.
+
+       This step is fairly fast because it's entirely in-memory,
+       making no accesses to disk.
+
+    b. Process items from the settled-item queue (drain the
+       settled-item queue, fill the augment queue and the tree queue).
+
+       This step accesses disk, and so the order we process the queue
+       in turns out to be pretty important in order to keep our disk
+       access patterns cache-friendly.  For the most part, we just
+       sort each queue item by tree, then by key.  But, we have
+       special handling for `EXTENT_ITEM`s, `METADATA_ITEM`s, and
+       `EXTENT_DATA_REF` items: We break `EXTENT_ITEM`s and
+       `METADATA_ITEM`s in to "sub-items", treating each ref embedded
+       in them as a separate item.  For those embedded items that are
+       `EXTENT_DATA_REF`s, and for stand-alone `EXTENT_DATA_REF`
+       items, we sort them not with the `EXTENT_TREE` items, but with
+       the items of the tree that the extent data ref points at.
+       Recall that during the intitial scan step, we took note of
+       which tree every extent data ref points at, so we can perform
+       this sort without accessing disk yet.  This splitting does mean
+       that we may visit/read an `EXTENT_ITEM` or `METADATA_ITEM`
+       multiple times as we process the queue, but to do otherwise is
+       to solve MinLA, which is NP-hard and also an optimal MinLA
+       solution I still think would perform worse than this; there is
+       a reasonably lengthy discussion of this in a comment in
+       `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:sortSettledItemQueue()`.
+
+       Now we loop over that sorted queue.  In the code, this loop is
+       deceptively simple.  Read the item, then pass it to a function
+       that tells us what other items are implied by it.  That
+       function is large, but simple; it's just a giant table.  The
+       trick is how it tells us about implied items; we give it set of
+       callbacks that it calls to tell us these things; the real
+       complexity is in the callbacks.  These "graph callbacks" will
+       be discussed in detail below, but as an illustrative example:
+       It may call `.WantOff()` with a tree ID, object ID, item type,
+       and offset to specify a precise item that it believes should
+       exist.
+
+       If we encounter a `ROOT_ITEM`, add the tree described by that
+       item to the tree queue.
+
+    (Both the "can this item even imply the existence of another item"
+    check and the "what items are implied by this item" routine are in
+    `./lib/btrfscheck/graph.go`)
+
+ 3. Apply augments; drain the augment queue (and maybe the retry-item
+    queue), fill the added-item queuee.
+
+    It is at this point that I call out that the augment queue isn't
+    implemented as a simple map/set like the others, the
+    `treeAugmentQueue struct` has special handling for sets of
+    different sizes; optimizing the space for empty and len()==1 sized
+    sets, and falling back to normal the usual implementation for
+    larger sets; this is important because those small sets are the
+    overwhelming majority, and otherwise there's no way the program
+    would be able to run on my 32GB RAM laptop.  Now that I think
+    about it, I bet it would even be worth it to add optimized storage
+    for len()==2 sized sets.
+
+    The reason is that each "want" from above is tracked in the queue
+    separately; if we were OK merging them, then this optimized
+    storage wouldn't be nescessary.  But we keep them separate, so
+    that:
+
+    - For all "wants", including ones with empty sets, graph callbacks
+      can check if a want has already been processed; avoiding
+      re-doing any work (see the description of the graph callbacks
+      below).
+
+    - For "wants" with non-empty sets, we can see how many different
+      "wants" could be satisfied with a given root, in order to decide
+      which root to choose.
+
+    Anyway, we loop over the trees in the augment queue.  For each
+    tree we look at that tree's augment queue and look at all the
+    choices of root nodes to add (below), and decide on a list to add.
+    The we add each of those roots to the tree; the adding of each
+    root triggers several calls to our item-added callback (filling
+    the added-item queue), and our root-added callback.  The
+    root-added callback moves any items from the retry-item queue for
+    this tree to the added-item queue.
+
+    How do we decide between choices of root nodes to add?
+    `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:resolveTreeAugments()`
+    has a good comment explaining the criteria we'd like to optimize
+    for, and then code that does an OK-ish job of actually optimizing
+    for that:
+	
+    - It loops over the augment queue for that tree, building a list
+      of possible roots, for each possible root making note of 3
+      things:
+
+       a. how many "wants" that root satisfies,
+
+       b. how far from treee the root's owner is (owner=tree is a
+          distance of 0, owner=tree.parent is a distance of 1,
+          owner=tree.parent.parent is a distance of 2, and so on), and
+          
+       c. what the generation of that root is.
+       
+    - We sort that list first by highest-count-first, then by
+      lowest-distance-first, then by highest-generation-first.
+      
+    - We create a "return" set and an "illegal" set.  We loop over the
+      sorted list; for each possible root if it is in the illegal set,
+      we skip it, otherwise we insert it into the return set and for
+      each "want" that includes this root we all all roots that
+      satisfy that want to the illegal list.
+
+It is important that the rebuilt forrest have the flag set so that it
+refuses to look up a tree if it can't look up all of that tree's
+ancestors; otherwise the potential-items index would be garbage as we
+wouldn't have a good idea of which nodes are OK to consider; but this
+does have the downside that it won't even attempt to improve a tree
+with a missing parent.  Perhaps the algorithm should flip the flag
+once the loop terminates, and then re-seed the tree queue with each
+`ROOT_ITEM` from the `ROOT_TREE`?
+
+#### 4.3.3.3. graph callbacks
+
+(see: `./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go`)
+
+The graph callbacks are what tie the above together.
+
+For each of these callbacks, whenever I say that it looks up something
+in a tree's item index or potential-item index, that implies looking
+the tree up from the forrest; if the forrest cannot look up that tree,
+then the callback returns early, after either:
+
+ - if we are in substep 1 and are processing a tree: we add the tree
+   that is being processed to the tree queue.  (TODO: Wait, this
+   assumes that an augment will be applied to the `ROOT_TREE` before
+   the next pass... if that isn't the case, this will result in the
+   loop never terminating... I guess I need to add a separate
+   retry-tree queue?)
+
+ - if we are in substep 2 and are processing an item: we add the item
+   that is being processed to the retry-item queue for the tree that
+   cannot be looked up
+
+The 6 methods in the `brfscheck.GraphCallbacks` interface are:
+
+ 1. `FSErr()`: There's an error with the filesystem; this callback
+    just spits it out on stderr.  I say such a trivial matter because,
+    again, for a recovery tool I think it's worth putting care in to
+    how you handle errors and where you expect them: We expect them
+    here, so we have to check for them to avoid reading invalid data
+    or whatever, but we don't actually need to do anything other than
+    watch our step.
+
+ 2. `Want()`: We want an item in a given tree with a given object ID
+    and item type, but we don't care about what the item's offset is.
+
+    The callback works by searching the item index to see if it can
+    find such an item; if so, it has nothing else to do and returns.
+    Otherwise, it searches the potential-item index; for each matching
+    item it finds it looks in the node index for the node containing
+    that item, and adds the roots that would add that node, and adds
+    those roots to a set.  Once it has finished searching the
+    potential-item index, it adds that set to the augment queue (even
+    if that set is still empty).
+
+ 3. `WantOff()`: The same, but we want a specific offset.
+
+ 4. `WantDirIndex()`: We want a `DIR_INDEX` item for a given inode and
+    filename, but we don't know what the offset of that item is.
+
+    First we scan over the item index, looking at all `DIR_INDEX`
+    items for that inode number.  For each item, we can check the scan
+    data to see what the filename in that `DIR_INDEX` is, so we can
+    see if the item satisfies this want without accessing the disk.
+    If there's a match, then there is nothing else to do, so we
+    return.  Otherwise, we do that same search over the potential-item
+    index; if we find any matches, then we build the set of roots to
+    add to the augment queue the same as in `Want`.
+
+ 5. `WantFileExt()`: We want 1 or more `DATA_EXTENT` items in the
+    given tree for the given inode, and we want them to cover from 0
+    to a given size bytes of that file.
+
+    First we walk that range in the item index, to build a list of the
+    gaps that we need to fill ("Step 1" in
+    `rebuild_wantcb.go:_wantRange()`).  This walk
+    (`rebuild_wantcb.go:_walkRange()`) requires knowing the size of
+    each file extent; so doing this quickly without hitting disk is
+    why we recorded the size of each file extent in our initialization
+    step.
+	
+	Then ("Step 2" in `_wantRange()`) we iterate over each of the
+    gaps, and for each gap do a very similar walk (again, by calling
+    `_walkRange()`, but this time over the potential-item index.  For
+    each file extent we find that has is entirely within the gap, we
+    "want" that extent, and move the beginning of of the gap forward
+    to the end of that extent.  This algorithm is dumb and greedy,
+    potentially making sub-optimal selections; and so could probably
+    stand to be improved; but in my real-world use, it seems to be
+    "good enough".
+
+ 6. `WantCSum()`: We want 1 or more `EXTENT_CSUM` items to cover the
+    half-open interval [`lo_logical_addr`, `hi_logical_addr`).  Well,
+    maybe.  It also takes a subvolume ID and an inode number; and
+    looks up in the scan data whether that inode has the
+    `INODE_NODATASUM` flag set; if it does have the flag set, then it
+    returns early without looking for any `EXTENT_CSUM` items.  If it
+    doesn't return early, then it performs the same want-range routine
+    as `WantFileExt`, but with the appropriate tree, object ID, and
+    item types for csums as opposed to data extents.
+
+For each of these callbacks, we generate a "wantKey", a tuple
+representing the function and its arguments; we check the
+augment-queue to see if we've already enqueued a set of roots for that
+want, and if so, that callback can return early without checking the
+potential-item index.
+
+# 5. Future work
+
+It's in a reasonably useful place, I think; and so now I'm going to
+take a break from it for a while.  But there's still lots of work to
+do:
+
+ - RAID almost certainly doesn't work.
+
+ - Encryption is not implemented.
+
+ - It doesn't understand (ignores) the `TREE_LOG` (because I don't
+   understand the `TREE_LOG`).
+
+ - `btrfs-rec inspect mount` should add "lost+found" directories for
+   inodes that are included in the subvolume's tree but aren't
+   reachable from the tree's root inode
+
+ - I still need to implement `btrfs-rec repair SUBCMD` subcommands to
+   write rebuilt-information from `btrfs-rec inspect` back to the
+   filesystem.
+
+ - I need to figure out the error handling/reporting story for
+   `mount`.
+
+ - It needs a lot more tests
+
+    + I'd like to get the existing btrfs-progs fsck tests to run on
+      it.
+
+ - In the process of writing this email, I realized that I probably
+   need to add a retry-tree queue; see the "graph callbacks" section
+   in the description of the `rebuild-trees` algorithm above.
+
+ - Shere are a number of "TODO" comments or panics in the code:
+
+    + Some of them definitely need done.
+
+    + Some of them are `panic("TODO")` on the basis that if it's
+      seeing something on the filesystem that it doesn't recognize,
+      it's probably that I didn't get to implementing that
+      thing/situation, but it's possible that the thing is just
+      corrupt.  This should only be for situations that the node
+      passed the checksum test, so it being corrupt would have to be
+      caused by a bug in btrfs rather than a failing drive or other
+      corruption; I wasn't too worried about btrfs bugs.
+
+ - `btrfs-rec inspect rebuild-trees` is slow, and can probably be made
+   a lot faster.
+
+   Just to give you an idea of the speeds, the run-times for the
+   various steps on my ThinkPad E15 for a 256GB disk image are as
+   follows:
+
+        btrfs-rec inspect rebuild-mappings scan       :     7m 31s
+        btrfs-rec inspect rebuild-mappings list-nodes :        47s
+        btrfs-rec inspect rebuild-mappings process    :     8m 22s
+        btrfs-rec inspect rebuild-trees               : 1h  4m 55s
+        btrfs-rec inspect ls-files                    :    29m 55s
+        btrfs-rec inspect ls-trees                    :     8m 40s
+
+   For the most part, it's all single-threaded (with the main
+   exception that in several places I/O has been moved to a separate
+   thread from the main CPU-heavy thread), but a lot of the algorithms
+   could be parallelized.
+
+ - There are a lot of "tunable" values that I haven't really spent
+   time tuning.  These are all annotated with `textui.Tunable()`.  I
+   sort-of intended for them to be adjustable on the CLI.
+
+ - Perhaps the `btrfs inspect rebuild-trees` algorithm could be
+   adjusted to also try to rebuild trees with missing parents; see the
+   above discussion of the algorithm.
+
+# 6. Problems for merging this code into btrfs-progs
+
+ - It's written in Go, not C.
+
+ - It's effectively GPLv3+ (not GPLv2-only or GPLv2+) because of use
+   of some code under the Apache 2.0 license (2 files in the codebase
+   itself that are based off of Apache-licensed code, and use of
+   unmodified 3rd-party libraries).
+
+ - It uses ARC (Adaptive Replacement Cache), which is patented by IBM,
+   and the patent doesn't expire for another 7 months.  An important
+   property of ARC over LRU is that it is scan-resistant; the above
+   algorithms do a lot of scanning.  On that note, now that RedHat is
+   owned by IBM: who in the company do we need to get to talk to
+   eachother so that we can get ARC into the Linux kernel before then?
+
+<div style="font-family: monospace">
+-- <br/>
+Happy hacking,<br/>
+~ Luke Shumaker<br/>
+</div>
author	Luke Shumaker <lukeshu@lukeshu.com>	2023-07-14 14:38:38 -0700
committer	Luke Shumaker <lukeshu@lukeshu.com>	2023-07-14 15:24:09 -0700
commit	c045654a862bc1119fa4e7584fff9d2a965192ea (patch)
tree	15b075deea49efdba94df64f90a29fbc5aea0ba8
parent	8c11dbdaacd8ec0b75d9110fe326955f01c45bee (diff)