summaryrefslogtreecommitdiff
path: root/public/btrfs-rec.html
diff options
context:
space:
mode:
Diffstat (limited to 'public/btrfs-rec.html')
-rw-r--r--public/btrfs-rec.html1170
1 files changed, 1170 insertions, 0 deletions
diff --git a/public/btrfs-rec.html b/public/btrfs-rec.html
new file mode 100644
index 0000000..3012538
--- /dev/null
+++ b/public/btrfs-rec.html
@@ -0,0 +1,1170 @@
+<!DOCTYPE html>
+<html lang="en">
+<head>
+ <meta charset="utf-8">
+ <title>Announcing: btrfs-rec: Recover (data from) a broken btrfs filesystem — Luke Shumaker</title>
+ <link rel="stylesheet" href="assets/style.css">
+ <link rel="alternate" type="application/atom+xml" href="./index.atom" name="web log entries"/>
+</head>
+<body>
+<header><a href="/">Luke Shumaker</a> » <a href=/blog>blog</a> » btrfs-rec</header>
+<article>
+<h1
+id="announcing-btrfs-rec-recover-data-from-a-broken-btrfs-filesystem">Announcing:
+btrfs-rec: Recover (data from) a broken btrfs filesystem</h1>
+<blockquote>
+<p>I originally sent this email on 2023-07-10, but it has been caught by
+their bogofilter. Yes, it was <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/README.md?id=18e6066c241cf3d252b6521150843ffc858d8434">plaintext</a>.
+No, I didn't use GMail. Yes, I've successfully participated in vger
+lists in the past. Yes, I've reached out to postmaster; no, I haven't
+received a reply yet (as of 2023-07-14).</p>
+</blockquote>
+<div style="font-family: monospace">
+<p>To: linux-btrfs@vger.kernel.org<br/> From: Luke T. Shumaker
+&lt;lukeshu@lukeshu.com&gt;<br/> Subject: btrfs-rec: Recover (data from)
+a broken btrfs filesystem<br/> Date: Mon, 10 Jul 2023 21:23:41
+-0600<br/> Message-ID:
+&lt;87jzv7uo5e.wl-lukeshu@lukeshu.com&gt;<br/></p>
+</div>
+<p>Inspired by a mis-typed <code>dd</code> command, for the last year
+I've been working on a tool for recovering corrupt btrfs filesystems; at
+first idly here and there, but more actively in the last few months. I
+hope to get it incorporated into btrfs-progs, though perhaps that is
+problematic for a few reasons I'll get to. If the code can't be
+incorporated into btrfs-progs, at least the ideas and algorithms should
+be.</p>
+<p><a
+href="https://git.lukeshu.com/btrfs-progs-ng/">https://git.lukeshu.com/btrfs-progs-ng/</a></p>
+<p>Highlights:</p>
+<ul>
+<li><p>In general, it's more tolerant of corrupt filesystems than
+<code>btrfs check --repair</code>, <code>btrfs rescue</code> or
+<code>btrfs restore</code>.</p></li>
+<li><p><code>btrfs-rec inspect rebuild-mappings</code> is a better
+<code>btrfs rescue chunk-recover</code>.</p></li>
+<li><p><code>btrfs-rec inspect rebuild-trees</code> can re-attach lost
+branches to broken B+ trees.</p></li>
+<li><p><code>btrfs-rec inspect mount</code> is a read-only FUSE
+implementation of btrfs. This is conceptually a replacement for
+<code>btrfs restore</code>.</p></li>
+<li><p>It's entirely written in Go. I'm not saying that's a good thing,
+but it's an interesting thing.</p></li>
+</ul>
+<p>Hopefully some folks will find it useful, or at least neat!</p>
+<ul>
+<li><a href="#motivation">1. Motivation</a></li>
+<li><a href="#overview-of-use">2. Overview of use</a></li>
+<li><a href="#prior-art">3. Prior art</a></li>
+<li><a href="#internalsdesign">4. Internals/Design</a></li>
+<li><a href="#overview-of-the-source-tree-layout">4.1. Overview of the
+source tree layout</a></li>
+<li><a href="#base-decisions-cli-structure-go-json">4.2. Base decisions:
+CLI structure, Go, JSON</a></li>
+<li><a href="#algorithms">4.3. Algorithms</a></li>
+<li><a href="#the-rebuild-mappings-algorithm">4.3.1. The
+<code>rebuild-mappings</code> algorithm</a></li>
+<li><a href="#the---rebuild-algorithm">4.3.2. The <code>--rebuild</code>
+algorithm</a></li>
+<li><a href="#rebuilt-forrest-behavior-looking-up-trees">4.3.2.1.
+rebuilt forrest behavior</a></li>
+<li><a href="#rebuilt-individual-tree-behavior">4.3.2.2. rebuilt
+individual tree behavior</a></li>
+<li><a href="#the-rebuild-trees-algorithm">4.3.3. The
+<code>rebuild-trees</code> algorithm</a></li>
+<li><a href="#initialization">4.3.3.1. initialization</a></li>
+<li><a href="#the-main-loop">4.3.3.2. the main loop</a></li>
+<li><a href="#graph-callbacks">4.3.3.3. graph callbacks</a></li>
+<li><a href="#future-work">5. Future work</a></li>
+<li><a href="#problems-with-merging-this-code-into-btrfs">6. Problems
+for merging this code into btrfs-progs</a></li>
+</ul>
+<h1 id="motivation">1. Motivation</h1>
+<p>Have you ever ended up with a corrupt btrfs filesystem (through no
+fault of btrfs itself, but perhaps a failing drive, or a mistaken
+<code>dd</code> invocation)? Surely losing less than 100MB of data from
+a drive should not render hundreds of GB of perfectly intact data
+unreadable! And yet, the existing tools are unable to even attempt to
+read that data:</p>
+<pre><code>$ btrfs check --repair --force dump-zero.1.img
+enabling repair mode
+Opening filesystem to check...
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+ERROR: cannot read chunk root
+ERROR: cannot open file system</code></pre>
+<p>or</p>
+<pre><code>$ btrfs check --init-extent-tree --force dump-zero.1.img
+Opening filesystem to check...
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+ERROR: cannot read chunk root
+ERROR: cannot open file system</code></pre>
+<p>or</p>
+<pre><code>$ btrfs check --init-csum-tree --force dump-zero.1.img
+Creating a new CRC tree
+Opening filesystem to check...
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+ERROR: cannot read chunk root
+ERROR: cannot open file system</code></pre>
+<p>or</p>
+<pre><code>$ btrfs rescue chunk-recover dump-zero.1.img
+Scanning: DONE in dev0
+corrupt node: root=1 block=160410271744 slot=0, corrupt node: root=1 block=160410271744, nritems too large, have 39 expect range [1,0]
+Couldn&#39;t read tree root
+open with broken chunk error</code></pre>
+<p>or</p>
+<pre><code>$ btrfs rescue zero-log dump-zero.1.img
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+ERROR: cannot read chunk root
+ERROR: could not open ctree</code></pre>
+<p>or</p>
+<pre><code>$ mkdir out
+$ btrfs restore dump-zero.1.img out
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+ERROR: cannot read chunk root
+Could not open root, trying backup super
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+ERROR: cannot read chunk root
+Could not open root, trying backup super
+ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
+Could not open root, trying backup super</code></pre>
+<p>or</p>
+<pre><code>$ btrfs restore --list-roots dump-zero.1.img
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+ERROR: cannot read chunk root
+Could not open root, trying backup super
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0
+bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733
+ERROR: cannot read chunk root
+Could not open root, trying backup super
+ERROR: superblock bytenr 274877906944 is larger than device size 256060514304
+Could not open root, trying backup super</code></pre>
+<p>or</p>
+<pre><code>$ btrfs-find-root dump-zero.1.img
+WARNING: cannot read chunk root, continue anyway
+Superblock thinks the generation is 6596071
+Superblock thinks the level is 1</code></pre>
+<p>Well, have I got a tool for you!</p>
+<p>(FWIW, I also tried manipulating the filesystem and patching to tools
+to try to get past those errors, only to get a different set of errors.
+Some of these patches I am separately submitting to btrfs-progs.)</p>
+<h1 id="overview-of-use">2. Overview of use</h1>
+<p>There are two <code>btrfs-rec</code> sub-command groups:
+<code>btrfs-rec inspect <var>SUBCMD</var></code> and <code>btrfs-rec
+repair <var>SUBCMD</var></code>, and you can find out about various
+sub-commands with <code>btrfs-rec help</code>. These are both told about
+devices or images with the <code>--pv</code> flag.</p>
+<p><code>btrfs-rec inspect <var>SUBCMD</var></code> commands open the
+filesystem read-only, and (generally speaking) write extracted or
+rebuilt information to stdout. <code>btrfs-rec repair
+<var>SUBCMD</var></code> commands open the filesystem read+write, and
+consume information from <code>btrfs-rec inspect
+<var>SUBCMD</var></code> commands to actually repair the filesystem
+(except I haven't actually implemented any <code>repair</code> commands
+yet... despite the lack of <code>repair</code> commands, I believe that
+<code>btrfs-rec</code> is already a useful because of the
+<code>btrfs-rec inspect mount</code> command to get data out of the
+broken filesystem). This split allows you to try things without being
+scared by WARNINGs about not using these tools unless you're an expert
+or have been told to by a developer.</p>
+<p>In the broken <code>dump-zero.1.img</code> example above (which has a
+perfectly intact superblock, but a totally broken
+<code>CHUNK_TREE</code>), to "repair" it I'd:</p>
+<ol type="1">
+<li><p>Start by using <code>btrfs-rec inspect rebuild-mappings</code> to
+rebuild the broken chunk/dev/blockgroup trees:</p>
+<pre><code>$ btrfs-rec inspect rebuild-mappings \
+ --pv=dump-zero.1.img \
+ &gt; mappings-1.json</code></pre></li>
+<li><p>If it only mostly succeeds, but on stderr tells us about a few
+regions of the image that it wasn't able to figure out the chunks for.
+Using some human-level knowledge, you can write those yourself,
+inserting them into the generated <code>mappings.json</code>, and ask
+<code>rebuild-mappings</code> to normalize what you wrote:</p>
+<pre><code>$ btrfs-rec inspect rebuild-mappings \
+ --pv=dump-zero.1.img \
+ --mappings=&lt;(sed &lt;mappings-1.json \
+ -e &#39;2a{&quot;LAddr&quot;:5242880,&quot;PAddr&quot;:{&quot;Dev&quot;:1,&quot;Addr&quot;:5242880},&quot;Size&quot;:1},&#39; \
+ -e &#39;2a{&quot;LAddr&quot;:13631488,&quot;PAddr&quot;:{&quot;Dev&quot;:1,&quot;Addr&quot;:13631488},&quot;Size&quot;:1},&#39;) \
+ &gt; mappings-2.json</code></pre></li>
+<li><p>Now that it has functioning chunk/dev/blockgroup trees, we can
+use <code>btrfs-rec inspect rebuild-trees</code> to rebuild other trees
+that rely on those:</p>
+<pre><code>$ btrfs-rec inspect rebuild-mappings \
+ --pv=dump-zero.1.img \
+ --mappings=mappings-2.json \
+ &gt; trees.json</code></pre></li>
+<li><p>Now that (hopefully) everything that was damaged has been
+reconstructed, we can use <code>btrfs-rec inspect mount</code> to mount
+the filesystem read-only and copy out our data:</p>
+<pre><code>$ mkdir mnt
+$ sudo btrfs-rec inspect mount \
+ --pv=dump-zero.1.img \
+ --mappings=mappings-2.json \
+ --trees=trees.json \
+ ./mnt</code></pre></li>
+</ol>
+<p>This example is fleshed out more (and the manual edits to
+<code>mappings.json</code> explained more) in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples/main.sh?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./examples/main.sh</code></a>.</p>
+<h1 id="prior-art">3. Prior art</h1>
+<p>Comparing <code>btrfs-rec inspect mount</code> with the existing
+https://github.com/adam900710/btrfs-fuse project:</p>
+<ul>
+<li>Again, mine has better fault tolerance</li>
+<li>Mine is read-only</li>
+<li>Mine supports xattrs ("TODO" in Adam's)</li>
+<li>Mine supports separate inode address spaces for subvolumes; Adam's
+doesn't due to limitations in FUSE, mine works around this by lazily
+setting up separate mountpoints for each subvolume (though this does
+mean that the process needs to run as root, which is a bummer).</li>
+</ul>
+<h1 id="internalsdesign">4. Internals/Design</h1>
+<h2 id="overview-of-the-source-tree-layout">4.1. Overview of the source
+tree layout</h2>
+<ul>
+<li><p><a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>examples/</code></a>
+has example scripts showing how to use <code>btrfs-rec</code>.</p></li>
+<li><p><a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfs?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfs/</code></a>
+is the core btrfs implementation.</p></li>
+<li><p><a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfscheck?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfscheck/</code></a>
+and <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfsutil/</code></a>
+are libraries for "btrfs-progs" type programs, that are userland-y
+things that I thought should be separate from the core implementation;
+something that frustrated me about libbtrfs was having to figure out "is
+this thing here in support of btrfs bits-on-disk, or in support of a
+higher-level 'how btrfs-progs wants to think about things'?"</p></li>
+<li><p><a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>cmd/btrfs-rec/</code></a>
+is where the command implementations live. If a sub-command fits in a
+single file, it's
+<code>cmd/btrfs-rec/inspect_<var>SUBCMD</var>.go</code>, otherwise, it's
+in a separate <code>cmd/btrfs-rec/inspect/<var>SUBCMD</var>/</code>
+package.</p></li>
+<li><p><a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/textui?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/textui/</code></a>
+is reasonably central to how the commands implement a text/CLI
+user-interface.</p></li>
+<li><p><a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/binstruct?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/binstruct/</code></a>,
+<a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/diskio?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/diskio/</code></a>,
+and <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/streamio?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/streamio/</code></a>
+are non-btrfs-specific libraries related to the problem domain.</p></li>
+<li><p><a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/containers?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/containers/</code></a>,
+<a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/fmtutil?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/fmtutil/</code></a>,
+<a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/maps?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/maps/</code></a>,
+<a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/slices?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/slices/</code></a>,
+and <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/profile?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/profile/</code></a>
+are all generic Go libraries that have nothing to do with btrfs or the
+problem domain, but weren't in the Go standard library and I didn't
+find/know-of exiting implementations that I liked. Of these, all but
+<code>containers</code> are pretty simple utility libraries. Also, some
+of these things have been added to the standard library since I started
+the project.</p></li>
+</ul>
+<h2 id="base-decisions-cli-structure-go-json">4.2. Base decisions: CLI
+structure, Go, JSON</h2>
+<p>I started with trying to enhance btrfs-progs, but ended up writing a
+wholy new program in Go, for several reasons:</p>
+<ul>
+<li><p>writing a new thing: I was having to learn both the btrfs-progs
+codebase and how btrfs-bits-on-disk work, and it got to the point that I
+decided I should just focus on learning btrfs-bits-on-disk.</p></li>
+<li><p>writing a new thing: It was becoming increasingly apparent to me
+that it was going to be an uphill-fight of having recovery-tools share
+the same code as the main-tools, as the routines used by the main-tools
+rightly have validity checks, where recovery-tools want to say "yes, I
+know it's invalid, can you give it to me anyway?".</p></li>
+<li><p>writing it in not-C: I love me some C, but higher level languages
+are good for productivity. And I was trying to write a whole lot of code
+at once, I needed a productivity boost.</p></li>
+<li><p>writing it in not-C: This forced me to learn btrfs-bits-on-disk
+better, instead of just cribbing from btrfs-progs. That knowledge is
+particularly important for having ideas on how to deal with corrupt
+bits-on-disk.</p></li>
+<li><p>writing it in Go: At the time I started, my day job was writing
+Go, so I had Go swapped into my brain. And Go still feels close to C but
+provides <em>a lot</em> of niceness and safety over C.</p></li>
+</ul>
+<p>It turned out that Go was perhaps not the best choice, but we'll come
+back to that.</p>
+<p>I wanted to separate things into a pipeline. For instance: Instead of
+<code>btrfs rescue chunk-recover</code> trying to do everything to
+rebuild a broken chunk tree, I wanted to separate I/O from computation
+from repairs. So I have
+<code>btrfs-rec inspect rebuild-mappings scan</code> that reads all the
+info necessary to rebuild the chunk tree, then dump that as a 2GB glob
+of JSON. Then I can feed that JSON to
+<code>btrfs-rec inspect rebuild-mappings process</code> which actually
+rebuilds the mappings in the chunk tree, and dumps them as JSON. And
+then other commands can consume that <code>mappings.json</code> to use
+that instead of trying to read the chunk tree from the actual FS, so
+that you don't have to make potentially destructive writes to inspect an
+FS with a broken chunk tree, and can inspect it more forensically. Or
+then use <code>btrfs-rec repair
+<var>SOME_SUBCMD_I_HAVENT_WRITTEN_YET</var></code> to write that chunk
+tree in <code>mappings.json</code> back to the filesystem.</p>
+<p>(But also, the separate steps thing was useful just so I could
+iterate on the algorithms of <code>rebuild-mappings process</code>
+separately from having to scan the entire FS)</p>
+<p>So, I made the decision that <code>btrfs-rec inspect
+<var>SUBCMD</var></code> commands should all only open the FS read-only,
+and output their work to a separate file; that writing that info back to
+the FS should be separate in <code>btrfs-rec repair
+<var>SUBCMD</var></code>.</p>
+<p>For connecting those parts of the pipeline, I chose JSON, for a few
+reasons:</p>
+<ul>
+<li><p>I wanted something reasonably human-readable, so that I could
+debug it easier.</p></li>
+<li><p>I wanted something reasonably human-readable, so that human
+end-users could make manual edits; for example, in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples/main.sh?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>examples/main.sh</code></a>
+I have an example of manually editing <code>mappings.json</code> to
+resolve a region that the algorithm couldn't figure out, but with
+knowledge of what caused the corruption a human can.</p></li>
+<li><p>I didn't want to invent my own DSL and have to handle writing a
+parser. (This part didn't pay off! See below.)</p></li>
+<li><p>I wanted something that I thought would have good support in a
+variety of languages, so that if Go is problematic for getting things
+merged upstream it could be rewritten in C (or maybe Rust?) piece-meal
+where each subcommand can be rewritten one at a time.</p></li>
+</ul>
+<p>It turned out that JSON was perhaps not the best choice.</p>
+<p>OK, so: Go and/or JSON maybe being mistakes:</p>
+<ul>
+<li><p>I spent a lot of time getting the garbage collector to not just
+kill performance.</p></li>
+<li><p>The <code>btrfs-rec inspect rebuild-mappings
+<var>SUBCMD</var></code> subcommands all throw a lot of data through the
+JSON encoder/decoder, and I learned that the Go stdlib
+<code>encoding/json</code> package has memory use that grows O(n^2)
+(-ish? I didn't study the implementation, but that's what the curve
+looks like just observing it) on the size of the data being shoved
+through it, so I had to go take a break and go write
+https://pkg.go.dev/git.lukeshu.com/go/lowmemjson which is a
+mostly-drop-in-replacement that tries to be as close-as possible to O(1)
+memory use. So I did end up having to write my own parser anyway
+:(</p></li>
+</ul>
+<h2 id="algorithms">4.3. Algorithms</h2>
+<p>There are 3 algorithms of note in <code>btrfs-rec</code>, that I
+think are worth getting into mainline btrfs-progs even if the code of
+<code>btrfs-rec</code> doesn't get in:</p>
+<ol type="1">
+<li><p>The <code>btrfs-rec inspect rebuild-mappings</code> algoritithm
+to rebuild information from the <code>CHUNK_TREE</code>,
+<code>DEV_TREE</code>, and <code>BLOCK_GROUP_TREE</code>.</p></li>
+<li><p>The <code>btrfs-rec --rebuild</code> algorithm to cope with
+reading broken B+ trees.</p></li>
+<li><p>The <code>btrfs-rec inspect rebuild-trees</code> algorithm to
+re-attach lost branches to broken B+ trees.</p></li>
+</ol>
+<h3 id="the-rebuild-mappings-algorithm">4.3.1. The
+<code>rebuild-mappings</code> algorithm</h3>
+<p>(This step-zero scan is
+<code>btrfs-rec inspect rebuild-mappings scan</code>, and principally
+lives in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/scan.go</code></a>
+and <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/scan.go</code></a>)</p>
+<ol start="0" type="1">
+<li>Similar to <code>btrfs rescue chunk-recover</code>, scan each device
+for things that look like nodes; keep track of:
+<ul>
+<li>Checksums of every block on the device</li>
+<li>Which physical addresses contain nodes that claim to be at a given
+logical addess.</li>
+<li>Any found Chunk items, BlockGroup items, DevExtent, and CSum items.
+Keep track of the key for each of these, and for CSum items also track
+the generation.</li>
+</ul></li>
+</ol>
+<p>Create a bucket of the data from Chunks, DevExtents, and BlockGroups;
+since these are mostly a Chunk and a DevExtent+BlockGroup store pretty
+much the same information; we can use one to reconstruct the other. How
+we "merge" these and handle conflicts is in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfs/btrfsvol/lvm.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n121"><code>./lib/btrfs/btrfsvol/lvm.go:addMapping()</code></a>,
+I don't think this part is particularly clever, but given that
+<code>btrfs rescue chunk-recover</code> crashes if it encounters two
+overlapping chunks, I suppose I should spell it out:</p>
+<ul>
+<li><p>A "mapping" is represented as a group of 4 things:</p>
+<ul>
+<li>logical address</li>
+<li>a list of 1 or more physical addresses (device ID and offset)</li>
+<li>size, and a Boolean indicator of whether the size is "locked"</li>
+<li>block group flags, and a Boolean presence-indicator</li>
+</ul></li>
+<li><p>Mappings must be merged if their logical or physical regions
+overlap.</p></li>
+<li><p>If a mapping has a "locked" size, then when merging it may
+subsume smaller mappings with unlocked sizes, but its size cannot be
+changed; trying to merge a locked-size mapping with another mapping that
+is not for a subset region should return an error.</p></li>
+<li><p>If a mapping has block group flags present, then those flags may
+not be changed; it may only be merged with another mapping that does not
+have flags present, or has identical flags.</p></li>
+<li><p>When returning an error because of overlapping non-mergeable
+mappings, just log an error on stderr and keep going. That's an
+important design thing that is different than normal filesystem code; if
+there's an error, yeah, detect and notify about it, <strong>but don't
+bail out of the whole routine</strong>. Just skip that one item or
+whatever.</p></li>
+</ul>
+<p>Now that we know how to "add a mapping", let's do that:</p>
+<p>(The following main-steps are
+<code>btrfs-rec inspect rebuild-mappings process</code>, and principally
+live in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/process.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/process.go</code></a>)</p>
+<ol type="1">
+<li><p>Add all found Chunks.</p></li>
+<li><p>Add all found DevExtents.</p></li>
+<li><p>Add a phyical:logical mapping of length nodesize for each node
+that was found.</p></li>
+<li><p>Any mappings from steps 2 or 3 that are missing blockgroup flags
+(that is: they weren't able to be merged with a mapping from step 1),
+use the found BlockGroups to fill in those flags.</p></li>
+<li><p>Now we'll merge all found CSum items into a map of the sums of
+the logical address space. Sort all of the csum items by generation,
+then by address. Loop over them in that order, inserting their sums into
+the map. If two csum items overlap, but agree about the sums of the
+overlapping region, that's fine, just take their union. For overlaps
+that disagree, items with a newer generation kick out items with an
+older generation. If disagreeing items have the same generation... I
+don't think that can happen except by a filesystem bug (i.e. not by a
+failing drive or other external corruption), so I wasn't too concerned
+about it, so I just log an error on stderr and skip the later-processed
+item. See <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/process_sums_logical.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/process_sums_logical.go</code></a>.</p>
+<p>Look at regions of the logical address space that meet all the 3
+criteria:</p>
+<ul>
+<li>we have CSum items for them</li>
+<li>we have a BlockGroup for them</li>
+<li>we don't have a Chunk/DevExtent mapping them to the pysical address
+space.</li>
+</ul>
+<p>Pair those CSums up with BlockGroups, and for each BlockGroup, search
+the list of checksums of physical blocks to try to find a physical
+region that matches the logical csums (and isn't already mapped to a
+different logical region). I used a Knuth-Morris-Pratt search, modified
+to handle holes in the logical csum list as wildcards.</p>
+<p>Insert any found mappings into our bucket of mappings.</p></li>
+<li><p>Do the same again, but with a fuzzy search (we can re-use the
+csum map of the logical address space). My implementation of this is
+comparatively time and space intensive; I just walk over the entire
+unmapped physical address space, noting what % of match each BlockGroup
+has if placed at that location. I keep track of the best 2 matches for
+each BlockGroup. If the best match is better than a 50% match, and the
+second best is less than a 50% match, then I add the best match. In my
+experience, the best match is &gt;90% (or at whatever the maximum
+percent is for how much of the BlockGroup has logical sums), and the
+second best is 0% or 1%. The point of tracking both is that if there
+isn't a clear-cut winner, I don't want it to commit to a potentially
+wrong choice.</p></li>
+</ol>
+<h3 id="the---rebuild-algorithm">4.3.2. The <code>--rebuild</code>
+algorithm</h3>
+<p>The <code>--rebuild</code> flag is implied by the
+<code>--trees=trees.json</code> flag, and triggers an algorithm that
+allows "safely" reading from a broken B+ tree, rather than the usual B+
+tree lookup and search functions. I probably should have tried to
+understand the <code>btrfs restore</code> algorithm, maybe I reinvented
+the wheel...</p>
+<p>This algorithm requires a list of all nodes on the filesystem; we
+find these using the same scan as above (<a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/scan.go</code></a>),
+the same procedure as <code>btrfs rescue chunk-recover</code>.</p>
+<p>We walk all of those nodes, and build a reasonably lightweight
+in-memory graph of all nodes (<a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/graph.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/graph.go</code></a>),
+tracking</p>
+<ul>
+<li>each node's
+<ul>
+<li>logical address</li>
+<li>level</li>
+<li>generation</li>
+<li>tree</li>
+<li>each item's key and size</li>
+</ul></li>
+<li>each keypointer's
+<ul>
+<li>source node</li>
+<li>source slot within the node</li>
+<li>tree of the source node</li>
+<li>destination node</li>
+<li>destination level implied by the level of the source node</li>
+<li>destination key</li>
+<li>destination generation</li>
+</ul></li>
+<li>logical addresses and error messages for nodes that are pointed to
+by a keypointer or the superblock, but can't be read (because that
+logical address isn't mapped, or it doesn't look like a node,
+or...)</li>
+<li>an index such that for a given node we can quickly list both
+keypointers both originating at that node and pointing to that
+node.</li>
+</ul>
+<h4 id="rebuilt-forrest-behavior-looking-up-trees">4.3.2.1. rebuilt
+forrest behavior (looking up trees)</h4>
+<p>(see: <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/rebuilt_forrest.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/rebuilt_forrest.go</code></a>)</p>
+<ul>
+<li>The <code>ROOT_TREE</code>, <code>CHUNK_TREE</code>,
+<code>TREE_LOG</code>, and <code>BLOCK_GROUP_TREE</code> (the trees
+pointed to directy by the superblock) work as you'd expect.</li>
+<li>For other trees, we (as you'd expect) look up the root item in the
+rebuilt <code>ROOT_TREE</code>, and then (if rootitem.ParentUUID is
+non-zero) eagerly also look up the parent tree (recursing on ourself).
+We try to use the <code>UUID_TREE</code> tree to help with this, but
+fall back to just doing a linear scan over the <code>ROOT_TREE</code>.
+If we fail to look up the parent tree (or its parent, or a more distant
+ancestor), then (depending on a flag) we either make a note of that, or
+error out and fail to look up the child tree. For <code>--rebuild</code>
+and <code>--trees=trees.json</code> we are permissive of this error, and
+just make note of it; but we'll re-use this algorithm in the
+<code>rebuild-trees</code> algorithm below, and it needs the more strict
+handling.</li>
+<li>When creating the rebuilt individual tree, we start by adding the
+root node specified by the superblock/root-item. But we may also add
+additional root nodes grafted on to the tree by the
+<code>--trees=trees.json</code> flag or by the
+<code>rebuild-trees</code> algorithm below. So a tree may have more than
+1 root node.</li>
+</ul>
+<h4 id="rebuilt-individual-tree-behavior">4.3.2.2. rebuilt individual
+tree behavior</h4>
+<p>(see: <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/rebuilt_tree.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/rebuilt_tree.go</code></a>)</p>
+<p>In order to read from a tree, we first have to build a few indexes.
+We store these indexes in an Adaptive Replacement Cache; they are all
+re-buildable based on the tree's list of roots and the above graph; if
+we have a bunch of trees we don't need to keep all of this in memory at
+once. Note that this is done 100% with the in-memory graph, we don't
+need to read anything from the filesystem during these procedures.</p>
+<ul>
+<li><p>The first index we build is the "node index". This is an index
+that for every node tells us what root(s) the tree would need to have in
+order for the tree to include that node, and also what the highest item
+key would be acceptable in the node if the tree includes that root. We
+track both a <code>loMaxItem</code> and a <code>hiMaxItem</code>, in
+case the tree is real broken and there are multiple paths from the root
+to the node; as these different paths may imply different max-item
+constraints. Put more concretely, the type of the index is:</p>
+<pre><code>map[ nodeID → map[ rootNodeID → {loMaxItem, hiMaxItem} ] ]</code></pre>
+<p>We'll do a loop over the graph, using dynamic-programming memoization
+to figure out ordering and avoid processing the same node twice; for
+each node we'll</p>
+<ul>
+<li><p>Check whether the owner-tree is this tree or one of this tree's
+ancestors (and if it's an ancestor, that the node's generation isn't
+after the point that the child tree was forked from the parent tree). If
+not, we are done processing that node (record an empty/nil set of roots
+for it).</p></li>
+<li><p>Create an empty map of <code>rootID</code> →
+{<code>loMaxItem</code>, <code>hiMaxItem</code>}.</p></li>
+<li><p>Look at each keypointer that that points at the node and:</p>
+<ul>
+<li><p>Skip the keypointer if its expectations of the node aren't met:
+if the level, generation, and min-key constraints don't match up. If the
+keypointer isn't in the last slot in the source node, we also go ahead
+and include checking that the destination node's max-key is under the
+min-key of the keypointer in the next slot, since that's cheap to do
+now.</p></li>
+<li><p>Skip the keypointer if its source node's owner-tree isn't this
+tree or one of this tree's ancestors (and if it's an ancestor, that the
+node's generation isn't after the point that the child tree was forked
+from the parent tree).</p></li>
+<li><p>dynamic-programming recurse and index the keypointer's source
+node.</p></li>
+<li><p>for every root that would result in the keypointer's source node
+being included in the tree:</p>
+<p>. If the keypointer is in the last slot, look at what the what the
+source node's last-item constraints would be if that root is included,
+and can now check the max-item of our destination node. We check against
+the <code>hiMaxItem</code>; as if there is any valid path from the root
+to this node, then we want to be permissive and include it. If that
+check fails, then we're done with this keypointer. Also, make node of
+those <code>loMaxItem</code> and <code>hiMaxItem</code> values, we'll
+use them again in just a moment.</p>
+<p>. Otherwise, set both <code>loMaxItem</code> and
+<code>hiMaxItem</code> to 1-under the min-item of the keypointer in the
+next slot.</p>
+<p>. Insert that <code>loMaxItem</code> and <code>hiMaxItem</code> pair
+into the <code>rootID</code> → {<code>loMaxItem</code>,
+<code>hiMaxItem</code>} map we created above. If an entry already exists
+for this root (since a broken tree might have multiple paths from the
+root to our node), then set <code>loMaxItem</code> to the min of the
+existing entry and our value, and <code>hiMaxItem</code> to the
+max.</p></li>
+</ul></li>
+<li><p>If that <code>rootID</code> → {<code>loMaxItem</code>,
+<code>hiMaxItem</code>} map is still empty, then consider this node to
+be a (potential) root, and insert <code>rootID=thisNode</code> -&gt;
+{<code>loMaxItem=maxKey</code>, <code>hiMaxItem=maxKey</code>} (where
+<code>maxKey</code> is the maximum value of the key datatype).</p></li>
+<li><p>Take that <code>rootID</code> → {<code>loMaxItem</code>,
+<code>hiMaxItem</code>} map and insert it into the index as the entry
+for this node.</p></li>
+</ul></li>
+<li><p>The next index we build is the "item index". This is a "sorted
+map" (implemented as a red-black tree, supporting sub-range iteration)
+of <code>key</code> → {<code>nodeID</code>, <code>slotNumber</code>}; a
+map that for each key tells us where to find the item with that key.</p>
+<ul>
+<li><p>Loop over the node index, and for each node check if both (a) it
+has <code>level==0</code> (is a leaf node containing items), and (b) its
+set of roots that would include it has any overlap with the tree's set
+of roots.</p></li>
+<li><p>Loop over each of those included leaf nodes, and loop over the
+items in each node. Insert the <code>key</code> → {<code>nodeId</code>,
+<code>slot</code>} into our sorted map. If there is already an entry for
+that key, decide which one wins by:</p>
+<ul>
+<li><p>Use the one from the node with the owner-tree that is closer to
+this tree; node with owner=thisTree wins over a node with
+owner=thisTree.parent, which would win over a node with
+owner.thisTree.parent.parent. If that's a tie, then...</p></li>
+<li><p>Use the one from the node with the higher generation. If that's a
+tie, then...</p></li>
+<li><p>I don't know, I have the code <code>panic</code>:</p>
+<pre><code>// TODO: This is a panic because I&#39;m not really sure what the
+// best way to handle this is, and so if this happens I want the
+// program to crash and force me to figure out how to handle it.
+panic(fmt.Errorf(&quot;dup nodes in tree=%v: old=%v=%v ; new=%v=%v&quot;,
+ tree.ID,
+ oldNode, tree.forrest.graph.Nodes[oldNode],
+ newNode, tree.forrest.graph.Nodes[newNode]))</code></pre></li>
+</ul></li>
+</ul>
+<p>Note that this algorithm means that for a given node we may use a few
+items from that node, while having other items from that same node be
+overridden by another node.</p></li>
+<li><p>The final index we build is the "error index". This is an index
+of what errors correspond to which range of keys, so that we can report
+them, and give an idea of "there may be entries missing from this
+directory" and similar.</p>
+<p>For each error, we'll track the min-key and max-key of the range it
+applies to, the node it came from, and what the error string is. We'll
+store these into an interval tree keyed on that min-key/max-key
+range.</p>
+<ul>
+<li><p>Create an empty set <code>nodesToProcess</code>. Now populate
+it:</p>
+<ul>
+<li><p>Once again, we'll loop over the node index, but this time we'll
+only check that there's overlap between the set of roots that would
+include the node and the tree's set of roots. The nodes that are
+included in this tree, insert both that node itself and all node IDs
+that it has keypointers pointing to into the <code>nodesToProcess</code>
+set.</p></li>
+<li><p>Also insert all of the tree's roots into
+<code>nodesToProcess</code>; this is in case the superblock/root-item
+points to an invalid node that we couldn't read.</p></li>
+</ul></li>
+<li><p>Now loop over <code>nodesToProcess</code>. For each node, create
+an empty list of errors. Use the keypointers pointing to and the min
+<code>loMaxItem</code> from the node index to construct a set of
+expectations for the node; this should be reasonably straight-forward,
+given:</p>
+<ul>
+<li><p>If different keypointers have disagreeing levels, insert an error
+in to the list, and don't bother with checking the node's
+level.</p></li>
+<li><p>If different keypointers have disagreeing generations, insert an
+error in to the list, and don't bother with checking the node's
+generation.</p></li>
+<li><p>If different keypointers have different min-item expectations,
+use the max of them.</p></li>
+</ul>
+<p>Then:</p>
+<ul>
+<li>If the node is a "bad node" in the graph, insert the error message
+associated with it. Otherwise, check those expectations against the node
+in the graph.</li>
+</ul>
+<p>If the list of error messages is non-empty, then insert their
+concatenation into the interval tree, with the range set to the min of
+the min-item expectations from the keypointers through the max of the
+<code>hiMaxItem</code>s from the node index. If the min min-item
+expectation turns out to be higher than the max <code>hiMaxItem</code>,
+then set the range to the zero-key through the max-key.</p></li>
+</ul></li>
+</ul>
+<p>From there, it should be trivial to implement the usual B+ tree
+operations using those indexes; exact-lookup using the item index, and
+range-lookups and walks using the item index together with the error
+index. Efficiently searching the <code>CSUM_TREE</code> requires knowing
+item sizes, so that's why we recorded the item sizes into the graph.</p>
+<h3 id="the-rebuild-trees-algorithm">4.3.3. The
+<code>rebuild-trees</code> algorithm</h3>
+<p>The <code>btrfs inspect rebuild-trees</code> algorithm finds nodes to
+attach as extra roots to trees. I think that conceptually it's the the
+simplest of the 3 algorithms, but turned out to be the hardest to get
+right. So... maybe more than the others reference the source code too
+(<a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/</code></a>)
+because I might forget some small but important detail.</p>
+<p>The core idea here is that we're just going to walk each tree,
+inspecting each item in the tree, and checking for any items that are
+implied by other items (e.g.: a dir entry item implies the existence of
+inode item for the inode that it points at). If an implied item is not
+in the tree, but is in some other node, then we look at which potential
+roots we could add to the tree that would add that other node. Then,
+after we've processed all of the items in the filesystem, we go add
+those various roots to the various trees, keeping track of which items
+are added or updated. If any of those added/updated items have a version
+with a newer generation on a different node, see what roots we could add
+to get that newer version. Then add those roots, keeping track of items
+that are added/updated. Once we reach steady-state with the newest
+version of each item has been added, loop back and inspect all
+added/updated items for implied items, keeping track of roots we could
+add. Repeat until a steady-state is reached.</p>
+<p>There are lots of little details in that process, some of which are
+for correctness, and some of which are for "it should run in hours
+instead of weeks."</p>
+<h4 id="initialization">4.3.3.1. initialization</h4>
+<p>First up, we're going to build and in-memory graph, same as above.
+But this time, while we're reading the nodes to do that, we're also
+going to watch for some specific items and record a few things about
+them.</p>
+<p>(see: <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/scan.go</code></a>)</p>
+<p>For each {<code>nodeID</code>, <code>slotNumber</code>} pair that
+matches one of these item types, we're going to record:</p>
+<ul>
+<li>flags:
+<ul>
+<li><code>INODE_ITEM</code>s: whether it has the
+<code>INODE_NODATASUM</code> flag set</li>
+</ul></li>
+<li>names:
+<ul>
+<li><code>DIR_INDEX</code> items: the file's name</li>
+</ul></li>
+<li>sizes:
+<ul>
+<li><code>EXTENT_CSUM</code> items: the number of bytes that this is a
+sum for (i.e. the item size over the checksum size, times the block
+size)</li>
+<li><code>EXTENT_DATA</code> items: the number of bytes in this extent
+(i.e. either the item size minus
+<code>offsetof(btrfs_file_extent_item.disk_bytenr)</code> if
+<code>FILE_EXTENT_INLINE</code>, or else the item's
+<code>num_bytes</code>).</li>
+</ul></li>
+<li>data backrefs:
+<ul>
+<li><code>EXTENT_ITEM</code>s and <code>METADATA_ITEM</code>s: a list of
+the same length as the number of refs embedded in the item; for embeded
+ExtentDataRefs, the list entry is the subvolume tree ID that the
+ExtentDataRef points at, otherwise it is zero.</li>
+<li><code>EXTENT_DATA_REF</code> items: a list of length 1, with the
+sole member being the subvolume tree ID that the ExtentDataRef points
+at.</li>
+</ul></li>
+</ul>
+<h4 id="the-main-loop">4.3.3.2. the main loop</h4>
+<p>(see: <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go</code></a>)</p>
+<p>Start with that scan data (graph + info about items), and also a
+rebuilt forrest from the above algorithm, but with:</p>
+<ul>
+<li><p>the flag set so that it refuses to look up a tree if it can't
+look up all of that tree's ancestors</p></li>
+<li><p>an additional "potential-item index" that is similar to the item
+index. It is generated the same way and can cache/evict the same way;
+the difference is that we invert the check for if the set of roots for a
+node has overlap with the tree's set of roots; we're looking for
+<em>potential</em> nodes that we could add to this tree.</p></li>
+<li><p>some callbacks; we'll get to what we do in these callbacks in a
+bit, but for now, what the callbacks are:</p>
+<ul>
+<li><p>a callback that is called for each added/updated item when we add
+a root.</p></li>
+<li><p>a callback that is called whenever we add a root</p></li>
+<li><p>a callback that intercepts looking up a root item</p></li>
+<li><p>a callback that intercepts resolving an UUID to an object
+ID.</p></li>
+</ul></li>
+</ul>
+<p>(The callbacks are in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go</code></a>)</p>
+<p>We have 5 unordered queues ("work lists"?); these are sets that when
+it's time to drain them we'll sort the members and process them in that
+order.</p>
+<ol type="1">
+<li>the tree queue: a list of tree IDs that we need to crawl</li>
+<li>the retry-item queue: for each tree ID, a set of items that we
+should re-process if we add a root to that tree</li>
+<li>the added-item queue: a set of key/tree pairs identifying items that
+have been added by adding a root to a tree</li>
+<li>the settled-item-queue: a set of key/tree pairs that have have not
+just been added by adding a root, but we've also verified that they are
+the newest-generation item with that key that we could add to the
+tree.</li>
+<li>the augment queue: for each item that we want to add to a tree, the
+list of roots that we could add to get that item.</li>
+</ol>
+<p>The roots all start out empty, except for the tree queue, which we
+seed with the <code>ROOT_TREE</code>, the <code>CHUNK_TREE</code>, and
+the <code>BLOCK_GROUP_TREE</code> (It is a "TODO" task that it should
+probably also be seeded with the <code>TREE_LOG</code>, but as I will
+say below in the "future work" section, I don't actually understand the
+<code>TREE_LOG</code>, so I couldn't implement it).</p>
+<p>Now we're going to loop until the tree queue, added-item queue,
+settled-item queue, and augment queue are all empty (all queues except
+for the retry-item queue). Each loop "pass" has 3 substeps:</p>
+<ol type="1">
+<li><p>Crawl the trees (drain the tree queue, fill the added-item
+queue).</p></li>
+<li><p>Either:</p>
+<ol type="a">
+<li><p>if the added-item queue is non-empty: "settle" those items (drain
+the added-item queue, fill the augment queue and the settled-item
+queue).</p></li>
+<li><p>otherwise: process items (drain the settled-item queue, fill the
+augment queue and the tree queue)</p></li>
+</ol></li>
+<li><p>Apply augments (drain the augment queue and maybe the retry-item
+queue, fill the added-item queue).</p></li>
+</ol>
+<p>OK, let's look at those 3 substeps in more detail:</p>
+<ol type="1">
+<li><p>Crawl the trees; drain the tree queue, fill the added-item
+queue.</p>
+<p>We just look up the tree in the rebuilt forrest, which will (per the
+above <code>--rebuild</code> algorithm) will either fail to look up the
+tree, or succeed, and add to that tree the root node from the
+superblock/root-item. Because we set an item-added callback, when adding
+that root it will loop over the nodes added by that root, and call our
+callback for each item in one of the added nodes. Our callback inserts
+each item into the added-item queue. The forrest also calls our
+root-added callback, but because of the way this algorithm works, that
+turns out to be a no-op at this step.</p>
+<p>I mentioned that we added callbacks to intercept the forrest's
+looking up of root items and resolving UUIDs; we override the forrest's
+"lookup root item" routine and "resolve UUID" routine to instead of
+doing normal lookups on the <code>ROOT_TREE</code> and
+<code>UUID_TREE</code>, use the above <code>Want<var>XXX</var></code>
+routines that we'll define below in the "graph callbacks" section.</p>
+<p>It shouldn't matter what order this queue is processed in, but I sort
+tree IDs numerically.</p>
+<p>The crawling is fairly fast because it's just in-memory, the only
+accesses to disk are looking up root items and resolving UUIDs.</p></li>
+<li><p>Either:</p>
+<ol type="a">
+<li><p>Settle items from the added-item queue to the settled-item queue
+(and fill the augment queue).</p>
+<p>For each item in the queue, we look in the tree's item index to get
+the {node, slot} pair for it, then we do the same in the tree's
+potential-item index. If the potential-item index contains an entry for
+the item's key, then we check if the potential-item's node should "win"
+over the queue item's node, deciding the "winner" using the same routine
+as when building the item index. If the potential-item's node wins, then
+we add the potential node's set of roots to the augment queue. If the
+queue-item's node wins, then we add the item to the settled-item queue
+(except, as an optimization, if the item is of a type that cannot
+possibly imply the existence of another item, then we just drop it and
+don't add it to the settled-item queue).</p>
+<p>It shouldn't matter what order this queue is processed in, but I sort
+it numerically by treeID and then by item key.</p>
+<p>This step is fairly fast because it's entirely in-memory, making no
+accesses to disk.</p></li>
+<li><p>Process items from the settled-item queue (drain the settled-item
+queue, fill the augment queue and the tree queue).</p>
+<p>This step accesses disk, and so the order we process the queue in
+turns out to be pretty important in order to keep our disk access
+patterns cache-friendly. For the most part, we just sort each queue item
+by tree, then by key. But, we have special handling for
+<code>EXTENT_ITEM</code>s, <code>METADATA_ITEM</code>s, and
+<code>EXTENT_DATA_REF</code> items: We break <code>EXTENT_ITEM</code>s
+and <code>METADATA_ITEM</code>s in to "sub-items", treating each ref
+embedded in them as a separate item. For those embedded items that are
+<code>EXTENT_DATA_REF</code>s, and for stand-alone
+<code>EXTENT_DATA_REF</code> items, we sort them not with the
+<code>EXTENT_TREE</code> items, but with the items of the tree that the
+extent data ref points at. Recall that during the intitial scan step, we
+took note of which tree every extent data ref points at, so we can
+perform this sort without accessing disk yet. This splitting does mean
+that we may visit/read an <code>EXTENT_ITEM</code> or
+<code>METADATA_ITEM</code> multiple times as we process the queue, but
+to do otherwise is to solve MinLA, which is NP-hard and also an optimal
+MinLA solution I still think would perform worse than this; there is a
+reasonably lengthy discussion of this in a comment in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n251"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:sortSettledItemQueue()</code></a>.</p>
+<p>Now we loop over that sorted queue. In the code, this loop is
+deceptively simple. Read the item, then pass it to a function that tells
+us what other items are implied by it. That function is large, but
+simple; it's just a giant table. The trick is how it tells us about
+implied items; we give it set of callbacks that it calls to tell us
+these things; the real complexity is in the callbacks. These "graph
+callbacks" will be discussed in detail below, but as an illustrative
+example: It may call <code>.WantOff()</code> with a tree ID, object ID,
+item type, and offset to specify a precise item that it believes should
+exist.</p>
+<p>If we encounter a <code>ROOT_ITEM</code>, add the tree described by
+that item to the tree queue.</p></li>
+</ol>
+<p>(Both the "can this item even imply the existence of another item"
+check and the "what items are implied by this item" routine are in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfscheck/graph.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfscheck/graph.go</code></a>)</p></li>
+<li><p>Apply augments; drain the augment queue (and maybe the retry-item
+queue), fill the added-item queuee.</p>
+<p>It is at this point that I call out that the augment queue isn't
+implemented as a simple map/set like the others, the
+<code>treeAugmentQueue struct</code> has special handling for sets of
+different sizes; optimizing the space for empty and len()==1 sized sets,
+and falling back to normal the usual implementation for larger sets;
+this is important because those small sets are the overwhelming
+majority, and otherwise there's no way the program would be able to run
+on my 32GB RAM laptop. Now that I think about it, I bet it would even be
+worth it to add optimized storage for len()==2 sized sets.</p>
+<p>The reason is that each "want" from above is tracked in the queue
+separately; if we were OK merging them, then this optimized storage
+wouldn't be nescessary. But we keep them separate, so that:</p>
+<ul>
+<li><p>For all "wants", including ones with empty sets, graph callbacks
+can check if a want has already been processed; avoiding re-doing any
+work (see the description of the graph callbacks below).</p></li>
+<li><p>For "wants" with non-empty sets, we can see how many different
+"wants" could be satisfied with a given root, in order to decide which
+root to choose.</p></li>
+</ul>
+<p>Anyway, we loop over the trees in the augment queue. For each tree we
+look at that tree's augment queue and look at all the choices of root
+nodes to add (below), and decide on a list to add. The we add each of
+those roots to the tree; the adding of each root triggers several calls
+to our item-added callback (filling the added-item queue), and our
+root-added callback. The root-added callback moves any items from the
+retry-item queue for this tree to the added-item queue.</p>
+<p>How do we decide between choices of root nodes to add? <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n528"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:resolveTreeAugments()</code></a>
+has a good comment explaining the criteria we'd like to optimize for,
+and then code that does an OK-ish job of actually optimizing for
+that:</p>
+<ul>
+<li><p>It loops over the augment queue for that tree, building a list of
+possible roots, for each possible root making note of 3 things:</p>
+<ol type="a">
+<li><p>how many "wants" that root satisfies,</p></li>
+<li><p>how far from treee the root's owner is (owner=tree is a distance
+of 0, owner=tree.parent is a distance of 1, owner=tree.parent.parent is
+a distance of 2, and so on), and</p></li>
+<li><p>what the generation of that root is.</p></li>
+</ol></li>
+<li><p>We sort that list first by highest-count-first, then by
+lowest-distance-first, then by highest-generation-first.</p></li>
+<li><p>We create a "return" set and an "illegal" set. We loop over the
+sorted list; for each possible root if it is in the illegal set, we skip
+it, otherwise we insert it into the return set and for each "want" that
+includes this root we all all roots that satisfy that want to the
+illegal list.</p></li>
+</ul></li>
+</ol>
+<p>It is important that the rebuilt forrest have the flag set so that it
+refuses to look up a tree if it can't look up all of that tree's
+ancestors; otherwise the potential-items index would be garbage as we
+wouldn't have a good idea of which nodes are OK to consider; but this
+does have the downside that it won't even attempt to improve a tree with
+a missing parent. Perhaps the algorithm should flip the flag once the
+loop terminates, and then re-seed the tree queue with each
+<code>ROOT_ITEM</code> from the <code>ROOT_TREE</code>?</p>
+<h4 id="graph-callbacks">4.3.3.3. graph callbacks</h4>
+<p>(see: <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go</code></a>)</p>
+<p>The graph callbacks are what tie the above together.</p>
+<p>For each of these callbacks, whenever I say that it looks up
+something in a tree's item index or potential-item index, that implies
+looking the tree up from the forrest; if the forrest cannot look up that
+tree, then the callback returns early, after either:</p>
+<ul>
+<li><p>if we are in substep 1 and are processing a tree: we add the tree
+that is being processed to the tree queue. (TODO: Wait, this assumes
+that an augment will be applied to the <code>ROOT_TREE</code> before the
+next pass... if that isn't the case, this will result in the loop never
+terminating... I guess I need to add a separate retry-tree
+queue?)</p></li>
+<li><p>if we are in substep 2 and are processing an item: we add the
+item that is being processed to the retry-item queue for the tree that
+cannot be looked up</p></li>
+</ul>
+<p>The 6 methods in the <code>brfscheck.GraphCallbacks</code> interface
+are:</p>
+<ol type="1">
+<li><p><code>FSErr()</code>: There's an error with the filesystem; this
+callback just spits it out on stderr. I say such a trivial matter
+because, again, for a recovery tool I think it's worth putting care in
+to how you handle errors and where you expect them: We expect them here,
+so we have to check for them to avoid reading invalid data or whatever,
+but we don't actually need to do anything other than watch our
+step.</p></li>
+<li><p><code>Want()</code>: We want an item in a given tree with a given
+object ID and item type, but we don't care about what the item's offset
+is.</p>
+<p>The callback works by searching the item index to see if it can find
+such an item; if so, it has nothing else to do and returns. Otherwise,
+it searches the potential-item index; for each matching item it finds it
+looks in the node index for the node containing that item, and adds the
+roots that would add that node, and adds those roots to a set. Once it
+has finished searching the potential-item index, it adds that set to the
+augment queue (even if that set is still empty).</p></li>
+<li><p><code>WantOff()</code>: The same, but we want a specific
+offset.</p></li>
+<li><p><code>WantDirIndex()</code>: We want a <code>DIR_INDEX</code>
+item for a given inode and filename, but we don't know what the offset
+of that item is.</p>
+<p>First we scan over the item index, looking at all
+<code>DIR_INDEX</code> items for that inode number. For each item, we
+can check the scan data to see what the filename in that
+<code>DIR_INDEX</code> is, so we can see if the item satisfies this want
+without accessing the disk. If there's a match, then there is nothing
+else to do, so we return. Otherwise, we do that same search over the
+potential-item index; if we find any matches, then we build the set of
+roots to add to the augment queue the same as in
+<code>Want</code>.</p></li>
+<li><p><code>WantFileExt()</code>: We want 1 or more
+<code>DATA_EXTENT</code> items in the given tree for the given inode,
+and we want them to cover from 0 to a given size bytes of that file.</p>
+<p>First we walk that range in the item index, to build a list of the
+gaps that we need to fill ("Step 1" in <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n260"><code>rebuild_wantcb.go:_wantRange()</code></a>).
+This walk (<a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n195"><code>rebuild_wantcb.go:_walkRange()</code></a>)
+requires knowing the size of each file extent; so doing this quickly
+without hitting disk is why we recorded the size of each file extent in
+our initialization step.</p>
+<p>Then ("Step 2" in <code>_wantRange()</code>) we iterate over each of
+the gaps, and for each gap do a very similar walk (again, by calling
+<code>_walkRange()</code>, but this time over the potential-item index.
+For each file extent we find that has is entirely within the gap, we
+"want" that extent, and move the beginning of of the gap forward to the
+end of that extent. This algorithm is dumb and greedy, potentially
+making sub-optimal selections; and so could probably stand to be
+improved; but in my real-world use, it seems to be "good
+enough".</p></li>
+<li><p><code>WantCSum()</code>: We want 1 or more
+<code>EXTENT_CSUM</code> items to cover the half-open interval
+[<code>lo_logical_addr</code>, <code>hi_logical_addr</code>). Well,
+maybe. It also takes a subvolume ID and an inode number; and looks up in
+the scan data whether that inode has the <code>INODE_NODATASUM</code>
+flag set; if it does have the flag set, then it returns early without
+looking for any <code>EXTENT_CSUM</code> items. If it doesn't return
+early, then it performs the same want-range routine as
+<code>WantFileExt</code>, but with the appropriate tree, object ID, and
+item types for csums as opposed to data extents.</p></li>
+</ol>
+<p>For each of these callbacks, we generate a "wantKey", a tuple
+representing the function and its arguments; we check the augment-queue
+to see if we've already enqueued a set of roots for that want, and if
+so, that callback can return early without checking the potential-item
+index.</p>
+<h1 id="future-work">5. Future work</h1>
+<p>It's in a reasonably useful place, I think; and so now I'm going to
+take a break from it for a while. But there's still lots of work to
+do:</p>
+<ul>
+<li><p>RAID almost certainly doesn't work.</p></li>
+<li><p>Encryption is not implemented.</p></li>
+<li><p>It doesn't understand (ignores) the <code>TREE_LOG</code>
+(because I don't understand the <code>TREE_LOG</code>).</p></li>
+<li><p><code>btrfs-rec inspect mount</code> should add "lost+found"
+directories for inodes that are included in the subvolume's tree but
+aren't reachable from the tree's root inode</p></li>
+<li><p>I still need to implement <code>btrfs-rec repair
+<var>SUBCMD</var></code> subcommands to write rebuilt-information from
+<code>btrfs-rec inspect</code> back to the filesystem.</p></li>
+<li><p>I need to figure out the error handling/reporting story for
+<code>mount</code>.</p></li>
+<li><p>It needs a lot more tests</p>
+<ul>
+<li>I'd like to get the existing btrfs-progs fsck tests to run on
+it.</li>
+</ul></li>
+<li><p>In the process of writing this email, I realized that I probably
+need to add a retry-tree queue; see the "graph callbacks" section in the
+description of the <code>rebuild-trees</code> algorithm above.</p></li>
+<li><p>Shere are a number of "TODO" comments or panics in the code:</p>
+<ul>
+<li><p>Some of them definitely need done.</p></li>
+<li><p>Some of them are <code>panic("TODO")</code> on the basis that if
+it's seeing something on the filesystem that it doesn't recognize, it's
+probably that I didn't get to implementing that thing/situation, but
+it's possible that the thing is just corrupt. This should only be for
+situations that the node passed the checksum test, so it being corrupt
+would have to be caused by a bug in btrfs rather than a failing drive or
+other corruption; I wasn't too worried about btrfs bugs.</p></li>
+</ul></li>
+<li><p><code>btrfs-rec inspect rebuild-trees</code> is slow, and can
+probably be made a lot faster.</p>
+<p>Just to give you an idea of the speeds, the run-times for the various
+steps on my ThinkPad E15 for a 256GB disk image are as follows:</p>
+<pre><code> btrfs-rec inspect rebuild-mappings scan : 7m 31s
+ btrfs-rec inspect rebuild-mappings list-nodes : 47s
+ btrfs-rec inspect rebuild-mappings process : 8m 22s
+ btrfs-rec inspect rebuild-trees : 1h 4m 55s
+ btrfs-rec inspect ls-files : 29m 55s
+ btrfs-rec inspect ls-trees : 8m 40s</code></pre>
+<p>For the most part, it's all single-threaded (with the main exception
+that in several places I/O has been moved to a separate thread from the
+main CPU-heavy thread), but a lot of the algorithms could be
+parallelized.</p></li>
+<li><p>There are a lot of "tunable" values that I haven't really spent
+time tuning. These are all annotated with <a
+href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/textui/tunable.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>textui.Tunable()</code></a>.
+I sort-of intended for them to be adjustable on the CLI.</p></li>
+<li><p>Perhaps the <code>btrfs inspect rebuild-trees</code> algorithm
+could be adjusted to also try to rebuild trees with missing parents; see
+the above discussion of the algorithm.</p></li>
+</ul>
+<h1 id="problems-for-merging-this-code-into-btrfs-progs">6. Problems for
+merging this code into btrfs-progs</h1>
+<ul>
+<li><p>It's written in Go, not C.</p></li>
+<li><p>It's effectively GPLv3+ (not GPLv2-only or GPLv2+) because of use
+of some code under the Apache 2.0 license (2 files in the codebase
+itself that are based off of Apache-licensed code, and use of unmodified
+3rd-party libraries).</p></li>
+<li><p>It uses ARC (Adaptive Replacement Cache), which is patented by
+IBM, and the patent doesn't expire for another 7 months. An important
+property of ARC over LRU is that it is scan-resistant; the above
+algorithms do a lot of scanning. On that note, now that RedHat is owned
+by IBM: who in the company do we need to get to talk to eachother so
+that we can get ARC into the Linux kernel before then?</p></li>
+</ul>
+<div style="font-family: monospace">
+<p>-- <br/> Happy hacking,<br/> ~ Luke Shumaker<br/></p>
+</div>
+
+</article>
+<footer>
+<p>The content of this page is Copyright © 2023 <a href="mailto:lukeshu@sbcglobal.net">Luke Shumaker</a>.</p>
+<p>This page is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA-3.0</a> license.</p>
+</footer>
+</body>
+</html>