diff options
Diffstat (limited to 'public/btrfs-rec.html')
-rw-r--r-- | public/btrfs-rec.html | 1170 |
1 files changed, 1170 insertions, 0 deletions
diff --git a/public/btrfs-rec.html b/public/btrfs-rec.html new file mode 100644 index 0000000..3012538 --- /dev/null +++ b/public/btrfs-rec.html @@ -0,0 +1,1170 @@ +<!DOCTYPE html> +<html lang="en"> +<head> + <meta charset="utf-8"> + <title>Announcing: btrfs-rec: Recover (data from) a broken btrfs filesystem — Luke Shumaker</title> + <link rel="stylesheet" href="assets/style.css"> + <link rel="alternate" type="application/atom+xml" href="./index.atom" name="web log entries"/> +</head> +<body> +<header><a href="/">Luke Shumaker</a> » <a href=/blog>blog</a> » btrfs-rec</header> +<article> +<h1 +id="announcing-btrfs-rec-recover-data-from-a-broken-btrfs-filesystem">Announcing: +btrfs-rec: Recover (data from) a broken btrfs filesystem</h1> +<blockquote> +<p>I originally sent this email on 2023-07-10, but it has been caught by +their bogofilter. Yes, it was <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/README.md?id=18e6066c241cf3d252b6521150843ffc858d8434">plaintext</a>. +No, I didn't use GMail. Yes, I've successfully participated in vger +lists in the past. Yes, I've reached out to postmaster; no, I haven't +received a reply yet (as of 2023-07-14).</p> +</blockquote> +<div style="font-family: monospace"> +<p>To: linux-btrfs@vger.kernel.org<br/> From: Luke T. Shumaker +<lukeshu@lukeshu.com><br/> Subject: btrfs-rec: Recover (data from) +a broken btrfs filesystem<br/> Date: Mon, 10 Jul 2023 21:23:41 +-0600<br/> Message-ID: +<87jzv7uo5e.wl-lukeshu@lukeshu.com><br/></p> +</div> +<p>Inspired by a mis-typed <code>dd</code> command, for the last year +I've been working on a tool for recovering corrupt btrfs filesystems; at +first idly here and there, but more actively in the last few months. I +hope to get it incorporated into btrfs-progs, though perhaps that is +problematic for a few reasons I'll get to. If the code can't be +incorporated into btrfs-progs, at least the ideas and algorithms should +be.</p> +<p><a +href="https://git.lukeshu.com/btrfs-progs-ng/">https://git.lukeshu.com/btrfs-progs-ng/</a></p> +<p>Highlights:</p> +<ul> +<li><p>In general, it's more tolerant of corrupt filesystems than +<code>btrfs check --repair</code>, <code>btrfs rescue</code> or +<code>btrfs restore</code>.</p></li> +<li><p><code>btrfs-rec inspect rebuild-mappings</code> is a better +<code>btrfs rescue chunk-recover</code>.</p></li> +<li><p><code>btrfs-rec inspect rebuild-trees</code> can re-attach lost +branches to broken B+ trees.</p></li> +<li><p><code>btrfs-rec inspect mount</code> is a read-only FUSE +implementation of btrfs. This is conceptually a replacement for +<code>btrfs restore</code>.</p></li> +<li><p>It's entirely written in Go. I'm not saying that's a good thing, +but it's an interesting thing.</p></li> +</ul> +<p>Hopefully some folks will find it useful, or at least neat!</p> +<ul> +<li><a href="#motivation">1. Motivation</a></li> +<li><a href="#overview-of-use">2. Overview of use</a></li> +<li><a href="#prior-art">3. Prior art</a></li> +<li><a href="#internalsdesign">4. Internals/Design</a></li> +<li><a href="#overview-of-the-source-tree-layout">4.1. Overview of the +source tree layout</a></li> +<li><a href="#base-decisions-cli-structure-go-json">4.2. Base decisions: +CLI structure, Go, JSON</a></li> +<li><a href="#algorithms">4.3. Algorithms</a></li> +<li><a href="#the-rebuild-mappings-algorithm">4.3.1. The +<code>rebuild-mappings</code> algorithm</a></li> +<li><a href="#the---rebuild-algorithm">4.3.2. The <code>--rebuild</code> +algorithm</a></li> +<li><a href="#rebuilt-forrest-behavior-looking-up-trees">4.3.2.1. +rebuilt forrest behavior</a></li> +<li><a href="#rebuilt-individual-tree-behavior">4.3.2.2. rebuilt +individual tree behavior</a></li> +<li><a href="#the-rebuild-trees-algorithm">4.3.3. The +<code>rebuild-trees</code> algorithm</a></li> +<li><a href="#initialization">4.3.3.1. initialization</a></li> +<li><a href="#the-main-loop">4.3.3.2. the main loop</a></li> +<li><a href="#graph-callbacks">4.3.3.3. graph callbacks</a></li> +<li><a href="#future-work">5. Future work</a></li> +<li><a href="#problems-with-merging-this-code-into-btrfs">6. Problems +for merging this code into btrfs-progs</a></li> +</ul> +<h1 id="motivation">1. Motivation</h1> +<p>Have you ever ended up with a corrupt btrfs filesystem (through no +fault of btrfs itself, but perhaps a failing drive, or a mistaken +<code>dd</code> invocation)? Surely losing less than 100MB of data from +a drive should not render hundreds of GB of perfectly intact data +unreadable! And yet, the existing tools are unable to even attempt to +read that data:</p> +<pre><code>$ btrfs check --repair --force dump-zero.1.img +enabling repair mode +Opening filesystem to check... +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 +ERROR: cannot read chunk root +ERROR: cannot open file system</code></pre> +<p>or</p> +<pre><code>$ btrfs check --init-extent-tree --force dump-zero.1.img +Opening filesystem to check... +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 +ERROR: cannot read chunk root +ERROR: cannot open file system</code></pre> +<p>or</p> +<pre><code>$ btrfs check --init-csum-tree --force dump-zero.1.img +Creating a new CRC tree +Opening filesystem to check... +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 +ERROR: cannot read chunk root +ERROR: cannot open file system</code></pre> +<p>or</p> +<pre><code>$ btrfs rescue chunk-recover dump-zero.1.img +Scanning: DONE in dev0 +corrupt node: root=1 block=160410271744 slot=0, corrupt node: root=1 block=160410271744, nritems too large, have 39 expect range [1,0] +Couldn't read tree root +open with broken chunk error</code></pre> +<p>or</p> +<pre><code>$ btrfs rescue zero-log dump-zero.1.img +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +ERROR: cannot read chunk root +ERROR: could not open ctree</code></pre> +<p>or</p> +<pre><code>$ mkdir out +$ btrfs restore dump-zero.1.img out +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 +ERROR: cannot read chunk root +Could not open root, trying backup super +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 +ERROR: cannot read chunk root +Could not open root, trying backup super +ERROR: superblock bytenr 274877906944 is larger than device size 256060514304 +Could not open root, trying backup super</code></pre> +<p>or</p> +<pre><code>$ btrfs restore --list-roots dump-zero.1.img +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 +ERROR: cannot read chunk root +Could not open root, trying backup super +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +checksum verify failed on 1048576 wanted 0xf81c950a found 0xd66a46e0 +bad tree block 1048576, bytenr mismatch, want=1048576, have=11553381380038442733 +ERROR: cannot read chunk root +Could not open root, trying backup super +ERROR: superblock bytenr 274877906944 is larger than device size 256060514304 +Could not open root, trying backup super</code></pre> +<p>or</p> +<pre><code>$ btrfs-find-root dump-zero.1.img +WARNING: cannot read chunk root, continue anyway +Superblock thinks the generation is 6596071 +Superblock thinks the level is 1</code></pre> +<p>Well, have I got a tool for you!</p> +<p>(FWIW, I also tried manipulating the filesystem and patching to tools +to try to get past those errors, only to get a different set of errors. +Some of these patches I am separately submitting to btrfs-progs.)</p> +<h1 id="overview-of-use">2. Overview of use</h1> +<p>There are two <code>btrfs-rec</code> sub-command groups: +<code>btrfs-rec inspect <var>SUBCMD</var></code> and <code>btrfs-rec +repair <var>SUBCMD</var></code>, and you can find out about various +sub-commands with <code>btrfs-rec help</code>. These are both told about +devices or images with the <code>--pv</code> flag.</p> +<p><code>btrfs-rec inspect <var>SUBCMD</var></code> commands open the +filesystem read-only, and (generally speaking) write extracted or +rebuilt information to stdout. <code>btrfs-rec repair +<var>SUBCMD</var></code> commands open the filesystem read+write, and +consume information from <code>btrfs-rec inspect +<var>SUBCMD</var></code> commands to actually repair the filesystem +(except I haven't actually implemented any <code>repair</code> commands +yet... despite the lack of <code>repair</code> commands, I believe that +<code>btrfs-rec</code> is already a useful because of the +<code>btrfs-rec inspect mount</code> command to get data out of the +broken filesystem). This split allows you to try things without being +scared by WARNINGs about not using these tools unless you're an expert +or have been told to by a developer.</p> +<p>In the broken <code>dump-zero.1.img</code> example above (which has a +perfectly intact superblock, but a totally broken +<code>CHUNK_TREE</code>), to "repair" it I'd:</p> +<ol type="1"> +<li><p>Start by using <code>btrfs-rec inspect rebuild-mappings</code> to +rebuild the broken chunk/dev/blockgroup trees:</p> +<pre><code>$ btrfs-rec inspect rebuild-mappings \ + --pv=dump-zero.1.img \ + > mappings-1.json</code></pre></li> +<li><p>If it only mostly succeeds, but on stderr tells us about a few +regions of the image that it wasn't able to figure out the chunks for. +Using some human-level knowledge, you can write those yourself, +inserting them into the generated <code>mappings.json</code>, and ask +<code>rebuild-mappings</code> to normalize what you wrote:</p> +<pre><code>$ btrfs-rec inspect rebuild-mappings \ + --pv=dump-zero.1.img \ + --mappings=<(sed <mappings-1.json \ + -e '2a{"LAddr":5242880,"PAddr":{"Dev":1,"Addr":5242880},"Size":1},' \ + -e '2a{"LAddr":13631488,"PAddr":{"Dev":1,"Addr":13631488},"Size":1},') \ + > mappings-2.json</code></pre></li> +<li><p>Now that it has functioning chunk/dev/blockgroup trees, we can +use <code>btrfs-rec inspect rebuild-trees</code> to rebuild other trees +that rely on those:</p> +<pre><code>$ btrfs-rec inspect rebuild-mappings \ + --pv=dump-zero.1.img \ + --mappings=mappings-2.json \ + > trees.json</code></pre></li> +<li><p>Now that (hopefully) everything that was damaged has been +reconstructed, we can use <code>btrfs-rec inspect mount</code> to mount +the filesystem read-only and copy out our data:</p> +<pre><code>$ mkdir mnt +$ sudo btrfs-rec inspect mount \ + --pv=dump-zero.1.img \ + --mappings=mappings-2.json \ + --trees=trees.json \ + ./mnt</code></pre></li> +</ol> +<p>This example is fleshed out more (and the manual edits to +<code>mappings.json</code> explained more) in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples/main.sh?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./examples/main.sh</code></a>.</p> +<h1 id="prior-art">3. Prior art</h1> +<p>Comparing <code>btrfs-rec inspect mount</code> with the existing +https://github.com/adam900710/btrfs-fuse project:</p> +<ul> +<li>Again, mine has better fault tolerance</li> +<li>Mine is read-only</li> +<li>Mine supports xattrs ("TODO" in Adam's)</li> +<li>Mine supports separate inode address spaces for subvolumes; Adam's +doesn't due to limitations in FUSE, mine works around this by lazily +setting up separate mountpoints for each subvolume (though this does +mean that the process needs to run as root, which is a bummer).</li> +</ul> +<h1 id="internalsdesign">4. Internals/Design</h1> +<h2 id="overview-of-the-source-tree-layout">4.1. Overview of the source +tree layout</h2> +<ul> +<li><p><a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>examples/</code></a> +has example scripts showing how to use <code>btrfs-rec</code>.</p></li> +<li><p><a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfs?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfs/</code></a> +is the core btrfs implementation.</p></li> +<li><p><a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfscheck?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfscheck/</code></a> +and <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/btrfsutil/</code></a> +are libraries for "btrfs-progs" type programs, that are userland-y +things that I thought should be separate from the core implementation; +something that frustrated me about libbtrfs was having to figure out "is +this thing here in support of btrfs bits-on-disk, or in support of a +higher-level 'how btrfs-progs wants to think about things'?"</p></li> +<li><p><a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>cmd/btrfs-rec/</code></a> +is where the command implementations live. If a sub-command fits in a +single file, it's +<code>cmd/btrfs-rec/inspect_<var>SUBCMD</var>.go</code>, otherwise, it's +in a separate <code>cmd/btrfs-rec/inspect/<var>SUBCMD</var>/</code> +package.</p></li> +<li><p><a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/textui?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/textui/</code></a> +is reasonably central to how the commands implement a text/CLI +user-interface.</p></li> +<li><p><a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/binstruct?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/binstruct/</code></a>, +<a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/diskio?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/diskio/</code></a>, +and <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/streamio?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/streamio/</code></a> +are non-btrfs-specific libraries related to the problem domain.</p></li> +<li><p><a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/containers?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/containers/</code></a>, +<a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/fmtutil?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/fmtutil/</code></a>, +<a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/maps?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/maps/</code></a>, +<a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/slices?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/slices/</code></a>, +and <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/profile?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>lib/profile/</code></a> +are all generic Go libraries that have nothing to do with btrfs or the +problem domain, but weren't in the Go standard library and I didn't +find/know-of exiting implementations that I liked. Of these, all but +<code>containers</code> are pretty simple utility libraries. Also, some +of these things have been added to the standard library since I started +the project.</p></li> +</ul> +<h2 id="base-decisions-cli-structure-go-json">4.2. Base decisions: CLI +structure, Go, JSON</h2> +<p>I started with trying to enhance btrfs-progs, but ended up writing a +wholy new program in Go, for several reasons:</p> +<ul> +<li><p>writing a new thing: I was having to learn both the btrfs-progs +codebase and how btrfs-bits-on-disk work, and it got to the point that I +decided I should just focus on learning btrfs-bits-on-disk.</p></li> +<li><p>writing a new thing: It was becoming increasingly apparent to me +that it was going to be an uphill-fight of having recovery-tools share +the same code as the main-tools, as the routines used by the main-tools +rightly have validity checks, where recovery-tools want to say "yes, I +know it's invalid, can you give it to me anyway?".</p></li> +<li><p>writing it in not-C: I love me some C, but higher level languages +are good for productivity. And I was trying to write a whole lot of code +at once, I needed a productivity boost.</p></li> +<li><p>writing it in not-C: This forced me to learn btrfs-bits-on-disk +better, instead of just cribbing from btrfs-progs. That knowledge is +particularly important for having ideas on how to deal with corrupt +bits-on-disk.</p></li> +<li><p>writing it in Go: At the time I started, my day job was writing +Go, so I had Go swapped into my brain. And Go still feels close to C but +provides <em>a lot</em> of niceness and safety over C.</p></li> +</ul> +<p>It turned out that Go was perhaps not the best choice, but we'll come +back to that.</p> +<p>I wanted to separate things into a pipeline. For instance: Instead of +<code>btrfs rescue chunk-recover</code> trying to do everything to +rebuild a broken chunk tree, I wanted to separate I/O from computation +from repairs. So I have +<code>btrfs-rec inspect rebuild-mappings scan</code> that reads all the +info necessary to rebuild the chunk tree, then dump that as a 2GB glob +of JSON. Then I can feed that JSON to +<code>btrfs-rec inspect rebuild-mappings process</code> which actually +rebuilds the mappings in the chunk tree, and dumps them as JSON. And +then other commands can consume that <code>mappings.json</code> to use +that instead of trying to read the chunk tree from the actual FS, so +that you don't have to make potentially destructive writes to inspect an +FS with a broken chunk tree, and can inspect it more forensically. Or +then use <code>btrfs-rec repair +<var>SOME_SUBCMD_I_HAVENT_WRITTEN_YET</var></code> to write that chunk +tree in <code>mappings.json</code> back to the filesystem.</p> +<p>(But also, the separate steps thing was useful just so I could +iterate on the algorithms of <code>rebuild-mappings process</code> +separately from having to scan the entire FS)</p> +<p>So, I made the decision that <code>btrfs-rec inspect +<var>SUBCMD</var></code> commands should all only open the FS read-only, +and output their work to a separate file; that writing that info back to +the FS should be separate in <code>btrfs-rec repair +<var>SUBCMD</var></code>.</p> +<p>For connecting those parts of the pipeline, I chose JSON, for a few +reasons:</p> +<ul> +<li><p>I wanted something reasonably human-readable, so that I could +debug it easier.</p></li> +<li><p>I wanted something reasonably human-readable, so that human +end-users could make manual edits; for example, in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/examples/main.sh?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>examples/main.sh</code></a> +I have an example of manually editing <code>mappings.json</code> to +resolve a region that the algorithm couldn't figure out, but with +knowledge of what caused the corruption a human can.</p></li> +<li><p>I didn't want to invent my own DSL and have to handle writing a +parser. (This part didn't pay off! See below.)</p></li> +<li><p>I wanted something that I thought would have good support in a +variety of languages, so that if Go is problematic for getting things +merged upstream it could be rewritten in C (or maybe Rust?) piece-meal +where each subcommand can be rewritten one at a time.</p></li> +</ul> +<p>It turned out that JSON was perhaps not the best choice.</p> +<p>OK, so: Go and/or JSON maybe being mistakes:</p> +<ul> +<li><p>I spent a lot of time getting the garbage collector to not just +kill performance.</p></li> +<li><p>The <code>btrfs-rec inspect rebuild-mappings +<var>SUBCMD</var></code> subcommands all throw a lot of data through the +JSON encoder/decoder, and I learned that the Go stdlib +<code>encoding/json</code> package has memory use that grows O(n^2) +(-ish? I didn't study the implementation, but that's what the curve +looks like just observing it) on the size of the data being shoved +through it, so I had to go take a break and go write +https://pkg.go.dev/git.lukeshu.com/go/lowmemjson which is a +mostly-drop-in-replacement that tries to be as close-as possible to O(1) +memory use. So I did end up having to write my own parser anyway +:(</p></li> +</ul> +<h2 id="algorithms">4.3. Algorithms</h2> +<p>There are 3 algorithms of note in <code>btrfs-rec</code>, that I +think are worth getting into mainline btrfs-progs even if the code of +<code>btrfs-rec</code> doesn't get in:</p> +<ol type="1"> +<li><p>The <code>btrfs-rec inspect rebuild-mappings</code> algoritithm +to rebuild information from the <code>CHUNK_TREE</code>, +<code>DEV_TREE</code>, and <code>BLOCK_GROUP_TREE</code>.</p></li> +<li><p>The <code>btrfs-rec --rebuild</code> algorithm to cope with +reading broken B+ trees.</p></li> +<li><p>The <code>btrfs-rec inspect rebuild-trees</code> algorithm to +re-attach lost branches to broken B+ trees.</p></li> +</ol> +<h3 id="the-rebuild-mappings-algorithm">4.3.1. The +<code>rebuild-mappings</code> algorithm</h3> +<p>(This step-zero scan is +<code>btrfs-rec inspect rebuild-mappings scan</code>, and principally +lives in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/scan.go</code></a> +and <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/scan.go</code></a>)</p> +<ol start="0" type="1"> +<li>Similar to <code>btrfs rescue chunk-recover</code>, scan each device +for things that look like nodes; keep track of: +<ul> +<li>Checksums of every block on the device</li> +<li>Which physical addresses contain nodes that claim to be at a given +logical addess.</li> +<li>Any found Chunk items, BlockGroup items, DevExtent, and CSum items. +Keep track of the key for each of these, and for CSum items also track +the generation.</li> +</ul></li> +</ol> +<p>Create a bucket of the data from Chunks, DevExtents, and BlockGroups; +since these are mostly a Chunk and a DevExtent+BlockGroup store pretty +much the same information; we can use one to reconstruct the other. How +we "merge" these and handle conflicts is in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfs/btrfsvol/lvm.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n121"><code>./lib/btrfs/btrfsvol/lvm.go:addMapping()</code></a>, +I don't think this part is particularly clever, but given that +<code>btrfs rescue chunk-recover</code> crashes if it encounters two +overlapping chunks, I suppose I should spell it out:</p> +<ul> +<li><p>A "mapping" is represented as a group of 4 things:</p> +<ul> +<li>logical address</li> +<li>a list of 1 or more physical addresses (device ID and offset)</li> +<li>size, and a Boolean indicator of whether the size is "locked"</li> +<li>block group flags, and a Boolean presence-indicator</li> +</ul></li> +<li><p>Mappings must be merged if their logical or physical regions +overlap.</p></li> +<li><p>If a mapping has a "locked" size, then when merging it may +subsume smaller mappings with unlocked sizes, but its size cannot be +changed; trying to merge a locked-size mapping with another mapping that +is not for a subset region should return an error.</p></li> +<li><p>If a mapping has block group flags present, then those flags may +not be changed; it may only be merged with another mapping that does not +have flags present, or has identical flags.</p></li> +<li><p>When returning an error because of overlapping non-mergeable +mappings, just log an error on stderr and keep going. That's an +important design thing that is different than normal filesystem code; if +there's an error, yeah, detect and notify about it, <strong>but don't +bail out of the whole routine</strong>. Just skip that one item or +whatever.</p></li> +</ul> +<p>Now that we know how to "add a mapping", let's do that:</p> +<p>(The following main-steps are +<code>btrfs-rec inspect rebuild-mappings process</code>, and principally +live in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/process.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/process.go</code></a>)</p> +<ol type="1"> +<li><p>Add all found Chunks.</p></li> +<li><p>Add all found DevExtents.</p></li> +<li><p>Add a phyical:logical mapping of length nodesize for each node +that was found.</p></li> +<li><p>Any mappings from steps 2 or 3 that are missing blockgroup flags +(that is: they weren't able to be merged with a mapping from step 1), +use the found BlockGroups to fill in those flags.</p></li> +<li><p>Now we'll merge all found CSum items into a map of the sums of +the logical address space. Sort all of the csum items by generation, +then by address. Loop over them in that order, inserting their sums into +the map. If two csum items overlap, but agree about the sums of the +overlapping region, that's fine, just take their union. For overlaps +that disagree, items with a newer generation kick out items with an +older generation. If disagreeing items have the same generation... I +don't think that can happen except by a filesystem bug (i.e. not by a +failing drive or other external corruption), so I wasn't too concerned +about it, so I just log an error on stderr and skip the later-processed +item. See <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildmappings/process_sums_logical.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildmappings/process_sums_logical.go</code></a>.</p> +<p>Look at regions of the logical address space that meet all the 3 +criteria:</p> +<ul> +<li>we have CSum items for them</li> +<li>we have a BlockGroup for them</li> +<li>we don't have a Chunk/DevExtent mapping them to the pysical address +space.</li> +</ul> +<p>Pair those CSums up with BlockGroups, and for each BlockGroup, search +the list of checksums of physical blocks to try to find a physical +region that matches the logical csums (and isn't already mapped to a +different logical region). I used a Knuth-Morris-Pratt search, modified +to handle holes in the logical csum list as wildcards.</p> +<p>Insert any found mappings into our bucket of mappings.</p></li> +<li><p>Do the same again, but with a fuzzy search (we can re-use the +csum map of the logical address space). My implementation of this is +comparatively time and space intensive; I just walk over the entire +unmapped physical address space, noting what % of match each BlockGroup +has if placed at that location. I keep track of the best 2 matches for +each BlockGroup. If the best match is better than a 50% match, and the +second best is less than a 50% match, then I add the best match. In my +experience, the best match is >90% (or at whatever the maximum +percent is for how much of the BlockGroup has logical sums), and the +second best is 0% or 1%. The point of tracking both is that if there +isn't a clear-cut winner, I don't want it to commit to a potentially +wrong choice.</p></li> +</ol> +<h3 id="the---rebuild-algorithm">4.3.2. The <code>--rebuild</code> +algorithm</h3> +<p>The <code>--rebuild</code> flag is implied by the +<code>--trees=trees.json</code> flag, and triggers an algorithm that +allows "safely" reading from a broken B+ tree, rather than the usual B+ +tree lookup and search functions. I probably should have tried to +understand the <code>btrfs restore</code> algorithm, maybe I reinvented +the wheel...</p> +<p>This algorithm requires a list of all nodes on the filesystem; we +find these using the same scan as above (<a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/scan.go</code></a>), +the same procedure as <code>btrfs rescue chunk-recover</code>.</p> +<p>We walk all of those nodes, and build a reasonably lightweight +in-memory graph of all nodes (<a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/graph.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/graph.go</code></a>), +tracking</p> +<ul> +<li>each node's +<ul> +<li>logical address</li> +<li>level</li> +<li>generation</li> +<li>tree</li> +<li>each item's key and size</li> +</ul></li> +<li>each keypointer's +<ul> +<li>source node</li> +<li>source slot within the node</li> +<li>tree of the source node</li> +<li>destination node</li> +<li>destination level implied by the level of the source node</li> +<li>destination key</li> +<li>destination generation</li> +</ul></li> +<li>logical addresses and error messages for nodes that are pointed to +by a keypointer or the superblock, but can't be read (because that +logical address isn't mapped, or it doesn't look like a node, +or...)</li> +<li>an index such that for a given node we can quickly list both +keypointers both originating at that node and pointing to that +node.</li> +</ul> +<h4 id="rebuilt-forrest-behavior-looking-up-trees">4.3.2.1. rebuilt +forrest behavior (looking up trees)</h4> +<p>(see: <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/rebuilt_forrest.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/rebuilt_forrest.go</code></a>)</p> +<ul> +<li>The <code>ROOT_TREE</code>, <code>CHUNK_TREE</code>, +<code>TREE_LOG</code>, and <code>BLOCK_GROUP_TREE</code> (the trees +pointed to directy by the superblock) work as you'd expect.</li> +<li>For other trees, we (as you'd expect) look up the root item in the +rebuilt <code>ROOT_TREE</code>, and then (if rootitem.ParentUUID is +non-zero) eagerly also look up the parent tree (recursing on ourself). +We try to use the <code>UUID_TREE</code> tree to help with this, but +fall back to just doing a linear scan over the <code>ROOT_TREE</code>. +If we fail to look up the parent tree (or its parent, or a more distant +ancestor), then (depending on a flag) we either make a note of that, or +error out and fail to look up the child tree. For <code>--rebuild</code> +and <code>--trees=trees.json</code> we are permissive of this error, and +just make note of it; but we'll re-use this algorithm in the +<code>rebuild-trees</code> algorithm below, and it needs the more strict +handling.</li> +<li>When creating the rebuilt individual tree, we start by adding the +root node specified by the superblock/root-item. But we may also add +additional root nodes grafted on to the tree by the +<code>--trees=trees.json</code> flag or by the +<code>rebuild-trees</code> algorithm below. So a tree may have more than +1 root node.</li> +</ul> +<h4 id="rebuilt-individual-tree-behavior">4.3.2.2. rebuilt individual +tree behavior</h4> +<p>(see: <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfsutil/rebuilt_tree.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfsutil/rebuilt_tree.go</code></a>)</p> +<p>In order to read from a tree, we first have to build a few indexes. +We store these indexes in an Adaptive Replacement Cache; they are all +re-buildable based on the tree's list of roots and the above graph; if +we have a bunch of trees we don't need to keep all of this in memory at +once. Note that this is done 100% with the in-memory graph, we don't +need to read anything from the filesystem during these procedures.</p> +<ul> +<li><p>The first index we build is the "node index". This is an index +that for every node tells us what root(s) the tree would need to have in +order for the tree to include that node, and also what the highest item +key would be acceptable in the node if the tree includes that root. We +track both a <code>loMaxItem</code> and a <code>hiMaxItem</code>, in +case the tree is real broken and there are multiple paths from the root +to the node; as these different paths may imply different max-item +constraints. Put more concretely, the type of the index is:</p> +<pre><code>map[ nodeID → map[ rootNodeID → {loMaxItem, hiMaxItem} ] ]</code></pre> +<p>We'll do a loop over the graph, using dynamic-programming memoization +to figure out ordering and avoid processing the same node twice; for +each node we'll</p> +<ul> +<li><p>Check whether the owner-tree is this tree or one of this tree's +ancestors (and if it's an ancestor, that the node's generation isn't +after the point that the child tree was forked from the parent tree). If +not, we are done processing that node (record an empty/nil set of roots +for it).</p></li> +<li><p>Create an empty map of <code>rootID</code> → +{<code>loMaxItem</code>, <code>hiMaxItem</code>}.</p></li> +<li><p>Look at each keypointer that that points at the node and:</p> +<ul> +<li><p>Skip the keypointer if its expectations of the node aren't met: +if the level, generation, and min-key constraints don't match up. If the +keypointer isn't in the last slot in the source node, we also go ahead +and include checking that the destination node's max-key is under the +min-key of the keypointer in the next slot, since that's cheap to do +now.</p></li> +<li><p>Skip the keypointer if its source node's owner-tree isn't this +tree or one of this tree's ancestors (and if it's an ancestor, that the +node's generation isn't after the point that the child tree was forked +from the parent tree).</p></li> +<li><p>dynamic-programming recurse and index the keypointer's source +node.</p></li> +<li><p>for every root that would result in the keypointer's source node +being included in the tree:</p> +<p>. If the keypointer is in the last slot, look at what the what the +source node's last-item constraints would be if that root is included, +and can now check the max-item of our destination node. We check against +the <code>hiMaxItem</code>; as if there is any valid path from the root +to this node, then we want to be permissive and include it. If that +check fails, then we're done with this keypointer. Also, make node of +those <code>loMaxItem</code> and <code>hiMaxItem</code> values, we'll +use them again in just a moment.</p> +<p>. Otherwise, set both <code>loMaxItem</code> and +<code>hiMaxItem</code> to 1-under the min-item of the keypointer in the +next slot.</p> +<p>. Insert that <code>loMaxItem</code> and <code>hiMaxItem</code> pair +into the <code>rootID</code> → {<code>loMaxItem</code>, +<code>hiMaxItem</code>} map we created above. If an entry already exists +for this root (since a broken tree might have multiple paths from the +root to our node), then set <code>loMaxItem</code> to the min of the +existing entry and our value, and <code>hiMaxItem</code> to the +max.</p></li> +</ul></li> +<li><p>If that <code>rootID</code> → {<code>loMaxItem</code>, +<code>hiMaxItem</code>} map is still empty, then consider this node to +be a (potential) root, and insert <code>rootID=thisNode</code> -> +{<code>loMaxItem=maxKey</code>, <code>hiMaxItem=maxKey</code>} (where +<code>maxKey</code> is the maximum value of the key datatype).</p></li> +<li><p>Take that <code>rootID</code> → {<code>loMaxItem</code>, +<code>hiMaxItem</code>} map and insert it into the index as the entry +for this node.</p></li> +</ul></li> +<li><p>The next index we build is the "item index". This is a "sorted +map" (implemented as a red-black tree, supporting sub-range iteration) +of <code>key</code> → {<code>nodeID</code>, <code>slotNumber</code>}; a +map that for each key tells us where to find the item with that key.</p> +<ul> +<li><p>Loop over the node index, and for each node check if both (a) it +has <code>level==0</code> (is a leaf node containing items), and (b) its +set of roots that would include it has any overlap with the tree's set +of roots.</p></li> +<li><p>Loop over each of those included leaf nodes, and loop over the +items in each node. Insert the <code>key</code> → {<code>nodeId</code>, +<code>slot</code>} into our sorted map. If there is already an entry for +that key, decide which one wins by:</p> +<ul> +<li><p>Use the one from the node with the owner-tree that is closer to +this tree; node with owner=thisTree wins over a node with +owner=thisTree.parent, which would win over a node with +owner.thisTree.parent.parent. If that's a tie, then...</p></li> +<li><p>Use the one from the node with the higher generation. If that's a +tie, then...</p></li> +<li><p>I don't know, I have the code <code>panic</code>:</p> +<pre><code>// TODO: This is a panic because I'm not really sure what the +// best way to handle this is, and so if this happens I want the +// program to crash and force me to figure out how to handle it. +panic(fmt.Errorf("dup nodes in tree=%v: old=%v=%v ; new=%v=%v", + tree.ID, + oldNode, tree.forrest.graph.Nodes[oldNode], + newNode, tree.forrest.graph.Nodes[newNode]))</code></pre></li> +</ul></li> +</ul> +<p>Note that this algorithm means that for a given node we may use a few +items from that node, while having other items from that same node be +overridden by another node.</p></li> +<li><p>The final index we build is the "error index". This is an index +of what errors correspond to which range of keys, so that we can report +them, and give an idea of "there may be entries missing from this +directory" and similar.</p> +<p>For each error, we'll track the min-key and max-key of the range it +applies to, the node it came from, and what the error string is. We'll +store these into an interval tree keyed on that min-key/max-key +range.</p> +<ul> +<li><p>Create an empty set <code>nodesToProcess</code>. Now populate +it:</p> +<ul> +<li><p>Once again, we'll loop over the node index, but this time we'll +only check that there's overlap between the set of roots that would +include the node and the tree's set of roots. The nodes that are +included in this tree, insert both that node itself and all node IDs +that it has keypointers pointing to into the <code>nodesToProcess</code> +set.</p></li> +<li><p>Also insert all of the tree's roots into +<code>nodesToProcess</code>; this is in case the superblock/root-item +points to an invalid node that we couldn't read.</p></li> +</ul></li> +<li><p>Now loop over <code>nodesToProcess</code>. For each node, create +an empty list of errors. Use the keypointers pointing to and the min +<code>loMaxItem</code> from the node index to construct a set of +expectations for the node; this should be reasonably straight-forward, +given:</p> +<ul> +<li><p>If different keypointers have disagreeing levels, insert an error +in to the list, and don't bother with checking the node's +level.</p></li> +<li><p>If different keypointers have disagreeing generations, insert an +error in to the list, and don't bother with checking the node's +generation.</p></li> +<li><p>If different keypointers have different min-item expectations, +use the max of them.</p></li> +</ul> +<p>Then:</p> +<ul> +<li>If the node is a "bad node" in the graph, insert the error message +associated with it. Otherwise, check those expectations against the node +in the graph.</li> +</ul> +<p>If the list of error messages is non-empty, then insert their +concatenation into the interval tree, with the range set to the min of +the min-item expectations from the keypointers through the max of the +<code>hiMaxItem</code>s from the node index. If the min min-item +expectation turns out to be higher than the max <code>hiMaxItem</code>, +then set the range to the zero-key through the max-key.</p></li> +</ul></li> +</ul> +<p>From there, it should be trivial to implement the usual B+ tree +operations using those indexes; exact-lookup using the item index, and +range-lookups and walks using the item index together with the error +index. Efficiently searching the <code>CSUM_TREE</code> requires knowing +item sizes, so that's why we recorded the item sizes into the graph.</p> +<h3 id="the-rebuild-trees-algorithm">4.3.3. The +<code>rebuild-trees</code> algorithm</h3> +<p>The <code>btrfs inspect rebuild-trees</code> algorithm finds nodes to +attach as extra roots to trees. I think that conceptually it's the the +simplest of the 3 algorithms, but turned out to be the hardest to get +right. So... maybe more than the others reference the source code too +(<a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/</code></a>) +because I might forget some small but important detail.</p> +<p>The core idea here is that we're just going to walk each tree, +inspecting each item in the tree, and checking for any items that are +implied by other items (e.g.: a dir entry item implies the existence of +inode item for the inode that it points at). If an implied item is not +in the tree, but is in some other node, then we look at which potential +roots we could add to the tree that would add that other node. Then, +after we've processed all of the items in the filesystem, we go add +those various roots to the various trees, keeping track of which items +are added or updated. If any of those added/updated items have a version +with a newer generation on a different node, see what roots we could add +to get that newer version. Then add those roots, keeping track of items +that are added/updated. Once we reach steady-state with the newest +version of each item has been added, loop back and inspect all +added/updated items for implied items, keeping track of roots we could +add. Repeat until a steady-state is reached.</p> +<p>There are lots of little details in that process, some of which are +for correctness, and some of which are for "it should run in hours +instead of weeks."</p> +<h4 id="initialization">4.3.3.1. initialization</h4> +<p>First up, we're going to build and in-memory graph, same as above. +But this time, while we're reading the nodes to do that, we're also +going to watch for some specific items and record a few things about +them.</p> +<p>(see: <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/scan.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/scan.go</code></a>)</p> +<p>For each {<code>nodeID</code>, <code>slotNumber</code>} pair that +matches one of these item types, we're going to record:</p> +<ul> +<li>flags: +<ul> +<li><code>INODE_ITEM</code>s: whether it has the +<code>INODE_NODATASUM</code> flag set</li> +</ul></li> +<li>names: +<ul> +<li><code>DIR_INDEX</code> items: the file's name</li> +</ul></li> +<li>sizes: +<ul> +<li><code>EXTENT_CSUM</code> items: the number of bytes that this is a +sum for (i.e. the item size over the checksum size, times the block +size)</li> +<li><code>EXTENT_DATA</code> items: the number of bytes in this extent +(i.e. either the item size minus +<code>offsetof(btrfs_file_extent_item.disk_bytenr)</code> if +<code>FILE_EXTENT_INLINE</code>, or else the item's +<code>num_bytes</code>).</li> +</ul></li> +<li>data backrefs: +<ul> +<li><code>EXTENT_ITEM</code>s and <code>METADATA_ITEM</code>s: a list of +the same length as the number of refs embedded in the item; for embeded +ExtentDataRefs, the list entry is the subvolume tree ID that the +ExtentDataRef points at, otherwise it is zero.</li> +<li><code>EXTENT_DATA_REF</code> items: a list of length 1, with the +sole member being the subvolume tree ID that the ExtentDataRef points +at.</li> +</ul></li> +</ul> +<h4 id="the-main-loop">4.3.3.2. the main loop</h4> +<p>(see: <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go</code></a>)</p> +<p>Start with that scan data (graph + info about items), and also a +rebuilt forrest from the above algorithm, but with:</p> +<ul> +<li><p>the flag set so that it refuses to look up a tree if it can't +look up all of that tree's ancestors</p></li> +<li><p>an additional "potential-item index" that is similar to the item +index. It is generated the same way and can cache/evict the same way; +the difference is that we invert the check for if the set of roots for a +node has overlap with the tree's set of roots; we're looking for +<em>potential</em> nodes that we could add to this tree.</p></li> +<li><p>some callbacks; we'll get to what we do in these callbacks in a +bit, but for now, what the callbacks are:</p> +<ul> +<li><p>a callback that is called for each added/updated item when we add +a root.</p></li> +<li><p>a callback that is called whenever we add a root</p></li> +<li><p>a callback that intercepts looking up a root item</p></li> +<li><p>a callback that intercepts resolving an UUID to an object +ID.</p></li> +</ul></li> +</ul> +<p>(The callbacks are in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_treecb.go</code></a>)</p> +<p>We have 5 unordered queues ("work lists"?); these are sets that when +it's time to drain them we'll sort the members and process them in that +order.</p> +<ol type="1"> +<li>the tree queue: a list of tree IDs that we need to crawl</li> +<li>the retry-item queue: for each tree ID, a set of items that we +should re-process if we add a root to that tree</li> +<li>the added-item queue: a set of key/tree pairs identifying items that +have been added by adding a root to a tree</li> +<li>the settled-item-queue: a set of key/tree pairs that have have not +just been added by adding a root, but we've also verified that they are +the newest-generation item with that key that we could add to the +tree.</li> +<li>the augment queue: for each item that we want to add to a tree, the +list of roots that we could add to get that item.</li> +</ol> +<p>The roots all start out empty, except for the tree queue, which we +seed with the <code>ROOT_TREE</code>, the <code>CHUNK_TREE</code>, and +the <code>BLOCK_GROUP_TREE</code> (It is a "TODO" task that it should +probably also be seeded with the <code>TREE_LOG</code>, but as I will +say below in the "future work" section, I don't actually understand the +<code>TREE_LOG</code>, so I couldn't implement it).</p> +<p>Now we're going to loop until the tree queue, added-item queue, +settled-item queue, and augment queue are all empty (all queues except +for the retry-item queue). Each loop "pass" has 3 substeps:</p> +<ol type="1"> +<li><p>Crawl the trees (drain the tree queue, fill the added-item +queue).</p></li> +<li><p>Either:</p> +<ol type="a"> +<li><p>if the added-item queue is non-empty: "settle" those items (drain +the added-item queue, fill the augment queue and the settled-item +queue).</p></li> +<li><p>otherwise: process items (drain the settled-item queue, fill the +augment queue and the tree queue)</p></li> +</ol></li> +<li><p>Apply augments (drain the augment queue and maybe the retry-item +queue, fill the added-item queue).</p></li> +</ol> +<p>OK, let's look at those 3 substeps in more detail:</p> +<ol type="1"> +<li><p>Crawl the trees; drain the tree queue, fill the added-item +queue.</p> +<p>We just look up the tree in the rebuilt forrest, which will (per the +above <code>--rebuild</code> algorithm) will either fail to look up the +tree, or succeed, and add to that tree the root node from the +superblock/root-item. Because we set an item-added callback, when adding +that root it will loop over the nodes added by that root, and call our +callback for each item in one of the added nodes. Our callback inserts +each item into the added-item queue. The forrest also calls our +root-added callback, but because of the way this algorithm works, that +turns out to be a no-op at this step.</p> +<p>I mentioned that we added callbacks to intercept the forrest's +looking up of root items and resolving UUIDs; we override the forrest's +"lookup root item" routine and "resolve UUID" routine to instead of +doing normal lookups on the <code>ROOT_TREE</code> and +<code>UUID_TREE</code>, use the above <code>Want<var>XXX</var></code> +routines that we'll define below in the "graph callbacks" section.</p> +<p>It shouldn't matter what order this queue is processed in, but I sort +tree IDs numerically.</p> +<p>The crawling is fairly fast because it's just in-memory, the only +accesses to disk are looking up root items and resolving UUIDs.</p></li> +<li><p>Either:</p> +<ol type="a"> +<li><p>Settle items from the added-item queue to the settled-item queue +(and fill the augment queue).</p> +<p>For each item in the queue, we look in the tree's item index to get +the {node, slot} pair for it, then we do the same in the tree's +potential-item index. If the potential-item index contains an entry for +the item's key, then we check if the potential-item's node should "win" +over the queue item's node, deciding the "winner" using the same routine +as when building the item index. If the potential-item's node wins, then +we add the potential node's set of roots to the augment queue. If the +queue-item's node wins, then we add the item to the settled-item queue +(except, as an optimization, if the item is of a type that cannot +possibly imply the existence of another item, then we just drop it and +don't add it to the settled-item queue).</p> +<p>It shouldn't matter what order this queue is processed in, but I sort +it numerically by treeID and then by item key.</p> +<p>This step is fairly fast because it's entirely in-memory, making no +accesses to disk.</p></li> +<li><p>Process items from the settled-item queue (drain the settled-item +queue, fill the augment queue and the tree queue).</p> +<p>This step accesses disk, and so the order we process the queue in +turns out to be pretty important in order to keep our disk access +patterns cache-friendly. For the most part, we just sort each queue item +by tree, then by key. But, we have special handling for +<code>EXTENT_ITEM</code>s, <code>METADATA_ITEM</code>s, and +<code>EXTENT_DATA_REF</code> items: We break <code>EXTENT_ITEM</code>s +and <code>METADATA_ITEM</code>s in to "sub-items", treating each ref +embedded in them as a separate item. For those embedded items that are +<code>EXTENT_DATA_REF</code>s, and for stand-alone +<code>EXTENT_DATA_REF</code> items, we sort them not with the +<code>EXTENT_TREE</code> items, but with the items of the tree that the +extent data ref points at. Recall that during the intitial scan step, we +took note of which tree every extent data ref points at, so we can +perform this sort without accessing disk yet. This splitting does mean +that we may visit/read an <code>EXTENT_ITEM</code> or +<code>METADATA_ITEM</code> multiple times as we process the queue, but +to do otherwise is to solve MinLA, which is NP-hard and also an optimal +MinLA solution I still think would perform worse than this; there is a +reasonably lengthy discussion of this in a comment in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n251"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:sortSettledItemQueue()</code></a>.</p> +<p>Now we loop over that sorted queue. In the code, this loop is +deceptively simple. Read the item, then pass it to a function that tells +us what other items are implied by it. That function is large, but +simple; it's just a giant table. The trick is how it tells us about +implied items; we give it set of callbacks that it calls to tell us +these things; the real complexity is in the callbacks. These "graph +callbacks" will be discussed in detail below, but as an illustrative +example: It may call <code>.WantOff()</code> with a tree ID, object ID, +item type, and offset to specify a precise item that it believes should +exist.</p> +<p>If we encounter a <code>ROOT_ITEM</code>, add the tree described by +that item to the tree queue.</p></li> +</ol> +<p>(Both the "can this item even imply the existence of another item" +check and the "what items are implied by this item" routine are in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/btrfscheck/graph.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./lib/btrfscheck/graph.go</code></a>)</p></li> +<li><p>Apply augments; drain the augment queue (and maybe the retry-item +queue), fill the added-item queuee.</p> +<p>It is at this point that I call out that the augment queue isn't +implemented as a simple map/set like the others, the +<code>treeAugmentQueue struct</code> has special handling for sets of +different sizes; optimizing the space for empty and len()==1 sized sets, +and falling back to normal the usual implementation for larger sets; +this is important because those small sets are the overwhelming +majority, and otherwise there's no way the program would be able to run +on my 32GB RAM laptop. Now that I think about it, I bet it would even be +worth it to add optimized storage for len()==2 sized sets.</p> +<p>The reason is that each "want" from above is tracked in the queue +separately; if we were OK merging them, then this optimized storage +wouldn't be nescessary. But we keep them separate, so that:</p> +<ul> +<li><p>For all "wants", including ones with empty sets, graph callbacks +can check if a want has already been processed; avoiding re-doing any +work (see the description of the graph callbacks below).</p></li> +<li><p>For "wants" with non-empty sets, we can see how many different +"wants" could be satisfied with a given root, in order to decide which +root to choose.</p></li> +</ul> +<p>Anyway, we loop over the trees in the augment queue. For each tree we +look at that tree's augment queue and look at all the choices of root +nodes to add (below), and decide on a list to add. The we add each of +those roots to the tree; the adding of each root triggers several calls +to our item-added callback (filling the added-item queue), and our +root-added callback. The root-added callback moves any items from the +retry-item queue for this tree to the added-item queue.</p> +<p>How do we decide between choices of root nodes to add? <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n528"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild.go:resolveTreeAugments()</code></a> +has a good comment explaining the criteria we'd like to optimize for, +and then code that does an OK-ish job of actually optimizing for +that:</p> +<ul> +<li><p>It loops over the augment queue for that tree, building a list of +possible roots, for each possible root making note of 3 things:</p> +<ol type="a"> +<li><p>how many "wants" that root satisfies,</p></li> +<li><p>how far from treee the root's owner is (owner=tree is a distance +of 0, owner=tree.parent is a distance of 1, owner=tree.parent.parent is +a distance of 2, and so on), and</p></li> +<li><p>what the generation of that root is.</p></li> +</ol></li> +<li><p>We sort that list first by highest-count-first, then by +lowest-distance-first, then by highest-generation-first.</p></li> +<li><p>We create a "return" set and an "illegal" set. We loop over the +sorted list; for each possible root if it is in the illegal set, we skip +it, otherwise we insert it into the return set and for each "want" that +includes this root we all all roots that satisfy that want to the +illegal list.</p></li> +</ul></li> +</ol> +<p>It is important that the rebuilt forrest have the flag set so that it +refuses to look up a tree if it can't look up all of that tree's +ancestors; otherwise the potential-items index would be garbage as we +wouldn't have a good idea of which nodes are OK to consider; but this +does have the downside that it won't even attempt to improve a tree with +a missing parent. Perhaps the algorithm should flip the flag once the +loop terminates, and then re-seed the tree queue with each +<code>ROOT_ITEM</code> from the <code>ROOT_TREE</code>?</p> +<h4 id="graph-callbacks">4.3.3.3. graph callbacks</h4> +<p>(see: <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>./cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go</code></a>)</p> +<p>The graph callbacks are what tie the above together.</p> +<p>For each of these callbacks, whenever I say that it looks up +something in a tree's item index or potential-item index, that implies +looking the tree up from the forrest; if the forrest cannot look up that +tree, then the callback returns early, after either:</p> +<ul> +<li><p>if we are in substep 1 and are processing a tree: we add the tree +that is being processed to the tree queue. (TODO: Wait, this assumes +that an augment will be applied to the <code>ROOT_TREE</code> before the +next pass... if that isn't the case, this will result in the loop never +terminating... I guess I need to add a separate retry-tree +queue?)</p></li> +<li><p>if we are in substep 2 and are processing an item: we add the +item that is being processed to the retry-item queue for the tree that +cannot be looked up</p></li> +</ul> +<p>The 6 methods in the <code>brfscheck.GraphCallbacks</code> interface +are:</p> +<ol type="1"> +<li><p><code>FSErr()</code>: There's an error with the filesystem; this +callback just spits it out on stderr. I say such a trivial matter +because, again, for a recovery tool I think it's worth putting care in +to how you handle errors and where you expect them: We expect them here, +so we have to check for them to avoid reading invalid data or whatever, +but we don't actually need to do anything other than watch our +step.</p></li> +<li><p><code>Want()</code>: We want an item in a given tree with a given +object ID and item type, but we don't care about what the item's offset +is.</p> +<p>The callback works by searching the item index to see if it can find +such an item; if so, it has nothing else to do and returns. Otherwise, +it searches the potential-item index; for each matching item it finds it +looks in the node index for the node containing that item, and adds the +roots that would add that node, and adds those roots to a set. Once it +has finished searching the potential-item index, it adds that set to the +augment queue (even if that set is still empty).</p></li> +<li><p><code>WantOff()</code>: The same, but we want a specific +offset.</p></li> +<li><p><code>WantDirIndex()</code>: We want a <code>DIR_INDEX</code> +item for a given inode and filename, but we don't know what the offset +of that item is.</p> +<p>First we scan over the item index, looking at all +<code>DIR_INDEX</code> items for that inode number. For each item, we +can check the scan data to see what the filename in that +<code>DIR_INDEX</code> is, so we can see if the item satisfies this want +without accessing the disk. If there's a match, then there is nothing +else to do, so we return. Otherwise, we do that same search over the +potential-item index; if we find any matches, then we build the set of +roots to add to the augment queue the same as in +<code>Want</code>.</p></li> +<li><p><code>WantFileExt()</code>: We want 1 or more +<code>DATA_EXTENT</code> items in the given tree for the given inode, +and we want them to cover from 0 to a given size bytes of that file.</p> +<p>First we walk that range in the item index, to build a list of the +gaps that we need to fill ("Step 1" in <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n260"><code>rebuild_wantcb.go:_wantRange()</code></a>). +This walk (<a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/cmd/btrfs-rec/inspect/rebuildtrees/rebuild_wantcb.go?id=18e6066c241cf3d252b6521150843ffc858d8434#n195"><code>rebuild_wantcb.go:_walkRange()</code></a>) +requires knowing the size of each file extent; so doing this quickly +without hitting disk is why we recorded the size of each file extent in +our initialization step.</p> +<p>Then ("Step 2" in <code>_wantRange()</code>) we iterate over each of +the gaps, and for each gap do a very similar walk (again, by calling +<code>_walkRange()</code>, but this time over the potential-item index. +For each file extent we find that has is entirely within the gap, we +"want" that extent, and move the beginning of of the gap forward to the +end of that extent. This algorithm is dumb and greedy, potentially +making sub-optimal selections; and so could probably stand to be +improved; but in my real-world use, it seems to be "good +enough".</p></li> +<li><p><code>WantCSum()</code>: We want 1 or more +<code>EXTENT_CSUM</code> items to cover the half-open interval +[<code>lo_logical_addr</code>, <code>hi_logical_addr</code>). Well, +maybe. It also takes a subvolume ID and an inode number; and looks up in +the scan data whether that inode has the <code>INODE_NODATASUM</code> +flag set; if it does have the flag set, then it returns early without +looking for any <code>EXTENT_CSUM</code> items. If it doesn't return +early, then it performs the same want-range routine as +<code>WantFileExt</code>, but with the appropriate tree, object ID, and +item types for csums as opposed to data extents.</p></li> +</ol> +<p>For each of these callbacks, we generate a "wantKey", a tuple +representing the function and its arguments; we check the augment-queue +to see if we've already enqueued a set of roots for that want, and if +so, that callback can return early without checking the potential-item +index.</p> +<h1 id="future-work">5. Future work</h1> +<p>It's in a reasonably useful place, I think; and so now I'm going to +take a break from it for a while. But there's still lots of work to +do:</p> +<ul> +<li><p>RAID almost certainly doesn't work.</p></li> +<li><p>Encryption is not implemented.</p></li> +<li><p>It doesn't understand (ignores) the <code>TREE_LOG</code> +(because I don't understand the <code>TREE_LOG</code>).</p></li> +<li><p><code>btrfs-rec inspect mount</code> should add "lost+found" +directories for inodes that are included in the subvolume's tree but +aren't reachable from the tree's root inode</p></li> +<li><p>I still need to implement <code>btrfs-rec repair +<var>SUBCMD</var></code> subcommands to write rebuilt-information from +<code>btrfs-rec inspect</code> back to the filesystem.</p></li> +<li><p>I need to figure out the error handling/reporting story for +<code>mount</code>.</p></li> +<li><p>It needs a lot more tests</p> +<ul> +<li>I'd like to get the existing btrfs-progs fsck tests to run on +it.</li> +</ul></li> +<li><p>In the process of writing this email, I realized that I probably +need to add a retry-tree queue; see the "graph callbacks" section in the +description of the <code>rebuild-trees</code> algorithm above.</p></li> +<li><p>Shere are a number of "TODO" comments or panics in the code:</p> +<ul> +<li><p>Some of them definitely need done.</p></li> +<li><p>Some of them are <code>panic("TODO")</code> on the basis that if +it's seeing something on the filesystem that it doesn't recognize, it's +probably that I didn't get to implementing that thing/situation, but +it's possible that the thing is just corrupt. This should only be for +situations that the node passed the checksum test, so it being corrupt +would have to be caused by a bug in btrfs rather than a failing drive or +other corruption; I wasn't too worried about btrfs bugs.</p></li> +</ul></li> +<li><p><code>btrfs-rec inspect rebuild-trees</code> is slow, and can +probably be made a lot faster.</p> +<p>Just to give you an idea of the speeds, the run-times for the various +steps on my ThinkPad E15 for a 256GB disk image are as follows:</p> +<pre><code> btrfs-rec inspect rebuild-mappings scan : 7m 31s + btrfs-rec inspect rebuild-mappings list-nodes : 47s + btrfs-rec inspect rebuild-mappings process : 8m 22s + btrfs-rec inspect rebuild-trees : 1h 4m 55s + btrfs-rec inspect ls-files : 29m 55s + btrfs-rec inspect ls-trees : 8m 40s</code></pre> +<p>For the most part, it's all single-threaded (with the main exception +that in several places I/O has been moved to a separate thread from the +main CPU-heavy thread), but a lot of the algorithms could be +parallelized.</p></li> +<li><p>There are a lot of "tunable" values that I haven't really spent +time tuning. These are all annotated with <a +href="https://git.lukeshu.com/btrfs-progs-ng/tree/lib/textui/tunable.go?id=18e6066c241cf3d252b6521150843ffc858d8434"><code>textui.Tunable()</code></a>. +I sort-of intended for them to be adjustable on the CLI.</p></li> +<li><p>Perhaps the <code>btrfs inspect rebuild-trees</code> algorithm +could be adjusted to also try to rebuild trees with missing parents; see +the above discussion of the algorithm.</p></li> +</ul> +<h1 id="problems-for-merging-this-code-into-btrfs-progs">6. Problems for +merging this code into btrfs-progs</h1> +<ul> +<li><p>It's written in Go, not C.</p></li> +<li><p>It's effectively GPLv3+ (not GPLv2-only or GPLv2+) because of use +of some code under the Apache 2.0 license (2 files in the codebase +itself that are based off of Apache-licensed code, and use of unmodified +3rd-party libraries).</p></li> +<li><p>It uses ARC (Adaptive Replacement Cache), which is patented by +IBM, and the patent doesn't expire for another 7 months. An important +property of ARC over LRU is that it is scan-resistant; the above +algorithms do a lot of scanning. On that note, now that RedHat is owned +by IBM: who in the company do we need to get to talk to eachother so +that we can get ARC into the Linux kernel before then?</p></li> +</ul> +<div style="font-family: monospace"> +<p>-- <br/> Happy hacking,<br/> ~ Luke Shumaker<br/></p> +</div> + +</article> +<footer> +<p>The content of this page is Copyright © 2023 <a href="mailto:lukeshu@sbcglobal.net">Luke Shumaker</a>.</p> +<p>This page is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC BY-SA-3.0</a> license.</p> +</footer> +</body> +</html> |