summaryrefslogtreecommitdiff
path: root/Documentation/filesystems/aufs/design/02struct.txt
blob: b53a9778b2b706fb2c093f1d463ee88dcfce10e2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258

# Copyright (C) 2005-2015 Junjiro R. Okajima
# 
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; either version 2 of the License, or
# (at your option) any later version.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

Basic Aufs Internal Structure

Superblock/Inode/Dentry/File Objects
----------------------------------------------------------------------
As like an ordinary filesystem, aufs has its own
superblock/inode/dentry/file objects. All these objects have a
dynamically allocated array and store the same kind of pointers to the
lower filesystem, branch.
For example, when you build a union with one readwrite branch and one
readonly, mounted /au, /rw and /ro respectively.
- /au = /rw + /ro
- /ro/fileA exists but /rw/fileA

Aufs lookup operation finds /ro/fileA and gets dentry for that. These
pointers are stored in a aufs dentry. The array in aufs dentry will be,
- [0] = NULL (because /rw/fileA doesn't exist)
- [1] = /ro/fileA

This style of an array is essentially same to the aufs
superblock/inode/dentry/file objects.

Because aufs supports manipulating branches, ie. add/delete/change
branches dynamically, these objects has its own generation. When
branches are changed, the generation in aufs superblock is
incremented. And a generation in other object are compared when it is
accessed. When a generation in other objects are obsoleted, aufs
refreshes the internal array.


Superblock
----------------------------------------------------------------------
Additionally aufs superblock has some data for policies to select one
among multiple writable branches, XIB files, pseudo-links and kobject.
See below in detail.
About the policies which supports copy-down a directory, see
wbr_policy.txt too.


Branch and XINO(External Inode Number Translation Table)
----------------------------------------------------------------------
Every branch has its own xino (external inode number translation table)
file. The xino file is created and unlinked by aufs internally. When two
members of a union exist on the same filesystem, they share the single
xino file.
The struct of a xino file is simple, just a sequence of aufs inode
numbers which is indexed by the lower inode number.
In the above sample, assume the inode number of /ro/fileA is i111 and
aufs assigns the inode number i999 for fileA. Then aufs writes 999 as
4(8) bytes at 111 * 4(8) bytes offset in the xino file.

When the inode numbers are not contiguous, the xino file will be sparse
which has a hole in it and doesn't consume as much disk space as it
might appear. If your branch filesystem consumes disk space for such
holes, then you should specify 'xino=' option at mounting aufs.

Aufs has a mount option to free the disk blocks for such holes in XINO
files on tmpfs or ramdisk. But it is not so effective actually. If you
meet a problem of disk shortage due to XINO files, then you should try
"tmpfs-ino.patch" (and "vfs-ino.patch" too) in aufs4-standalone.git.
The patch localizes the assignment inumbers per tmpfs-mount and avoid
the holes in XINO files.

Also a writable branch has three kinds of "whiteout bases". All these
are existed when the branch is joined to aufs, and their names are
whiteout-ed doubly, so that users will never see their names in aufs
hierarchy.
1. a regular file which will be hardlinked to all whiteouts.
2. a directory to store a pseudo-link.
3. a directory to store an "orphan"-ed file temporary.

1. Whiteout Base
   When you remove a file on a readonly branch, aufs handles it as a
   logical deletion and creates a whiteout on the upper writable branch
   as a hardlink of this file in order not to consume inode on the
   writable branch.
2. Pseudo-link Dir
   See below, Pseudo-link.
3. Step-Parent Dir
   When "fileC" exists on the lower readonly branch only and it is
   opened and removed with its parent dir, and then user writes
   something into it, then aufs copies-up fileC to this
   directory. Because there is no other dir to store fileC. After
   creating a file under this dir, the file is unlinked.

Because aufs supports manipulating branches, ie. add/delete/change
dynamically, a branch has its own id. When the branch order changes,
aufs finds the new index by searching the branch id.


Pseudo-link
----------------------------------------------------------------------
Assume "fileA" exists on the lower readonly branch only and it is
hardlinked to "fileB" on the branch. When you write something to fileA,
aufs copies-up it to the upper writable branch. Additionally aufs
creates a hardlink under the Pseudo-link Directory of the writable
branch. The inode of a pseudo-link is kept in aufs super_block as a
simple list. If fileB is read after unlinking fileA, aufs returns
filedata from the pseudo-link instead of the lower readonly
branch. Because the pseudo-link is based upon the inode, to keep the
inode number by xino (see above) is essentially necessary.

All the hardlinks under the Pseudo-link Directory of the writable branch
should be restored in a proper location later. Aufs provides a utility
to do this. The userspace helpers executed at remounting and unmounting
aufs by default.
During this utility is running, it puts aufs into the pseudo-link
maintenance mode. In this mode, only the process which began the
maintenance mode (and its child processes) is allowed to operate in
aufs. Some other processes which are not related to the pseudo-link will
be allowed to run too, but the rest have to return an error or wait
until the maintenance mode ends. If a process already acquires an inode
mutex (in VFS), it has to return an error.


XIB(external inode number bitmap)
----------------------------------------------------------------------
Addition to the xino file per a branch, aufs has an external inode number
bitmap in a superblock object. It is also an internal file such like a
xino file.
It is a simple bitmap to mark whether the aufs inode number is in-use or
not.
To reduce the file I/O, aufs prepares a single memory page to cache xib.

As well as XINO files, aufs has a feature to truncate/refresh XIB to
reduce the number of consumed disk blocks for these files.


Virtual or Vertical Dir, and Readdir in Userspace
----------------------------------------------------------------------
In order to support multiple layers (branches), aufs readdir operation
constructs a virtual dir block on memory. For readdir, aufs calls
vfs_readdir() internally for each dir on branches, merges their entries
with eliminating the whiteout-ed ones, and sets it to file (dir)
object. So the file object has its entry list until it is closed. The
entry list will be updated when the file position is zero and becomes
obsoleted. This decision is made in aufs automatically.

The dynamically allocated memory block for the name of entries has a
unit of 512 bytes (by default) and stores the names contiguously (no
padding). Another block for each entry is handled by kmem_cache too.
During building dir blocks, aufs creates hash list and judging whether
the entry is whiteouted by its upper branch or already listed.
The merged result is cached in the corresponding inode object and
maintained by a customizable life-time option.

Some people may call it can be a security hole or invite DoS attack
since the opened and once readdir-ed dir (file object) holds its entry
list and becomes a pressure for system memory. But I'd say it is similar
to files under /proc or /sys. The virtual files in them also holds a
memory page (generally) while they are opened. When an idea to reduce
memory for them is introduced, it will be applied to aufs too.
For those who really hate this situation, I've developed readdir(3)
library which operates this merging in userspace. You just need to set
LD_PRELOAD environment variable, and aufs will not consume no memory in
kernel space for readdir(3).


Workqueue
----------------------------------------------------------------------
Aufs sometimes requires privilege access to a branch. For instance,
in copy-up/down operation. When a user process is going to make changes
to a file which exists in the lower readonly branch only, and the mode
of one of ancestor directories may not be writable by a user
process. Here aufs copy-up the file with its ancestors and they may
require privilege to set its owner/group/mode/etc.
This is a typical case of a application character of aufs (see
Introduction).

Aufs uses workqueue synchronously for this case. It creates its own
workqueue. The workqueue is a kernel thread and has privilege. Aufs
passes the request to call mkdir or write (for example), and wait for
its completion. This approach solves a problem of a signal handler
simply.
If aufs didn't adopt the workqueue and changed the privilege of the
process, then the process may receive the unexpected SIGXFSZ or other
signals.

Also aufs uses the system global workqueue ("events" kernel thread) too
for asynchronous tasks, such like handling inotify/fsnotify, re-creating a
whiteout base and etc. This is unrelated to a privilege.
Most of aufs operation tries acquiring a rw_semaphore for aufs
superblock at the beginning, at the same time waits for the completion
of all queued asynchronous tasks.


Whiteout
----------------------------------------------------------------------
The whiteout in aufs is very similar to Unionfs's. That is represented
by its filename. UnionMount takes an approach of a file mode, but I am
afraid several utilities (find(1) or something) will have to support it.

Basically the whiteout represents "logical deletion" which stops aufs to
lookup further, but also it represents "dir is opaque" which also stop
further lookup.

In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively.
In order to make several functions in a single systemcall to be
revertible, aufs adopts an approach to rename a directory to a temporary
unique whiteouted name.
For example, in rename(2) dir where the target dir already existed, aufs
renames the target dir to a temporary unique whiteouted name before the
actual rename on a branch, and then handles other actions (make it opaque,
update the attributes, etc). If an error happens in these actions, aufs
simply renames the whiteouted name back and returns an error. If all are
succeeded, aufs registers a function to remove the whiteouted unique
temporary name completely and asynchronously to the system global
workqueue.


Copy-up
----------------------------------------------------------------------
It is a well-known feature or concept.
When user modifies a file on a readonly branch, aufs operate "copy-up"
internally and makes change to the new file on the upper writable branch.
When the trigger systemcall does not update the timestamps of the parent
dir, aufs reverts it after copy-up.


Move-down (aufs3.9 and later)
----------------------------------------------------------------------
"Copy-up" is one of the essential feature in aufs. It copies a file from
the lower readonly branch to the upper writable branch when a user
changes something about the file.
"Move-down" is an opposite action of copy-up. Basically this action is
ran manually instead of automatically and internally.
For desgin and implementation, aufs has to consider these issues.
- whiteout for the file may exist on the lower branch.
- ancestor directories may not exist on the lower branch.
- diropq for the ancestor directories may exist on the upper branch.
- free space on the lower branch will reduce.
- another access to the file may happen during moving-down, including
  UDBA (see "Revalidate Dentry and UDBA").
- the file should not be hard-linked nor pseudo-linked. they should be
  handled by auplink utility later.

Sometimes users want to move-down a file from the upper writable branch
to the lower readonly or writable branch. For instance,
- the free space of the upper writable branch is going to run out.
- create a new intermediate branch between the upper and lower branch.
- etc.

For this purpose, use "aumvdown" command in aufs-util.git.