From 03dd4cb26d967f9588437b0fc9cc0e8353322bb7 Mon Sep 17 00:00:00 2001 From: André Fabian Silva Delgado Date: Fri, 25 Mar 2016 03:53:42 -0300 Subject: Linux-libre 4.5-gnu --- Documentation/filesystems/aufs/design/01intro.txt | 157 ------------- Documentation/filesystems/aufs/design/02struct.txt | 245 --------------------- .../filesystems/aufs/design/03atomic_open.txt | 72 ------ Documentation/filesystems/aufs/design/03lookup.txt | 100 --------- Documentation/filesystems/aufs/design/04branch.txt | 61 ----- .../filesystems/aufs/design/05wbr_policy.txt | 51 ----- Documentation/filesystems/aufs/design/06fhsm.txt | 105 --------- Documentation/filesystems/aufs/design/06mmap.txt | 59 ----- Documentation/filesystems/aufs/design/06xattr.txt | 81 ------- Documentation/filesystems/aufs/design/07export.txt | 45 ---- Documentation/filesystems/aufs/design/08shwh.txt | 39 ---- Documentation/filesystems/aufs/design/10dynop.txt | 34 --- 12 files changed, 1049 deletions(-) delete mode 100644 Documentation/filesystems/aufs/design/01intro.txt delete mode 100644 Documentation/filesystems/aufs/design/02struct.txt delete mode 100644 Documentation/filesystems/aufs/design/03atomic_open.txt delete mode 100644 Documentation/filesystems/aufs/design/03lookup.txt delete mode 100644 Documentation/filesystems/aufs/design/04branch.txt delete mode 100644 Documentation/filesystems/aufs/design/05wbr_policy.txt delete mode 100644 Documentation/filesystems/aufs/design/06fhsm.txt delete mode 100644 Documentation/filesystems/aufs/design/06mmap.txt delete mode 100644 Documentation/filesystems/aufs/design/06xattr.txt delete mode 100644 Documentation/filesystems/aufs/design/07export.txt delete mode 100644 Documentation/filesystems/aufs/design/08shwh.txt delete mode 100644 Documentation/filesystems/aufs/design/10dynop.txt (limited to 'Documentation/filesystems/aufs/design') diff --git a/Documentation/filesystems/aufs/design/01intro.txt b/Documentation/filesystems/aufs/design/01intro.txt deleted file mode 100644 index 5d0121439..000000000 --- a/Documentation/filesystems/aufs/design/01intro.txt +++ /dev/null @@ -1,157 +0,0 @@ - -# Copyright (C) 2005-2016 Junjiro R. Okajima - -Introduction ----------------------------------------- - -aufs [ei ju: ef es] | [a u f s] -1. abbrev. for "advanced multi-layered unification filesystem". -2. abbrev. for "another unionfs". -3. abbrev. for "auf das" in German which means "on the" in English. - Ex. "Butter aufs Brot"(G) means "butter onto bread"(E). - But "Filesystem aufs Filesystem" is hard to understand. - -AUFS is a filesystem with features: -- multi layered stackable unification filesystem, the member directory - is called as a branch. -- branch permission and attribute, 'readonly', 'real-readonly', - 'readwrite', 'whiteout-able', 'link-able whiteout', etc. and their - combination. -- internal "file copy-on-write". -- logical deletion, whiteout. -- dynamic branch manipulation, adding, deleting and changing permission. -- allow bypassing aufs, user's direct branch access. -- external inode number translation table and bitmap which maintains the - persistent aufs inode number. -- seekable directory, including NFS readdir. -- file mapping, mmap and sharing pages. -- pseudo-link, hardlink over branches. -- loopback mounted filesystem as a branch. -- several policies to select one among multiple writable branches. -- revert a single systemcall when an error occurs in aufs. -- and more... - - -Multi Layered Stackable Unification Filesystem ----------------------------------------------------------------------- -Most people already knows what it is. -It is a filesystem which unifies several directories and provides a -merged single directory. When users access a file, the access will be -passed/re-directed/converted (sorry, I am not sure which English word is -correct) to the real file on the member filesystem. The member -filesystem is called 'lower filesystem' or 'branch' and has a mode -'readonly' and 'readwrite.' And the deletion for a file on the lower -readonly branch is handled by creating 'whiteout' on the upper writable -branch. - -On LKML, there have been discussions about UnionMount (Jan Blunck, -Bharata B Rao and Valerie Aurora) and Unionfs (Erez Zadok). They took -different approaches to implement the merged-view. -The former tries putting it into VFS, and the latter implements as a -separate filesystem. -(If I misunderstand about these implementations, please let me know and -I shall correct it. Because it is a long time ago when I read their -source files last time). - -UnionMount's approach will be able to small, but may be hard to share -branches between several UnionMount since the whiteout in it is -implemented in the inode on branch filesystem and always -shared. According to Bharata's post, readdir does not seems to be -finished yet. -There are several missing features known in this implementations such as -- for users, the inode number may change silently. eg. copy-up. -- link(2) may break by copy-up. -- read(2) may get an obsoleted filedata (fstat(2) too). -- fcntl(F_SETLK) may be broken by copy-up. -- unnecessary copy-up may happen, for example mmap(MAP_PRIVATE) after - open(O_RDWR). - -In linux-3.18, "overlay" filesystem (formerly known as "overlayfs") was -merged into mainline. This is another implementation of UnionMount as a -separated filesystem. All the limitations and known problems which -UnionMount are equally inherited to "overlay" filesystem. - -Unionfs has a longer history. When I started implementing a stackable -filesystem (Aug 2005), it already existed. It has virtual super_block, -inode, dentry and file objects and they have an array pointing lower -same kind objects. After contributing many patches for Unionfs, I -re-started my project AUFS (Jun 2006). - -In AUFS, the structure of filesystem resembles to Unionfs, but I -implemented my own ideas, approaches and enhancements and it became -totally different one. - -Comparing DM snapshot and fs based implementation -- the number of bytes to be copied between devices is much smaller. -- the type of filesystem must be one and only. -- the fs must be writable, no readonly fs, even for the lower original - device. so the compression fs will not be usable. but if we use - loopback mount, we may address this issue. - for instance, - mount /cdrom/squashfs.img /sq - losetup /sq/ext2.img - losetup /somewhere/cow - dmsetup "snapshot /dev/loop0 /dev/loop1 ..." -- it will be difficult (or needs more operations) to extract the - difference between the original device and COW. -- DM snapshot-merge may help a lot when users try merging. in the - fs-layer union, users will use rsync(1). - -You may want to read my old paper "Filesystems in LiveCD" -(http://aufs.sourceforge.net/aufs2/report/sq/sq.pdf). - - -Several characters/aspects/persona of aufs ----------------------------------------------------------------------- - -Aufs has several characters, aspects or persona. -1. a filesystem, callee of VFS helper -2. sub-VFS, caller of VFS helper for branches -3. a virtual filesystem which maintains persistent inode number -4. reader/writer of files on branches such like an application - -1. Callee of VFS Helper -As an ordinary linux filesystem, aufs is a callee of VFS. For instance, -unlink(2) from an application reaches sys_unlink() kernel function and -then vfs_unlink() is called. vfs_unlink() is one of VFS helper and it -calls filesystem specific unlink operation. Actually aufs implements the -unlink operation but it behaves like a redirector. - -2. Caller of VFS Helper for Branches -aufs_unlink() passes the unlink request to the branch filesystem as if -it were called from VFS. So the called unlink operation of the branch -filesystem acts as usual. As a caller of VFS helper, aufs should handle -every necessary pre/post operation for the branch filesystem. -- acquire the lock for the parent dir on a branch -- lookup in a branch -- revalidate dentry on a branch -- mnt_want_write() for a branch -- vfs_unlink() for a branch -- mnt_drop_write() for a branch -- release the lock on a branch - -3. Persistent Inode Number -One of the most important issue for a filesystem is to maintain inode -numbers. This is particularly important to support exporting a -filesystem via NFS. Aufs is a virtual filesystem which doesn't have a -backend block device for its own. But some storage is necessary to -keep and maintain the inode numbers. It may be a large space and may not -suit to keep in memory. Aufs rents some space from its first writable -branch filesystem (by default) and creates file(s) on it. These files -are created by aufs internally and removed soon (currently) keeping -opened. -Note: Because these files are removed, they are totally gone after - unmounting aufs. It means the inode numbers are not persistent - across unmount or reboot. I have a plan to make them really - persistent which will be important for aufs on NFS server. - -4. Read/Write Files Internally (copy-on-write) -Because a branch can be readonly, when you write a file on it, aufs will -"copy-up" it to the upper writable branch internally. And then write the -originally requested thing to the file. Generally kernel doesn't -open/read/write file actively. In aufs, even a single write may cause a -internal "file copy". This behaviour is very similar to cp(1) command. - -Some people may think it is better to pass such work to user space -helper, instead of doing in kernel space. Actually I am still thinking -about it. But currently I have implemented it in kernel space. diff --git a/Documentation/filesystems/aufs/design/02struct.txt b/Documentation/filesystems/aufs/design/02struct.txt deleted file mode 100644 index 783328a75..000000000 --- a/Documentation/filesystems/aufs/design/02struct.txt +++ /dev/null @@ -1,245 +0,0 @@ - -# Copyright (C) 2005-2016 Junjiro R. Okajima - -Basic Aufs Internal Structure - -Superblock/Inode/Dentry/File Objects ----------------------------------------------------------------------- -As like an ordinary filesystem, aufs has its own -superblock/inode/dentry/file objects. All these objects have a -dynamically allocated array and store the same kind of pointers to the -lower filesystem, branch. -For example, when you build a union with one readwrite branch and one -readonly, mounted /au, /rw and /ro respectively. -- /au = /rw + /ro -- /ro/fileA exists but /rw/fileA - -Aufs lookup operation finds /ro/fileA and gets dentry for that. These -pointers are stored in a aufs dentry. The array in aufs dentry will be, -- [0] = NULL (because /rw/fileA doesn't exist) -- [1] = /ro/fileA - -This style of an array is essentially same to the aufs -superblock/inode/dentry/file objects. - -Because aufs supports manipulating branches, ie. add/delete/change -branches dynamically, these objects has its own generation. When -branches are changed, the generation in aufs superblock is -incremented. And a generation in other object are compared when it is -accessed. When a generation in other objects are obsoleted, aufs -refreshes the internal array. - - -Superblock ----------------------------------------------------------------------- -Additionally aufs superblock has some data for policies to select one -among multiple writable branches, XIB files, pseudo-links and kobject. -See below in detail. -About the policies which supports copy-down a directory, see -wbr_policy.txt too. - - -Branch and XINO(External Inode Number Translation Table) ----------------------------------------------------------------------- -Every branch has its own xino (external inode number translation table) -file. The xino file is created and unlinked by aufs internally. When two -members of a union exist on the same filesystem, they share the single -xino file. -The struct of a xino file is simple, just a sequence of aufs inode -numbers which is indexed by the lower inode number. -In the above sample, assume the inode number of /ro/fileA is i111 and -aufs assigns the inode number i999 for fileA. Then aufs writes 999 as -4(8) bytes at 111 * 4(8) bytes offset in the xino file. - -When the inode numbers are not contiguous, the xino file will be sparse -which has a hole in it and doesn't consume as much disk space as it -might appear. If your branch filesystem consumes disk space for such -holes, then you should specify 'xino=' option at mounting aufs. - -Aufs has a mount option to free the disk blocks for such holes in XINO -files on tmpfs or ramdisk. But it is not so effective actually. If you -meet a problem of disk shortage due to XINO files, then you should try -"tmpfs-ino.patch" (and "vfs-ino.patch" too) in aufs4-standalone.git. -The patch localizes the assignment inumbers per tmpfs-mount and avoid -the holes in XINO files. - -Also a writable branch has three kinds of "whiteout bases". All these -are existed when the branch is joined to aufs, and their names are -whiteout-ed doubly, so that users will never see their names in aufs -hierarchy. -1. a regular file which will be hardlinked to all whiteouts. -2. a directory to store a pseudo-link. -3. a directory to store an "orphan"-ed file temporary. - -1. Whiteout Base - When you remove a file on a readonly branch, aufs handles it as a - logical deletion and creates a whiteout on the upper writable branch - as a hardlink of this file in order not to consume inode on the - writable branch. -2. Pseudo-link Dir - See below, Pseudo-link. -3. Step-Parent Dir - When "fileC" exists on the lower readonly branch only and it is - opened and removed with its parent dir, and then user writes - something into it, then aufs copies-up fileC to this - directory. Because there is no other dir to store fileC. After - creating a file under this dir, the file is unlinked. - -Because aufs supports manipulating branches, ie. add/delete/change -dynamically, a branch has its own id. When the branch order changes, -aufs finds the new index by searching the branch id. - - -Pseudo-link ----------------------------------------------------------------------- -Assume "fileA" exists on the lower readonly branch only and it is -hardlinked to "fileB" on the branch. When you write something to fileA, -aufs copies-up it to the upper writable branch. Additionally aufs -creates a hardlink under the Pseudo-link Directory of the writable -branch. The inode of a pseudo-link is kept in aufs super_block as a -simple list. If fileB is read after unlinking fileA, aufs returns -filedata from the pseudo-link instead of the lower readonly -branch. Because the pseudo-link is based upon the inode, to keep the -inode number by xino (see above) is essentially necessary. - -All the hardlinks under the Pseudo-link Directory of the writable branch -should be restored in a proper location later. Aufs provides a utility -to do this. The userspace helpers executed at remounting and unmounting -aufs by default. -During this utility is running, it puts aufs into the pseudo-link -maintenance mode. In this mode, only the process which began the -maintenance mode (and its child processes) is allowed to operate in -aufs. Some other processes which are not related to the pseudo-link will -be allowed to run too, but the rest have to return an error or wait -until the maintenance mode ends. If a process already acquires an inode -mutex (in VFS), it has to return an error. - - -XIB(external inode number bitmap) ----------------------------------------------------------------------- -Addition to the xino file per a branch, aufs has an external inode number -bitmap in a superblock object. It is also an internal file such like a -xino file. -It is a simple bitmap to mark whether the aufs inode number is in-use or -not. -To reduce the file I/O, aufs prepares a single memory page to cache xib. - -As well as XINO files, aufs has a feature to truncate/refresh XIB to -reduce the number of consumed disk blocks for these files. - - -Virtual or Vertical Dir, and Readdir in Userspace ----------------------------------------------------------------------- -In order to support multiple layers (branches), aufs readdir operation -constructs a virtual dir block on memory. For readdir, aufs calls -vfs_readdir() internally for each dir on branches, merges their entries -with eliminating the whiteout-ed ones, and sets it to file (dir) -object. So the file object has its entry list until it is closed. The -entry list will be updated when the file position is zero and becomes -obsoleted. This decision is made in aufs automatically. - -The dynamically allocated memory block for the name of entries has a -unit of 512 bytes (by default) and stores the names contiguously (no -padding). Another block for each entry is handled by kmem_cache too. -During building dir blocks, aufs creates hash list and judging whether -the entry is whiteouted by its upper branch or already listed. -The merged result is cached in the corresponding inode object and -maintained by a customizable life-time option. - -Some people may call it can be a security hole or invite DoS attack -since the opened and once readdir-ed dir (file object) holds its entry -list and becomes a pressure for system memory. But I'd say it is similar -to files under /proc or /sys. The virtual files in them also holds a -memory page (generally) while they are opened. When an idea to reduce -memory for them is introduced, it will be applied to aufs too. -For those who really hate this situation, I've developed readdir(3) -library which operates this merging in userspace. You just need to set -LD_PRELOAD environment variable, and aufs will not consume no memory in -kernel space for readdir(3). - - -Workqueue ----------------------------------------------------------------------- -Aufs sometimes requires privilege access to a branch. For instance, -in copy-up/down operation. When a user process is going to make changes -to a file which exists in the lower readonly branch only, and the mode -of one of ancestor directories may not be writable by a user -process. Here aufs copy-up the file with its ancestors and they may -require privilege to set its owner/group/mode/etc. -This is a typical case of a application character of aufs (see -Introduction). - -Aufs uses workqueue synchronously for this case. It creates its own -workqueue. The workqueue is a kernel thread and has privilege. Aufs -passes the request to call mkdir or write (for example), and wait for -its completion. This approach solves a problem of a signal handler -simply. -If aufs didn't adopt the workqueue and changed the privilege of the -process, then the process may receive the unexpected SIGXFSZ or other -signals. - -Also aufs uses the system global workqueue ("events" kernel thread) too -for asynchronous tasks, such like handling inotify/fsnotify, re-creating a -whiteout base and etc. This is unrelated to a privilege. -Most of aufs operation tries acquiring a rw_semaphore for aufs -superblock at the beginning, at the same time waits for the completion -of all queued asynchronous tasks. - - -Whiteout ----------------------------------------------------------------------- -The whiteout in aufs is very similar to Unionfs's. That is represented -by its filename. UnionMount takes an approach of a file mode, but I am -afraid several utilities (find(1) or something) will have to support it. - -Basically the whiteout represents "logical deletion" which stops aufs to -lookup further, but also it represents "dir is opaque" which also stop -further lookup. - -In aufs, rmdir(2) and rename(2) for dir uses whiteout alternatively. -In order to make several functions in a single systemcall to be -revertible, aufs adopts an approach to rename a directory to a temporary -unique whiteouted name. -For example, in rename(2) dir where the target dir already existed, aufs -renames the target dir to a temporary unique whiteouted name before the -actual rename on a branch, and then handles other actions (make it opaque, -update the attributes, etc). If an error happens in these actions, aufs -simply renames the whiteouted name back and returns an error. If all are -succeeded, aufs registers a function to remove the whiteouted unique -temporary name completely and asynchronously to the system global -workqueue. - - -Copy-up ----------------------------------------------------------------------- -It is a well-known feature or concept. -When user modifies a file on a readonly branch, aufs operate "copy-up" -internally and makes change to the new file on the upper writable branch. -When the trigger systemcall does not update the timestamps of the parent -dir, aufs reverts it after copy-up. - - -Move-down (aufs3.9 and later) ----------------------------------------------------------------------- -"Copy-up" is one of the essential feature in aufs. It copies a file from -the lower readonly branch to the upper writable branch when a user -changes something about the file. -"Move-down" is an opposite action of copy-up. Basically this action is -ran manually instead of automatically and internally. -For desgin and implementation, aufs has to consider these issues. -- whiteout for the file may exist on the lower branch. -- ancestor directories may not exist on the lower branch. -- diropq for the ancestor directories may exist on the upper branch. -- free space on the lower branch will reduce. -- another access to the file may happen during moving-down, including - UDBA (see "Revalidate Dentry and UDBA"). -- the file should not be hard-linked nor pseudo-linked. they should be - handled by auplink utility later. - -Sometimes users want to move-down a file from the upper writable branch -to the lower readonly or writable branch. For instance, -- the free space of the upper writable branch is going to run out. -- create a new intermediate branch between the upper and lower branch. -- etc. - -For this purpose, use "aumvdown" command in aufs-util.git. diff --git a/Documentation/filesystems/aufs/design/03atomic_open.txt b/Documentation/filesystems/aufs/design/03atomic_open.txt deleted file mode 100644 index 741ad6d66..000000000 --- a/Documentation/filesystems/aufs/design/03atomic_open.txt +++ /dev/null @@ -1,72 +0,0 @@ - -# Copyright (C) 2015-2016 Junjiro R. Okajima - -Support for a branch who has its ->atomic_open() ----------------------------------------------------------------------- -The filesystems who implement its ->atomic_open() are not majority. For -example NFSv4 does, and aufs should call NFSv4 ->atomic_open, -particularly for open(O_CREAT|O_EXCL, 0400) case. Other than -->atomic_open(), NFSv4 returns an error for this open(2). While I am not -sure whether all filesystems who have ->atomic_open() behave like this, -but NFSv4 surely returns the error. - -In order to support ->atomic_open() for aufs, there are a few -approaches. - -A. Introduce aufs_atomic_open() - - calls one of VFS:do_last(), lookup_open() or atomic_open() for - branch fs. -B. Introduce aufs_atomic_open() calling create, open and chmod. this is - an aufs user Pip Cet's approach - - calls aufs_create(), VFS finish_open() and notify_change(). - - pass fake-mode to finish_open(), and then correct the mode by - notify_change(). -C. Extend aufs_open() to call branch fs's ->atomic_open() - - no aufs_atomic_open(). - - aufs_lookup() registers the TID to an aufs internal object. - - aufs_create() does nothing when the matching TID is registered, but - registers the mode. - - aufs_open() calls branch fs's ->atomic_open() when the matching - TID is registered. -D. Extend aufs_open() to re-try branch fs's ->open() with superuser's - credential - - no aufs_atomic_open(). - - aufs_create() registers the TID to an internal object. this info - represents "this process created this file just now." - - when aufs gets EACCES from branch fs's ->open(), then confirm the - registered TID and re-try open() with superuser's credential. - -Pros and cons for each approach. - -A. - - straightforward but highly depends upon VFS internal. - - the atomic behavaiour is kept. - - some of parameters such as nameidata are hard to reproduce for - branch fs. - - large overhead. -B. - - easy to implement. - - the atomic behavaiour is lost. -C. - - the atomic behavaiour is kept. - - dirty and tricky. - - VFS checks whether the file is created correctly after calling - ->create(), which means this approach doesn't work. -D. - - easy to implement. - - the atomic behavaiour is lost. - - to open a file with superuser's credential and give it to a user - process is a bad idea, since the file object keeps the credential - in it. It may affect LSM or something. This approach doesn't work - either. - -The approach A is ideal, but it hard to implement. So here is a -variation of A, which is to be implemented. - -A-1. Introduce aufs_atomic_open() - - calls branch fs ->atomic_open() if exists. otherwise calls - vfs_create() and finish_open(). - - the demerit is that the several checks after branch fs - ->atomic_open() are lost. in the ordinary case, the checks are - done by VFS:do_last(), lookup_open() and atomic_open(). some can - be implemented in aufs, but not all I am afraid. diff --git a/Documentation/filesystems/aufs/design/03lookup.txt b/Documentation/filesystems/aufs/design/03lookup.txt deleted file mode 100644 index 5b6b000b5..000000000 --- a/Documentation/filesystems/aufs/design/03lookup.txt +++ /dev/null @@ -1,100 +0,0 @@ - -# Copyright (C) 2005-2016 Junjiro R. Okajima - -Lookup in a Branch ----------------------------------------------------------------------- -Since aufs has a character of sub-VFS (see Introduction), it operates -lookup for branches as VFS does. It may be a heavy work. But almost all -lookup operation in aufs is the simplest case, ie. lookup only an entry -directly connected to its parent. Digging down the directory hierarchy -is unnecessary. VFS has a function lookup_one_len() for that use, and -aufs calls it. - -When a branch is a remote filesystem, aufs basically relies upon its -->d_revalidate(), also aufs forces the hardest revalidate tests for -them. -For d_revalidate, aufs implements three levels of revalidate tests. See -"Revalidate Dentry and UDBA" in detail. - - -Test Only the Highest One for the Directory Permission (dirperm1 option) ----------------------------------------------------------------------- -Let's try case study. -- aufs has two branches, upper readwrite and lower readonly. - /au = /rw + /ro -- "dirA" exists under /ro, but /rw. and its mode is 0700. -- user invoked "chmod a+rx /au/dirA" -- the internal copy-up is activated and "/rw/dirA" is created and its - permission bits are set to world readable. -- then "/au/dirA" becomes world readable? - -In this case, /ro/dirA is still 0700 since it exists in readonly branch, -or it may be a natively readonly filesystem. If aufs respects the lower -branch, it should not respond readdir request from other users. But user -allowed it by chmod. Should really aufs rejects showing the entries -under /ro/dirA? - -To be honest, I don't have a good solution for this case. So aufs -implements 'dirperm1' and 'nodirperm1' mount options, and leave it to -users. -When dirperm1 is specified, aufs checks only the highest one for the -directory permission, and shows the entries. Otherwise, as usual, checks -every dir existing on all branches and rejects the request. - -As a side effect, dirperm1 option improves the performance of aufs -because the number of permission check is reduced when the number of -branch is many. - - -Revalidate Dentry and UDBA (User's Direct Branch Access) ----------------------------------------------------------------------- -Generally VFS helpers re-validate a dentry as a part of lookup. -0. digging down the directory hierarchy. -1. lock the parent dir by its i_mutex. -2. lookup the final (child) entry. -3. revalidate it. -4. call the actual operation (create, unlink, etc.) -5. unlock the parent dir - -If the filesystem implements its ->d_revalidate() (step 3), then it is -called. Actually aufs implements it and checks the dentry on a branch is -still valid. -But it is not enough. Because aufs has to release the lock for the -parent dir on a branch at the end of ->lookup() (step 2) and -->d_revalidate() (step 3) while the i_mutex of the aufs dir is still -held by VFS. -If the file on a branch is changed directly, eg. bypassing aufs, after -aufs released the lock, then the subsequent operation may cause -something unpleasant result. - -This situation is a result of VFS architecture, ->lookup() and -->d_revalidate() is separated. But I never say it is wrong. It is a good -design from VFS's point of view. It is just not suitable for sub-VFS -character in aufs. - -Aufs supports such case by three level of revalidation which is -selectable by user. -1. Simple Revalidate - Addition to the native flow in VFS's, confirm the child-parent - relationship on the branch just after locking the parent dir on the - branch in the "actual operation" (step 4). When this validation - fails, aufs returns EBUSY. ->d_revalidate() (step 3) in aufs still - checks the validation of the dentry on branches. -2. Monitor Changes Internally by Inotify/Fsnotify - Addition to above, in the "actual operation" (step 4) aufs re-lookup - the dentry on the branch, and returns EBUSY if it finds different - dentry. - Additionally, aufs sets the inotify/fsnotify watch for every dir on branches - during it is in cache. When the event is notified, aufs registers a - function to kernel 'events' thread by schedule_work(). And the - function sets some special status to the cached aufs dentry and inode - private data. If they are not cached, then aufs has nothing to - do. When the same file is accessed through aufs (step 0-3) later, - aufs will detect the status and refresh all necessary data. - In this mode, aufs has to ignore the event which is fired by aufs - itself. -3. No Extra Validation - This is the simplest test and doesn't add any additional revalidation - test, and skip the revalidation in step 4. It is useful and improves - aufs performance when system surely hide the aufs branches from user, - by over-mounting something (or another method). diff --git a/Documentation/filesystems/aufs/design/04branch.txt b/Documentation/filesystems/aufs/design/04branch.txt deleted file mode 100644 index e68f4d3df..000000000 --- a/Documentation/filesystems/aufs/design/04branch.txt +++ /dev/null @@ -1,61 +0,0 @@ - -# Copyright (C) 2005-2016 Junjiro R. Okajima - -Branch Manipulation - -Since aufs supports dynamic branch manipulation, ie. add/remove a branch -and changing its permission/attribute, there are a lot of works to do. - - -Add a Branch ----------------------------------------------------------------------- -o Confirm the adding dir exists outside of aufs, including loopback - mount, and its various attributes. -o Initialize the xino file and whiteout bases if necessary. - See struct.txt. - -o Check the owner/group/mode of the directory - When the owner/group/mode of the adding directory differs from the - existing branch, aufs issues a warning because it may impose a - security risk. - For example, when a upper writable branch has a world writable empty - top directory, a malicious user can create any files on the writable - branch directly, like copy-up and modify manually. If something like - /etc/{passwd,shadow} exists on the lower readonly branch but the upper - writable branch, and the writable branch is world-writable, then a - malicious guy may create /etc/passwd on the writable branch directly - and the infected file will be valid in aufs. - I am afraid it can be a security issue, but aufs can do nothing except - producing a warning. - - -Delete a Branch ----------------------------------------------------------------------- -o Confirm the deleting branch is not busy - To be general, there is one merit to adopt "remount" interface to - manipulate branches. It is to discard caches. At deleting a branch, - aufs checks the still cached (and connected) dentries and inodes. If - there are any, then they are all in-use. An inode without its - corresponding dentry can be alive alone (for example, inotify/fsnotify case). - - For the cached one, aufs checks whether the same named entry exists on - other branches. - If the cached one is a directory, because aufs provides a merged view - to users, as long as one dir is left on any branch aufs can show the - dir to users. In this case, the branch can be removed from aufs. - Otherwise aufs rejects deleting the branch. - - If any file on the deleting branch is opened by aufs, then aufs - rejects deleting. - - -Modify the Permission of a Branch ----------------------------------------------------------------------- -o Re-initialize or remove the xino file and whiteout bases if necessary. - See struct.txt. - -o rw --> ro: Confirm the modifying branch is not busy - Aufs rejects the request if any of these conditions are true. - - a file on the branch is mmap-ed. - - a regular file on the branch is opened for write and there is no - same named entry on the upper branch. diff --git a/Documentation/filesystems/aufs/design/05wbr_policy.txt b/Documentation/filesystems/aufs/design/05wbr_policy.txt deleted file mode 100644 index 1726d5d06..000000000 --- a/Documentation/filesystems/aufs/design/05wbr_policy.txt +++ /dev/null @@ -1,51 +0,0 @@ - -# Copyright (C) 2005-2016 Junjiro R. Okajima - -Policies to Select One among Multiple Writable Branches ----------------------------------------------------------------------- -When the number of writable branch is more than one, aufs has to decide -the target branch for file creation or copy-up. By default, the highest -writable branch which has the parent (or ancestor) dir of the target -file is chosen (top-down-parent policy). -By user's request, aufs implements some other policies to select the -writable branch, for file creation several policies, round-robin, -most-free-space, and other policies. For copy-up, top-down-parent, -bottom-up-parent, bottom-up and others. - -As expected, the round-robin policy selects the branch in circular. When -you have two writable branches and creates 10 new files, 5 files will be -created for each branch. mkdir(2) systemcall is an exception. When you -create 10 new directories, all will be created on the same branch. -And the most-free-space policy selects the one which has most free -space among the writable branches. The amount of free space will be -checked by aufs internally, and users can specify its time interval. - -The policies for copy-up is more simple, -top-down-parent is equivalent to the same named on in create policy, -bottom-up-parent selects the writable branch where the parent dir -exists and the nearest upper one from the copyup-source, -bottom-up selects the nearest upper writable branch from the -copyup-source, regardless the existence of the parent dir. - -There are some rules or exceptions to apply these policies. -- If there is a readonly branch above the policy-selected branch and - the parent dir is marked as opaque (a variation of whiteout), or the - target (creating) file is whiteout-ed on the upper readonly branch, - then the result of the policy is ignored and the target file will be - created on the nearest upper writable branch than the readonly branch. -- If there is a writable branch above the policy-selected branch and - the parent dir is marked as opaque or the target file is whiteouted - on the branch, then the result of the policy is ignored and the target - file will be created on the highest one among the upper writable - branches who has diropq or whiteout. In case of whiteout, aufs removes - it as usual. -- link(2) and rename(2) systemcalls are exceptions in every policy. - They try selecting the branch where the source exists as possible - since copyup a large file will take long time. If it can't be, - ie. the branch where the source exists is readonly, then they will - follow the copyup policy. -- There is an exception for rename(2) when the target exists. - If the rename target exists, aufs compares the index of the branches - where the source and the target exists and selects the higher - one. If the selected branch is readonly, then aufs follows the - copyup policy. diff --git a/Documentation/filesystems/aufs/design/06fhsm.txt b/Documentation/filesystems/aufs/design/06fhsm.txt deleted file mode 100644 index 84b46dc5b..000000000 --- a/Documentation/filesystems/aufs/design/06fhsm.txt +++ /dev/null @@ -1,105 +0,0 @@ - -# Copyright (C) 2011-2016 Junjiro R. Okajima - -File-based Hierarchical Storage Management (FHSM) ----------------------------------------------------------------------- -Hierarchical Storage Management (or HSM) is a well-known feature in the -storage world. Aufs provides this feature as file-based with multiple -writable branches, based upon the principle of "Colder, the Lower". -Here the word "colder" means that the less used files, and "lower" means -that the position in the order of the stacked branches vertically. -These multiple writable branches are prioritized, ie. the topmost one -should be the fastest drive and be used heavily. - -o Characters in aufs FHSM story -- aufs itself and a new branch attribute. -- a new ioctl interface to move-down and to establish a connection with - the daemon ("move-down" is a converse of "copy-up"). -- userspace tool and daemon. - -The userspace daemon establishes a connection with aufs and waits for -the notification. The notified information is very similar to struct -statfs containing the number of consumed blocks and inodes. -When the consumed blocks/inodes of a branch exceeds the user-specified -upper watermark, the daemon activates its move-down process until the -consumed blocks/inodes reaches the user-specified lower watermark. - -The actual move-down is done by aufs based upon the request from -user-space since we need to maintain the inode number and the internal -pointer arrays in aufs. - -Currently aufs FHSM handles the regular files only. Additionally they -must not be hard-linked nor pseudo-linked. - - -o Cowork of aufs and the user-space daemon - During the userspace daemon established the connection, aufs sends a - small notification to it whenever aufs writes something into the - writable branch. But it may cost high since aufs issues statfs(2) - internally. So user can specify a new option to cache the - info. Actually the notification is controlled by these factors. - + the specified cache time. - + classified as "force" by aufs internally. - Until the specified time expires, aufs doesn't send the info - except the forced cases. When aufs decide forcing, the info is always - notified to userspace. - For example, the number of free inodes is generally large enough and - the shortage of it happens rarely. So aufs doesn't force the - notification when creating a new file, directory and others. This is - the typical case which aufs doesn't force. - When aufs writes the actual filedata and the files consumes any of new - blocks, the aufs forces notifying. - - -o Interfaces in aufs -- New branch attribute. - + fhsm - Specifies that the branch is managed by FHSM feature. In other word, - participant in the FHSM. - When nofhsm is set to the branch, it will not be the source/target - branch of the move-down operation. This attribute is set - independently from coo and moo attributes, and if you want full - FHSM, you should specify them as well. -- New mount option. - + fhsm_sec - Specifies a second to suppress many less important info to be - notified. -- New ioctl. - + AUFS_CTL_FHSM_FD - create a new file descriptor which userspace can read the notification - (a subset of struct statfs) from aufs. -- Module parameter 'brs' - It has to be set to 1. Otherwise the new mount option 'fhsm' will not - be set. -- mount helpers /sbin/mount.aufs and /sbin/umount.aufs - When there are two or more branches with fhsm attributes, - /sbin/mount.aufs invokes the user-space daemon and /sbin/umount.aufs - terminates it. As a result of remounting and branch-manipulation, the - number of branches with fhsm attribute can be one. In this case, - /sbin/mount.aufs will terminate the user-space daemon. - - -Finally the operation is done as these steps in kernel-space. -- make sure that, - + no one else is using the file. - + the file is not hard-linked. - + the file is not pseudo-linked. - + the file is a regular file. - + the parent dir is not opaqued. -- find the target writable branch. -- make sure the file is not whiteout-ed by the upper (than the target) - branch. -- make the parent dir on the target branch. -- mutex lock the inode on the branch. -- unlink the whiteout on the target branch (if exists). -- lookup and create the whiteout-ed temporary name on the target branch. -- copy the file as the whiteout-ed temporary name on the target branch. -- rename the whiteout-ed temporary name to the original name. -- unlink the file on the source branch. -- maintain the internal pointer array and the external inode number - table (XINO). -- maintain the timestamps and other attributes of the parent dir and the - file. - -And of course, in every step, an error may happen. So the operation -should restore the original file state after an error happens. diff --git a/Documentation/filesystems/aufs/design/06mmap.txt b/Documentation/filesystems/aufs/design/06mmap.txt deleted file mode 100644 index 991c0b1fa..000000000 --- a/Documentation/filesystems/aufs/design/06mmap.txt +++ /dev/null @@ -1,59 +0,0 @@ - -# Copyright (C) 2005-2016 Junjiro R. Okajima - -mmap(2) -- File Memory Mapping ----------------------------------------------------------------------- -In aufs, the file-mapped pages are handled by a branch fs directly, no -interaction with aufs. It means aufs_mmap() calls the branch fs's -->mmap(). -This approach is simple and good, but there is one problem. -Under /proc, several entries show the mmapped files by its path (with -device and inode number), and the printed path will be the path on the -branch fs's instead of virtual aufs's. -This is not a problem in most cases, but some utilities lsof(1) (and its -user) may expect the path on aufs. - -To address this issue, aufs adds a new member called vm_prfile in struct -vm_area_struct (and struct vm_region). The original vm_file points to -the file on the branch fs in order to handle everything correctly as -usual. The new vm_prfile points to a virtual file in aufs, and the -show-functions in procfs refers to vm_prfile if it is set. -Also we need to maintain several other places where touching vm_file -such like -- fork()/clone() copies vma and the reference count of vm_file is - incremented. -- merging vma maintains the ref count too. - -This is not a good approach. It just fakes the printed path. But it -leaves all behaviour around f_mapping unchanged. This is surely an -advantage. -Actually aufs had adopted another complicated approach which calls -generic_file_mmap() and handles struct vm_operations_struct. In this -approach, aufs met a hard problem and I could not solve it without -switching the approach. - -There may be one more another approach which is -- bind-mount the branch-root onto the aufs-root internally -- grab the new vfsmount (ie. struct mount) -- lazy-umount the branch-root internally -- in open(2) the aufs-file, open the branch-file with the hidden - vfsmount (instead of the original branch's vfsmount) -- ideally this "bind-mount and lazy-umount" should be done atomically, - but it may be possible from userspace by the mount helper. - -Adding the internal hidden vfsmount and using it in opening a file, the -file path under /proc will be printed correctly. This approach looks -smarter, but is not possible I am afraid. -- aufs-root may be bind-mount later. when it happens, another hidden - vfsmount will be required. -- it is hard to get the chance to bind-mount and lazy-umount - + in kernel-space, FS can have vfsmount in open(2) via - file->f_path, and aufs can know its vfsmount. But several locks are - already acquired, and if aufs tries to bind-mount and lazy-umount - here, then it may cause a deadlock. - + in user-space, bind-mount doesn't invoke the mount helper. -- since /proc shows dev and ino, aufs has to give vma these info. it - means a new member vm_prinode will be necessary. this is essentially - equivalent to vm_prfile described above. - -I have to give up this "looks-smater" approach. diff --git a/Documentation/filesystems/aufs/design/06xattr.txt b/Documentation/filesystems/aufs/design/06xattr.txt deleted file mode 100644 index 7bfa94f7b..000000000 --- a/Documentation/filesystems/aufs/design/06xattr.txt +++ /dev/null @@ -1,81 +0,0 @@ - -# Copyright (C) 2014-2016 Junjiro R. Okajima - -Listing XATTR/EA and getting the value ----------------------------------------------------------------------- -For the inode standard attributes (owner, group, timestamps, etc.), aufs -shows the values from the topmost existing file. This behaviour is good -for the non-dir entries since the bahaviour exactly matches the shown -information. But for the directories, aufs considers all the same named -entries on the lower branches. Which means, if one of the lower entry -rejects readdir call, then aufs returns an error even if the topmost -entry allows it. This behaviour is necessary to respect the branch fs's -security, but can make users confused since the user-visible standard -attributes don't match the behaviour. -To address this issue, aufs has a mount option called dirperm1 which -checks the permission for the topmost entry only, and ignores the lower -entry's permission. - -A similar issue can happen around XATTR. -getxattr(2) and listxattr(2) families behave as if dirperm1 option is -always set. Otherwise these very unpleasant situation would happen. -- listxattr(2) may return the duplicated entries. -- users may not be able to remove or reset the XATTR forever, - - -XATTR/EA support in the internal (copy,move)-(up,down) ----------------------------------------------------------------------- -Generally the extended attributes of inode are categorized as these. -- "security" for LSM and capability. -- "system" for posix ACL, 'acl' mount option is required for the branch - fs generally. -- "trusted" for userspace, CAP_SYS_ADMIN is required. -- "user" for userspace, 'user_xattr' mount option is required for the - branch fs generally. - -Moreover there are some other categories. Aufs handles these rather -unpopular categories as the ordinary ones, ie. there is no special -condition nor exception. - -In copy-up, the support for XATTR on the dst branch may differ from the -src branch. In this case, the copy-up operation will get an error and -the original user operation which triggered the copy-up will fail. It -can happen that even all copy-up will fail. -When both of src and dst branches support XATTR and if an error occurs -during copying XATTR, then the copy-up should fail obviously. That is a -good reason and aufs should return an error to userspace. But when only -the src branch support that XATTR, aufs should not return an error. -For example, the src branch supports ACL but the dst branch doesn't -because the dst branch may natively un-support it or temporary -un-support it due to "noacl" mount option. Of course, the dst branch fs -may NOT return an error even if the XATTR is not supported. It is -totally up to the branch fs. - -Anyway when the aufs internal copy-up gets an error from the dst branch -fs, then aufs tries removing the just copied entry and returns the error -to the userspace. The worst case of this situation will be all copy-up -will fail. - -For the copy-up operation, there two basic approaches. -- copy the specified XATTR only (by category above), and return the - error unconditionally if it happens. -- copy all XATTR, and ignore the error on the specified category only. - -In order to support XATTR and to implement the correct behaviour, aufs -chooses the latter approach and introduces some new branch attributes, -"icexsec", "icexsys", "icextr", "icexusr", and "icexoth". -They correspond to the XATTR namespaces (see above). Additionally, to be -convenient, "icex" is also provided which means all "icex*" attributes -are set (here the word "icex" stands for "ignore copy-error on XATTR"). - -The meaning of these attributes is to ignore the error from setting -XATTR on that branch. -Note that aufs tries copying all XATTR unconditionally, and ignores the -error from the dst branch according to the specified attributes. - -Some XATTR may have its default value. The default value may come from -the parent dir or the environment. If the default value is set at the -file creating-time, it will be overwritten by copy-up. -Some contradiction may happen I am afraid. -Do we need another attribute to stop copying XATTR? I am unsure. For -now, aufs implements the branch attributes to ignore the error. diff --git a/Documentation/filesystems/aufs/design/07export.txt b/Documentation/filesystems/aufs/design/07export.txt deleted file mode 100644 index c23930b49..000000000 --- a/Documentation/filesystems/aufs/design/07export.txt +++ /dev/null @@ -1,45 +0,0 @@ - -# Copyright (C) 2005-2016 Junjiro R. Okajima - -Export Aufs via NFS ----------------------------------------------------------------------- -Here is an approach. -- like xino/xib, add a new file 'xigen' which stores aufs inode - generation. -- iget_locked(): initialize aufs inode generation for a new inode, and - store it in xigen file. -- destroy_inode(): increment aufs inode generation and store it in xigen - file. it is necessary even if it is not unlinked, because any data of - inode may be changed by UDBA. -- encode_fh(): for a root dir, simply return FILEID_ROOT. otherwise - build file handle by - + branch id (4 bytes) - + superblock generation (4 bytes) - + inode number (4 or 8 bytes) - + parent dir inode number (4 or 8 bytes) - + inode generation (4 bytes)) - + return value of exportfs_encode_fh() for the parent on a branch (4 - bytes) - + file handle for a branch (by exportfs_encode_fh()) -- fh_to_dentry(): - + find the index of a branch from its id in handle, and check it is - still exist in aufs. - + 1st level: get the inode number from handle and search it in cache. - + 2nd level: if not found in cache, get the parent inode number from - the handle and search it in cache. and then open the found parent - dir, find the matching inode number by vfs_readdir() and get its - name, and call lookup_one_len() for the target dentry. - + 3rd level: if the parent dir is not cached, call - exportfs_decode_fh() for a branch and get the parent on a branch, - build a pathname of it, convert it a pathname in aufs, call - path_lookup(). now aufs gets a parent dir dentry, then handle it as - the 2nd level. - + to open the dir, aufs needs struct vfsmount. aufs keeps vfsmount - for every branch, but not itself. to get this, (currently) aufs - searches in current->nsproxy->mnt_ns list. it may not be a good - idea, but I didn't get other approach. - + test the generation of the gotten inode. -- every inode operation: they may get EBUSY due to UDBA. in this case, - convert it into ESTALE for NFSD. -- readdir(): call lockdep_on/off() because filldir in NFSD calls - lookup_one_len(), vfs_getattr(), encode_fh() and others. diff --git a/Documentation/filesystems/aufs/design/08shwh.txt b/Documentation/filesystems/aufs/design/08shwh.txt deleted file mode 100644 index ad58ebe15..000000000 --- a/Documentation/filesystems/aufs/design/08shwh.txt +++ /dev/null @@ -1,39 +0,0 @@ - -# Copyright (C) 2005-2016 Junjiro R. Okajima - -Show Whiteout Mode (shwh) ----------------------------------------------------------------------- -Generally aufs hides the name of whiteouts. But in some cases, to show -them is very useful for users. For instance, creating a new middle layer -(branch) by merging existing layers. - -(borrowing aufs1 HOW-TO from a user, Michael Towers) -When you have three branches, -- Bottom: 'system', squashfs (underlying base system), read-only -- Middle: 'mods', squashfs, read-only -- Top: 'overlay', ram (tmpfs), read-write - -The top layer is loaded at boot time and saved at shutdown, to preserve -the changes made to the system during the session. -When larger changes have been made, or smaller changes have accumulated, -the size of the saved top layer data grows. At this point, it would be -nice to be able to merge the two overlay branches ('mods' and 'overlay') -and rewrite the 'mods' squashfs, clearing the top layer and thus -restoring save and load speed. - -This merging is simplified by the use of another aufs mount, of just the -two overlay branches using the 'shwh' option. -# mount -t aufs -o ro,shwh,br:/livesys/overlay=ro+wh:/livesys/mods=rr+wh \ - aufs /livesys/merge_union - -A merged view of these two branches is then available at -/livesys/merge_union, and the new feature is that the whiteouts are -visible! -Note that in 'shwh' mode the aufs mount must be 'ro', which will disable -writing to all branches. Also the default mode for all branches is 'ro'. -It is now possible to save the combined contents of the two overlay -branches to a new squashfs, e.g.: -# mksquashfs /livesys/merge_union /path/to/newmods.squash - -This new squashfs archive can be stored on the boot device and the -initramfs will use it to replace the old one at the next boot. diff --git a/Documentation/filesystems/aufs/design/10dynop.txt b/Documentation/filesystems/aufs/design/10dynop.txt deleted file mode 100644 index 49afc5899..000000000 --- a/Documentation/filesystems/aufs/design/10dynop.txt +++ /dev/null @@ -1,34 +0,0 @@ - -# Copyright (C) 2010-2016 Junjiro R. Okajima - -Dynamically customizable FS operations ----------------------------------------------------------------------- -Generally FS operations (struct inode_operations, struct -address_space_operations, struct file_operations, etc.) are defined as -"static const", but it never means that FS have only one set of -operation. Some FS have multiple sets of them. For instance, ext2 has -three sets, one for XIP, for NOBH, and for normal. -Since aufs overrides and redirects these operations, sometimes aufs has -to change its behaviour according to the branch FS type. More importantly -VFS acts differently if a function (member in the struct) is set or -not. It means aufs should have several sets of operations and select one -among them according to the branch FS definition. - -In order to solve this problem and not to affect the behaviour of VFS, -aufs defines these operations dynamically. For instance, aufs defines -dummy direct_IO function for struct address_space_operations, but it may -not be set to the address_space_operations actually. When the branch FS -doesn't have it, aufs doesn't set it to its address_space_operations -while the function definition itself is still alive. So the behaviour -itself will not change, and it will return an error when direct_IO is -not set. - -The lifetime of these dynamically generated operation object is -maintained by aufs branch object. When the branch is removed from aufs, -the reference counter of the object is decremented. When it reaches -zero, the dynamically generated operation object will be freed. - -This approach is designed to support AIO (io_submit), Direct I/O and -XIP (DAX) mainly. -Currently this approach is applied to address_space_operations for -regular files only. -- cgit v1.2.3-54-g00ecf