From 670027c507e99521d416994a18a498def9ef2ea3 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?Andr=C3=A9=20Fabian=20Silva=20Delgado?=
 <emulatorman@parabola.nu>
Date: Sat, 22 Oct 2016 19:31:08 -0300
Subject: Linux-libre 4.8.3-gnu

---
 Documentation/power/tuxonice-internals.txt | 532 -----------------------------
 1 file changed, 532 deletions(-)
 delete mode 100644 Documentation/power/tuxonice-internals.txt

(limited to 'Documentation/power/tuxonice-internals.txt')

diff --git a/Documentation/power/tuxonice-internals.txt b/Documentation/power/tuxonice-internals.txt
deleted file mode 100644
index 0c6a2163a..000000000
--- a/Documentation/power/tuxonice-internals.txt
+++ /dev/null
@@ -1,532 +0,0 @@
-		   TuxOnIce 4.0 Internal Documentation.
-			Updated to 23 March 2015
-
-(Please note that incremental image support mentioned in this document is work
-in progress. This document may need updating prior to the actual release of
-4.0!)
-
-1.  Introduction.
-
-    TuxOnIce 4.0 is an addition to the Linux Kernel, designed to
-    allow the user to quickly shutdown and quickly boot a computer, without
-    needing to close documents or programs. It is equivalent to the
-    hibernate facility in some laptops. This implementation, however,
-    requires no special BIOS or hardware support.
-
-    The code in these files is based upon the original implementation
-    prepared by Gabor Kuti and additional work by Pavel Machek and a
-    host of others. This code has been substantially reworked by Nigel
-    Cunningham, again with the help and testing of many others, not the
-    least of whom are Bernard Blackham and Michael Frank. At its heart,
-    however, the operation is essentially the same as Gabor's version.
-
-2.  Overview of operation.
-
-    The basic sequence of operations is as follows:
-
-	a. Quiesce all other activity.
-	b. Ensure enough memory and storage space are available, and attempt
-	   to free memory/storage if necessary.
-	c. Allocate the required memory and storage space.
-	d. Write the image.
-	e. Power down.
-
-    There are a number of complicating factors which mean that things are
-    not as simple as the above would imply, however...
-
-    o The activity of each process must be stopped at a point where it will
-    not be holding locks necessary for saving the image, or unexpectedly
-    restart operations due to something like a timeout and thereby make
-    our image inconsistent.
-
-    o It is desirous that we sync outstanding I/O to disk before calculating
-    image statistics. This reduces corruption if one should suspend but
-    then not resume, and also makes later parts of the operation safer (see
-    below).
-
-    o We need to get as close as we can to an atomic copy of the data.
-    Inconsistencies in the image will result in inconsistent memory contents at
-    resume time, and thus in instability of the system and/or file system
-    corruption. This would appear to imply a maximum image size of one half of
-    the amount of RAM, but we have a solution... (again, below).
-
-    o In 2.6 and later, we choose to play nicely with the other suspend-to-disk
-    implementations.
-
-3.  Detailed description of internals.
-
-    a. Quiescing activity.
-
-    Safely quiescing the system is achieved using three separate but related
-    aspects.
-
-    First, we use the vanilla kerne's support for freezing processes. This code
-    is based on the observation that the vast majority of processes don't need
-    to run during suspend. They can be 'frozen'. The kernel therefore
-    implements a refrigerator routine, which processes enter and in which they
-    remain until the cycle is complete. Processes enter the refrigerator via
-    try_to_freeze() invocations at appropriate places.  A process cannot be
-    frozen in any old place. It must not be holding locks that will be needed
-    for writing the image or freezing other processes. For this reason,
-    userspace processes generally enter the refrigerator via the signal
-    handling code, and kernel threads at the place in their event loops where
-    they drop locks and yield to other processes or sleep. The task of freezing
-    processes is complicated by the fact that there can be interdependencies
-    between processes. Freezing process A before process B may mean that
-    process B cannot be frozen, because it stops at waiting for process A
-    rather than in the refrigerator. This issue is seen where userspace waits
-    on freezeable kernel threads or fuse filesystem threads. To address this
-    issue, we implement the following algorithm for quiescing activity:
-
-	- Freeze filesystems (including fuse - userspace programs starting
-		new requests are immediately frozen; programs already running
-		requests complete their work before being frozen in the next
-		step)
-	- Freeze userspace
-	- Thaw filesystems (this is safe now that userspace is frozen and no
-		fuse requests are outstanding).
-	- Invoke sys_sync (noop on fuse).
-	- Freeze filesystems
-	- Freeze kernel threads
-
-    If we need to free memory, we thaw kernel threads and filesystems, but not
-    userspace. We can then free caches without worrying about deadlocks due to
-    swap files being on frozen filesystems or such like.
-
-    b. Ensure enough memory & storage are available.
-
-    We have a number of constraints to meet in order to be able to successfully
-    suspend and resume.
-
-    First, the image will be written in two parts, described below. One of
-    these parts needs to have an atomic copy made, which of course implies a
-    maximum size of one half of the amount of system memory. The other part
-    ('pageset') is not atomically copied, and can therefore be as large or
-    small as desired.
-
-    Second, we have constraints on the amount of storage available. In these
-    calculations, we may also consider any compression that will be done. The
-    cryptoapi module allows the user to configure an expected compression ratio.
-
-    Third, the user can specify an arbitrary limit on the image size, in
-    megabytes. This limit is treated as a soft limit, so that we don't fail the
-    attempt to suspend if we cannot meet this constraint.
-
-    c. Allocate the required memory and storage space.
-
-    Having done the initial freeze, we determine whether the above constraints
-    are met, and seek to allocate the metadata for the image. If the constraints
-    are not met, or we fail to allocate the required space for the metadata, we
-    seek to free the amount of memory that we calculate is needed and try again.
-    We allow up to four iterations of this loop before aborting the cycle. If
-    we do fail, it should only be because of a bug in TuxOnIce's calculations
-    or the vanilla kernel code for freeing memory.
-
-    These steps are merged together in the prepare_image function, found in
-    prepare_image.c. The functions are merged because of the cyclical nature
-    of the problem of calculating how much memory and storage is needed. Since
-    the data structures containing the information about the image must
-    themselves take memory and use storage, the amount of memory and storage
-    required changes as we prepare the image. Since the changes are not large,
-    only one or two iterations will be required to achieve a solution.
-
-    The recursive nature of the algorithm is miminised by keeping user space
-    frozen while preparing the image, and by the fact that our records of which
-    pages are to be saved and which pageset they are saved in use bitmaps (so
-    that changes in number or fragmentation of the pages to be saved don't
-    feedback via changes in the amount of memory needed for metadata). The
-    recursiveness is thus limited to any extra slab pages allocated to store the
-    extents that record storage used, and the effects of seeking to free memory.
-
-    d. Write the image.
-
-    We previously mentioned the need to create an atomic copy of the data, and
-    the half-of-memory limitation that is implied in this. This limitation is
-    circumvented by dividing the memory to be saved into two parts, called
-    pagesets.
-
-    Pageset2 contains most of the page cache - the pages on the active and
-    inactive LRU lists that aren't needed or modified while TuxOnIce is
-    running, so they can be safely written without an atomic copy. They are
-    therefore saved first and reloaded last. While saving these pages,
-    TuxOnIce carefully ensures that the work of writing the pages doesn't make
-    the image inconsistent. With the support for Kernel (Video) Mode Setting
-    going into the kernel at the time of writing, we need to check for pages
-    on the LRU that are used by KMS, and exclude them from pageset2. They are
-    atomically copied as part of pageset 1.
-
-    Once pageset2 has been saved, we prepare to do the atomic copy of remaining
-    memory. As part of the preparation, we power down drivers, thereby providing
-    them with the opportunity to have their state recorded in the image. The
-    amount of memory allocated by drivers for this is usually negligible, but if
-    DRI is in use, video drivers may require significants amounts. Ideally we
-    would be able to query drivers while preparing the image as to the amount of
-    memory they will need. Unfortunately no such mechanism exists at the time of
-    writing. For this reason, TuxOnIce allows the user to set an
-    'extra_pages_allowance', which is used to seek to ensure sufficient memory
-    is available for drivers at this point. TuxOnIce also lets the user set this
-    value to 0. In this case, a test driver suspend is done while preparing the
-    image, and the difference (plus a margin) used instead. TuxOnIce will also
-    automatically restart the hibernation process (twice at most) if it finds
-    that the extra pages allowance is not sufficient. It will then use what was
-    actually needed (plus a margin, again). Failure to hibernate should thus
-    be an extremely rare occurence.
-
-    Having suspended the drivers, we save the CPU context before making an
-    atomic copy of pageset1, resuming the drivers and saving the atomic copy.
-    After saving the two pagesets, we just need to save our metadata before
-    powering down.
-
-    As we mentioned earlier, the contents of pageset2 pages aren't needed once
-    they've been saved. We therefore use them as the destination of our atomic
-    copy. In the unlikely event that pageset1 is larger, extra pages are
-    allocated while the image is being prepared. This is normally only a real
-    possibility when the system has just been booted and the page cache is
-    small.
-
-    This is where we need to be careful about syncing, however. Pageset2 will
-    probably contain filesystem meta data. If this is overwritten with pageset1
-    and then a sync occurs, the filesystem will be corrupted - at least until
-    resume time and another sync of the restored data. Since there is a
-    possibility that the user might not resume or (may it never be!) that
-    TuxOnIce might oops, we do our utmost to avoid syncing filesystems after
-    copying pageset1.
-
-    e. Incremental images
-
-    TuxOnIce 4.0 introduces a new incremental image mode which changes things a
-    little. When incremental images are enabled, we save a 'normal' image the
-    first time we hibernate. One resume however, we do not free the image or
-    the associated storage. Instead, it is retained until the next attempt at
-    hibernating and a mechanism is enabled which is used to track which pages
-    of memory are modified between the two cycles. The modified pages can then
-    be added to the existing image, rather than unmodified pages being saved
-    again unnecessarily.
-
-    Incremental image support is available in 64 bit Linux only, due to the
-    requirement for extra page flags.
-
-    This support is accomplished in the following way:
-
-    1) Tracking of pages.
-
-    The tracking of changed pages is accomplished using the page fault
-    mechanism. When we reach a point at which we want to start tracking
-    changes, most pages are marked read-only and also flagged as being
-    read-only because of this support. Since this cannot happen for every page
-    of RAM, some are marked as untracked and always treated as modified whn
-    preparing an incremental iamge. When a process attempts to modify a page
-    that is marked read-only in this way, a page fault occurs, with TuxOnIce
-    code marking the page writable and dirty before allowing the write to
-    continue. In this way, the effect of incremental images on performance is
-    minimised - a page only causes a fault once. Small modifications to the
-    page allocator further reduce the number of faults that occur - free pages
-    are not tracked; they are made writable and marked as dirty as part of
-    being allocated.
-
-    2) Saving the incremental image / atomicity.
-
-    The page fault mechanism is also used to improve the means by which
-    atomicity of the image is acheived. When it is time to do an atomic copy,
-    the flags for pages are reset, with the result being that it is no longer
-    necessary for us to do an atomic of pageset1. Instead, we normally write
-    the uncopied pages to disk. When an attempt is made to modify a page that
-    has not yet been saved, the page-fault mechanism makes a copy of the page
-    prior to allowing the write. This copy is then written to disk. Likewise,
-    on resume, if a process attempts to write to a page that has been read
-    while the rest of the image is still being loaded, a copy of that page is
-    made prior to the write being allowed. At the end of loading the image,
-    modified pages can thus be restored to their 'atomic copy' contents prior
-    to restarting normal operation. We also mark pages that are yet to be read
-    as invalid PFNs, so that we can capture as a bug any attempt by a
-    half-restored kernel to access a page that hasn't yet been reloaded.
-
-    f. Power down.
-
-    Powering down uses standard kernel routines. TuxOnIce supports powering down
-    using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off.
-    Supporting suspend to ram (S3) as a power off option might sound strange,
-    but it allows the user to quickly get their system up and running again if
-    the battery doesn't run out (we just need to re-read the overwritten pages)
-    and if the battery does run out (or the user removes power), they can still
-    resume.
-
-4.  Data Structures.
-
-    TuxOnIce uses three main structures to store its metadata and configuration
-    information:
-
-    a) Pageflags bitmaps.
-
-    TuxOnIce records which pages will be in pageset1, pageset2, the destination
-    of the atomic copy and the source of the atomically restored image using
-    bitmaps. The code used is that written for swsusp, with small improvements
-    to match TuxOnIce's requirements.
-
-    The pageset1 bitmap is thus easily stored in the image header for use at
-    resume time.
-
-    As mentioned above, using bitmaps also means that the amount of memory and
-    storage required for recording the above information is constant. This
-    greatly simplifies the work of preparing the image. In earlier versions of
-    TuxOnIce, extents were used to record which pages would be stored. In that
-    case, however, eating memory could result in greater fragmentation of the
-    lists of pages, which in turn required more memory to store the extents and
-    more storage in the image header. These could in turn require further
-    freeing of memory, and another iteration. All of this complexity is removed
-    by having bitmaps.
-
-    Bitmaps also make a lot of sense because TuxOnIce only ever iterates
-    through the lists. There is therefore no cost to not being able to find the
-    nth page in order 0 time. We only need to worry about the cost of finding
-    the n+1th page, given the location of the nth page. Bitwise optimisations
-    help here.
-
-    b) Extents for block data.
-
-    TuxOnIce supports writing the image to multiple block devices. In the case
-    of swap, multiple partitions and/or files may be in use, and we happily use
-    them all (with the exception of compcache pages, which we allocate but do
-    not use). This use of multiple block devices is accomplished as follows:
-
-    Whatever the actual source of the allocated storage, the destination of the
-    image can be viewed in terms of one or more block devices, and on each
-    device, a list of sectors. To simplify matters, we only use contiguous,
-    PAGE_SIZE aligned sectors, like the swap code does.
-
-    Since sector numbers on each bdev may well not start at 0, it makes much
-    more sense to use extents here. Contiguous ranges of pages can thus be
-    represented in the extents by contiguous values.
-
-    Variations in block size are taken account of in transforming this data
-    into the parameters for bio submission.
-
-    We can thus implement a layer of abstraction wherein the core of TuxOnIce
-    doesn't have to worry about which device we're currently writing to or
-    where in the device we are. It simply requests that the next page in the
-    pageset or header be written, leaving the details to this lower layer.
-    The lower layer remembers where in the sequence of devices and blocks each
-    pageset starts. The header always starts at the beginning of the allocated
-    storage.
-
-    So extents are:
-
-    struct extent {
-      unsigned long minimum, maximum;
-      struct extent *next;
-    }
-
-    These are combined into chains of extents for a device:
-
-    struct extent_chain {
-      int size; /* size of the extent ie sum (max-min+1) */
-      int allocs, frees;
-      char *name;
-      struct extent *first, *last_touched;
-    };
-
-    For each bdev, we need to store a little more info (simplified definition):
-
-    struct toi_bdev_info {
-       struct block_device *bdev;
-
-       char uuid[17];
-       dev_t dev_t;
-       int bmap_shift;
-       int blocks_per_page;
-    };
-
-    The uuid is the main means used to identify the device in the storage
-    image. This means we can cope with the dev_t representation of a device
-    changing between saving the image and restoring it, as may happen on some
-    bioses or in the LVM case.
-
-    bmap_shift and blocks_per_page apply the effects of variations in blocks
-    per page settings for the filesystem and underlying bdev. For most
-    filesystems, these are the same, but for xfs, they can have independant
-    values.
-
-    Combining these two structures together, we have everything we need to
-    record what devices and what blocks on each device are being used to
-    store the image, and to submit i/o using bio_submit.
-
-    The last elements in the picture are a means of recording how the storage
-    is being used.
-
-    We do this first and foremost by implementing a layer of abstraction on
-    top of the devices and extent chains which allows us to view however many
-    devices there might be as one long storage tape, with a single 'head' that
-    tracks a 'current position' on the tape:
-
-    struct extent_iterate_state {
-      struct extent_chain *chains;
-      int num_chains;
-      int current_chain;
-      struct extent *current_extent;
-      unsigned long current_offset;
-    };
-
-    That is, *chains points to an array of size num_chains of extent chains.
-    For the filewriter, this is always a single chain. For the swapwriter, the
-    array is of size MAX_SWAPFILES.
-
-    current_chain, current_extent and current_offset thus point to the current
-    index in the chains array (and into a matching array of struct
-    suspend_bdev_info), the current extent in that chain (to optimise access),
-    and the current value in the offset.
-
-    The image is divided into three parts:
-    - The header
-    - Pageset 1
-    - Pageset 2
-
-    The header always starts at the first device and first block. We know its
-    size before we begin to save the image because we carefully account for
-    everything that will be stored in it.
-
-    The second pageset (LRU) is stored first. It begins on the next page after
-    the end of the header.
-
-    The first pageset is stored second. It's start location is only known once
-    pageset2 has been saved, since pageset2 may be compressed as it is written.
-    This location is thus recorded at the end of saving pageset2. It is page
-    aligned also.
-
-    Since this information is needed at resume time, and the location of extents
-    in memory will differ at resume time, this needs to be stored in a portable
-    way:
-
-    struct extent_iterate_saved_state {
-        int chain_num;
-        int extent_num;
-        unsigned long offset;
-    };
-
-    We can thus implement a layer of abstraction wherein the core of TuxOnIce
-    doesn't have to worry about which device we're currently writing to or
-    where in the device we are. It simply requests that the next page in the
-    pageset or header be written, leaving the details to this layer, and
-    invokes the routines to remember and restore the position, without having
-    to worry about the details of how the data is arranged on disk or such like.
-
-    c) Modules
-
-    One aim in designing TuxOnIce was to make it flexible. We wanted to allow
-    for the implementation of different methods of transforming a page to be
-    written to disk and different methods of getting the pages stored.
-
-    In early versions (the betas and perhaps Suspend1), compression support was
-    inlined in the image writing code, and the data structures and code for
-    managing swap were intertwined with the rest of the code. A number of people
-    had expressed interest in implementing image encryption, and alternative
-    methods of storing the image.
-
-    In order to achieve this, TuxOnIce was given a modular design.
-
-    A module is a single file which encapsulates the functionality needed
-    to transform a pageset of data (encryption or compression, for example),
-    or to write the pageset to a device. The former type of module is called
-    a 'page-transformer', the later a 'writer'.
-
-    Modules are linked together in pipeline fashion. There may be zero or more
-    page transformers in a pipeline, and there is always exactly one writer.
-    The pipeline follows this pattern:
-
-		---------------------------------
-		|          TuxOnIce Core        |
-		---------------------------------
-				|
-				|
-		---------------------------------
-		|	Page transformer 1	|
-		---------------------------------
-				|
-				|
-		---------------------------------
-		|	Page transformer 2	|
-		---------------------------------
-				|
-				|
-		---------------------------------
-		|            Writer		|
-		---------------------------------
-
-    During the writing of an image, the core code feeds pages one at a time
-    to the first module. This module performs whatever transformations it
-    implements on the incoming data, completely consuming the incoming data and
-    feeding output in a similar manner to the next module.
-
-    All routines are SMP safe, and the final result of the transformations is
-    written with an index (provided by the core) and size of the output by the
-    writer. As a result, we can have multithreaded I/O without needing to
-    worry about the sequence in which pages are written (or read).
-
-    During reading, the pipeline works in the reverse direction. The core code
-    calls the first module with the address of a buffer which should be filled.
-    (Note that the buffer size is always PAGE_SIZE at this time). This module
-    will in turn request data from the next module and so on down until the
-    writer is made to read from the stored image.
-
-    Part of definition of the structure of a module thus looks like this:
-
-        int (*rw_init) (int rw, int stream_number);
-        int (*rw_cleanup) (int rw);
-        int (*write_chunk) (struct page *buffer_page);
-        int (*read_chunk) (struct page *buffer_page, int sync);
-
-    It should be noted that the _cleanup routine may be called before the
-    full stream of data has been read or written. While writing the image,
-    the user may (depending upon settings) choose to abort suspending, and
-    if we are in the midst of writing the last portion of the image, a portion
-    of the second pageset may be reread. This may also happen if an error
-    occurs and we seek to abort the process of writing the image.
-
-    The modular design is also useful in a number of other ways. It provides
-    a means where by we can add support for:
-
-    - providing overall initialisation and cleanup routines;
-    - serialising configuration information in the image header;
-    - providing debugging information to the user;
-    - determining memory and image storage requirements;
-    - dis/enabling components at run-time;
-    - configuring the module (see below);
-
-    ...and routines for writers specific to their work:
-    - Parsing a resume= location;
-    - Determining whether an image exists;
-    - Marking a resume as having been attempted;
-    - Invalidating an image;
-
-    Since some parts of the core - the user interface and storage manager
-    support - have use for some of these functions, they are registered as
-    'miscellaneous' modules as well.
-
-    d) Sysfs data structures.
-
-    This brings us naturally to support for configuring TuxOnIce. We desired to
-    provide a way to make TuxOnIce as flexible and configurable as possible.
-    The user shouldn't have to reboot just because they want to now hibernate to
-    a file instead of a partition, for example.
-
-    To accomplish this, TuxOnIce implements a very generic means whereby the
-    core and modules can register new sysfs entries. All TuxOnIce entries use
-    a single _store and _show routine, both of which are found in
-    tuxonice_sysfs.c in the kernel/power directory. These routines handle the
-    most common operations - getting and setting the values of bits, integers,
-    longs, unsigned longs and strings in one place, and allow overrides for
-    customised get and set options as well as side-effect routines for all
-    reads and writes.
-
-    When combined with some simple macros, a new sysfs entry can then be defined
-    in just a couple of lines:
-
-        SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1,
-                        2048, 0, NULL),
-
-    This defines a sysfs entry named "progress_granularity" which is rw and
-    allows the user to access an integer stored at &progress_granularity, giving
-    it a value between 1 and 2048 inclusive.
-
-    Sysfs entries are registered under /sys/power/tuxonice, and entries for
-    modules are located in a subdirectory named after the module.
-
-- 
cgit v1.2.3