1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
|
TuxOnIce 4.0 Internal Documentation.
Updated to 23 March 2015
(Please note that incremental image support mentioned in this document is work
in progress. This document may need updating prior to the actual release of
4.0!)
1. Introduction.
TuxOnIce 4.0 is an addition to the Linux Kernel, designed to
allow the user to quickly shutdown and quickly boot a computer, without
needing to close documents or programs. It is equivalent to the
hibernate facility in some laptops. This implementation, however,
requires no special BIOS or hardware support.
The code in these files is based upon the original implementation
prepared by Gabor Kuti and additional work by Pavel Machek and a
host of others. This code has been substantially reworked by Nigel
Cunningham, again with the help and testing of many others, not the
least of whom are Bernard Blackham and Michael Frank. At its heart,
however, the operation is essentially the same as Gabor's version.
2. Overview of operation.
The basic sequence of operations is as follows:
a. Quiesce all other activity.
b. Ensure enough memory and storage space are available, and attempt
to free memory/storage if necessary.
c. Allocate the required memory and storage space.
d. Write the image.
e. Power down.
There are a number of complicating factors which mean that things are
not as simple as the above would imply, however...
o The activity of each process must be stopped at a point where it will
not be holding locks necessary for saving the image, or unexpectedly
restart operations due to something like a timeout and thereby make
our image inconsistent.
o It is desirous that we sync outstanding I/O to disk before calculating
image statistics. This reduces corruption if one should suspend but
then not resume, and also makes later parts of the operation safer (see
below).
o We need to get as close as we can to an atomic copy of the data.
Inconsistencies in the image will result in inconsistent memory contents at
resume time, and thus in instability of the system and/or file system
corruption. This would appear to imply a maximum image size of one half of
the amount of RAM, but we have a solution... (again, below).
o In 2.6 and later, we choose to play nicely with the other suspend-to-disk
implementations.
3. Detailed description of internals.
a. Quiescing activity.
Safely quiescing the system is achieved using three separate but related
aspects.
First, we use the vanilla kerne's support for freezing processes. This code
is based on the observation that the vast majority of processes don't need
to run during suspend. They can be 'frozen'. The kernel therefore
implements a refrigerator routine, which processes enter and in which they
remain until the cycle is complete. Processes enter the refrigerator via
try_to_freeze() invocations at appropriate places. A process cannot be
frozen in any old place. It must not be holding locks that will be needed
for writing the image or freezing other processes. For this reason,
userspace processes generally enter the refrigerator via the signal
handling code, and kernel threads at the place in their event loops where
they drop locks and yield to other processes or sleep. The task of freezing
processes is complicated by the fact that there can be interdependencies
between processes. Freezing process A before process B may mean that
process B cannot be frozen, because it stops at waiting for process A
rather than in the refrigerator. This issue is seen where userspace waits
on freezeable kernel threads or fuse filesystem threads. To address this
issue, we implement the following algorithm for quiescing activity:
- Freeze filesystems (including fuse - userspace programs starting
new requests are immediately frozen; programs already running
requests complete their work before being frozen in the next
step)
- Freeze userspace
- Thaw filesystems (this is safe now that userspace is frozen and no
fuse requests are outstanding).
- Invoke sys_sync (noop on fuse).
- Freeze filesystems
- Freeze kernel threads
If we need to free memory, we thaw kernel threads and filesystems, but not
userspace. We can then free caches without worrying about deadlocks due to
swap files being on frozen filesystems or such like.
b. Ensure enough memory & storage are available.
We have a number of constraints to meet in order to be able to successfully
suspend and resume.
First, the image will be written in two parts, described below. One of
these parts needs to have an atomic copy made, which of course implies a
maximum size of one half of the amount of system memory. The other part
('pageset') is not atomically copied, and can therefore be as large or
small as desired.
Second, we have constraints on the amount of storage available. In these
calculations, we may also consider any compression that will be done. The
cryptoapi module allows the user to configure an expected compression ratio.
Third, the user can specify an arbitrary limit on the image size, in
megabytes. This limit is treated as a soft limit, so that we don't fail the
attempt to suspend if we cannot meet this constraint.
c. Allocate the required memory and storage space.
Having done the initial freeze, we determine whether the above constraints
are met, and seek to allocate the metadata for the image. If the constraints
are not met, or we fail to allocate the required space for the metadata, we
seek to free the amount of memory that we calculate is needed and try again.
We allow up to four iterations of this loop before aborting the cycle. If
we do fail, it should only be because of a bug in TuxOnIce's calculations
or the vanilla kernel code for freeing memory.
These steps are merged together in the prepare_image function, found in
prepare_image.c. The functions are merged because of the cyclical nature
of the problem of calculating how much memory and storage is needed. Since
the data structures containing the information about the image must
themselves take memory and use storage, the amount of memory and storage
required changes as we prepare the image. Since the changes are not large,
only one or two iterations will be required to achieve a solution.
The recursive nature of the algorithm is miminised by keeping user space
frozen while preparing the image, and by the fact that our records of which
pages are to be saved and which pageset they are saved in use bitmaps (so
that changes in number or fragmentation of the pages to be saved don't
feedback via changes in the amount of memory needed for metadata). The
recursiveness is thus limited to any extra slab pages allocated to store the
extents that record storage used, and the effects of seeking to free memory.
d. Write the image.
We previously mentioned the need to create an atomic copy of the data, and
the half-of-memory limitation that is implied in this. This limitation is
circumvented by dividing the memory to be saved into two parts, called
pagesets.
Pageset2 contains most of the page cache - the pages on the active and
inactive LRU lists that aren't needed or modified while TuxOnIce is
running, so they can be safely written without an atomic copy. They are
therefore saved first and reloaded last. While saving these pages,
TuxOnIce carefully ensures that the work of writing the pages doesn't make
the image inconsistent. With the support for Kernel (Video) Mode Setting
going into the kernel at the time of writing, we need to check for pages
on the LRU that are used by KMS, and exclude them from pageset2. They are
atomically copied as part of pageset 1.
Once pageset2 has been saved, we prepare to do the atomic copy of remaining
memory. As part of the preparation, we power down drivers, thereby providing
them with the opportunity to have their state recorded in the image. The
amount of memory allocated by drivers for this is usually negligible, but if
DRI is in use, video drivers may require significants amounts. Ideally we
would be able to query drivers while preparing the image as to the amount of
memory they will need. Unfortunately no such mechanism exists at the time of
writing. For this reason, TuxOnIce allows the user to set an
'extra_pages_allowance', which is used to seek to ensure sufficient memory
is available for drivers at this point. TuxOnIce also lets the user set this
value to 0. In this case, a test driver suspend is done while preparing the
image, and the difference (plus a margin) used instead. TuxOnIce will also
automatically restart the hibernation process (twice at most) if it finds
that the extra pages allowance is not sufficient. It will then use what was
actually needed (plus a margin, again). Failure to hibernate should thus
be an extremely rare occurence.
Having suspended the drivers, we save the CPU context before making an
atomic copy of pageset1, resuming the drivers and saving the atomic copy.
After saving the two pagesets, we just need to save our metadata before
powering down.
As we mentioned earlier, the contents of pageset2 pages aren't needed once
they've been saved. We therefore use them as the destination of our atomic
copy. In the unlikely event that pageset1 is larger, extra pages are
allocated while the image is being prepared. This is normally only a real
possibility when the system has just been booted and the page cache is
small.
This is where we need to be careful about syncing, however. Pageset2 will
probably contain filesystem meta data. If this is overwritten with pageset1
and then a sync occurs, the filesystem will be corrupted - at least until
resume time and another sync of the restored data. Since there is a
possibility that the user might not resume or (may it never be!) that
TuxOnIce might oops, we do our utmost to avoid syncing filesystems after
copying pageset1.
e. Incremental images
TuxOnIce 4.0 introduces a new incremental image mode which changes things a
little. When incremental images are enabled, we save a 'normal' image the
first time we hibernate. One resume however, we do not free the image or
the associated storage. Instead, it is retained until the next attempt at
hibernating and a mechanism is enabled which is used to track which pages
of memory are modified between the two cycles. The modified pages can then
be added to the existing image, rather than unmodified pages being saved
again unnecessarily.
Incremental image support is available in 64 bit Linux only, due to the
requirement for extra page flags.
This support is accomplished in the following way:
1) Tracking of pages.
The tracking of changed pages is accomplished using the page fault
mechanism. When we reach a point at which we want to start tracking
changes, most pages are marked read-only and also flagged as being
read-only because of this support. Since this cannot happen for every page
of RAM, some are marked as untracked and always treated as modified whn
preparing an incremental iamge. When a process attempts to modify a page
that is marked read-only in this way, a page fault occurs, with TuxOnIce
code marking the page writable and dirty before allowing the write to
continue. In this way, the effect of incremental images on performance is
minimised - a page only causes a fault once. Small modifications to the
page allocator further reduce the number of faults that occur - free pages
are not tracked; they are made writable and marked as dirty as part of
being allocated.
2) Saving the incremental image / atomicity.
The page fault mechanism is also used to improve the means by which
atomicity of the image is acheived. When it is time to do an atomic copy,
the flags for pages are reset, with the result being that it is no longer
necessary for us to do an atomic of pageset1. Instead, we normally write
the uncopied pages to disk. When an attempt is made to modify a page that
has not yet been saved, the page-fault mechanism makes a copy of the page
prior to allowing the write. This copy is then written to disk. Likewise,
on resume, if a process attempts to write to a page that has been read
while the rest of the image is still being loaded, a copy of that page is
made prior to the write being allowed. At the end of loading the image,
modified pages can thus be restored to their 'atomic copy' contents prior
to restarting normal operation. We also mark pages that are yet to be read
as invalid PFNs, so that we can capture as a bug any attempt by a
half-restored kernel to access a page that hasn't yet been reloaded.
f. Power down.
Powering down uses standard kernel routines. TuxOnIce supports powering down
using the ACPI S3, S4 and S5 methods or the kernel's non-ACPI power-off.
Supporting suspend to ram (S3) as a power off option might sound strange,
but it allows the user to quickly get their system up and running again if
the battery doesn't run out (we just need to re-read the overwritten pages)
and if the battery does run out (or the user removes power), they can still
resume.
4. Data Structures.
TuxOnIce uses three main structures to store its metadata and configuration
information:
a) Pageflags bitmaps.
TuxOnIce records which pages will be in pageset1, pageset2, the destination
of the atomic copy and the source of the atomically restored image using
bitmaps. The code used is that written for swsusp, with small improvements
to match TuxOnIce's requirements.
The pageset1 bitmap is thus easily stored in the image header for use at
resume time.
As mentioned above, using bitmaps also means that the amount of memory and
storage required for recording the above information is constant. This
greatly simplifies the work of preparing the image. In earlier versions of
TuxOnIce, extents were used to record which pages would be stored. In that
case, however, eating memory could result in greater fragmentation of the
lists of pages, which in turn required more memory to store the extents and
more storage in the image header. These could in turn require further
freeing of memory, and another iteration. All of this complexity is removed
by having bitmaps.
Bitmaps also make a lot of sense because TuxOnIce only ever iterates
through the lists. There is therefore no cost to not being able to find the
nth page in order 0 time. We only need to worry about the cost of finding
the n+1th page, given the location of the nth page. Bitwise optimisations
help here.
b) Extents for block data.
TuxOnIce supports writing the image to multiple block devices. In the case
of swap, multiple partitions and/or files may be in use, and we happily use
them all (with the exception of compcache pages, which we allocate but do
not use). This use of multiple block devices is accomplished as follows:
Whatever the actual source of the allocated storage, the destination of the
image can be viewed in terms of one or more block devices, and on each
device, a list of sectors. To simplify matters, we only use contiguous,
PAGE_SIZE aligned sectors, like the swap code does.
Since sector numbers on each bdev may well not start at 0, it makes much
more sense to use extents here. Contiguous ranges of pages can thus be
represented in the extents by contiguous values.
Variations in block size are taken account of in transforming this data
into the parameters for bio submission.
We can thus implement a layer of abstraction wherein the core of TuxOnIce
doesn't have to worry about which device we're currently writing to or
where in the device we are. It simply requests that the next page in the
pageset or header be written, leaving the details to this lower layer.
The lower layer remembers where in the sequence of devices and blocks each
pageset starts. The header always starts at the beginning of the allocated
storage.
So extents are:
struct extent {
unsigned long minimum, maximum;
struct extent *next;
}
These are combined into chains of extents for a device:
struct extent_chain {
int size; /* size of the extent ie sum (max-min+1) */
int allocs, frees;
char *name;
struct extent *first, *last_touched;
};
For each bdev, we need to store a little more info (simplified definition):
struct toi_bdev_info {
struct block_device *bdev;
char uuid[17];
dev_t dev_t;
int bmap_shift;
int blocks_per_page;
};
The uuid is the main means used to identify the device in the storage
image. This means we can cope with the dev_t representation of a device
changing between saving the image and restoring it, as may happen on some
bioses or in the LVM case.
bmap_shift and blocks_per_page apply the effects of variations in blocks
per page settings for the filesystem and underlying bdev. For most
filesystems, these are the same, but for xfs, they can have independant
values.
Combining these two structures together, we have everything we need to
record what devices and what blocks on each device are being used to
store the image, and to submit i/o using bio_submit.
The last elements in the picture are a means of recording how the storage
is being used.
We do this first and foremost by implementing a layer of abstraction on
top of the devices and extent chains which allows us to view however many
devices there might be as one long storage tape, with a single 'head' that
tracks a 'current position' on the tape:
struct extent_iterate_state {
struct extent_chain *chains;
int num_chains;
int current_chain;
struct extent *current_extent;
unsigned long current_offset;
};
That is, *chains points to an array of size num_chains of extent chains.
For the filewriter, this is always a single chain. For the swapwriter, the
array is of size MAX_SWAPFILES.
current_chain, current_extent and current_offset thus point to the current
index in the chains array (and into a matching array of struct
suspend_bdev_info), the current extent in that chain (to optimise access),
and the current value in the offset.
The image is divided into three parts:
- The header
- Pageset 1
- Pageset 2
The header always starts at the first device and first block. We know its
size before we begin to save the image because we carefully account for
everything that will be stored in it.
The second pageset (LRU) is stored first. It begins on the next page after
the end of the header.
The first pageset is stored second. It's start location is only known once
pageset2 has been saved, since pageset2 may be compressed as it is written.
This location is thus recorded at the end of saving pageset2. It is page
aligned also.
Since this information is needed at resume time, and the location of extents
in memory will differ at resume time, this needs to be stored in a portable
way:
struct extent_iterate_saved_state {
int chain_num;
int extent_num;
unsigned long offset;
};
We can thus implement a layer of abstraction wherein the core of TuxOnIce
doesn't have to worry about which device we're currently writing to or
where in the device we are. It simply requests that the next page in the
pageset or header be written, leaving the details to this layer, and
invokes the routines to remember and restore the position, without having
to worry about the details of how the data is arranged on disk or such like.
c) Modules
One aim in designing TuxOnIce was to make it flexible. We wanted to allow
for the implementation of different methods of transforming a page to be
written to disk and different methods of getting the pages stored.
In early versions (the betas and perhaps Suspend1), compression support was
inlined in the image writing code, and the data structures and code for
managing swap were intertwined with the rest of the code. A number of people
had expressed interest in implementing image encryption, and alternative
methods of storing the image.
In order to achieve this, TuxOnIce was given a modular design.
A module is a single file which encapsulates the functionality needed
to transform a pageset of data (encryption or compression, for example),
or to write the pageset to a device. The former type of module is called
a 'page-transformer', the later a 'writer'.
Modules are linked together in pipeline fashion. There may be zero or more
page transformers in a pipeline, and there is always exactly one writer.
The pipeline follows this pattern:
---------------------------------
| TuxOnIce Core |
---------------------------------
|
|
---------------------------------
| Page transformer 1 |
---------------------------------
|
|
---------------------------------
| Page transformer 2 |
---------------------------------
|
|
---------------------------------
| Writer |
---------------------------------
During the writing of an image, the core code feeds pages one at a time
to the first module. This module performs whatever transformations it
implements on the incoming data, completely consuming the incoming data and
feeding output in a similar manner to the next module.
All routines are SMP safe, and the final result of the transformations is
written with an index (provided by the core) and size of the output by the
writer. As a result, we can have multithreaded I/O without needing to
worry about the sequence in which pages are written (or read).
During reading, the pipeline works in the reverse direction. The core code
calls the first module with the address of a buffer which should be filled.
(Note that the buffer size is always PAGE_SIZE at this time). This module
will in turn request data from the next module and so on down until the
writer is made to read from the stored image.
Part of definition of the structure of a module thus looks like this:
int (*rw_init) (int rw, int stream_number);
int (*rw_cleanup) (int rw);
int (*write_chunk) (struct page *buffer_page);
int (*read_chunk) (struct page *buffer_page, int sync);
It should be noted that the _cleanup routine may be called before the
full stream of data has been read or written. While writing the image,
the user may (depending upon settings) choose to abort suspending, and
if we are in the midst of writing the last portion of the image, a portion
of the second pageset may be reread. This may also happen if an error
occurs and we seek to abort the process of writing the image.
The modular design is also useful in a number of other ways. It provides
a means where by we can add support for:
- providing overall initialisation and cleanup routines;
- serialising configuration information in the image header;
- providing debugging information to the user;
- determining memory and image storage requirements;
- dis/enabling components at run-time;
- configuring the module (see below);
...and routines for writers specific to their work:
- Parsing a resume= location;
- Determining whether an image exists;
- Marking a resume as having been attempted;
- Invalidating an image;
Since some parts of the core - the user interface and storage manager
support - have use for some of these functions, they are registered as
'miscellaneous' modules as well.
d) Sysfs data structures.
This brings us naturally to support for configuring TuxOnIce. We desired to
provide a way to make TuxOnIce as flexible and configurable as possible.
The user shouldn't have to reboot just because they want to now hibernate to
a file instead of a partition, for example.
To accomplish this, TuxOnIce implements a very generic means whereby the
core and modules can register new sysfs entries. All TuxOnIce entries use
a single _store and _show routine, both of which are found in
tuxonice_sysfs.c in the kernel/power directory. These routines handle the
most common operations - getting and setting the values of bits, integers,
longs, unsigned longs and strings in one place, and allow overrides for
customised get and set options as well as side-effect routines for all
reads and writes.
When combined with some simple macros, a new sysfs entry can then be defined
in just a couple of lines:
SYSFS_INT("progress_granularity", SYSFS_RW, &progress_granularity, 1,
2048, 0, NULL),
This defines a sysfs entry named "progress_granularity" which is rw and
allows the user to access an integer stored at &progress_granularity, giving
it a value between 1 and 2048 inclusive.
Sysfs entries are registered under /sys/power/tuxonice, and entries for
modules are located in a subdirectory named after the module.
|