summaryrefslogtreecommitdiff
path: root/Documentation/device-mapper
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/device-mapper')
-rw-r--r--Documentation/device-mapper/cache-policies.txt73
-rw-r--r--Documentation/device-mapper/dm-flakey.txt2
-rw-r--r--Documentation/device-mapper/dm-raid.txt59
-rw-r--r--Documentation/device-mapper/log-writes.txt10
-rw-r--r--Documentation/device-mapper/statistics.txt2
5 files changed, 81 insertions, 65 deletions
diff --git a/Documentation/device-mapper/cache-policies.txt b/Documentation/device-mapper/cache-policies.txt
index d9246a32e..d3ca8af21 100644
--- a/Documentation/device-mapper/cache-policies.txt
+++ b/Documentation/device-mapper/cache-policies.txt
@@ -11,7 +11,7 @@ Every bio that is mapped by the target is referred to the policy.
The policy can return a simple HIT or MISS or issue a migration.
Currently there's no way for the policy to issue background work,
-e.g. to start writing back dirty blocks that are going to be evicte
+e.g. to start writing back dirty blocks that are going to be evicted
soon.
Because we map bios, rather than requests it's easy for the policy
@@ -28,51 +28,16 @@ Overview of supplied cache replacement policies
multiqueue (mq)
---------------
-This policy has been deprecated in favor of the smq policy (see below).
+This policy is now an alias for smq (see below).
-The multiqueue policy has three sets of 16 queues: one set for entries
-waiting for the cache and another two for those in the cache (a set for
-clean entries and a set for dirty entries).
+The following tunables are accepted, but have no effect:
-Cache entries in the queues are aged based on logical time. Entry into
-the cache is based on variable thresholds and queue selection is based
-on hit count on entry. The policy aims to take different cache miss
-costs into account and to adjust to varying load patterns automatically.
-
-Message and constructor argument pairs are:
'sequential_threshold <#nr_sequential_ios>'
'random_threshold <#nr_random_ios>'
'read_promote_adjustment <value>'
'write_promote_adjustment <value>'
'discard_promote_adjustment <value>'
-The sequential threshold indicates the number of contiguous I/Os
-required before a stream is treated as sequential. Once a stream is
-considered sequential it will bypass the cache. The random threshold
-is the number of intervening non-contiguous I/Os that must be seen
-before the stream is treated as random again.
-
-The sequential and random thresholds default to 512 and 4 respectively.
-
-Large, sequential I/Os are probably better left on the origin device
-since spindles tend to have good sequential I/O bandwidth. The
-io_tracker counts contiguous I/Os to try to spot when the I/O is in one
-of these sequential modes. But there are use-cases for wanting to
-promote sequential blocks to the cache (e.g. fast application startup).
-If sequential threshold is set to 0 the sequential I/O detection is
-disabled and sequential I/O will no longer implicitly bypass the cache.
-Setting the random threshold to 0 does _not_ disable the random I/O
-stream detection.
-
-Internally the mq policy determines a promotion threshold. If the hit
-count of a block not in the cache goes above this threshold it gets
-promoted to the cache. The read, write and discard promote adjustment
-tunables allow you to tweak the promotion threshold by adding a small
-value based on the io type. They default to 4, 8 and 1 respectively.
-If you're trying to quickly warm a new cache device you may wish to
-reduce these to encourage promotion. Remember to switch them back to
-their defaults after the cache fills though.
-
Stochastic multiqueue (smq)
---------------------------
@@ -83,7 +48,7 @@ with the multiqueue (mq) policy.
The smq policy (vs mq) offers the promise of less memory utilization,
improved performance and increased adaptability in the face of changing
-workloads. SMQ also does not have any cumbersome tuning knobs.
+workloads. smq also does not have any cumbersome tuning knobs.
Users may switch from "mq" to "smq" simply by appropriately reloading a
DM table that is using the cache target. Doing so will cause all of the
@@ -92,47 +57,45 @@ degrade slightly until smq recalculates the origin device's hotspots
that should be cached.
Memory usage:
-The mq policy uses a lot of memory; 88 bytes per cache block on a 64
+The mq policy used a lot of memory; 88 bytes per cache block on a 64
bit machine.
-SMQ uses 28bit indexes to implement it's data structures rather than
+smq uses 28bit indexes to implement it's data structures rather than
pointers. It avoids storing an explicit hit count for each block. It
-has a 'hotspot' queue rather than a pre cache which uses a quarter of
+has a 'hotspot' queue, rather than a pre-cache, which uses a quarter of
the entries (each hotspot block covers a larger area than a single
cache block).
-All these mean smq uses ~25bytes per cache block. Still a lot of
+All this means smq uses ~25bytes per cache block. Still a lot of
memory, but a substantial improvement nontheless.
Level balancing:
-MQ places entries in different levels of the multiqueue structures
-based on their hit count (~ln(hit count)). This means the bottom
-levels generally have the most entries, and the top ones have very
-few. Having unbalanced levels like this reduces the efficacy of the
+mq placed entries in different levels of the multiqueue structures
+based on their hit count (~ln(hit count)). This meant the bottom
+levels generally had the most entries, and the top ones had very
+few. Having unbalanced levels like this reduced the efficacy of the
multiqueue.
-SMQ does not maintain a hit count, instead it swaps hit entries with
-the least recently used entry from the level above. The over all
+smq does not maintain a hit count, instead it swaps hit entries with
+the least recently used entry from the level above. The overall
ordering being a side effect of this stochastic process. With this
scheme we can decide how many entries occupy each multiqueue level,
resulting in better promotion/demotion decisions.
Adaptability:
-The MQ policy maintains a hit count for each cache block. For a
+The mq policy maintained a hit count for each cache block. For a
different block to get promoted to the cache it's hit count has to
-exceed the lowest currently in the cache. This means it can take a
+exceed the lowest currently in the cache. This meant it could take a
long time for the cache to adapt between varying IO patterns.
-Periodically degrading the hit counts could help with this, but I
-haven't found a nice general solution.
-SMQ doesn't maintain hit counts, so a lot of this problem just goes
+smq doesn't maintain hit counts, so a lot of this problem just goes
away. In addition it tracks performance of the hotspot queue, which
is used to decide which blocks to promote. If the hotspot queue is
performing badly then it starts moving entries more quickly between
levels. This lets it adapt to new IO patterns very quickly.
Performance:
-Testing SMQ shows substantially better performance than MQ.
+Testing smq shows substantially better performance than mq.
cleaner
-------
diff --git a/Documentation/device-mapper/dm-flakey.txt b/Documentation/device-mapper/dm-flakey.txt
index 6ff5c2327..c43030718 100644
--- a/Documentation/device-mapper/dm-flakey.txt
+++ b/Documentation/device-mapper/dm-flakey.txt
@@ -42,7 +42,7 @@ Optional feature parameters:
<direction>: Either 'r' to corrupt reads or 'w' to corrupt writes.
'w' is incompatible with drop_writes.
<value>: The value (from 0-255) to write.
- <flags>: Perform the replacement only if bio->bi_rw has all the
+ <flags>: Perform the replacement only if bio->bi_opf has all the
selected flags set.
Examples:
diff --git a/Documentation/device-mapper/dm-raid.txt b/Documentation/device-mapper/dm-raid.txt
index df2d636b6..c75b64a85 100644
--- a/Documentation/device-mapper/dm-raid.txt
+++ b/Documentation/device-mapper/dm-raid.txt
@@ -14,8 +14,12 @@ The target is named "raid" and it accepts the following parameters:
<#raid_devs> <metadata_dev0> <dev0> [.. <metadata_devN> <devN>]
<raid_type>:
+ raid0 RAID0 striping (no resilience)
raid1 RAID1 mirroring
- raid4 RAID4 dedicated parity disk
+ raid4 RAID4 with dedicated last parity disk
+ raid5_n RAID5 with dedicated last parity disk suporting takeover
+ Same as raid4
+ -Transitory layout
raid5_la RAID5 left asymmetric
- rotating parity 0 with data continuation
raid5_ra RAID5 right asymmetric
@@ -30,7 +34,19 @@ The target is named "raid" and it accepts the following parameters:
- rotating parity N (right-to-left) with data restart
raid6_nc RAID6 N continue
- rotating parity N (right-to-left) with data continuation
+ raid6_n_6 RAID6 with dedicate parity disks
+ - parity and Q-syndrome on the last 2 disks;
+ laylout for takeover from/to raid4/raid5_n
+ raid6_la_6 Same as "raid_la" plus dedicated last Q-syndrome disk
+ - layout for takeover from raid5_la from/to raid6
+ raid6_ra_6 Same as "raid5_ra" dedicated last Q-syndrome disk
+ - layout for takeover from raid5_ra from/to raid6
+ raid6_ls_6 Same as "raid5_ls" dedicated last Q-syndrome disk
+ - layout for takeover from raid5_ls from/to raid6
+ raid6_rs_6 Same as "raid5_rs" dedicated last Q-syndrome disk
+ - layout for takeover from raid5_rs from/to raid6
raid10 Various RAID10 inspired algorithms chosen by additional params
+ (see raid10_format and raid10_copies below)
- RAID10: Striped Mirrors (aka 'Striping on top of mirrors')
- RAID1E: Integrated Adjacent Stripe Mirroring
- RAID1E: Integrated Offset Stripe Mirroring
@@ -116,10 +132,41 @@ The target is named "raid" and it accepts the following parameters:
Here we see layouts closely akin to 'RAID1E - Integrated
Offset Stripe Mirroring'.
+ [delta_disks <N>]
+ The delta_disks option value (-251 < N < +251) triggers
+ device removal (negative value) or device addition (positive
+ value) to any reshape supporting raid levels 4/5/6 and 10.
+ RAID levels 4/5/6 allow for addition of devices (metadata
+ and data device tupel), raid10_near and raid10_offset only
+ allow for device addtion. raid10_far does not support any
+ reshaping at all.
+ A minimum of devices have to be kept to enforce resilience,
+ which is 3 devices for raid4/5 and 4 devices for raid6.
+
+ [data_offset <sectors>]
+ This option value defines the offset into each data device
+ where the data starts. This is used to provide out-of-place
+ reshaping space to avoid writing over data whilst
+ changing the layout of stripes, hence an interruption/crash
+ may happen at any time without the risk of losing data.
+ E.g. when adding devices to an existing raid set during
+ forward reshaping, the out-of-place space will be allocated
+ at the beginning of each raid device. The kernel raid4/5/6/10
+ MD personalities supporting such device addition will read the data from
+ the existing first stripes (those with smaller number of stripes)
+ starting at data_offset to fill up a new stripe with the larger
+ number of stripes, calculate the redundancy blocks (CRC/Q-syndrome)
+ and write that new stripe to offset 0. Same will be applied to all
+ N-1 other new stripes. This out-of-place scheme is used to change
+ the RAID type (i.e. the allocation algorithm) as well, e.g.
+ changing from raid5_ls to raid5_n.
+
<#raid_devs>: The number of devices composing the array.
Each device consists of two entries. The first is the device
containing the metadata (if any); the second is the one containing the
- data.
+ data. A Maximum of 64 metadata/data device entries are supported
+ up to target version 1.8.0.
+ 1.9.0 supports up to 253 which is enforced by the used MD kernel runtime.
If a drive has failed or is missing at creation time, a '-' can be
given for both the metadata and data drives for a given position.
@@ -207,7 +254,6 @@ include:
"recover"- Initiate/continue a recover process.
"check" - Initiate a check (i.e. a "scrub") of the array.
"repair" - Initiate a repair of the array.
- "reshape"- Currently unsupported (-EINVAL).
Discard Support
@@ -257,3 +303,10 @@ Version History
1.5.2 'mismatch_cnt' is zero unless [last_]sync_action is "check".
1.6.0 Add discard support (and devices_handle_discard_safely module param).
1.7.0 Add support for MD RAID0 mappings.
+1.8.0 Explictely check for compatible flags in the superblock metadata
+ and reject to start the raid set if any are set by a newer
+ target version, thus avoiding data corruption on a raid set
+ with a reshape in progress.
+1.9.0 Add support for RAID level takeover/reshape/region size
+ and set size reduction.
+1.9.1 Fix activation of existing RAID 4/10 mapped devices
diff --git a/Documentation/device-mapper/log-writes.txt b/Documentation/device-mapper/log-writes.txt
index c10f30c9b..f4ebcbaf5 100644
--- a/Documentation/device-mapper/log-writes.txt
+++ b/Documentation/device-mapper/log-writes.txt
@@ -14,14 +14,14 @@ Log Ordering
We log things in order of completion once we are sure the write is no longer in
cache. This means that normal WRITE requests are not actually logged until the
-next REQ_FLUSH request. This is to make it easier for userspace to replay the
-log in a way that correlates to what is on disk and not what is in cache, to
-make it easier to detect improper waiting/flushing.
+next REQ_PREFLUSH request. This is to make it easier for userspace to replay
+the log in a way that correlates to what is on disk and not what is in cache,
+to make it easier to detect improper waiting/flushing.
This works by attaching all WRITE requests to a list once the write completes.
-Once we see a REQ_FLUSH request we splice this list onto the request and once
+Once we see a REQ_PREFLUSH request we splice this list onto the request and once
the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only
-completed WRITEs, at the time the REQ_FLUSH is issued, are added in order to
+completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to
simulate the worst case scenario with regard to power failures. Consider the
following example (W means write, C means complete):
diff --git a/Documentation/device-mapper/statistics.txt b/Documentation/device-mapper/statistics.txt
index 6f5ef944c..170ac02a1 100644
--- a/Documentation/device-mapper/statistics.txt
+++ b/Documentation/device-mapper/statistics.txt
@@ -205,7 +205,7 @@ statistics on them:
dmsetup message vol 0 @stats_create - /100
-Set the auxillary data string to "foo bar baz" (the escape for each
+Set the auxiliary data string to "foo bar baz" (the escape for each
space must also be escaped, otherwise the shell will consume them):
dmsetup message vol 0 @stats_set_aux 0 foo\\ bar\\ baz