summaryrefslogtreecommitdiff
path: root/Documentation/scheduler/sched-BFS.txt
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/scheduler/sched-BFS.txt')
-rw-r--r--Documentation/scheduler/sched-BFS.txt216
1 files changed, 103 insertions, 113 deletions
diff --git a/Documentation/scheduler/sched-BFS.txt b/Documentation/scheduler/sched-BFS.txt
index 6470f30b0..c0282002a 100644
--- a/Documentation/scheduler/sched-BFS.txt
+++ b/Documentation/scheduler/sched-BFS.txt
@@ -13,14 +13,14 @@ one workload cause massive detriment to another.
Design summary.
-BFS is best described as a single runqueue, O(log n) insertion, O(1) lookup,
-earliest effective virtual deadline first design, loosely based on EEVDF
-(earliest eligible virtual deadline first) and my previous Staircase Deadline
-scheduler. Each component shall be described in order to understand the
-significance of, and reasoning for it. The codebase when the first stable
-version was released was approximately 9000 lines less code than the existing
-mainline linux kernel scheduler (in 2.6.31). This does not even take into
-account the removal of documentation and the cgroups code that is not used.
+BFS is best described as a single runqueue, O(n) lookup, earliest effective
+virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
+deadline first) and my previous Staircase Deadline scheduler. Each component
+shall be described in order to understand the significance of, and reasoning for
+it. The codebase when the first stable version was released was approximately
+9000 lines less code than the existing mainline linux kernel scheduler (in
+2.6.31). This does not even take into account the removal of documentation and
+the cgroups code that is not used.
Design reasoning.
@@ -62,13 +62,12 @@ Design details.
Task insertion.
-BFS inserts tasks into each relevant queue as an O(log n) insertion into a
-customised skip list (as described by William Pugh). At the time of insertion,
-*every* running queue is checked to see if the newly queued task can run on any
-idle queue, or preempt the lowest running task on the system. This is how the
-cross-CPU scheduling of BFS achieves significantly lower latency per extra CPU
-the system has. In this case the lookup is, in the worst case scenario, O(k)
-where k is the number of online CPUs on the system.
+BFS inserts tasks into each relevant queue as an O(1) insertion into a double
+linked list. On insertion, *every* running queue is checked to see if the newly
+queued task can run on any idle queue, or preempt the lowest running task on the
+system. This is how the cross-CPU scheduling of BFS achieves significantly lower
+latency per extra CPU the system has. In this case the lookup is, in the worst
+case scenario, O(n) where n is the number of CPUs on the system.
Data protection.
@@ -93,7 +92,7 @@ the virtual deadline mechanism is explained.
Virtual deadline.
The key to achieving low latency, scheduling fairness, and "nice level"
-distribution in BFS is entirely in the virtual deadline mechanism. The related
+distribution in BFS is entirely in the virtual deadline mechanism. The one
tunable in BFS is the rr_interval, or "round robin interval". This is the
maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
tasks of the same nice level will be running for, or looking at it the other
@@ -118,7 +117,7 @@ higher priority than a currently running task on any cpu by virtue of the fact
that it has an earlier virtual deadline than the currently running task. The
earlier deadline is the key to which task is next chosen for the first and
second cases. Once a task is descheduled, it is put back on the queue, and an
-O(1) lookup of all queued-but-not-running tasks is done to determine which has
+O(n) lookup of all queued-but-not-running tasks is done to determine which has
the earliest deadline and that task is chosen to receive CPU next.
The CPU proportion of different nice tasks works out to be approximately the
@@ -135,40 +134,26 @@ Task lookup.
BFS has 103 priority queues. 100 of these are dedicated to the static priority
of realtime tasks, and the remaining 3 are, in order of best to worst priority,
-SCHED_ISO (isochronous), SCHED_NORMAL/SCHED_BATCH, and SCHED_IDLEPRIO (idle
-priority scheduling).
-
-When a task of these priorities is queued, it is added to the skiplist with a
-different sorting value according to the type of task. For realtime tasks and
-isochronous tasks, it is their static priority. For SCHED_NORMAL and
-SCHED_BATCH tasks it is their virtual deadline value. For SCHED_IDLEPRIO tasks
-it is their virtual deadline value offset by an impossibly large value to ensure
-they never go before normal tasks. When isochronous or idleprio tasks do not
-meet the conditions that allow them to run with their special scheduling they
-are queued as per the remainder of the SCHED_NORMAL tasks.
-
-Lookup is performed by selecting the very first entry in the "level 0" skiplist
-as it will always be the lowest priority task having been sorted while being
-entered into the skiplist. This is usually an O(1) operation, however if there
-are tasks with limited affinity set and they are not able to run on the current
-CPU, the next in the list is checked and so on.
-
-Thus, the lookup for the common case is O(1) and O(n) in the worst case when
-the system has nothing but selectively affined tasks that can never run on the
-current CPU.
-
-
-Task removal.
-
-Removal of tasks in the skip list is an O(k) operation where 0 <= k < 16,
-corresponding with the "levels" in the skip list. 16 was chosen as the upper
-limit in the skiplist as it guarantees O(log n) insertion for up to 64k
-currently active tasks and most systems do not usually allow more than 32k
-tasks, and 16 levels makes the skiplist lookup components fit in 2 cachelines.
-The skiplist level chosen when inserting a task is pseudo-random but a minor
-optimisation is used to limit the max level based on the absolute number of
-queued tasks since high levels afford no advantage at low numbers of queued
-tasks yet increase overhead.
+SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
+scheduling). When a task of these priorities is queued, a bitmap of running
+priorities is set showing which of these priorities has tasks waiting for CPU
+time. When a CPU is made to reschedule, the lookup for the next task to get
+CPU time is performed in the following way:
+
+First the bitmap is checked to see what static priority tasks are queued. If
+any realtime priorities are found, the corresponding queue is checked and the
+first task listed there is taken (provided CPU affinity is suitable) and lookup
+is complete. If the priority corresponds to a SCHED_ISO task, they are also
+taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
+to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
+stage, every task in the runlist that corresponds to that priority is checked
+to see which has the earliest set deadline, and (provided it has suitable CPU
+affinity) it is taken off the runqueue and given the CPU. If a task has an
+expired deadline, it is taken and the rest of the lookup aborted (as they are
+chosen in FIFO order).
+
+Thus, the lookup is O(n) in the worst case only, where n is as described
+earlier, as tasks may be chosen before the whole task list is looked over.
Scalability.
@@ -191,17 +176,30 @@ when it has been deemed their overhead is so marginal that they're worth adding.
The first is the local copy of the running process' data to the CPU it's running
on to allow that data to be updated lockless where possible. Then there is
deference paid to the last CPU a task was running on, by trying that CPU first
-when looking for an idle CPU to use the next time it's scheduled.
-
-The real cost of migrating a task from one CPU to another is entirely dependant
-on the cache footprint of the task, how cache intensive the task is, how long
-it's been running on that CPU to take up the bulk of its cache, how big the CPU
-cache is, how fast and how layered the CPU cache is, how fast a context switch
-is... and so on. In other words, it's close to random in the real world where we
-do more than just one sole workload. The only thing we can be sure of is that
-it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and
-utilising idle CPUs is more important than cache locality, and cache locality
-only plays a part after that.
+when looking for an idle CPU to use the next time it's scheduled. Finally there
+is the notion of cache locality beyond the last running CPU. The sched_domains
+information is used to determine the relative virtual "cache distance" that
+other CPUs have from the last CPU a task was running on. CPUs with shared
+caches, such as SMT siblings, or multicore CPUs with shared caches, are treated
+as cache local. CPUs without shared caches are treated as not cache local, and
+CPUs on different NUMA nodes are treated as very distant. This "relative cache
+distance" is used by modifying the virtual deadline value when doing lookups.
+Effectively, the deadline is unaltered between "cache local" CPUs, doubled for
+"cache distant" CPUs, and quadrupled for "very distant" CPUs. The reasoning
+behind the doubling of deadlines is as follows. The real cost of migrating a
+task from one CPU to another is entirely dependant on the cache footprint of
+the task, how cache intensive the task is, how long it's been running on that
+CPU to take up the bulk of its cache, how big the CPU cache is, how fast and
+how layered the CPU cache is, how fast a context switch is... and so on. In
+other words, it's close to random in the real world where we do more than just
+one sole workload. The only thing we can be sure of is that it's not free. So
+BFS uses the principle that an idle CPU is a wasted CPU and utilising idle CPUs
+is more important than cache locality, and cache locality only plays a part
+after that. Doubling the effective deadline is based on the premise that the
+"cache local" CPUs will tend to work on the same tasks up to double the number
+of cache local CPUs, and once the workload is beyond that amount, it is likely
+that none of the tasks are cache warm anywhere anyway. The quadrupling for NUMA
+is a value I pulled out of my arse.
When choosing an idle CPU for a waking task, the cache locality is determined
according to where the task last ran and then idle CPUs are ranked from best
@@ -209,26 +207,31 @@ to worst to choose the most suitable idle CPU based on cache locality, NUMA
node locality and hyperthread sibling business. They are chosen in the
following preference (if idle):
- * Same thread, idle or busy cache, idle or busy threads
- * Other core, same cache, idle or busy cache, idle threads.
- * Same node, other CPU, idle cache, idle threads.
- * Same node, other CPU, busy cache, idle threads.
- * Other core, same cache, busy threads.
- * Same node, other CPU, busy threads.
- * Other node, other CPU, idle cache, idle threads.
- * Other node, other CPU, busy cache, idle threads.
- * Other node, other CPU, busy threads.
+* Same core, idle or busy cache, idle threads
+* Other core, same cache, idle or busy cache, idle threads.
+* Same node, other CPU, idle cache, idle threads.
+* Same node, other CPU, busy cache, idle threads.
+* Same core, busy threads.
+* Other core, same cache, busy threads.
+* Same node, other CPU, busy threads.
+* Other node, other CPU, idle cache, idle threads.
+* Other node, other CPU, busy cache, idle threads.
+* Other node, other CPU, busy threads.
This shows the SMT or "hyperthread" awareness in the design as well which will
choose a real idle core first before a logical SMT sibling which already has
-tasks on the physical CPU. Early benchmarking of BFS suggested scalability
-dropped off at the 16 CPU mark. However this benchmarking was performed on an
-earlier design that was far less scalable than the current one so it's hard to
-know how scalable it is in terms of number of CPUs (due to the global
-runqueue). Note that in terms of scalability, the number of _logical_ CPUs
-matters, not the number of _physical_ CPUs. Thus, a dual (2x) quad core (4X)
-hyperthreaded (2X) machine is effectively a 16X. Newer benchmark results are
-very promising indeed. Benchmark contributions are most welcome.
+tasks on the physical CPU.
+
+Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
+However this benchmarking was performed on an earlier design that was far less
+scalable than the current one so it's hard to know how scalable it is in terms
+of both CPUs (due to the global runqueue) and heavily loaded machines (due to
+O(n) lookup) at this stage. Note that in terms of scalability, the number of
+_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
+quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
+results are very promising indeed, without needing to tweak any knobs, features
+or options. Benchmark contributions are most welcome.
+
Features
@@ -241,43 +244,30 @@ and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
support for CGROUPS. The average user should neither need to know what these
are, nor should they need to be using them to have good desktop behaviour.
-Rudimentary support for the CPU controller CGROUP in the form of filesystem
-stubs for the expected CGROUP structure to allow applications that demand their
-presence to work but they do not have any functionality.
-There are two "scheduler" tunables, the round robin interval and the
-interactive flag. These can be accessed in
+rr_interval
+
+There is only one "scheduler" tunable, the round robin interval. This can be
+accessed in
/proc/sys/kernel/rr_interval
- /proc/sys/kernel/interactive
-
-rr_interval value
-
-The value is in milliseconds, and the default value is set to 6ms. Valid values
-are from 1 to 1000. Decreasing the value will decrease latencies at the cost of
-decreasing throughput, while increasing it will improve throughput, but at the
-cost of worsening latencies. The accuracy of the rr interval is limited by HZ
-resolution of the kernel configuration. Thus, the worst case latencies are
-usually slightly higher than this actual value. BFS uses "dithering" to try and
-minimise the effect the Hz limitation has. The default value of 6 is not an
-arbitrary one. It is based on the fact that humans can detect jitter at
-approximately 7ms, so aiming for much lower latencies is pointless under most
-circumstances. It is worth noting this fact when comparing the latency
-performance of BFS to other schedulers. Worst case latencies being higher than
-7ms are far worse than average latencies not being in the microsecond range.
-Experimentation has shown that rr intervals being increased up to 300 can
-improve throughput but beyond that, scheduling noise from elsewhere prevents
-further demonstrable throughput.
-
-interactive flag
-
-This is a simple boolean that can be set to 1 or 0, set to 1 by default. This
-sacrifices some of the interactive performance by giving tasks a degree of
-soft affinity for logical CPUs when it will lead to improved throughput, but
-enabling it also sacrifices the completely deterministic nature with respect
-to latency that BFS otherwise normally provides, and subsequently leads to
-slightly higher latencies and a noticeably less interactive system.
+The value is in milliseconds, and the default value is set to 6 on a
+uniprocessor machine, and automatically set to a progressively higher value on
+multiprocessor machines. The reasoning behind increasing the value on more CPUs
+is that the effective latency is decreased by virtue of there being more CPUs on
+BFS (for reasons explained above), and increasing the value allows for less
+cache contention and more throughput. Valid values are from 1 to 1000
+Decreasing the value will decrease latencies at the cost of decreasing
+throughput, while increasing it will improve throughput, but at the cost of
+worsening latencies. The accuracy of the rr interval is limited by HZ resolution
+of the kernel configuration. Thus, the worst case latencies are usually slightly
+higher than this actual value. The default value of 6 is not an arbitrary one.
+It is based on the fact that humans can detect jitter at approximately 7ms, so
+aiming for much lower latencies is pointless under most circumstances. It is
+worth noting this fact when comparing the latency performance of BFS to other
+schedulers. Worst case latencies being higher than 7ms are far worse than
+average latencies not being in the microsecond range.
Isochronous scheduling.
@@ -358,4 +348,4 @@ of total wall clock time taken and total work done, rather than the reported
"cpu usage".
-Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011
+Con Kolivas <kernel@kolivas.org> Fri Aug 27 2010