1 files changed, 47 insertions, 53 deletions
diff --git a/Documentation/scheduler/sched-BFS.txt b/Documentation/scheduler/sched-BFS.txt
index 77a1e37b7..c0282002a 100644
--- a/Documentation/scheduler/sched-BFS.txt
+++ b/Documentation/scheduler/sched-BFS.txt
@@ -92,7 +92,7 @@ the virtual deadline mechanism is explained.
 Virtual deadline.
 
 The key to achieving low latency, scheduling fairness, and "nice level"
-distribution in BFS is entirely in the virtual deadline mechanism. The related
+distribution in BFS is entirely in the virtual deadline mechanism. The one
 tunable in BFS is the rr_interval, or "round robin interval". This is the
 maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
 tasks of the same nice level will be running for, or looking at it the other
@@ -177,26 +177,29 @@ The first is the local copy of the running process' data to the CPU it's running
 on to allow that data to be updated lockless where possible. Then there is
 deference paid to the last CPU a task was running on, by trying that CPU first
 when looking for an idle CPU to use the next time it's scheduled. Finally there
-is the notion of "sticky" tasks that are flagged when they are involuntarily
-descheduled, meaning they still want further CPU time. This sticky flag is
-used to bias heavily against those tasks being scheduled on a different CPU
-unless that CPU would be otherwise idle. When a cpu frequency governor is used
-that scales with CPU load, such as ondemand, sticky tasks are not scheduled
-on a different CPU at all, preferring instead to go idle. This means the CPU
-they were bound to is more likely to increase its speed while the other CPU
-will go idle, thus speeding up total task execution time and likely decreasing
-power usage. This is the only scenario where BFS will allow a CPU to go idle
-in preference to scheduling a task on the earliest available spare CPU.
-
-The real cost of migrating a task from one CPU to another is entirely dependant
-on the cache footprint of the task, how cache intensive the task is, how long
-it's been running on that CPU to take up the bulk of its cache, how big the CPU
-cache is, how fast and how layered the CPU cache is, how fast a context switch
-is... and so on. In other words, it's close to random in the real world where we
-do more than just one sole workload. The only thing we can be sure of is that
-it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and
-utilising idle CPUs is more important than cache locality, and cache locality
-only plays a part after that.
+is the notion of cache locality beyond the last running CPU. The sched_domains
+information is used to determine the relative virtual "cache distance" that
+other CPUs have from the last CPU a task was running on. CPUs with shared
+caches, such as SMT siblings, or multicore CPUs with shared caches, are treated
+as cache local. CPUs without shared caches are treated as not cache local, and
+CPUs on different NUMA nodes are treated as very distant. This "relative cache
+distance" is used by modifying the virtual deadline value when doing lookups.
+Effectively, the deadline is unaltered between "cache local" CPUs, doubled for
+"cache distant" CPUs, and quadrupled for "very distant" CPUs. The reasoning
+behind the doubling of deadlines is as follows. The real cost of migrating a
+task from one CPU to another is entirely dependant on the cache footprint of
+the task, how cache intensive the task is, how long it's been running on that
+CPU to take up the bulk of its cache, how big the CPU cache is, how fast and
+how layered the CPU cache is, how fast a context switch is... and so on. In
+other words, it's close to random in the real world where we do more than just
+one sole workload. The only thing we can be sure of is that it's not free. So
+BFS uses the principle that an idle CPU is a wasted CPU and utilising idle CPUs
+is more important than cache locality, and cache locality only plays a part
+after that. Doubling the effective deadline is based on the premise that the
+"cache local" CPUs will tend to work on the same tasks up to double the number
+of cache local CPUs, and once the workload is beyond that amount, it is likely
+that none of the tasks are cache warm anywhere anyway. The quadrupling for NUMA
+is a value I pulled out of my arse.
 
 When choosing an idle CPU for a waking task, the cache locality is determined
 according to where the task last ran and then idle CPUs are ranked from best
@@ -226,7 +229,8 @@ of both CPUs (due to the global runqueue) and heavily loaded machines (due to
 O(n) lookup) at this stage. Note that in terms of scalability, the number of
 _logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
 quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
-results are very promising indeed. Benchmark contributions are most welcome.
+results are very promising indeed, without needing to tweak any knobs, features
+or options. Benchmark contributions are most welcome.
 
 
 Features
@@ -241,39 +245,29 @@ to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
 support for CGROUPS. The average user should neither need to know what these
 are, nor should they need to be using them to have good desktop behaviour.
 
-There are two "scheduler" tunables, the round robin interval and the
-interactive flag. These can be accessed in
+rr_interval
+
+There is only one "scheduler" tunable, the round robin interval. This can be
+accessed in
 
 	/proc/sys/kernel/rr_interval
-	/proc/sys/kernel/interactive
-
-rr_interval value
-
-The value is in milliseconds, and the default value is set to 6ms. Valid values
-are from 1 to 1000. Decreasing the value will decrease latencies at the cost of
-decreasing throughput, while increasing it will improve throughput, but at the
-cost of worsening latencies. The accuracy of the rr interval is limited by HZ
-resolution of the kernel configuration. Thus, the worst case latencies are
-usually slightly higher than this actual value. BFS uses "dithering" to try and
-minimise the effect the Hz limitation has. The default value of 6 is not an
-arbitrary one. It is based on the fact that humans can detect jitter at
-approximately 7ms, so aiming for much lower latencies is pointless under most
-circumstances. It is worth noting this fact when comparing the latency
-performance of BFS to other schedulers. Worst case latencies being higher than
-7ms are far worse than average latencies not being in the microsecond range.
-Experimentation has shown that rr intervals being increased up to 300 can
-improve throughput but beyond that, scheduling noise from elsewhere prevents
-further demonstrable throughput.
-
-interactive flag
-
-This is a simple boolean that can be set to 1 or 0, set to 1 by default. This
-sacrifices some of the interactive performance by giving tasks a degree of
-soft affinity for logical CPUs when it will lead to improved throughput, but
-enabling it also sacrifices the completely deterministic nature with respect
-to latency that BFS otherwise normally provides, and subsequently leads to
-slightly higher latencies and a noticeably less interactive system.
 
+The value is in milliseconds, and the default value is set to 6 on a
+uniprocessor machine, and automatically set to a progressively higher value on
+multiprocessor machines. The reasoning behind increasing the value on more CPUs
+is that the effective latency is decreased by virtue of there being more CPUs on
+BFS (for reasons explained above), and increasing the value allows for less
+cache contention and more throughput. Valid values are from 1 to 1000
+Decreasing the value will decrease latencies at the cost of decreasing
+throughput, while increasing it will improve throughput, but at the cost of
+worsening latencies. The accuracy of the rr interval is limited by HZ resolution
+of the kernel configuration. Thus, the worst case latencies are usually slightly
+higher than this actual value. The default value of 6 is not an arbitrary one.
+It is based on the fact that humans can detect jitter at approximately 7ms, so
+aiming for much lower latencies is pointless under most circumstances. It is
+worth noting this fact when comparing the latency performance of BFS to other
+schedulers. Worst case latencies being higher than 7ms are far worse than
+average latencies not being in the microsecond range.
 
 Isochronous scheduling.
 
@@ -354,4 +348,4 @@ of total wall clock time taken and total work done, rather than the reported
 "cpu usage".
 
 
-Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011
+Con Kolivas <kernel@kolivas.org> Fri Aug 27 2010