diff options
Diffstat (limited to 'Documentation/scheduler/sched-BFS.txt')
-rw-r--r-- | Documentation/scheduler/sched-BFS.txt | 347 |
1 files changed, 0 insertions, 347 deletions
diff --git a/Documentation/scheduler/sched-BFS.txt b/Documentation/scheduler/sched-BFS.txt deleted file mode 100644 index c10d95601..000000000 --- a/Documentation/scheduler/sched-BFS.txt +++ /dev/null @@ -1,347 +0,0 @@ -BFS - The Brain Fuck Scheduler by Con Kolivas. - -Goals. - -The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to -completely do away with the complex designs of the past for the cpu process -scheduler and instead implement one that is very simple in basic design. -The main focus of BFS is to achieve excellent desktop interactivity and -responsiveness without heuristics and tuning knobs that are difficult to -understand, impossible to model and predict the effect of, and when tuned to -one workload cause massive detriment to another. - - -Design summary. - -BFS is best described as a single runqueue, O(n) lookup, earliest effective -virtual deadline first design, loosely based on EEVDF (earliest eligible virtual -deadline first) and my previous Staircase Deadline scheduler. Each component -shall be described in order to understand the significance of, and reasoning for -it. The codebase when the first stable version was released was approximately -9000 lines less code than the existing mainline linux kernel scheduler (in -2.6.31). This does not even take into account the removal of documentation and -the cgroups code that is not used. - -Design reasoning. - -The single runqueue refers to the queued but not running processes for the -entire system, regardless of the number of CPUs. The reason for going back to -a single runqueue design is that once multiple runqueues are introduced, -per-CPU or otherwise, there will be complex interactions as each runqueue will -be responsible for the scheduling latency and fairness of the tasks only on its -own runqueue, and to achieve fairness and low latency across multiple CPUs, any -advantage in throughput of having CPU local tasks causes other disadvantages. -This is due to requiring a very complex balancing system to at best achieve some -semblance of fairness across CPUs and can only maintain relatively low latency -for tasks bound to the same CPUs, not across them. To increase said fairness -and latency across CPUs, the advantage of local runqueue locking, which makes -for better scalability, is lost due to having to grab multiple locks. - -A significant feature of BFS is that all accounting is done purely based on CPU -used and nowhere is sleep time used in any way to determine entitlement or -interactivity. Interactivity "estimators" that use some kind of sleep/run -algorithm are doomed to fail to detect all interactive tasks, and to falsely tag -tasks that aren't interactive as being so. The reason for this is that it is -close to impossible to determine that when a task is sleeping, whether it is -doing it voluntarily, as in a userspace application waiting for input in the -form of a mouse click or otherwise, or involuntarily, because it is waiting for -another thread, process, I/O, kernel activity or whatever. Thus, such an -estimator will introduce corner cases, and more heuristics will be required to -cope with those corner cases, introducing more corner cases and failed -interactivity detection and so on. Interactivity in BFS is built into the design -by virtue of the fact that tasks that are waking up have not used up their quota -of CPU time, and have earlier effective deadlines, thereby making it very likely -they will preempt any CPU bound task of equivalent nice level. See below for -more information on the virtual deadline mechanism. Even if they do not preempt -a running task, because the rr interval is guaranteed to have a bound upper -limit on how long a task will wait for, it will be scheduled within a timeframe -that will not cause visible interface jitter. - - -Design details. - -Task insertion. - -BFS inserts tasks into each relevant queue as an O(1) insertion into a double -linked list. On insertion, *every* running queue is checked to see if the newly -queued task can run on any idle queue, or preempt the lowest running task on the -system. This is how the cross-CPU scheduling of BFS achieves significantly lower -latency per extra CPU the system has. In this case the lookup is, in the worst -case scenario, O(n) where n is the number of CPUs on the system. - -Data protection. - -BFS has one single lock protecting the process local data of every task in the -global queue. Thus every insertion, removal and modification of task data in the -global runqueue needs to grab the global lock. However, once a task is taken by -a CPU, the CPU has its own local data copy of the running process' accounting -information which only that CPU accesses and modifies (such as during a -timer tick) thus allowing the accounting data to be updated lockless. Once a -CPU has taken a task to run, it removes it from the global queue. Thus the -global queue only ever has, at most, - - (number of tasks requesting cpu time) - (number of logical CPUs) + 1 - -tasks in the global queue. This value is relevant for the time taken to look up -tasks during scheduling. This will increase if many tasks with CPU affinity set -in their policy to limit which CPUs they're allowed to run on if they outnumber -the number of CPUs. The +1 is because when rescheduling a task, the CPU's -currently running task is put back on the queue. Lookup will be described after -the virtual deadline mechanism is explained. - -Virtual deadline. - -The key to achieving low latency, scheduling fairness, and "nice level" -distribution in BFS is entirely in the virtual deadline mechanism. The one -tunable in BFS is the rr_interval, or "round robin interval". This is the -maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy) -tasks of the same nice level will be running for, or looking at it the other -way around, the longest duration two tasks of the same nice level will be -delayed for. When a task requests cpu time, it is given a quota (time_slice) -equal to the rr_interval and a virtual deadline. The virtual deadline is -offset from the current time in jiffies by this equation: - - jiffies + (prio_ratio * rr_interval) - -The prio_ratio is determined as a ratio compared to the baseline of nice -20 -and increases by 10% per nice level. The deadline is a virtual one only in that -no guarantee is placed that a task will actually be scheduled by this time, but -it is used to compare which task should go next. There are three components to -how a task is next chosen. First is time_slice expiration. If a task runs out -of its time_slice, it is descheduled, the time_slice is refilled, and the -deadline reset to that formula above. Second is sleep, where a task no longer -is requesting CPU for whatever reason. The time_slice and deadline are _not_ -adjusted in this case and are just carried over for when the task is next -scheduled. Third is preemption, and that is when a newly waking task is deemed -higher priority than a currently running task on any cpu by virtue of the fact -that it has an earlier virtual deadline than the currently running task. The -earlier deadline is the key to which task is next chosen for the first and -second cases. Once a task is descheduled, it is put back on the queue, and an -O(n) lookup of all queued-but-not-running tasks is done to determine which has -the earliest deadline and that task is chosen to receive CPU next. - -The CPU proportion of different nice tasks works out to be approximately the - - (prio_ratio difference)^2 - -The reason it is squared is that a task's deadline does not change while it is -running unless it runs out of time_slice. Thus, even if the time actually -passes the deadline of another task that is queued, it will not get CPU time -unless the current running task deschedules, and the time "base" (jiffies) is -constantly moving. - -Task lookup. - -BFS has 103 priority queues. 100 of these are dedicated to the static priority -of realtime tasks, and the remaining 3 are, in order of best to worst priority, -SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority -scheduling). When a task of these priorities is queued, a bitmap of running -priorities is set showing which of these priorities has tasks waiting for CPU -time. When a CPU is made to reschedule, the lookup for the next task to get -CPU time is performed in the following way: - -First the bitmap is checked to see what static priority tasks are queued. If -any realtime priorities are found, the corresponding queue is checked and the -first task listed there is taken (provided CPU affinity is suitable) and lookup -is complete. If the priority corresponds to a SCHED_ISO task, they are also -taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds -to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this -stage, every task in the runlist that corresponds to that priority is checked -to see which has the earliest set deadline, and (provided it has suitable CPU -affinity) it is taken off the runqueue and given the CPU. If a task has an -expired deadline, it is taken and the rest of the lookup aborted (as they are -chosen in FIFO order). - -Thus, the lookup is O(n) in the worst case only, where n is as described -earlier, as tasks may be chosen before the whole task list is looked over. - - -Scalability. - -The major limitations of BFS will be that of scalability, as the separate -runqueue designs will have less lock contention as the number of CPUs rises. -However they do not scale linearly even with separate runqueues as multiple -runqueues will need to be locked concurrently on such designs to be able to -achieve fair CPU balancing, to try and achieve some sort of nice-level fairness -across CPUs, and to achieve low enough latency for tasks on a busy CPU when -other CPUs would be more suited. BFS has the advantage that it requires no -balancing algorithm whatsoever, as balancing occurs by proxy simply because -all CPUs draw off the global runqueue, in priority and deadline order. Despite -the fact that scalability is _not_ the prime concern of BFS, it both shows very -good scalability to smaller numbers of CPUs and is likely a more scalable design -at these numbers of CPUs. - -It also has some very low overhead scalability features built into the design -when it has been deemed their overhead is so marginal that they're worth adding. -The first is the local copy of the running process' data to the CPU it's running -on to allow that data to be updated lockless where possible. Then there is -deference paid to the last CPU a task was running on, by trying that CPU first -when looking for an idle CPU to use the next time it's scheduled. Finally there -is the notion of "sticky" tasks that are flagged when they are involuntarily -descheduled, meaning they still want further CPU time. This sticky flag is -used to bias heavily against those tasks being scheduled on a different CPU -unless that CPU would be otherwise idle. When a cpu frequency governor is used -that scales with CPU load, such as ondemand, sticky tasks are not scheduled -on a different CPU at all, preferring instead to go idle. This means the CPU -they were bound to is more likely to increase its speed while the other CPU -will go idle, thus speeding up total task execution time and likely decreasing -power usage. This is the only scenario where BFS will allow a CPU to go idle -in preference to scheduling a task on the earliest available spare CPU. - -The real cost of migrating a task from one CPU to another is entirely dependant -on the cache footprint of the task, how cache intensive the task is, how long -it's been running on that CPU to take up the bulk of its cache, how big the CPU -cache is, how fast and how layered the CPU cache is, how fast a context switch -is... and so on. In other words, it's close to random in the real world where we -do more than just one sole workload. The only thing we can be sure of is that -it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and -utilising idle CPUs is more important than cache locality, and cache locality -only plays a part after that. - -When choosing an idle CPU for a waking task, the cache locality is determined -according to where the task last ran and then idle CPUs are ranked from best -to worst to choose the most suitable idle CPU based on cache locality, NUMA -node locality and hyperthread sibling business. They are chosen in the -following preference (if idle): - -* Same core, idle or busy cache, idle threads -* Other core, same cache, idle or busy cache, idle threads. -* Same node, other CPU, idle cache, idle threads. -* Same node, other CPU, busy cache, idle threads. -* Same core, busy threads. -* Other core, same cache, busy threads. -* Same node, other CPU, busy threads. -* Other node, other CPU, idle cache, idle threads. -* Other node, other CPU, busy cache, idle threads. -* Other node, other CPU, busy threads. - -This shows the SMT or "hyperthread" awareness in the design as well which will -choose a real idle core first before a logical SMT sibling which already has -tasks on the physical CPU. - -Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark. -However this benchmarking was performed on an earlier design that was far less -scalable than the current one so it's hard to know how scalable it is in terms -of both CPUs (due to the global runqueue) and heavily loaded machines (due to -O(n) lookup) at this stage. Note that in terms of scalability, the number of -_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x) -quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark -results are very promising indeed, without needing to tweak any knobs, features -or options. Benchmark contributions are most welcome. - - -Features - -As the initial prime target audience for BFS was the average desktop user, it -was designed to not need tweaking, tuning or have features set to obtain benefit -from it. Thus the number of knobs and features has been kept to an absolute -minimum and should not require extra user input for the vast majority of cases. -There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval -and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition -to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is -support for CGROUPS. The average user should neither need to know what these -are, nor should they need to be using them to have good desktop behaviour. - -rr_interval - -There is only one "scheduler" tunable, the round robin interval. This can be -accessed in - - /proc/sys/kernel/rr_interval - -The value is in milliseconds, and the default value is set to 6ms. Valid values -are from 1 to 1000. Decreasing the value will decrease latencies at the cost of -decreasing throughput, while increasing it will improve throughput, but at the -cost of worsening latencies. The accuracy of the rr interval is limited by HZ -resolution of the kernel configuration. Thus, the worst case latencies are -usually slightly higher than this actual value. BFS uses "dithering" to try and -minimise the effect the Hz limitation has. The default value of 6 is not an -arbitrary one. It is based on the fact that humans can detect jitter at -approximately 7ms, so aiming for much lower latencies is pointless under most -circumstances. It is worth noting this fact when comparing the latency -performance of BFS to other schedulers. Worst case latencies being higher than -7ms are far worse than average latencies not being in the microsecond range. -Experimentation has shown that rr intervals being increased up to 300 can -improve throughput but beyond that, scheduling noise from elsewhere prevents -further demonstrable throughput. - -Isochronous scheduling. - -Isochronous scheduling is a unique scheduling policy designed to provide -near-real-time performance to unprivileged (ie non-root) users without the -ability to starve the machine indefinitely. Isochronous tasks (which means -"same time") are set using, for example, the schedtool application like so: - - schedtool -I -e amarok - -This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works -is that it has a priority level between true realtime tasks and SCHED_NORMAL -which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie, -if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval -rate). However if ISO tasks run for more than a tunable finite amount of time, -they are then demoted back to SCHED_NORMAL scheduling. This finite amount of -time is the percentage of _total CPU_ available across the machine, configurable -as a percentage in the following "resource handling" tunable (as opposed to a -scheduler tunable): - - /proc/sys/kernel/iso_cpu - -and is set to 70% by default. It is calculated over a rolling 5 second average -Because it is the total CPU available, it means that on a multi CPU machine, it -is possible to have an ISO task running as realtime scheduling indefinitely on -just one CPU, as the other CPUs will be available. Setting this to 100 is the -equivalent of giving all users SCHED_RR access and setting it to 0 removes the -ability to run any pseudo-realtime tasks. - -A feature of BFS is that it detects when an application tries to obtain a -realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the -appropriate privileges to use those policies. When it detects this, it will -give the task SCHED_ISO policy instead. Thus it is transparent to the user. -Because some applications constantly set their policy as well as their nice -level, there is potential for them to undo the override specified by the user -on the command line of setting the policy to SCHED_ISO. To counter this, once -a task has been set to SCHED_ISO policy, it needs superuser privileges to set -it back to SCHED_NORMAL. This will ensure the task remains ISO and all child -processes and threads will also inherit the ISO policy. - -Idleprio scheduling. - -Idleprio scheduling is a scheduling policy designed to give out CPU to a task -_only_ when the CPU would be otherwise idle. The idea behind this is to allow -ultra low priority tasks to be run in the background that have virtually no -effect on the foreground tasks. This is ideally suited to distributed computing -clients (like setiathome, folding, mprime etc) but can also be used to start -a video encode or so on without any slowdown of other tasks. To avoid this -policy from grabbing shared resources and holding them indefinitely, if it -detects a state where the task is waiting on I/O, the machine is about to -suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As -per the Isochronous task management, once a task has been scheduled as IDLEPRIO, -it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can -be set to start as SCHED_IDLEPRIO with the schedtool command like so: - - schedtool -D -e ./mprime - -Subtick accounting. - -It is surprisingly difficult to get accurate CPU accounting, and in many cases, -the accounting is done by simply determining what is happening at the precise -moment a timer tick fires off. This becomes increasingly inaccurate as the -timer tick frequency (HZ) is lowered. It is possible to create an application -which uses almost 100% CPU, yet by being descheduled at the right time, records -zero CPU usage. While the main problem with this is that there are possible -security implications, it is also difficult to determine how much CPU a task -really does use. BFS tries to use the sub-tick accounting from the TSC clock, -where possible, to determine real CPU usage. This is not entirely reliable, but -is far more likely to produce accurate CPU usage data than the existing designs -and will not show tasks as consuming no CPU usage when they actually are. Thus, -the amount of CPU reported as being used by BFS will more accurately represent -how much CPU the task itself is using (as is shown for example by the 'time' -application), so the reported values may be quite different to other schedulers. -Values reported as the 'load' are more prone to problems with this design, but -per process values are closer to real usage. When comparing throughput of BFS -to other designs, it is important to compare the actual completed work in terms -of total wall clock time taken and total work done, rather than the reported -"cpu usage". - - -Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011 |