summaryrefslogtreecommitdiff
path: root/Documentation/scheduler
diff options
context:
space:
mode:
authorAndré Fabian Silva Delgado <emulatorman@parabola.nu>2015-09-08 01:01:14 -0300
committerAndré Fabian Silva Delgado <emulatorman@parabola.nu>2015-09-08 01:01:14 -0300
commite5fd91f1ef340da553f7a79da9540c3db711c937 (patch)
treeb11842027dc6641da63f4bcc524f8678263304a3 /Documentation/scheduler
parent2a9b0348e685a63d97486f6749622b61e9e3292f (diff)
Linux-libre 4.2-gnu
Diffstat (limited to 'Documentation/scheduler')
-rw-r--r--Documentation/scheduler/sched-BFS.txt347
-rw-r--r--Documentation/scheduler/sched-deadline.txt184
2 files changed, 154 insertions, 377 deletions
diff --git a/Documentation/scheduler/sched-BFS.txt b/Documentation/scheduler/sched-BFS.txt
deleted file mode 100644
index c10d95601..000000000
--- a/Documentation/scheduler/sched-BFS.txt
+++ /dev/null
@@ -1,347 +0,0 @@
-BFS - The Brain Fuck Scheduler by Con Kolivas.
-
-Goals.
-
-The goal of the Brain Fuck Scheduler, referred to as BFS from here on, is to
-completely do away with the complex designs of the past for the cpu process
-scheduler and instead implement one that is very simple in basic design.
-The main focus of BFS is to achieve excellent desktop interactivity and
-responsiveness without heuristics and tuning knobs that are difficult to
-understand, impossible to model and predict the effect of, and when tuned to
-one workload cause massive detriment to another.
-
-
-Design summary.
-
-BFS is best described as a single runqueue, O(n) lookup, earliest effective
-virtual deadline first design, loosely based on EEVDF (earliest eligible virtual
-deadline first) and my previous Staircase Deadline scheduler. Each component
-shall be described in order to understand the significance of, and reasoning for
-it. The codebase when the first stable version was released was approximately
-9000 lines less code than the existing mainline linux kernel scheduler (in
-2.6.31). This does not even take into account the removal of documentation and
-the cgroups code that is not used.
-
-Design reasoning.
-
-The single runqueue refers to the queued but not running processes for the
-entire system, regardless of the number of CPUs. The reason for going back to
-a single runqueue design is that once multiple runqueues are introduced,
-per-CPU or otherwise, there will be complex interactions as each runqueue will
-be responsible for the scheduling latency and fairness of the tasks only on its
-own runqueue, and to achieve fairness and low latency across multiple CPUs, any
-advantage in throughput of having CPU local tasks causes other disadvantages.
-This is due to requiring a very complex balancing system to at best achieve some
-semblance of fairness across CPUs and can only maintain relatively low latency
-for tasks bound to the same CPUs, not across them. To increase said fairness
-and latency across CPUs, the advantage of local runqueue locking, which makes
-for better scalability, is lost due to having to grab multiple locks.
-
-A significant feature of BFS is that all accounting is done purely based on CPU
-used and nowhere is sleep time used in any way to determine entitlement or
-interactivity. Interactivity "estimators" that use some kind of sleep/run
-algorithm are doomed to fail to detect all interactive tasks, and to falsely tag
-tasks that aren't interactive as being so. The reason for this is that it is
-close to impossible to determine that when a task is sleeping, whether it is
-doing it voluntarily, as in a userspace application waiting for input in the
-form of a mouse click or otherwise, or involuntarily, because it is waiting for
-another thread, process, I/O, kernel activity or whatever. Thus, such an
-estimator will introduce corner cases, and more heuristics will be required to
-cope with those corner cases, introducing more corner cases and failed
-interactivity detection and so on. Interactivity in BFS is built into the design
-by virtue of the fact that tasks that are waking up have not used up their quota
-of CPU time, and have earlier effective deadlines, thereby making it very likely
-they will preempt any CPU bound task of equivalent nice level. See below for
-more information on the virtual deadline mechanism. Even if they do not preempt
-a running task, because the rr interval is guaranteed to have a bound upper
-limit on how long a task will wait for, it will be scheduled within a timeframe
-that will not cause visible interface jitter.
-
-
-Design details.
-
-Task insertion.
-
-BFS inserts tasks into each relevant queue as an O(1) insertion into a double
-linked list. On insertion, *every* running queue is checked to see if the newly
-queued task can run on any idle queue, or preempt the lowest running task on the
-system. This is how the cross-CPU scheduling of BFS achieves significantly lower
-latency per extra CPU the system has. In this case the lookup is, in the worst
-case scenario, O(n) where n is the number of CPUs on the system.
-
-Data protection.
-
-BFS has one single lock protecting the process local data of every task in the
-global queue. Thus every insertion, removal and modification of task data in the
-global runqueue needs to grab the global lock. However, once a task is taken by
-a CPU, the CPU has its own local data copy of the running process' accounting
-information which only that CPU accesses and modifies (such as during a
-timer tick) thus allowing the accounting data to be updated lockless. Once a
-CPU has taken a task to run, it removes it from the global queue. Thus the
-global queue only ever has, at most,
-
- (number of tasks requesting cpu time) - (number of logical CPUs) + 1
-
-tasks in the global queue. This value is relevant for the time taken to look up
-tasks during scheduling. This will increase if many tasks with CPU affinity set
-in their policy to limit which CPUs they're allowed to run on if they outnumber
-the number of CPUs. The +1 is because when rescheduling a task, the CPU's
-currently running task is put back on the queue. Lookup will be described after
-the virtual deadline mechanism is explained.
-
-Virtual deadline.
-
-The key to achieving low latency, scheduling fairness, and "nice level"
-distribution in BFS is entirely in the virtual deadline mechanism. The one
-tunable in BFS is the rr_interval, or "round robin interval". This is the
-maximum time two SCHED_OTHER (or SCHED_NORMAL, the common scheduling policy)
-tasks of the same nice level will be running for, or looking at it the other
-way around, the longest duration two tasks of the same nice level will be
-delayed for. When a task requests cpu time, it is given a quota (time_slice)
-equal to the rr_interval and a virtual deadline. The virtual deadline is
-offset from the current time in jiffies by this equation:
-
- jiffies + (prio_ratio * rr_interval)
-
-The prio_ratio is determined as a ratio compared to the baseline of nice -20
-and increases by 10% per nice level. The deadline is a virtual one only in that
-no guarantee is placed that a task will actually be scheduled by this time, but
-it is used to compare which task should go next. There are three components to
-how a task is next chosen. First is time_slice expiration. If a task runs out
-of its time_slice, it is descheduled, the time_slice is refilled, and the
-deadline reset to that formula above. Second is sleep, where a task no longer
-is requesting CPU for whatever reason. The time_slice and deadline are _not_
-adjusted in this case and are just carried over for when the task is next
-scheduled. Third is preemption, and that is when a newly waking task is deemed
-higher priority than a currently running task on any cpu by virtue of the fact
-that it has an earlier virtual deadline than the currently running task. The
-earlier deadline is the key to which task is next chosen for the first and
-second cases. Once a task is descheduled, it is put back on the queue, and an
-O(n) lookup of all queued-but-not-running tasks is done to determine which has
-the earliest deadline and that task is chosen to receive CPU next.
-
-The CPU proportion of different nice tasks works out to be approximately the
-
- (prio_ratio difference)^2
-
-The reason it is squared is that a task's deadline does not change while it is
-running unless it runs out of time_slice. Thus, even if the time actually
-passes the deadline of another task that is queued, it will not get CPU time
-unless the current running task deschedules, and the time "base" (jiffies) is
-constantly moving.
-
-Task lookup.
-
-BFS has 103 priority queues. 100 of these are dedicated to the static priority
-of realtime tasks, and the remaining 3 are, in order of best to worst priority,
-SCHED_ISO (isochronous), SCHED_NORMAL, and SCHED_IDLEPRIO (idle priority
-scheduling). When a task of these priorities is queued, a bitmap of running
-priorities is set showing which of these priorities has tasks waiting for CPU
-time. When a CPU is made to reschedule, the lookup for the next task to get
-CPU time is performed in the following way:
-
-First the bitmap is checked to see what static priority tasks are queued. If
-any realtime priorities are found, the corresponding queue is checked and the
-first task listed there is taken (provided CPU affinity is suitable) and lookup
-is complete. If the priority corresponds to a SCHED_ISO task, they are also
-taken in FIFO order (as they behave like SCHED_RR). If the priority corresponds
-to either SCHED_NORMAL or SCHED_IDLEPRIO, then the lookup becomes O(n). At this
-stage, every task in the runlist that corresponds to that priority is checked
-to see which has the earliest set deadline, and (provided it has suitable CPU
-affinity) it is taken off the runqueue and given the CPU. If a task has an
-expired deadline, it is taken and the rest of the lookup aborted (as they are
-chosen in FIFO order).
-
-Thus, the lookup is O(n) in the worst case only, where n is as described
-earlier, as tasks may be chosen before the whole task list is looked over.
-
-
-Scalability.
-
-The major limitations of BFS will be that of scalability, as the separate
-runqueue designs will have less lock contention as the number of CPUs rises.
-However they do not scale linearly even with separate runqueues as multiple
-runqueues will need to be locked concurrently on such designs to be able to
-achieve fair CPU balancing, to try and achieve some sort of nice-level fairness
-across CPUs, and to achieve low enough latency for tasks on a busy CPU when
-other CPUs would be more suited. BFS has the advantage that it requires no
-balancing algorithm whatsoever, as balancing occurs by proxy simply because
-all CPUs draw off the global runqueue, in priority and deadline order. Despite
-the fact that scalability is _not_ the prime concern of BFS, it both shows very
-good scalability to smaller numbers of CPUs and is likely a more scalable design
-at these numbers of CPUs.
-
-It also has some very low overhead scalability features built into the design
-when it has been deemed their overhead is so marginal that they're worth adding.
-The first is the local copy of the running process' data to the CPU it's running
-on to allow that data to be updated lockless where possible. Then there is
-deference paid to the last CPU a task was running on, by trying that CPU first
-when looking for an idle CPU to use the next time it's scheduled. Finally there
-is the notion of "sticky" tasks that are flagged when they are involuntarily
-descheduled, meaning they still want further CPU time. This sticky flag is
-used to bias heavily against those tasks being scheduled on a different CPU
-unless that CPU would be otherwise idle. When a cpu frequency governor is used
-that scales with CPU load, such as ondemand, sticky tasks are not scheduled
-on a different CPU at all, preferring instead to go idle. This means the CPU
-they were bound to is more likely to increase its speed while the other CPU
-will go idle, thus speeding up total task execution time and likely decreasing
-power usage. This is the only scenario where BFS will allow a CPU to go idle
-in preference to scheduling a task on the earliest available spare CPU.
-
-The real cost of migrating a task from one CPU to another is entirely dependant
-on the cache footprint of the task, how cache intensive the task is, how long
-it's been running on that CPU to take up the bulk of its cache, how big the CPU
-cache is, how fast and how layered the CPU cache is, how fast a context switch
-is... and so on. In other words, it's close to random in the real world where we
-do more than just one sole workload. The only thing we can be sure of is that
-it's not free. So BFS uses the principle that an idle CPU is a wasted CPU and
-utilising idle CPUs is more important than cache locality, and cache locality
-only plays a part after that.
-
-When choosing an idle CPU for a waking task, the cache locality is determined
-according to where the task last ran and then idle CPUs are ranked from best
-to worst to choose the most suitable idle CPU based on cache locality, NUMA
-node locality and hyperthread sibling business. They are chosen in the
-following preference (if idle):
-
-* Same core, idle or busy cache, idle threads
-* Other core, same cache, idle or busy cache, idle threads.
-* Same node, other CPU, idle cache, idle threads.
-* Same node, other CPU, busy cache, idle threads.
-* Same core, busy threads.
-* Other core, same cache, busy threads.
-* Same node, other CPU, busy threads.
-* Other node, other CPU, idle cache, idle threads.
-* Other node, other CPU, busy cache, idle threads.
-* Other node, other CPU, busy threads.
-
-This shows the SMT or "hyperthread" awareness in the design as well which will
-choose a real idle core first before a logical SMT sibling which already has
-tasks on the physical CPU.
-
-Early benchmarking of BFS suggested scalability dropped off at the 16 CPU mark.
-However this benchmarking was performed on an earlier design that was far less
-scalable than the current one so it's hard to know how scalable it is in terms
-of both CPUs (due to the global runqueue) and heavily loaded machines (due to
-O(n) lookup) at this stage. Note that in terms of scalability, the number of
-_logical_ CPUs matters, not the number of _physical_ CPUs. Thus, a dual (2x)
-quad core (4X) hyperthreaded (2X) machine is effectively a 16X. Newer benchmark
-results are very promising indeed, without needing to tweak any knobs, features
-or options. Benchmark contributions are most welcome.
-
-
-Features
-
-As the initial prime target audience for BFS was the average desktop user, it
-was designed to not need tweaking, tuning or have features set to obtain benefit
-from it. Thus the number of knobs and features has been kept to an absolute
-minimum and should not require extra user input for the vast majority of cases.
-There are precisely 2 tunables, and 2 extra scheduling policies. The rr_interval
-and iso_cpu tunables, and the SCHED_ISO and SCHED_IDLEPRIO policies. In addition
-to this, BFS also uses sub-tick accounting. What BFS does _not_ now feature is
-support for CGROUPS. The average user should neither need to know what these
-are, nor should they need to be using them to have good desktop behaviour.
-
-rr_interval
-
-There is only one "scheduler" tunable, the round robin interval. This can be
-accessed in
-
- /proc/sys/kernel/rr_interval
-
-The value is in milliseconds, and the default value is set to 6ms. Valid values
-are from 1 to 1000. Decreasing the value will decrease latencies at the cost of
-decreasing throughput, while increasing it will improve throughput, but at the
-cost of worsening latencies. The accuracy of the rr interval is limited by HZ
-resolution of the kernel configuration. Thus, the worst case latencies are
-usually slightly higher than this actual value. BFS uses "dithering" to try and
-minimise the effect the Hz limitation has. The default value of 6 is not an
-arbitrary one. It is based on the fact that humans can detect jitter at
-approximately 7ms, so aiming for much lower latencies is pointless under most
-circumstances. It is worth noting this fact when comparing the latency
-performance of BFS to other schedulers. Worst case latencies being higher than
-7ms are far worse than average latencies not being in the microsecond range.
-Experimentation has shown that rr intervals being increased up to 300 can
-improve throughput but beyond that, scheduling noise from elsewhere prevents
-further demonstrable throughput.
-
-Isochronous scheduling.
-
-Isochronous scheduling is a unique scheduling policy designed to provide
-near-real-time performance to unprivileged (ie non-root) users without the
-ability to starve the machine indefinitely. Isochronous tasks (which means
-"same time") are set using, for example, the schedtool application like so:
-
- schedtool -I -e amarok
-
-This will start the audio application "amarok" as SCHED_ISO. How SCHED_ISO works
-is that it has a priority level between true realtime tasks and SCHED_NORMAL
-which would allow them to preempt all normal tasks, in a SCHED_RR fashion (ie,
-if multiple SCHED_ISO tasks are running, they purely round robin at rr_interval
-rate). However if ISO tasks run for more than a tunable finite amount of time,
-they are then demoted back to SCHED_NORMAL scheduling. This finite amount of
-time is the percentage of _total CPU_ available across the machine, configurable
-as a percentage in the following "resource handling" tunable (as opposed to a
-scheduler tunable):
-
- /proc/sys/kernel/iso_cpu
-
-and is set to 70% by default. It is calculated over a rolling 5 second average
-Because it is the total CPU available, it means that on a multi CPU machine, it
-is possible to have an ISO task running as realtime scheduling indefinitely on
-just one CPU, as the other CPUs will be available. Setting this to 100 is the
-equivalent of giving all users SCHED_RR access and setting it to 0 removes the
-ability to run any pseudo-realtime tasks.
-
-A feature of BFS is that it detects when an application tries to obtain a
-realtime policy (SCHED_RR or SCHED_FIFO) and the caller does not have the
-appropriate privileges to use those policies. When it detects this, it will
-give the task SCHED_ISO policy instead. Thus it is transparent to the user.
-Because some applications constantly set their policy as well as their nice
-level, there is potential for them to undo the override specified by the user
-on the command line of setting the policy to SCHED_ISO. To counter this, once
-a task has been set to SCHED_ISO policy, it needs superuser privileges to set
-it back to SCHED_NORMAL. This will ensure the task remains ISO and all child
-processes and threads will also inherit the ISO policy.
-
-Idleprio scheduling.
-
-Idleprio scheduling is a scheduling policy designed to give out CPU to a task
-_only_ when the CPU would be otherwise idle. The idea behind this is to allow
-ultra low priority tasks to be run in the background that have virtually no
-effect on the foreground tasks. This is ideally suited to distributed computing
-clients (like setiathome, folding, mprime etc) but can also be used to start
-a video encode or so on without any slowdown of other tasks. To avoid this
-policy from grabbing shared resources and holding them indefinitely, if it
-detects a state where the task is waiting on I/O, the machine is about to
-suspend to ram and so on, it will transiently schedule them as SCHED_NORMAL. As
-per the Isochronous task management, once a task has been scheduled as IDLEPRIO,
-it cannot be put back to SCHED_NORMAL without superuser privileges. Tasks can
-be set to start as SCHED_IDLEPRIO with the schedtool command like so:
-
- schedtool -D -e ./mprime
-
-Subtick accounting.
-
-It is surprisingly difficult to get accurate CPU accounting, and in many cases,
-the accounting is done by simply determining what is happening at the precise
-moment a timer tick fires off. This becomes increasingly inaccurate as the
-timer tick frequency (HZ) is lowered. It is possible to create an application
-which uses almost 100% CPU, yet by being descheduled at the right time, records
-zero CPU usage. While the main problem with this is that there are possible
-security implications, it is also difficult to determine how much CPU a task
-really does use. BFS tries to use the sub-tick accounting from the TSC clock,
-where possible, to determine real CPU usage. This is not entirely reliable, but
-is far more likely to produce accurate CPU usage data than the existing designs
-and will not show tasks as consuming no CPU usage when they actually are. Thus,
-the amount of CPU reported as being used by BFS will more accurately represent
-how much CPU the task itself is using (as is shown for example by the 'time'
-application), so the reported values may be quite different to other schedulers.
-Values reported as the 'load' are more prone to problems with this design, but
-per process values are closer to real usage. When comparing throughput of BFS
-to other designs, it is important to compare the actual completed work in terms
-of total wall clock time taken and total work done, rather than the reported
-"cpu usage".
-
-
-Con Kolivas <kernel@kolivas.org> Tue, 5 Apr 2011
diff --git a/Documentation/scheduler/sched-deadline.txt b/Documentation/scheduler/sched-deadline.txt
index 21461a044..e114513a2 100644
--- a/Documentation/scheduler/sched-deadline.txt
+++ b/Documentation/scheduler/sched-deadline.txt
@@ -8,6 +8,10 @@ CONTENTS
1. Overview
2. Scheduling algorithm
3. Scheduling Real-Time Tasks
+ 3.1 Definitions
+ 3.2 Schedulability Analysis for Uniprocessor Systems
+ 3.3 Schedulability Analysis for Multiprocessor Systems
+ 3.4 Relationship with SCHED_DEADLINE Parameters
4. Bandwidth management
4.1 System-wide settings
4.2 Task interface
@@ -43,7 +47,7 @@ CONTENTS
"deadline", to schedule tasks. A SCHED_DEADLINE task should receive
"runtime" microseconds of execution time every "period" microseconds, and
these "runtime" microseconds are available within "deadline" microseconds
- from the beginning of the period. In order to implement this behaviour,
+ from the beginning of the period. In order to implement this behavior,
every time the task wakes up, the scheduler computes a "scheduling deadline"
consistent with the guarantee (using the CBS[2,3] algorithm). Tasks are then
scheduled using EDF[1] on these scheduling deadlines (the task with the
@@ -52,7 +56,7 @@ CONTENTS
"admission control" strategy (see Section "4. Bandwidth management") is used
(clearly, if the system is overloaded this guarantee cannot be respected).
- Summing up, the CBS[2,3] algorithms assigns scheduling deadlines to tasks so
+ Summing up, the CBS[2,3] algorithm assigns scheduling deadlines to tasks so
that each task runs for at most its runtime every period, avoiding any
interference between different tasks (bandwidth isolation), while the EDF[1]
algorithm selects the task with the earliest scheduling deadline as the one
@@ -63,7 +67,7 @@ CONTENTS
In more details, the CBS algorithm assigns scheduling deadlines to
tasks in the following way:
- - Each SCHED_DEADLINE task is characterised by the "runtime",
+ - Each SCHED_DEADLINE task is characterized by the "runtime",
"deadline", and "period" parameters;
- The state of the task is described by a "scheduling deadline", and
@@ -78,7 +82,7 @@ CONTENTS
then, if the scheduling deadline is smaller than the current time, or
this condition is verified, the scheduling deadline and the
- remaining runtime are re-initialised as
+ remaining runtime are re-initialized as
scheduling deadline = current time + deadline
remaining runtime = runtime
@@ -126,31 +130,37 @@ CONTENTS
suited for periodic or sporadic real-time tasks that need guarantees on their
timing behavior, e.g., multimedia, streaming, control applications, etc.
+3.1 Definitions
+------------------------
+
A typical real-time task is composed of a repetition of computation phases
(task instances, or jobs) which are activated on a periodic or sporadic
fashion.
- Each job J_j (where J_j is the j^th job of the task) is characterised by an
+ Each job J_j (where J_j is the j^th job of the task) is characterized by an
arrival time r_j (the time when the job starts), an amount of computation
time c_j needed to finish the job, and a job absolute deadline d_j, which
is the time within which the job should be finished. The maximum execution
- time max_j{c_j} is called "Worst Case Execution Time" (WCET) for the task.
+ time max{c_j} is called "Worst Case Execution Time" (WCET) for the task.
A real-time task can be periodic with period P if r_{j+1} = r_j + P, or
sporadic with minimum inter-arrival time P is r_{j+1} >= r_j + P. Finally,
d_j = r_j + D, where D is the task's relative deadline.
- The utilisation of a real-time task is defined as the ratio between its
+ Summing up, a real-time task can be described as
+ Task = (WCET, D, P)
+
+ The utilization of a real-time task is defined as the ratio between its
WCET and its period (or minimum inter-arrival time), and represents
the fraction of CPU time needed to execute the task.
- If the total utilisation sum_i(WCET_i/P_i) is larger than M (with M equal
+ If the total utilization U=sum(WCET_i/P_i) is larger than M (with M equal
to the number of CPUs), then the scheduler is unable to respect all the
deadlines.
- Note that total utilisation is defined as the sum of the utilisations
+ Note that total utilization is defined as the sum of the utilizations
WCET_i/P_i over all the real-time tasks in the system. When considering
multiple real-time tasks, the parameters of the i-th task are indicated
with the "_i" suffix.
- Moreover, if the total utilisation is larger than M, then we risk starving
+ Moreover, if the total utilization is larger than M, then we risk starving
non- real-time tasks by real-time tasks.
- If, instead, the total utilisation is smaller than M, then non real-time
+ If, instead, the total utilization is smaller than M, then non real-time
tasks will not be starved and the system might be able to respect all the
deadlines.
As a matter of fact, in this case it is possible to provide an upper bound
@@ -159,38 +169,119 @@ CONTENTS
More precisely, it can be proven that using a global EDF scheduler the
maximum tardiness of each task is smaller or equal than
((M − 1) · WCET_max − WCET_min)/(M − (M − 2) · U_max) + WCET_max
- where WCET_max = max_i{WCET_i} is the maximum WCET, WCET_min=min_i{WCET_i}
- is the minimum WCET, and U_max = max_i{WCET_i/P_i} is the maximum utilisation.
+ where WCET_max = max{WCET_i} is the maximum WCET, WCET_min=min{WCET_i}
+ is the minimum WCET, and U_max = max{WCET_i/P_i} is the maximum
+ utilization[12].
+
+3.2 Schedulability Analysis for Uniprocessor Systems
+------------------------
If M=1 (uniprocessor system), or in case of partitioned scheduling (each
real-time task is statically assigned to one and only one CPU), it is
possible to formally check if all the deadlines are respected.
If D_i = P_i for all tasks, then EDF is able to respect all the deadlines
- of all the tasks executing on a CPU if and only if the total utilisation
+ of all the tasks executing on a CPU if and only if the total utilization
of the tasks running on such a CPU is smaller or equal than 1.
If D_i != P_i for some task, then it is possible to define the density of
- a task as C_i/min{D_i,T_i}, and EDF is able to respect all the deadlines
- of all the tasks running on a CPU if the sum sum_i C_i/min{D_i,T_i} of the
- densities of the tasks running on such a CPU is smaller or equal than 1
- (notice that this condition is only sufficient, and not necessary).
+ a task as WCET_i/min{D_i,P_i}, and EDF is able to respect all the deadlines
+ of all the tasks running on a CPU if the sum of the densities of the tasks
+ running on such a CPU is smaller or equal than 1:
+ sum(WCET_i / min{D_i, P_i}) <= 1
+ It is important to notice that this condition is only sufficient, and not
+ necessary: there are task sets that are schedulable, but do not respect the
+ condition. For example, consider the task set {Task_1,Task_2} composed by
+ Task_1=(50ms,50ms,100ms) and Task_2=(10ms,100ms,100ms).
+ EDF is clearly able to schedule the two tasks without missing any deadline
+ (Task_1 is scheduled as soon as it is released, and finishes just in time
+ to respect its deadline; Task_2 is scheduled immediately after Task_1, hence
+ its response time cannot be larger than 50ms + 10ms = 60ms) even if
+ 50 / min{50,100} + 10 / min{100, 100} = 50 / 50 + 10 / 100 = 1.1
+ Of course it is possible to test the exact schedulability of tasks with
+ D_i != P_i (checking a condition that is both sufficient and necessary),
+ but this cannot be done by comparing the total utilization or density with
+ a constant. Instead, the so called "processor demand" approach can be used,
+ computing the total amount of CPU time h(t) needed by all the tasks to
+ respect all of their deadlines in a time interval of size t, and comparing
+ such a time with the interval size t. If h(t) is smaller than t (that is,
+ the amount of time needed by the tasks in a time interval of size t is
+ smaller than the size of the interval) for all the possible values of t, then
+ EDF is able to schedule the tasks respecting all of their deadlines. Since
+ performing this check for all possible values of t is impossible, it has been
+ proven[4,5,6] that it is sufficient to perform the test for values of t
+ between 0 and a maximum value L. The cited papers contain all of the
+ mathematical details and explain how to compute h(t) and L.
+ In any case, this kind of analysis is too complex as well as too
+ time-consuming to be performed on-line. Hence, as explained in Section
+ 4 Linux uses an admission test based on the tasks' utilizations.
+
+3.3 Schedulability Analysis for Multiprocessor Systems
+------------------------
On multiprocessor systems with global EDF scheduling (non partitioned
systems), a sufficient test for schedulability can not be based on the
- utilisations (it can be shown that task sets with utilisations slightly
- larger than 1 can miss deadlines regardless of the number of CPUs M).
- However, as previously stated, enforcing that the total utilisation is smaller
- than M is enough to guarantee that non real-time tasks are not starved and
- that the tardiness of real-time tasks has an upper bound.
+ utilizations or densities: it can be shown that even if D_i = P_i task
+ sets with utilizations slightly larger than 1 can miss deadlines regardless
+ of the number of CPUs.
+
+ Consider a set {Task_1,...Task_{M+1}} of M+1 tasks on a system with M
+ CPUs, with the first task Task_1=(P,P,P) having period, relative deadline
+ and WCET equal to P. The remaining M tasks Task_i=(e,P-1,P-1) have an
+ arbitrarily small worst case execution time (indicated as "e" here) and a
+ period smaller than the one of the first task. Hence, if all the tasks
+ activate at the same time t, global EDF schedules these M tasks first
+ (because their absolute deadlines are equal to t + P - 1, hence they are
+ smaller than the absolute deadline of Task_1, which is t + P). As a
+ result, Task_1 can be scheduled only at time t + e, and will finish at
+ time t + e + P, after its absolute deadline. The total utilization of the
+ task set is U = M · e / (P - 1) + P / P = M · e / (P - 1) + 1, and for small
+ values of e this can become very close to 1. This is known as "Dhall's
+ effect"[7]. Note: the example in the original paper by Dhall has been
+ slightly simplified here (for example, Dhall more correctly computed
+ lim_{e->0}U).
+
+ More complex schedulability tests for global EDF have been developed in
+ real-time literature[8,9], but they are not based on a simple comparison
+ between total utilization (or density) and a fixed constant. If all tasks
+ have D_i = P_i, a sufficient schedulability condition can be expressed in
+ a simple way:
+ sum(WCET_i / P_i) <= M - (M - 1) · U_max
+ where U_max = max{WCET_i / P_i}[10]. Notice that for U_max = 1,
+ M - (M - 1) · U_max becomes M - M + 1 = 1 and this schedulability condition
+ just confirms the Dhall's effect. A more complete survey of the literature
+ about schedulability tests for multi-processor real-time scheduling can be
+ found in [11].
+
+ As seen, enforcing that the total utilization is smaller than M does not
+ guarantee that global EDF schedules the tasks without missing any deadline
+ (in other words, global EDF is not an optimal scheduling algorithm). However,
+ a total utilization smaller than M is enough to guarantee that non real-time
+ tasks are not starved and that the tardiness of real-time tasks has an upper
+ bound[12] (as previously noted). Different bounds on the maximum tardiness
+ experienced by real-time tasks have been developed in various papers[13,14],
+ but the theoretical result that is important for SCHED_DEADLINE is that if
+ the total utilization is smaller or equal than M then the response times of
+ the tasks are limited.
+
+3.4 Relationship with SCHED_DEADLINE Parameters
+------------------------
- SCHED_DEADLINE can be used to schedule real-time tasks guaranteeing that
- the jobs' deadlines of a task are respected. In order to do this, a task
- must be scheduled by setting:
+ Finally, it is important to understand the relationship between the
+ SCHED_DEADLINE scheduling parameters described in Section 2 (runtime,
+ deadline and period) and the real-time task parameters (WCET, D, P)
+ described in this section. Note that the tasks' temporal constraints are
+ represented by its absolute deadlines d_j = r_j + D described above, while
+ SCHED_DEADLINE schedules the tasks according to scheduling deadlines (see
+ Section 2).
+ If an admission test is used to guarantee that the scheduling deadlines
+ are respected, then SCHED_DEADLINE can be used to schedule real-time tasks
+ guaranteeing that all the jobs' deadlines of a task are respected.
+ In order to do this, a task must be scheduled by setting:
- runtime >= WCET
- deadline = D
- period <= P
- IOW, if runtime >= WCET and if period is >= P, then the scheduling deadlines
+ IOW, if runtime >= WCET and if period is <= P, then the scheduling deadlines
and the absolute deadlines (d_j) coincide, so a proper admission control
allows to respect the jobs' absolute deadlines for this task (this is what is
called "hard schedulability property" and is an extension of Lemma 1 of [2]).
@@ -206,6 +297,39 @@ CONTENTS
Symposium, 1998. http://retis.sssup.it/~giorgio/paps/1998/rtss98-cbs.pdf
3 - L. Abeni. Server Mechanisms for Multimedia Applications. ReTiS Lab
Technical Report. http://disi.unitn.it/~abeni/tr-98-01.pdf
+ 4 - J. Y. Leung and M.L. Merril. A Note on Preemptive Scheduling of
+ Periodic, Real-Time Tasks. Information Processing Letters, vol. 11,
+ no. 3, pp. 115-118, 1980.
+ 5 - S. K. Baruah, A. K. Mok and L. E. Rosier. Preemptively Scheduling
+ Hard-Real-Time Sporadic Tasks on One Processor. Proceedings of the
+ 11th IEEE Real-time Systems Symposium, 1990.
+ 6 - S. K. Baruah, L. E. Rosier and R. R. Howell. Algorithms and Complexity
+ Concerning the Preemptive Scheduling of Periodic Real-Time tasks on
+ One Processor. Real-Time Systems Journal, vol. 4, no. 2, pp 301-324,
+ 1990.
+ 7 - S. J. Dhall and C. L. Liu. On a real-time scheduling problem. Operations
+ research, vol. 26, no. 1, pp 127-140, 1978.
+ 8 - T. Baker. Multiprocessor EDF and Deadline Monotonic Schedulability
+ Analysis. Proceedings of the 24th IEEE Real-Time Systems Symposium, 2003.
+ 9 - T. Baker. An Analysis of EDF Schedulability on a Multiprocessor.
+ IEEE Transactions on Parallel and Distributed Systems, vol. 16, no. 8,
+ pp 760-768, 2005.
+ 10 - J. Goossens, S. Funk and S. Baruah, Priority-Driven Scheduling of
+ Periodic Task Systems on Multiprocessors. Real-Time Systems Journal,
+ vol. 25, no. 2–3, pp. 187–205, 2003.
+ 11 - R. Davis and A. Burns. A Survey of Hard Real-Time Scheduling for
+ Multiprocessor Systems. ACM Computing Surveys, vol. 43, no. 4, 2011.
+ http://www-users.cs.york.ac.uk/~robdavis/papers/MPSurveyv5.0.pdf
+ 12 - U. C. Devi and J. H. Anderson. Tardiness Bounds under Global EDF
+ Scheduling on a Multiprocessor. Real-Time Systems Journal, vol. 32,
+ no. 2, pp 133-189, 2008.
+ 13 - P. Valente and G. Lipari. An Upper Bound to the Lateness of Soft
+ Real-Time Tasks Scheduled by EDF on Multiprocessors. Proceedings of
+ the 26th IEEE Real-Time Systems Symposium, 2005.
+ 14 - J. Erickson, U. Devi and S. Baruah. Improved tardiness bounds for
+ Global EDF. Proceedings of the 22nd Euromicro Conference on
+ Real-Time Systems, 2010.
+
4. Bandwidth management
=======================
@@ -218,10 +342,10 @@ CONTENTS
no guarantee can be given on the actual scheduling of the -deadline tasks.
As already stated in Section 3, a necessary condition to be respected to
- correctly schedule a set of real-time tasks is that the total utilisation
+ correctly schedule a set of real-time tasks is that the total utilization
is smaller than M. When talking about -deadline tasks, this requires that
the sum of the ratio between runtime and period for all tasks is smaller
- than M. Notice that the ratio runtime/period is equivalent to the utilisation
+ than M. Notice that the ratio runtime/period is equivalent to the utilization
of a "traditional" real-time task, and is also often referred to as
"bandwidth".
The interface used to control the CPU bandwidth that can be allocated
@@ -251,7 +375,7 @@ CONTENTS
The system wide settings are configured under the /proc virtual file system.
For now the -rt knobs are used for -deadline admission control and the
- -deadline runtime is accounted against the -rt runtime. We realise that this
+ -deadline runtime is accounted against the -rt runtime. We realize that this
isn't entirely desirable; however, it is better to have a small interface for
now, and be able to change it easily later. The ideal situation (see 5.) is to
run -rt tasks from a -deadline server; in which case the -rt bandwidth is a