In 2009, BFS was introduced and had originally used a
doubly linked list data structure, but the data structure is treated like a
queue. Task insertion is . per CPU that was used in his RDSL scheduler was to put to ease fairness among the multiple CPU scenario and remove complexity that each runqueue in a multi-runqueue scenario had to maintain its own latencies and [task] fairness. This time this altered the efficiency of the scheduler. He noted task insertion, task lookup; , with , for task removal.}} where is the virtual deadline in u64 integer nanoseconds as a function of nice and which is the current time in niffies, is the prio ratio table lookup as a function of index, is the task's nice-to-index mapping function, is the round robin timeslice in milliseconds, is a constant of 1 millisecond in terms of nanoseconds as a latency reducing approximation of the conversion factor of \mathrm\frac{10^9 ns}{10^3 ms} but Kolivas uses a base 2 constant with approximately that scale. Smaller values of mean that the virtual deadline is earlier corresponding to negative nice values. Larger values of indicate the virtual deadline is pushed back later corresponding to positive nice values. It uses this formula whenever the timeslice expires. 128 in base 2 corresponds to 100 in base 10 and possibly a "pseudo 100". 115 in base 2 corresponds to 90 in base 10. Kolivas uses 128 for "fast
shifts", as in division is right shift base 2.
Scheduling policies BFS uses scheduling policies to determine how much of the CPU tasks may use. BFS uses 4 scheduling tiers (called scheduling policies or scheduling classes) ordered from best to worst which determines how tasks are selected with the ones on top being executed first. Each task has a special value called a prio. In the v0.462 edition (used in the -ck 4.0 kernel patchset), there are total of 103 "priority queues" (aka prio) or allowed values that it can take. No actual special data structure was used as the priority queue but only the doubly linked list runqueue itself. The lower prio value means it is more important and gets executed first.
Realtime policy The realtime policy was designed for
realtime tasks. This policy implies that the running tasks cannot be interrupted (i.e. preempted) by the lower prio-ed task or lower priority policy tiers. Priority classes considered under the realtime policy by the scheduler are those marked SCHED_RR and SCHED_FIFO. The scheduler treats realtime round robin (SCHED_RR) and realtime FIFO (SCHED_FIFO) differently. The design laid out first 100 static priority queues. The task that will get chosen for execution is based on task availability of the lowest value of prio of the 100 queues and FIFO scheduling. On
forks, the process priority will be demoted to normal policy. On unprivileged use (i.e. non-root user) of sched_setscheduler called with a request for realtime policy class, the scheduler will demote the task to Isochronous policy.
Isochronous policy The Isochronous policy was designed for near realtime performance for non-
root users. The design laid out 1 priority queue that by default ran as pseudo-realtime tasks, but can be tuned as a degree of realtime. The behavior of the policy can allow a task can be demoted to normal policy when it exceeds a tuneable resource handling percentage (70% by default) of 5 seconds scaled to the number of online CPUs and the timer resolution plus 1 tick. The formula was altered in MuQSS due to the multi-runqueue design. The exact formulas are: {{block indent|\mathtt{ T_{BFS} = ((5*F*n)+1)*R }}} {{block indent|\mathtt{ T_{MuQSS} = (5*F)*R }}} where is the total number of isochronous ticks, is the timer frequency, is the number of online CPUs, is the tuneable resource handling percentage not in decimal but as a whole number. The timer frequency is set to 250 by default and editable in the kernel, but usually tuned to 100 Hz for servers and 1000 Hz for interactive desktops. 250 is the balanced value. Setting to 100 made tasks behave as realtime and 0 made it not pseudo-realtime and anything in the middle was pseudo-realtime. The task that had an earliest virtual deadline was chosen for execution, but when multiple Isochronous tasks are in existence, they schedule as round robin allowing tasks to run the tuneable round robin value (with 6 ms as the default) one after another in a fair equal chance without considering the nice level. This behavior of the Isochronous policy is unique to only BFS and MuQSS and may not be implemented in other CPU schedulers.
Normal policy The normal policy was designed for regular use and is the default policy. Newly created tasks are typically marked normal. The design laid out one priority queue and tasks are chosen to be executed first based on earliest virtual deadline.
Idle priority policy The idle priority was designed for background processes such as
distributed programs and
transcoders so that foreground processes or those above this scheduling policy can run uninterrupted. The design laid out 1 priority queue and tasks can be
promoted to normal policy automatically to prevent
indefinite resource hold. The next executed task with Idle priority with others residing in the same priority policy is selected by the earliest virtual deadline.
Preemption Preemption can occur when a newly ready task with a higher priority policy (i.e. higher prio) has an earlier virtual deadline than the currently running task - which will be descheduled and put at the back of the queue. Descheduled means that its virtual deadline is updated. The task's time gets refilled to max round robin quantum when it has used up all its time. If the scheduler found the task at the higher prio with the earliest virtual deadline, it will execute in place of the less important currently running task only if all logical CPUs (including hyperthreaded cores / SMT threads) are busy. The scheduler will delay preemption as long as possible if there are unused logical CPUs. If a task is marked idle priority policy, it cannot preempt at all even other idle policy marked tasks but rather use
cooperative multitasking.
Task placement, multiple cores When the scheduler discovers a waking task on a non-unicore system, it will need to determine which logical CPU to run the task on. The scheduler favors most the idle
hyperthreaded cores (or idle
SMT threads) first on the same CPU that the task executed on, then the other idle core of a multicore CPU, then the other CPUs on the same NUMA node, then all busy hyperthreaded cores / SMT threads / logical CPUs to be preempted on the same
NUMA node, then the other (remote) NUMA node and is ranked on a preference list. This special scan exists to minimize latency overhead resulting of migrating the task. The preemption order is similar to the above paragraph. The preemption order is hyperthreaded core / SMT units on the same multicore first, then the other core in the multicore, then the other CPU on the same NUMA node. When it goes scanning for a task to preempt in the other remote NUMA node, the preemption is just any busy threads with lower to equal prio or later virtual deadline assuming that all logical CPUs (including hyperthreaded core / SMT threads) in the machine are all busy. The scheduler will have to scan for a suitable task with a lower or maybe equal priority policy task (with a later virtual deadline if necessary) to preempt and avoid logical CPUs with a task with a higher priority policy which it cannot preempt. Local preemption has a higher rank than scanning for a remote idle NUMA unit. When a task is involuntary preempted at the time the CPU is slowed down as a result of kernel mediated
CPU frequency scaling (aka CPU frequency governor), the task is specially marked "sticky" except those marked as realtime policy. Marked sticky indicates that the task still has unused time and the task is restricted executing to the same CPU. The task will be marked sticky whenever the CPU scaling governor has scaled the CPU at a slower speed. The idled stickied task will return to either executing at full Ghz speed by chance or to be rescheduled to execute on the best idle CPU that is not the same CPU that the task ran on. It is not desirable to migrate the task to other places but make it idle instead because of increased latency brought about of overhead to migrating the task to another CPU or NUMA node. This sticky feature was removed in the last iteration of BFS (v0.512) corresponding to Kolivas' patchset 4.8-ck1 and did not exist in MuQSS. ==schedtool==