Before describing the design of the pooled multiqueue scheduler, it is instructive to examine the implications of the scheduling algorithm followed by DSS and MQS. During schedule(), DSS selects the next best task to run, as determined by the goodness() function. MQ attempts to do the same though its examination of non-local tasks (tasks not on the runqueue of the invoking cpu) is not done under a lock. During reschedule_idle(), DSS tries to find the best CPU on which the newly woken/created task can be run. MQS does the same without holding a lock on the remote runqueues. Overall, whether it is choosing a task or a CPU, DSS looks at all viable candidates and choose the globally best one while MQ tries to achieve the same goal by an intelligent examination of fewer candidates. This has two implications for large SMP and NUMA machines :
To address these two problems, we propose a pooled multiqueue scheduler (PMQS) based on MQS. The central idea in PMQS is to partition the CPUs of an SMP/NUMA into pools and to limit the search for candidate CPUs and/or tasks to these pools. The size of the pools is a configurable parameter that can be dynamically changed to meet different criteria of performance, fairness etc. Restricting the scope of scheduling decisions to fixed-size pools improves the scalability of the scheduler by making it independent of NCPUS. It also increases the localilty of tasks with respect to the CPUs on which they run.
For simplicity, the implementation of PMQS is based on minor modifications to MQS. In general, DSS code which looks at all CPUs using
for (i = 0; i <= smp_num_cpus; i++)
is replaced by
for (i = first; i <= last; i++)
where first and last denote the corresponding limits of the invoking CPU's
pool. The first phase of the schedule() routine of MQS is unchanged in
PMQS and is used to find the best local candidate task. In the second phase of
the routine, the only remote runqueues examined are those of CPUs within the
same pool as the invoking CPU. The examination of remote runqueues continues to
be done without holding a lock. In the reschedule_idle() function,
only those CPUs are treated as candidates which lie within the same pool as the
task's previous CPU. The only exception to this rule is when there are idle
CPUs outside the pool. Leaving a CPU idle when other CPUs have runnable tasks
leads to a complete waste of CPU cycles and is particularly undesirable in a
server environment. Therefore, PMQS looks for idle CPUs systemwide without
regard to pool boundaries. The hunt for idle CPUs uses a simple mechanism. A
global bitmap, called node_idle_bits, records the currently idle CPUs
in the system. The bitmap is maintained by the scheduler when it switches tasks
on any CPU. During schedule(), node_idle_bits is first masked
with a local pool mask to identify idle CPUs within the local pool. If none is
found, it is masked with a remote pool mask to identify idle CPUs in remote
pools. Consistent with the pooling philosophy, idle CPUs within a pool are
preferentially chosen over those outside the pool. If an idle CPU is found at
any stage, it is selected as the target CPU. Contrary to the behaviour of DSS
and MQS, PMQS makes no attempt to select the longest idle CPU as the target. If
no idle CPU is found, reschedule_idle() proceeds as normal, examining
max_na_goodness values of CPUs within the pool.