Fair CPU Scheduler for Linux

        Borislav Deianov <borislav@users.sourceforge.net>
                  Last updated on 10 July 2000


This is an implementation of hierarchical fair CPU scheduling for
Linux. This document introduces the concept of hierarchical fair
scheduling and discusses the implementation and its relationship to
other works. It also describes the user-space interface (system
calls), although that part will be eventually moved to proper man
pages. At the end come acknowledgements and the unavoidable legalese.


Hierarchical Fair CPU Scheduling
--------------------------------

Informally, a fair CPU scheduler allocates CPU time to processes in
proportion to preassigned weights. A more precise description follows.

Partition the processes in the system into several groups A, B, and so
on. Each process belongs to exactly one group. Let weight(X) be the
assigned weight of group X, weight(X) > 0. Assume that for some length
of time two groups A and B always contain runnable processes and let
cpu(A) and cpu(B) be the total amount of CPU time allocated to groups
A and B, respectively, over that length of time. Then cpu(A)/cpu(B)
should be (approximately) equal to weight(B)/weight(A).

Note 1: We make no assumptions about the length of time that each
individual process remains runnable; the runnable processes in each
group may change over time.

Note 2: The CPU time in the fair scheduler described above is in
REVERSE proportion to the weights of the groups. This conflicts with
the usual intuition of a "heavier" process/group being "more
important". The reason for this change lies in the peculiarities of
the scheduling algorithm and its implementation. The intuition can be
changed as follows: a lighter process can move faster than a heavier
one.

Example:
--------

 weight 3     weight 2     weight 1
,--------.   ,--------.   ,--------.
| p1, p2 |   |   p3   |   | p4, p5 |
`--------'   `--------'   `--------'
 Group A      Group B      Group C

If the above system runs for 11 seconds and all depicted processes
remain runnable, then p1 and p2 should run for a total of 2 seconds,
p3 should run for 3 seconds and p4 and p5 should run for a total of 6
seconds. The relative running times of p1 and p2 (and p4 and p5) are
determined by some further scheduling algorithm and do not affect the
fairness property.

Hierarchical fair scheduling takes the above one step further.
Consider a tree structure. Groups of processes correspond to leaf
nodes. Each subtree is assigned a positive weight. We require that the
fairness property is observed at each level of the tree, i.e. the
amount of CPU time received of two subtrees of the same node is in
inverse proportion of their respective weights.

Example:
--------
                     root
          (3)         |         (2)
           ,----------^----------.
           |                     |
    (1)    |     (3)        ,----^---.
     ,-----^------.         | p4, p5 |
     |            |         `--------'  
,----^---.   ,----^---.       Group C
| p1, p2 |   |   p3   |
`--------'   `--------'
 Group A      Group B


The weights of each subtree are given in brackets above it. If the
above system runs for 10 seconds, then p1 and p2 should run for a
total of 3 seconds, p3 should run for 1 second, p4 and p5 should run
for a total of 6 seconds.

As is evident from the above examples, the fairness property does not
fully specify the behavior of the CPU scheduler. This is not
surprising, since the requirements for the scheduler of a general-
purpose OS like Linux are numerous and often contradictory (short
response times for interactive tasks, good throughput for CPU-bound
tasks, small scheduling overhead, to mention a few). The standard
Linux scheduler already does a pretty good job in most usage scenarios
and it would be unwise to discard it in order to provide the fairness
feature outlined above. This implementation combines the best of both
worlds - a fair scheduling algorithm is used to achieve the fairness
properties while the standard Linux scheduler is used for all
remaining scheduling decisions (scheduling processes within a group,
preemption decisions, etc).


What's been done before and what's new
--------------------------------------

This implementation uses a variant of the Start-time Fair Queuing
(SFQ) algorithm. The SFQ algorithm is described in the following paper:

P. Goyal and X. Guo and H.M. Vin, A Hierarchical CPU Scheduler for
Multimedia Operating Systems, Proceedings of 2nd Symposium on
Operating System Design and Implementation (OSDI'96), Seattle, WA,
pages 107-122, October 1996.
http://www.cs.utexas.edu/users/dmcl/papers/ps/OSDI96.ps

Another implementation of a hierarchical fair scheduler (based on the
algorithm in the above paper) is available as part of the QLinux
project: http://www.cs.umass.edu/~lass/software/qlinux/

The major differences between this implementation and the QLinux
implementation are as follows:

- SMP support. This implementation fully supports SMP via modifications
  to the original SFQ algorithm. QLinux does not support SMP;

- Incompatible user-space interface (system calls). Since fair
  scheduling is not yet a widely used feature, I chose to clean up and
  simplify the interface;

- Feature set. I have dropped some of the features of the QLinux
  implementation, for various reasons. In particular, this
  implementation does not support different leaf schedulers (all
  leaves use the standard Linux scheduler), named nodes and
  non-integer weights.


User-space interface
--------------------

The following is a feeble attempt to describe the system call
interface. This will eventually migrate to proper man pages, until
then look in the code for details.

The tree structure described above consists of numbered nodes. There
are two nodes that always exist: node 0 is the root of the tree, node
1 is the leaf node that all processes belong to by default. Other
nodes may be created or removed via system calls.

Every process in the system belongs to exactly one leaf node. Children
of a process belong to the same node as the parent, unless they are
subsequently moved to a new node. Each node in the tree except the
root is assigned a positive integer weight. The default weight of node
1 is 1.

The following system calls are provided:

	long fairsched_mknod(unsigned int parent, unsigned int weight,
				unsigned int desired) 

	     -- Creates a new node as a child of node parent, with initial
		weight weight. If desired is non-zero, tries to create
		a node with that id and returns 0 on success. If desired
		is zero, chooses a free id and returns it on success.

	long fairsched_rmnod(unsigned int id)

	     -- Removes a node. The node must currently be a leaf (not have
		any children) and contain no processes.

	long fairsched_chwt(unsigned int id, unsigned int weight)

	     -- Changes the weight of a node with a given id.

	long fairsched_mvpr(pid_t pid, unsigned int newid)

	     -- Moves a process to a new node. You can't move running
                processes, except the current process can move itself.

Unless otherwise noted, 0 is returned on success or a negative value
on error. All fairsched_* calls require root permissions (actually the
CAP_SYS_NICE capability).

If the /proc filesystem is mounted, /proc/fairsched gives a listing of
the current scheduling tree. The format of this file will probably
change numerous times while the implementation develops, so I will not
attempt to describe it until it stabilizes a bit. For each process
/proc/<pid>/status contains a new line "FNid: " that gives the id of
the node that process belongs to.


Acknowledgements and legalese
-----------------------------

This work was inspired by the QLinux scheduler by Pawan Goyal, Jasleen
Kaur Sahni and T R Vishwanath. Thanks to Pawan Goyal for explaining
the SFQ intricacies and encouraging me to do this.

Special thanks to WayForward Technologies.

Copyright (C) 1999-2000  Borislav Deianov
Copyright (C) 2000  Ensim Corporation

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or (at
your option) any later version.