Fair CPU Scheduler for Linux Borislav Deianov Last updated on 10 July 2000 This is an implementation of hierarchical fair CPU scheduling for Linux. This document introduces the concept of hierarchical fair scheduling and discusses the implementation and its relationship to other works. It also describes the user-space interface (system calls), although that part will be eventually moved to proper man pages. At the end come acknowledgements and the unavoidable legalese. Hierarchical Fair CPU Scheduling -------------------------------- Informally, a fair CPU scheduler allocates CPU time to processes in proportion to preassigned weights. A more precise description follows. Partition the processes in the system into several groups A, B, and so on. Each process belongs to exactly one group. Let weight(X) be the assigned weight of group X, weight(X) > 0. Assume that for some length of time two groups A and B always contain runnable processes and let cpu(A) and cpu(B) be the total amount of CPU time allocated to groups A and B, respectively, over that length of time. Then cpu(A)/cpu(B) should be (approximately) equal to weight(B)/weight(A). Note 1: We make no assumptions about the length of time that each individual process remains runnable; the runnable processes in each group may change over time. Note 2: The CPU time in the fair scheduler described above is in REVERSE proportion to the weights of the groups. This conflicts with the usual intuition of a "heavier" process/group being "more important". The reason for this change lies in the peculiarities of the scheduling algorithm and its implementation. The intuition can be changed as follows: a lighter process can move faster than a heavier one. Example: -------- weight 3 weight 2 weight 1 ,--------. ,--------. ,--------. | p1, p2 | | p3 | | p4, p5 | `--------' `--------' `--------' Group A Group B Group C If the above system runs for 11 seconds and all depicted processes remain runnable, then p1 and p2 should run for a total of 2 seconds, p3 should run for 3 seconds and p4 and p5 should run for a total of 6 seconds. The relative running times of p1 and p2 (and p4 and p5) are determined by some further scheduling algorithm and do not affect the fairness property. Hierarchical fair scheduling takes the above one step further. Consider a tree structure. Groups of processes correspond to leaf nodes. Each subtree is assigned a positive weight. We require that the fairness property is observed at each level of the tree, i.e. the amount of CPU time received of two subtrees of the same node is in inverse proportion of their respective weights. Example: -------- root (3) | (2) ,----------^----------. | | (1) | (3) ,----^---. ,-----^------. | p4, p5 | | | `--------' ,----^---. ,----^---. Group C | p1, p2 | | p3 | `--------' `--------' Group A Group B The weights of each subtree are given in brackets above it. If the above system runs for 10 seconds, then p1 and p2 should run for a total of 3 seconds, p3 should run for 1 second, p4 and p5 should run for a total of 6 seconds. As is evident from the above examples, the fairness property does not fully specify the behavior of the CPU scheduler. This is not surprising, since the requirements for the scheduler of a general- purpose OS like Linux are numerous and often contradictory (short response times for interactive tasks, good throughput for CPU-bound tasks, small scheduling overhead, to mention a few). The standard Linux scheduler already does a pretty good job in most usage scenarios and it would be unwise to discard it in order to provide the fairness feature outlined above. This implementation combines the best of both worlds - a fair scheduling algorithm is used to achieve the fairness properties while the standard Linux scheduler is used for all remaining scheduling decisions (scheduling processes within a group, preemption decisions, etc). What's been done before and what's new -------------------------------------- This implementation uses a variant of the Start-time Fair Queuing (SFQ) algorithm. The SFQ algorithm is described in the following paper: P. Goyal and X. Guo and H.M. Vin, A Hierarchical CPU Scheduler for Multimedia Operating Systems, Proceedings of 2nd Symposium on Operating System Design and Implementation (OSDI'96), Seattle, WA, pages 107-122, October 1996. http://www.cs.utexas.edu/users/dmcl/papers/ps/OSDI96.ps Another implementation of a hierarchical fair scheduler (based on the algorithm in the above paper) is available as part of the QLinux project: http://www.cs.umass.edu/~lass/software/qlinux/ The major differences between this implementation and the QLinux implementation are as follows: - SMP support. This implementation fully supports SMP via modifications to the original SFQ algorithm. QLinux does not support SMP; - Incompatible user-space interface (system calls). Since fair scheduling is not yet a widely used feature, I chose to clean up and simplify the interface; - Feature set. I have dropped some of the features of the QLinux implementation, for various reasons. In particular, this implementation does not support different leaf schedulers (all leaves use the standard Linux scheduler), named nodes and non-integer weights. User-space interface -------------------- The following is a feeble attempt to describe the system call interface. This will eventually migrate to proper man pages, until then look in the code for details. The tree structure described above consists of numbered nodes. There are two nodes that always exist: node 0 is the root of the tree, node 1 is the leaf node that all processes belong to by default. Other nodes may be created or removed via system calls. Every process in the system belongs to exactly one leaf node. Children of a process belong to the same node as the parent, unless they are subsequently moved to a new node. Each node in the tree except the root is assigned a positive integer weight. The default weight of node 1 is 1. The following system calls are provided: long fairsched_mknod(unsigned int parent, unsigned int weight, unsigned int desired) -- Creates a new node as a child of node parent, with initial weight weight. If desired is non-zero, tries to create a node with that id and returns 0 on success. If desired is zero, chooses a free id and returns it on success. long fairsched_rmnod(unsigned int id) -- Removes a node. The node must currently be a leaf (not have any children) and contain no processes. long fairsched_chwt(unsigned int id, unsigned int weight) -- Changes the weight of a node with a given id. long fairsched_mvpr(pid_t pid, unsigned int newid) -- Moves a process to a new node. You can't move running processes, except the current process can move itself. Unless otherwise noted, 0 is returned on success or a negative value on error. All fairsched_* calls require root permissions (actually the CAP_SYS_NICE capability). If the /proc filesystem is mounted, /proc/fairsched gives a listing of the current scheduling tree. The format of this file will probably change numerous times while the implementation develops, so I will not attempt to describe it until it stabilizes a bit. For each process /proc//status contains a new line "FNid: " that gives the id of the node that process belongs to. Acknowledgements and legalese ----------------------------- This work was inspired by the QLinux scheduler by Pawan Goyal, Jasleen Kaur Sahni and T R Vishwanath. Thanks to Pawan Goyal for explaining the SFQ intricacies and encouraging me to do this. Special thanks to WayForward Technologies. Copyright (C) 1999-2000 Borislav Deianov Copyright (C) 2000 Ensim Corporation This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.