Experiments in Scheduling

kernel, scheduler, preemption, Round Robin

Note: this article was written by Daniel Reinhold who unfortunately left the Haiku project and thus doesn't have his own website account.

The OS scheduler is the heart (nearly literally) of any operating system, so it's got to be in good working order. It needs to be fast, efficient, and distribute the computer's resources fairly. But what does that really mean? What makes a good scheduler?

In order to better understand this critical component, I recently wrote a userland prototype that mimics a real scheduler. This is taking advantage of the 'friendly third option' in kernel development. That is, when you want to test (and/or otherwise examine) a piece of kernel level code, you either:

get a serial cable connected to a second machine and prepare to capture and scan thru monster amounts of debug statements
boot and run under a emulator such as Bochs (funny that I say "such as" -- as if there is another one to use)
write a userland stub that simulates the portion of the kernel that you're interested in and hook your code up to that

The last option is by far the most attractive to me. I definitely prefer staying in the warmth and cozy safety of userland. Besides, simulators are kind of fun to write. And the turnaround time while editing/compiling/debugging is orders of magnitude faster.

It does have some drawbacks: you have to write the simulation stub (which may be a fair amount of work, depending on what you need) and then you have the deal with the fact that your simulation may not model the real thing well enough. Therefore, you don't want to draw too many conclusions from a prototype running in a simulated kernel.

What doth a scheduler do?

The scheduler creates the illusion that multiple programs are running simultaneously on your computer. That's probably the most succinct definition that I can give.

You do know that it's an illusion, right? After all, at any point in time, you've got dozens (maybe even hundreds) of threads running, but only one CPU to run them on. Well, maybe two (or four) CPUs if you have one of those computers, but there is still hardly a one-to-one ratio between running threads and CPUs. Something has got to act as traffic cop, sharing the resources of the CPU(s) with all the running threads that need attention. That something is the scheduler.

The basic idea of a scheduler is this: you have a pool of threads waiting to be run. The scheduler selects one, based upon some criterion, and then runs it. Then another one is selected, and so on. Since computers can switch between threads very quickly, it gives the impression of simultaneous execution (just as watching video gives the appearance of seamless motion even tho, in truth, it's just a sequence of static individual frames). Hopefully, the selection method results in all the threads being allowed their "fair share" of the CPU (whatever that means).

Pool of Waiting Threads

A	B	C	D	.	.	.	.	.	.	.	Z

			^

		(Thread 'D' is selected to run)

Once you try to start filling in the details for this basic idea, however, things get complicated very quickly. Which threads get to go first (i.e. how should they be ordered)? How long should each run last? Do some threads need to run longer or more often than others? When (and how often) should scheduling decisions be made? And I haven't even brought up the issue of prioritizing.

Now, rather than continue this discussion in a theoretical fashion, I'll cover these topics within the framework of my demonstration scheduler. We can use it to play around with various scheduling ideas. And then actually observe the effects that each one has, rather than just talking about it and running thought experiments in the mind.

A tiny virtual machine

To create a scheduler, you need some basics things: a thread structure, a queue to store them in, a thread selection algorithm (at least one), and a machine to run them on. The machine definition itself requires a machine language and an execution environment for interpreting the machine code. Additionally, as we shall see, it needs support for interrupts and basic time reporting. To support all this, I created a simulated machine called 'tvm' (for 'tiny virtual machine').

After some thought and general messing about, I came up with following fairly natural division of components in the source code:

machine - the machine definition
kernel - the pseudo kernel (including thread creation/management)
scheduler - the scheduler itself and various scheduling policies
tvm - top-level component (read command line args and then start the machine)

They are all linked together somewhat. In particular, the machine definition knows about threads, which wouldn't be true on a real machine. But heck, no harm is done. It keeps the code simpler and doesn't affect the basic design of the scheduler in any substantiative way (that I am aware of).

Preemption

Before describing the tvm machine, I need to point out one thing that has a major impact on the design of the whole thing: the scheduler is preemptive. Preemption allows the scheduler to stop running threads because their allotted burst has finished. This requires the machine to support timer interrupts.

Schedulers can be non-preemptive. In this case, a thread will run until it is finished or it decides voluntarily to relinquish control. Operating systems which are non-preemptive require the cooperation of running programs to be "nice". It's the honor system. But if just one program is allowed to run that turns out to be a selfish, resource hogging pig, the rest of the system can be reduced to a crawl, and the OS can do nothing about it (the user will have to power off the computer).

This is too grotesque for me, so I have designed a preemptive scheduler. How does this work? Each thread is allotted a fixed time slice (or quantum) in which to run. Once that quantum is finished, the thread's burst is over, and another thread is scheduled to run. The quantum is enforced by timer interrupts.

The program contains a global value called TimerQuantum. It can be set on the command line when starting the scheduler. The default value is 3 (measured in machine ticks). This value determines the frequency of the timer interrupt. When this interrupt occurs, the scheduler is invoked and another thread can then be selected to run.

But what if the thread finishes before its allotted quantum has expired? No problem, actually. The thread simply generates a timer interrupt itself as it is dying. This will invoke the scheduler right away and allow another thread to run.

The machine definition

Ok, on to the machine definition. The tvm machine consists of:

a simple stack-based machine language
an interpreter to execute the machine code
a basic interrupt handling mechanism

Interrupt handlers are stored in the Interrupt Vector Table. The interrupts themselves are processed by the Interrupt Event Queue. The event queue acts like a controller: interrupts are generated by pushing an interrupt number onto the queue, and then handled by retrieving the number from the queue. A few software interrupts are included, such as stack {under/over}flows and general errors. But the only "hardware" interrupt (using the term loosely) supported is the timer interrupt.

Timer interrupts are generated by a machine_tick() routine. This routine simply increments a couple of tick counters. These machine ticks form the basis of the timing functionality: an absolute counter keeps the system time, and a relative counter generates the timer interrupts. The march of time is somewhat quirky: a few calls to machine_tick() are sprinkled thru-out the main machine loop. There is no attempt to measure the relative timing cost of each machine operation -- it is just assumed that each basic routine will consume one tick. It's not a very sophisticated system, but I think it does its task well enough for the scheduler simulation.

What about the machine language? Well, I decided to go with something very simple here: just a tiny stack machine that supports about half a dozen operations. For a brief time (very brief), I flirted with something more in line with a real PC, such as a tiny subset of the Intel instruction set. But fortunately, I tossed that idea away very quickly. Stack machines are simple to create and demonstrate enough properties of a real machine to be useful. And you don't have to worry about registers and memory operands: everything is just pushed on and popped off the stack.

Furthermore, to keep things really simple, all that this simulated machine does is output characters and loop (to automate the output of multiple characters). This way, it will be easy to detect the effects of running threads, observe switching between them, etc. To get a feel for the machine code, consider the following small program:

 
op prog[] = {'$', opEmit, opHalt};

This program will emit a single '$' character and then halt. The 'op' type (which stands for opcode) is just a typedef for int, so the machine code can be easily placed into integer arrays. All programs for this machine will end with the 'halt' instruction to signify that the execution is finished.

An idle thread

Every computer needs an idle thread. That is, the scheduler has to have something at all times to schedule, so at startup time, an idle thread is created and run. It doesn't do anything itself. It just guarantees that the scheduler will have at least one thread in the queue. That's its whole purpose in life. Here's the idle thread that I created for my simulated machine:

 
op idle[] = {1, opLoop, opInc, opEndLoop, opHalt};

How does it work? The 'loop' instruction pops the stack to get the counter, i.e. the number of times to run the loop. In this case it finds the 1. It decrements the counter and pushes it back, along with it's own index. Next, the 'inc' instruction increments the top stack entry by one. Thus, the loop counter, which was down to 0, is back up to 1 again. The 'endloop' instruction just jumps backward to the beginning of the loop (by setting the instruction pointer to the index that 'loop' had previously saved).

In other words, the loop counter keeps toggling endlessly between 0 and 1. The 'halt' instruction is never reached. This is equivalent to the following C code:

 
i = 1;
loop:
if (i-- > 0) {
    ++i;
    goto loop;
}
halt:
exit(0);

The thread data structure

With a machine language defined, we now need a thread to encapsulate the code. Which is what a thread is -- the smallest self-contained structure that can be executed. Here is the definition of a tvm_thread:

 
// the thread structure

typedef struct
    {
    char *name;         // thread name (allocated dynamically)
    int   id;           // a unique id
   
    int   state;        // thread state (BIRTH, READY, RUNNING, DEATH)
    int   priority;     // thread priority (IDLE, LOW, NORMAL, HIGH)
    int   prio_adjust;  // priority adjustment
        
    op   *code;         // base address of the thread's machine code
    int   ip;           // instruction pointer (index into the machine code)
  
    op    stack[10];    // thread local stack
    int   sp;           // stack pointer
     
    thread_stat st;     // thread stats
    }
tvm_thread;

The thread name is, well, the thread name. Actually, it's not truly needed here, but it doesn't hurt anything and makes thread dumps more readable. The id, of course, is what truly identifies the thread: while we humans prefer textual names, the OS simply looks up threads by their id. Like most systems (probably) these thread id's are just generated sequentially (the first thread is 0, the next thread is 1, and so on.)

The thread state determines the life cycle of the thread. There are certainly more possible states than the ones I've allowed for in the tvm scheduler, but that's in keeping with the simple design. More on the priority and priority adjustment in a bit.

The 'code' and 'ip' members reference the machine code for execution. The 'code' member points to the base address of the (array of) opcodes, and the 'ip' member is the instruction pointer -- i.e. points to the next opcode to be interpreted.

Each thread contains its own local stack. This is merely expediency. Normally, a thread would just include pointers to the start and end of it assigned frame on the system stack. But I found it easier to avoid the joys (?) of carving up a global data structure, so each thread gets its own 10-opcode deep stack. More than big enough for the simple test threads that I've created. The 'sp' member is the stack pointer. It always points to the top of the thread's stack.

The thread stat structure holds a few statistics that enable the scheduler's performance to be measured. I won't cover this here.

The Thread Queue

Once threads have been created, they need to be placed into the Thread Queue. This way, the scheduler can find them! All threads in the queue will be marked with a state that indicates their readiness to be scheduled. Actually, in the case of tvm, the threads are almost always ready.

For a heavy duty thread queue capable of handling a large number (and range) of threads, you should have a dynamically allocated structure. But for my little simulator, I found it easiest to go with a simple fixed size array of thread pointers. This makes inserting and removing threads trivial: a new thread is allocated and its pointer is stored in the next available slot. When the thread is finished, it's resources are freed and a NULL is written back into the slot.

Initially then, the queue is just an array of NULL pointers. Of course, at startup time, the idle thread is created and inserted, so at least that pointer will always be present.

A global index called 'ThreadNext' is used to keep track of the location in the queue. As each thread is scheduled, the index is incremented. When it hits the maximum array slot, it just wraps around to index 0. This makes the array behave as a circular queue.

Thread Queue

After inserting the idle thread

ThreadNext ->

&idle

NULL

. . .

NULL

Thread Queue

After inserting several more threads

ThreadNext ->

&idle

&thread1

&thread2

NULL

. . .

NULL

The Scheduler

The scheduler is invoked regularly, via a timer interrupt. At that point (except for the very first schedule call), there will be a currently running thread. This thread must be stopped, and another thread selected to run (altho, in fact, it might turn out to be the same thread again). Once the next thread is selected, a context switch is made -- that is, the machine's execution environment is adjusted so that the next thread's machine code will be evaluated by the machine interpreter.

Therefore, I have decided that the following basic steps are the foundation of any thread scheduler:

'Unschedule' the current thread
Select the next thread
Switch to the next thread

This is a bit vague, but really, that's all there is to it! Switching to and from particular threads involves changing their state. Here are the thread states defined in tvm:

 
enum thread_states
    {
    BIRTH_STATE,          // just created, not yet in the queue
    READY_STATE,          // sitting in the queue (not running)
    RUNNING_STATE,        // currently executing
    DEATH_STATE           // execution is finished, marked for removal from queue
    };

Most of the time, threads sitting in the Thread Queue will be marked as "ready". Only one thread at a time will actually be in the "running" state (remember -- multiple programs running at the same time is only an illusion!). 'Unscheduling' the current thread basically means moving its state back to "ready". Unless the thread has finished, in which case its state will be "death" and the scheduler will remove it from the Thread Queue immediately.

Switching to the next thread is similarly trivial. The machine interpreter keeps a global called CurrentThread. A context switch merely involves resetting this global to the next thread that has been selected.

Therefore, most of the real work, in terms of scheduling decisions, is made in the middle step -- i.e. selecting the next thread. I decided while writing tvm that I didn't want to wire in just one algorithm for doing this selection, but rather allow a range of possibilities. So I created several scheduling policies. These are just various routines for determining how to pick the next thread for scheduling. A global hook function called get_next_thread() is set to point to one of these routines based on which command line option is specified.

Round Robin

If you have a preemptive scheduler, then you can't overlook Round Robin. This is the simplest and probably most fair (or rather, democratic) scheduling policy there is. It does nothing more than go from thread to thread, allotting each their own quantum.

For example, if you have threads A, B, and C running, then it just schedules A, then B, then C, then A, then B, then C, etc. on and on. Each thread gets completely equal access to the CPU. Here's Round Robin implemented in my scheduler:

 
static tvm_thread *
gnt_round_robin (void)
    {
    int i;
    int slot = ThreadNext;
 
    for (i = 0; i < MAX_THREADS; ++i)
        {
        tvm_thread *t = ThreadQ[slot++];
        if (slot >= MAX_THREADS)
            slot = 0;
              
        if (t && t->state == READY_STATE)
            {
            ThreadNext = slot;
            return t;
            }
        }
   
    return NULL;
}

The problem with Round Robin is that it is a bit too democratic. After all, does every thread deserve equal time? Consider that you have two threads in the queue, and one of them is (of course) the idle thread. Does the idle thread really need to have 50% of the running time? Of course not! It's a do-nothing thread and should be relegated to the background. It's a low priority operation. Aha! The notion of thread priorities.

Prioritizing

In the world of threads, there is the idea of priorities. Simply put, some threads are more important than others. For example, you really don't want the mouse or keyboard to become sluggish and lag in their response because some background process (like the mime sniffer, for example) starts to run. Thus a tiered hierarchy is defined for threads from low to high (or urgent) priority.

Categorizing threads by their urgency is great in theory, but a bit hard to determine in practice. Just exactly what kind of activity is most important? For a desktop OS, which must regularly interact with a person, user-oriented IO is usually near the top of the list. Or things which are very sensitive to CPU attention, such as running video codecs.

Well, you can't play video in tvm, so I don't dwell on this too much. Here's all the options that you get in my simulator:

 
enum thread_priorities
    {
    IDLE_PRIO,        // the lowest of the low (pond scum of the thread world)
    LOW_PRIO,         // lowly, but honorable
    NORMAL_PRIO,      // just your average thread
    HIGH_PRIO         // must be a VIT (Very Important Thread)
    };

Strict priority and starvation

Alright then, prioritizing threads is useful. So let's add a scheduling policy that is priority based. The first stab at this is a policy called "Priority Strict". This means that thread priority is the only criterion for selecting the next thread. It scans thru the Thread Queue, first looking for a HIGH priority thread to run. If not found, then it searches again for a NORMAL priority thread. Then for a LOW priority. Finally, an IDLE priority is accepted, if nothing else can be found.

This policy is harsh. It only has acceptable run time characteristics if higher priority threads always finish quickly -- or at least ahead of the lower category threads. If not, then you get "starvation". This is where the higher priority threads hog up all the CPU time, while lower priority threads languish for long periods without being run.

This simple minded priority based policy is included in the tvm scheduler only to demonstrate its bad effects and to show why you wouldn't want to actually use it for a real scheduler. Scheduling by priority is fine, but you need to have some kind of balancing mechanism to ensure that high priority threads don't steal all the processor time.

Avoiding thread starvation

I have included three variations of techniques for overcoming "thread starvation" because of priority based scheduling.

Aging
Leveling
Random Promotion

Aging refers to the idea that threads that have been overlooked (because they are not high enough priority) are "aging" or "decaying". So to compensate, whenever a thread is rejected, it is examined to see how long it has been waiting, marked as "ready" in the Thread Queue. If this waiting time exceeds some particular threshold, then its priority is boosted in the hope that this will improve its chances of being selected on the next round.

If low priority threads can be boosted when neglected, then why not do the inverse and "punish" high priority threads that have been given too much attention? That is, lower the priority of a thread hog. A reasonable idea, but... since a thread with a higher priority must be doing some kind of important work, you wouldn't want to have it reduced for too long, or you throw away the advantages of prioritizing. So it makes sense to combine any priority reduction technique along with aging.

Therefore, the next available policy is called "leveling". This refers to the fact that both lower priority threads can be boosted and higher priority threads can be diminished, thus giving somewhat of a "leveling out" of priorities. But you don't want things to get too level, because, again, the priority categories are useful. So for my implementation of leveling, the priority reduction is only in effect until the next selection. At that point, the priority returns to its original value.

For both aging and leveling, the priority change is held in the thread's 'prio_adjust' field. This can be either positive or negative, based on whether it is a boost or a reduction. In the case of both policies, the change in priority only lasts until the thread is selected again. Being selected is taken as an indicator that the priority adjustment has done its work and can be tossed away. So the 'prio_adjust' field is set back to 0 and the original priority is assumed.

The final compensation technique, which I call Random Promotion, is quite simple. When it comes time to select the next thread, then once in awhile (at some random interval) the next thread available is selected, no matter what its priority. Exactly how to define what the random interval should be is not easy -- you could try to pick an optimal value thru experimentation. I settled on using the value 5 or the ThreadCount-1 (which ever is greater). That is, about every fifth time the scheduler is invoked, it will select the next thread (marked as "ready") in the queue, regardless of its priority. This does a pretty good job of letting some lower priority threads run once in awhile without stopping the higher priority threads from getting their (higher) share of attention.

The obligatory caveats

Running the tvm scheduler can be fun. A few sample threads are created in the kernel startup, dumped, and then run. A message is displayed each time a thread dies -- this makes it easier to tell which threads are still affecting the output. When all but the idle thread is finished, a summary of the usage stats is displayed. Since the idle thread never ends, you have to kill the program with Cmd-C (Ctrl-C or Alt-C depending on your keyboard setup).

By running different policies and adjusting the timer quantum, you can see variation in the output and the usage stats. It's difficult to use this output, however, to make any conclusions about which policy is best. There is likely no such thing as a best scheduling policy -- it is too dependant on the run-time environment and the specific set of threads being run. Put another way, there are too many knobs controlling the behavior. Simply adjusting the implementation of the policies, even by something as basic as changing a threshold value, for example, will alter the results. Try it and see.

Also, as lengthy as this article has turned out to be, and as many things as appear to have been stuffed into the simulator, it is still very simplistic and limited. I never got around to adding, for example, a BLOCKED_STATE for threads. This is needed for something as trivial as asking a thread to sleep (e.g. adding an 'opSleep' opcode). Not to mention the possibility of threads engaging in some kind of disk or port IO that would require blocking them. Which would, in turn, require the presence of semaphores for serializing access to these global resources.

And what about communication? Threads in a real OS are engaged in communication all the time. There are the classic signals from Unix, and BeOS adds its own flavor of inter-thread communication. This would then entail adding a WAITING_STATE for communication, or perhaps two states: SENDING_STATE, and RECEIVING_STATE. And so on.

True threads exist in rich world of interaction, likes thousands of bees floating from flower to flower in a meadow. The threads in the tvm simulator, however, are more like little islands unto themselves. In a way, that's not bad. It makes understanding the basics of scheduling easier, which is the whole point. And in any event, you've got the code to play with, so you can extend it any way that you can think of. It's fun.

Source Code:
tvm_scheduler.zip

Attachment	Size
tvm_scheduler.zip	11.4 KB