# Going back to the basics... kinda.

Blog post by meianoite on Fri, 2007-06-22 05:52

(Or: what you finally learn to do after a number of false starts)

But first, the obligatory disclaimer, in case you missed it. Thanks.

After I admitted the basic flaws of my previous attempts, I decided to go back to the basics and use a "bottom-up" approach to come up with a solution that suited the needs of the Haiku scheduler: fairness, responsiveness, responsive, scalability, succinctness and efficiency.

Right. So let's keep in mind that the BeBook already hints at what the priority numbers should roughly mean: frequency of scheduling. We have 120 priority levels; the higher 20 are reserved to real-time threads.

This post is not going to address these directly. Instead, today I'll be telling the background story of how quite a number of pieces are being woven together to fulfill those specific needs and constraints. OK, enough metablogging ;)

After letting go of the previous approach, I decided to study the subject on the very traditional literature about it. And while I was studying classic scheduler approaches on the Minix book, I noticed that the description of lottery-based scheduling was very compatible with that description of priorities meaning approximate frequency of being chosen by the scheduler. Lottery scheduling was proposed by Carl Waldspurger in 1994, and it works roughly like this[1]:

Let's use a supply vs. demand analogy. Think of a very popular store you know, and suppose there's a big sale going on. Only one person can go inside the store at any time, and only when someone else gets out of it. The people outside refused to form a line, but they accepted participating in a lottery-like system, where numbers are drawn randomly and the person holding the winning ticket gets inside the store.

Suppose you could assign different priorities to people with special needs; for example, the elderly and the pregnant could receive more than one ticket to increase their chances of going inside sooner than others.

Seems fair, right? So let's go back to computers. Suppose there exists some resource that is available in limited quantities (like CPU power, I/O bandwidth, network bandwidth, and so on). Suppose there are several processes competing for those resources, and they can have different priorities. Now, suppose that you could assign those processes "tickets" that can be used to organise how they will be served.

A ticket corresponds to slot of time during which a process can make use of those limited resources. Like in a lottery, the ticket number is drawn at random, and the process with the winning ticket is allowed to use the resource. Suppose it's possible to give many tickets to process; it will then be allowed to use those limited resources proportionally to the number of tickets it holds. So processes holding 20 tickets will be allowed inside 20 times, process 10 tickets, 10 times; and so on.

Except that those tickets don't expire, and are good for as many trips inside as the process wishes. This effectively makes it so that, during their lifetime, a process holding 20 tickets will be allowed inside twice as many times as one holding 10 tickets. As time goes by, the processes will have used the resource proportionally to the number of tickets they hold. Cool, eh? Priorities can be easily implemented by simply handing a process a number of tickets proportional to what its priority should be.

I had to stop for a moment to really appreciate how elegant this is. I mean it; I just felt like staring at the void for a good time until things clicked together. It seems that people have no problem understanding priorities when dealing with an one-shot event (like a queue on the bank where the elderly, the pregnant etc have higher priority to be attended to by the cashier), but priorities when dealing with cyclical events are not this straightforward. The idea of being cyclically served, with the cycles expressed as temporal frequencies, and noticing that the ratios between frequencies can be understood as differences in priority, is very simple, very powerful, and very elegant. The concept of fairness stems naturally from it.

Not to mention how plainly intuitive this is. Isn't it great to have something palpable to relate to when discussing priorities? Isn't it better to think in terms of physical tickets instead of cold numbers?

Well, great!

I really wished things were just this simple.

So far, I've been trying to avoid sounding too technical, because until now most of the feedback I received came from a non-technical audience. Sorry, folks, I can't avoid going a little deeper into the technical details right now. But I'll try my best not to make things sound too hairy.

Lottery scheduling sounds very nice in theory, but there are some problems. The most obvious problem is the randomness: the first several rounds of scheduling will resemble anything but the proportion we'd like to attain. We only observe the desired ratios in the long run.

A possible solution to this was proposed by Waldspurger himself in 1995, when he introduced the Stride Scheduling. In a nutshell, he replaced random lotteries with strides and passes. All processes had a property called "pass" and a property called "stride". Strides are inversely proportional to the priority. Each time a process is chosen, its pass is increased by stride. The scheduler always picks the process with the lowest pass[2].

The second problem lies in the mapping between tickets and processes. Searching for a ticket by visiting every process is very inefficient. We could use hash tables, or trees, to do this indexing. I'll tackle this in a following post.

The third problem is that one of the requirements of Haiku's scheduler is that it should have constant (O(1)) complexity regarding the number of processes (threads, in our case). I didn't manage to find[3] whether Waldspurger's work touches this issue at all, but at least one implementation used a tree to organise the threads. And there was this thing called "Hierarchical Stride Scheduling" on his thesis and the sound of those words ringed some bells here. I'll address this one right in the next post.

The fourth and worst problem is that both lottery- and stride-based scheduling are great for processes that want to use as much of the limited resource as possible during their time share. In the case of CPU power, those would be CPU-bound threads. BeOS and Haiku are desktop operating systems, where interactive threads are prevalent. This means that there will be plenty of threads that spend most of their time waiting for user input instead of using the CPU.

Uh, quite a handful of tricky problems. Kind of discouraging, no?

But wait, all is not wasted ;) There are solutions for all those problems. I have come up with some twists that I'd like to share with you.

I'm convinced that even if the stride scheduling as expressed in Waldspurger's thesis could be less than optimal for interactive tasks (i.e., it doesn't approach the static version of shortest-remaining-time scheduling, which is proven to provide the best possible average response times), there's no doubt in my mind that the idea of using strides results in perfect fairness when all tasks are equal. One can actually prove this mathematically. Not necessarily his stride + pass approach, though. Simple strides will do.

For example, for modeling purposes, suppose you have an array where you store a task T as many times as T's priority is (let's call this the temporal frequency property, or TFP). So let task A have priority 1, task B have priority 2, task C have priority 3. One possible array where the TFP holds is
A B B C C C
but so does any permutation of this array. If you multiply the occurrences of every element by a constant factor, for example 2, the array becomes
A A B B B B C C C C C C.
The TFP still holds for this array, as well as for any possible permutation of it. This is the key of the discussion that follows, so please take a moment to convince yourself that this is true.

It turns out that for any array for which the TFP holds, it's possible to reduce it to the "trivial" array by ordering the array by task size (i.e., making it "non-permuted") and "scaling" it so that the fewest occurring element appears divided by Greatest_Common_Divisor(A.prio, B.prio, C.prio, ...) times.

In other words, doing exactly what we do when reducing fractions. Eh. Sorry if that sounded scary.

Nice, so now we have an invariant :) So, for clarity, let's keep the trivial array,
A B B C C C.

Ideally, we'd like those three tasks to be run with a fixed temporal frequency, to ensure fairness and to have smooth response times. If we were to access this array sequentially, the ideal permutation of this array would be
C B A C B C
or some slight variation of it. However, finding such permutation when the array is much bigger than this one (which would be the common situation in our case) is not a cheap operation.

However, we can do some math tricks to try to approach this ideal situation. In Computer Science, those kinds of tricks are called heuristic algorithms.

Notice that any strided, modulo size of the array (which is identical to the sum of all priorities), traversal of this array, where the stride is coprime to the size of the array (let's call it Good Stride Property), will return a given task proportionally to the number of times it appears on the array, divided by the size of the array (i.e., the sum of the priorities of all tasks). (Alternatively, if we don't scale the array, it's enough to calculate the stride using the trivial array, and then scale the stride using the GCD.)

So in our example A will appear 1/6 of the times, B, 2/6 (1/3) of the times, and C, 3/6 (1/2) of the times. For any stride for which the GSP holds.

Trivial case: stride = 1. Traversing the array will output A, B, B, C, C, C, A, B, B, ...
Next case, stride = 2: not coprime, GSP doesn't hold. Would output A, B, C, A, B, C..., and the proportions are gone.
Next case, stride = 3. not coprime, GSP doesn't hold, would output A, C, A, C, A, C..., and B doesn't even show!!
Stride 4: A, C, B, A, C, B, ... Not coprime, GSP doesn't hold, same situation as stride = 2.
Stride 5: A, C, C, C, B, B, ... OK.
Stride 6: not coprime (doh).

(well, unfortunately 3 tasks with respective priorities of 1, 2 and 3 are a terrible example, as their sum is 6, only 1 and 5 are coprime to 6, and the modular distance of both 1 and 5 to 6 is 1... But I chose them on purpose. Because they're simple enough to make the example clear, and also because this already proves that there's no algorithm that will return a stride that displays this even-ness property in every possible case, for the simple fact that such stride might not even exist.)

(And notice that actually it would suffice to test only half of the possible strides smaller than sum_of_priorities (let's call it SoP from now on), because since the strides are modulo SoP, the second half is just mirroring the first half.)

Anyway, just trust me on this one. Picking a good stride (when available) does ensue a distribution of tasks in time that is more evenly spaced, but even in the worst case, the correct proportions will have to emerge, by construction, after SoP scheduling rounds.

Now, what if we add a new task? Should we abandon the current strided traversal and start anew? Well, I still haven't decided what to do here. So far I considered starting anew but with a different stride, starting from a random position. Perhaps even taking the opposite direction (subtracting strides if they were being added in the previous scheduling round, and vice-versa). This should eventually converge to a fair distribution. Simply finishing the current strided traversal until it completes SoP scheduling rounds and ignoring new tasks until then would be terrible for responsiveness when the number of tasks grow. I believe using the rule of thumb I described above as a compromise should be reasonable enough, but I'm open for suggestions.

I hope that by now you're feeling that the pieces are finally starting to fall together. And they are! I'll try to cut the math abstractions to the minimum on the next blog post. It will be more algorithm-oriented.

See you then! ;)

Notes:

1. I'm aware that both lottery scheduling and stride scheduling were implemented in the Linux kernel several times before as academic exercises. While there's no doubt that fairness and multimedia responses improved, nobody (that I'm aware of) made an effort to reimplement these ideas on the 2.6 kernel series. Actually, I'm not being completely honest here. Those ideas did make a comeback in recent versions of the 2.6 Linux kernel series. I'll get to it in due time. Stay tuned!
2. Linux followers: does that ring a bell? *hint*
3. OK, I have to admit that I haven't really looked deep into that, but I didn't find anything that suggested he himself considered discussing the complexity regarding the number of threads in the system of his implementation. But the fact is, since the most complete implementation made for the Linux kernel so far used a tree, you can already assume that there's at least the possibility of such complexity being O(lg n) where n is the number of threads. But rest assured, my solution is even better. Or so I guess ;)

Don't miss the next chapter in the Scheduler saga! Tomorrow (hopefully), same bat time... Same bat channel (powered by RSS feeds!)... Right here. Stay tuned.

### Re: Going back to the basics... kinda.

The second problem lies in the mapping between tickets and processes. Searching for a ticket by visiting every process is very inefficient. We could use hash tables, or trees, to do this indexing.

Can't you just use a simple array - where process[ticket_id] will give you the process for the ticket. When a ticket is freed - put it's id in a recycler and reuse it later.
And why pick random tickets or use strides/passes? Can't you just iterate the list of the currently active tickets and pick the next one? You can try to distribute tickets from the same process evenly across the list of active tickets.
And you could use two pools of tickets - where tickets from the first pool are say - 10 times more likely to be picked than the ones from the second. This way instead of giving a process 87 tickets - you can give it 8 tickets from the first pool and 7 from the second.

### Re: Going back to the basics... kinda.

geleto wrote:

Can't you just use a simple array - where process[ticket_id] will give you the process for the ticket.

Excellent question! And the answer would be: "no."

Quote:

When a ticket is freed - put it's id in a recycler and reuse it later.

I don't see what this would buy us...

Quote:

And why pick random tickets or use strides/passes? Can't you just iterate the list of the currently active tickets and pick the next one?

It depends. If tickets are packed together, the whole pack belonging to a single process, simply iterating would select the same thread a lot of times. If tickets aren't packed but scattered around the array, and they are shuffled well enough, it could work better. But then again, shuffling is expensive, as discussed already. So the short answer is: I could, but I just won't ;)

(In case you're wondering, I'm not using strides/passes as described.)

Quote:

You can try to distribute tickets from the same process evenly across the list of active tickets.

I guess you mean that this distribution occurs when a process is no longer active. If this really is the case, I don't need to; tickets dying are not a problem. You'll see :)

Quote:

And you could use two pools of tickets - where tickets from the first pool are say - 10 times more likely to be picked than the ones from the second. This way instead of giving a process 87 tickets - you can give it 8 tickets from the first pool and 7 from the second.

This is starting to sound too complicated, no? ;)

OK, I'm being a dick.

I've been giving you witty and non-satisfactory answers on purpose. I'm sorry I'm this mean, but the fact is that the next blog post is almost entirely dedicated to the subject you brought up! I already had to deal with all the suggestions you gave, and unfortunately they're all flawed in a way or another. I'm striving not to litter the scheduler code with special cases and alternative paths. I'm striving to K.I.R.,R.,R.,R.,R.,R,I.M.,R.S.,S.[1]

I'll try to post the follow-up entry today, but just in case I won't be able to, here's a little hint: a thread with a priority of 10 holds 10 tickets. Priority of 15, 15 tickets; priority 20, 20 tickets. This is not the same as giving a single, special ticket that has a bigger probability of being drawn. The overhead of using an algorithm to simulate the statistics of drawing "special" tickets is just way too high when you have any hopes of having low-latency context switches. (or is it? *hint* ;))

Now, consider the fact that a freshly-booted Haiku system has around 90 threads already. Let's be a little optimist here and pretend all of those threads have B_NORMAL_PRIORITY (10) (and they have not; there are plenty of threads with priorities 15, 20, 90...); well, 900 tickets already! And that is with only 90 threads. Under an optimistic, and wrong, scenario. Ouch.

Our worst case scenario would be edging on the operating system limits on the number of threads, and letting all these threads have a priority of B_FIRST_REAL_TIME_PRIORITY minus 1; on BeOS, those would be 4096 threads having a priority of 99. The resulting number of tickets would be 405504; such array would be 405504 * sizeof(thread_id), that is, at least 1.54MB, large. And there are hints that Haiku will relieve the 4096 threads limit...

But let's suppose we don't care, whatever, we'll just use 1.54MB worth of RAM and be happy with it. But now what? We'll have to manage ticket allocation. If we let tickets be sparsely allocated, we'll have to keep track of which tickets belong to what thread. This incurs in a lot of extra overhead; 1.54MB pales in comparison to it. If we don't, we just have to keep track of which ticket pack belongs to what thread. And this is still undesirable, because it's backwards and adds unnecessary complexity to the code.

But IMHO that's not even the worst offender yet. It's the fact that threads come and go from the scheduling queues all the time. So what are we going to do? Insert dozens of tickets in the mapping array every time a thread becomes ready? Clear dozens of tickets off the array every time a thread blocks? Manage the ticket sparsity? Consolidate them every time a thread blocks and leaves a hole in the array?

This just sucks, no?

Trust me, you'll see shortly how this was turned into a complete non-issue.

1. Keep It Really, Really, Really, Really, Really, Really, I Mean, Really Simple, Sir.

### Re: Going back to the basics... kinda.

What about the recent scheduler work in the Linux kernel?

Ingo Molnar's "Completely Fair Scheduler" might be a good place to look for implementation techniques. It seems very elegant and very efficient, and he is a coding genius. :-)

It's obviously not the solution for Haiku but it could be a very good source of inspiration and some of the algorithms may be applicable to the scheduler solution you decide upon.

--

Want free games?

Free Gamer - open source games resource

### Re: Going back to the basics... kinda.

Hi, Charles,

I see you overlooked the summary of the very first post I wrote ;)

Excerpt from it (with added emphasis):

I'd like to dedicate my next few blog entries to introducing myself and discussing how I got here, why I wanted to tackle this specific task, what background I have regarding the subject of thread scheduling, how I failed miserably to realise that my first attempt at designing an algorithm suitable for Haiku's needs had fundamental flaws, how far I am at my second attempt, and the obligatory comparison to Ingo Molnár's Completely Fair Scheduler that has been making the news in the Linux world.

Patience, my friend :)

### Re: Going back to the basics... kinda.

I'm unfamiliar with scheduler algorithms, but I took a look around and came across Real-Time in Plan 9: a short Overview, which led me to Scheduling Algorithms for Multiprogramming in a Hard-Real-Time Environment, both of which made interesting food for thought.

-Jack

### Re: Going back to the basics... kinda.

RTOS scheduling is NOT what Haiku needs or desires...