The first (or nth, even) attempt: a cautionary tale

Blog post by meianoite on Sun, 2007-06-17 22:27

scheduler, gsoc, scheduling algorithm, algorithm, O(1), complexity, data structure, multiprocessor, affinity, fairness, real time, CFS, Linux, comparison, threads, responsiveness

(Or: rose-coloured glasses are both the blessing and the curse of being in love)

But first, the obligatory disclaimer, in case you missed it. Thanks.

Remember where I left on the previous post? Now, with greater confidence, I set myself to improve my original algorithm’s performance even further. But that’s just because I knew it sucked. It was extremely inefficient as far as implementation goes; it looked great in the benchmarks because it was being compared to O(n) (n being the number of threads in the system) algorithms, while it had O(1) complexity, so I already had a head-start, so to speak. Still, I knew the algorithm very well and understood that there were plenty of bottlenecks to fix. I wrote it, after all.

[/pride mode=off]

Well, things started to look great as I progressed. I eliminated a nested loop, then I eliminated an outer loop, then I benchmarked a little more, and I reached extremely small overhead (only 8x slower)[1] compared to the theoretical lower bound, i.e., a standard “for” loop that reads sequentially from an array (and does nothing afterwards with the element it just read).

But then something hit me.

My algorithm made no distinction between absolute priority levels, only relative.

If you read the description of set_thread_priority() and suggest_thread_priority() on the BeBook, you’ll realise that there should be an attempt to correlate priority levels and frequency of scheduling a given thread. This means that, at the very least, that if I have two threads in the system, one with a higher priority than the other, the higher-priority one shouldn’t, for example, always be picked 66% of the time while the other one is picked 33% of the time. If those priorities happen to be, say, 20 and 10, then those numbers are OK (2:1 ratio). But if one has a priority of 40 and the other a priority of 10, the ratio they’re chosen should be 4:1. My algorithm only cared if one had a higher priority than the other, and tended to pick ratios such that the highest priority one got the largest share, then it diminished progressively as levels became lower. If there were 3 threads in the system with different priority levels, the ratios would be 50:33:16; 4 threads, 40:30:20:10, and so on.

I felt that this is just wrong. (And I guess I’m not alone. Cf. 4th paragraph. Different issue, but the motivation is quite similar. And I’m baffled at how the Linux guys didn’t realise the issue with implementing nice(2) up-front, and they certainly have way more experience with the subject than I. Let alone the semi-priority round-robin effect. Unless their interpretation of priority really is regarding urgency, not quality of service[2].)

And then I resorted to nasty workarounds. (And I’m so not alone in doing this, and there are a million different scenarios where this shows that it’s not even worth linking. Most people with IT-related jobs know exactly what I’m talking about.)

First, I tried adjusting the ratios with an exponential function. Then I tried to come up with more efficient ways to implement the exponential function. And well, although the ratios actually improved, things already started to feel way too hackish when I considered using a fixed-point library of helper functions to avoid touching the FPU. This is the kind of thing you resort to when you have absolutely no other way out. Not the kind of thing you use to fix a broken design.

You know what? I was so infatuated with my own algorithm that I didn’t see how plainly flawed it was.

So, I admit: I, childishly, spent nearly 3 weeks polishing a turd.

Here’s a lesson I’d like to share: if you consider yourself a smart and reasonable person, and you believe you know what the hell you’re trying to do, but you reach a point where you’re effectively shoehorning your square-peg approach into a round hole (no pun intended, please!), give yourself a nice break to re-think things over. You might be wasting your time polishing a turd.

Let it go.

Notes:

Yes, in that particular case an 8x slowdown was damn efficient. Remember that the bottleneck on the theoretical lower bound is memory access, and the next element of the array was almost certainly in cache by the time it was accessed, and that any accesses to eight different memory positions will give you a similar 8x slowdown. Any 8 variables not in a register will do; temporary, global, local, loop counter, array element, struct member, you name it. Simply managing a linked list will easily eat 4 or 5 of those: a temporary pointer to the element to remove/insert, adjusting the previous node to point to the next node, adjusting the element counter, the key of the node that you use for comparisons, and so on...
IMHO, the issue of urgency is better solved through the use of scheduling classes, like real-time, interactive, batch and so on. Within those classes, I believe the access to a limited resource (CPU time in this context) should be granted according to Quality of Service-like characteristics, namely by following proportionality ratios like I mentioned. Notice I'm not talking about QoS in the "special reservation" sense in that a resource is wasted when not used (bogus example here), but the saner (IMHO) weighted fair queuing (an alternative explanation is found here) approach.

Don't miss the next chapter in the Scheduler saga! Tomorrow (hopefully), same bat time... Same bat channel (powered by RSS feeds!)... Right here. Stay tuned.

The first (or nth, even) attempt: a cautionary tale

meianoite's blog