And those decisions lead us to...

(Or: knitting a delicate fabric, part II: sewing it all together)
(Or: Right, Joker, the underwear might be on the outside, but I get to drive the Batmobile!)
(Or: I just had to say something about Batman and the Batmobile. Couldn't help it.)

Cool, now we have a space- and time-efficient data structure (which is nothing more than a nice structure to handle non-overlapping numeric ranges) we can use to implement the simplified stride scheduling thing I've been discussing ou the previous posts, which will remain small most of the time, and won't change its shape like mad, so it's also very gentle on the CPU cache. Not only that, but the same effects hold for any trees, shall we decide that red-black trees aren't adequate and set ourselves to use splay trees, AA trees, AVL trees, Huffman trees, unbalanced binary trees (please: no.), whatever. So far we took care of the (not any more) skewed proportions due to randomness (by using strides), the mapping of "tickets" to tasks (by using trees to store the queues, "indexes" as the tickets, offsets to simulate "special" tickets, and lazy evaluation to efficiently implement offset propagation to upper queues), and the order of complexity (by using the queues as the node elements of the tree, where the key is base priority times number of threads in that queue).

Haha! Gotcha! You really thought you'd get rid of the infamous reminder about the disclaimer, didn't you? Take a peek if you haven't before. Thanks.

(BIG Sidenote: notice that hashtables that map index numbers to queues won't really work, for the same reason that using an array to do this mapping won't work: we're dealing with ranges, which are most commonly searched for. There's no decent way to use them with arrays because of offset propagation, as discussed in the previous post, and the same logic applies to hashtables. Using discrete tickets is a no-no: you really don't want to insert a hundred tickets in the hashtable every time a thread with priority 100 is spawned... Or clear the same amount when a thread blocks or otherwise must be taken off of the ready threads queues.
For the sake of clarity, let me re-state the problem by using an analogy: it's like having an arbitrary number of sticks of different sizes that are multiple of the same unit (perhaps a centimetre), and you arrange them in a circle. Now you begin at an arbitrary point and walk R unit-large steps. Which stick will you be standing next to when you finish walking?
The problem lies in expressing the following facts: the sticks can actually be placed arbitrarily, but once you lay them down, they must be kept in that order at least until a (new) stick is removed from(/introduced into) the circle; the number of steps you take depend on how big the circumference of said circle is; the circumference will evidently change when you change the number of sticks; after you finish taking size_of_circumference walks, you MUST have stood beside a given stick size_of_stick times.
In other words: the proportions that were discussed in a previous post.)

Now, responsiveness is another matter entirely, and I intend to take the "easy" way out: forget about the cap on priority levels. But hey, there's no need for this shock expression of yours ;)

First, it's already established that, on BeOS and Haiku, there are 100 "regular" priority levels and 20 real-time priority levels. As mentioned, RT levels would be scheduled separately (the separate "realms"), and only when RT queues are active (evidently). They should be scheduled unconditionally when active, and in priority order. So strides will never touch RT queues. Keep this in mind: RT realm first, non-RT realm later.

Axel mentioned he'd like to have some mechanism to prevent starvation of non-RT queues. I believe something as simple as randomly skipping RT scheduling 5% of the time if we detect that some RT thread has gone runaway, or something like that, should do it. Maybe demoting the runaway RT thread to some non-RT level. Travis mentioned that he'd rather kill a RT thread that consumes a whole quantum in a scheduling round. He has a point there: RT threads should never go runaway, ever. I'd like to know what you people think.

But now that RT is (mostly ;)) out of our way and scheduled completely apart from "regular" threads, there's absolutely nothing stopping us from pretending there exist "regular" priority levels that are way beyond 100. The 120 priority numbers are simply part of a protocol between userland and the kernel. They are only numbers. There's nothing (other than sanity) that prevents us from internally numbering levels counting from 39254, or having them spaced by 50 units. This means we really don't need to care if we artificially inflate those numbers to well beyond what user-land would consider part of RT-reserved levels if we never expose the internal numbering scheme to user-land: we have already taken care of the RT-realm.

Thus, we could practically force any thread to run by simply placing it in a "fictitious" priority level (queue) that results from adding, say, 1000 to that given thread's base priority level. This way, if we detect that a thread that just woke up is interactive (using some heuristics, like those found on FreeBSD's ULE scheduler; but I have greater plans), we could boost it through the rooftop so it will (almost) certainly run next (unless a thread with even higher priority also received this "mega-boost"). Notice that:

  • said "mega-boosts" do not interfere with RT scheduling; those are separate scheduling realms
  • they are only good for a single scheduling round; after the boosted thread gets scheduled, it should be placed right back into its "native" queue, unless it blocks once again. Thus, those boosts should be completely transparent to the thread.
  • they should only be used to boost interactive threads (or otherwise "boostable" threads), preferably those associated with the GUI; boosting CPU-bound "batch-like" threads is a sure way to have crappy responsiveness.
  • the whole "mega-boosting" thing is effectively an expression of "scheduling classes". By boosting threads to a new "base" level that's much higher than the former ceiling of 100, we're actually creating plateaus of scheduling. This way, we can transparently prioritize certain patterns like interactive behaviour, and let CPU-bound threads be (quasi-optimally) handled by the variant of stride scheduling previously described. And by doing that we can emulate the ideal, and impossible to achieve in a general setup, shortest-job-first scheduling, quite successfully! And we don't need to stop here:
  • Greater Plan Warning! I once mentioned in a message to the Haiku mailing list that I'd like to "brand" semaphores with the subsystems they belong to, like GUI, disk I/O, network I/O, HID I/O and so on. This would be as simple as adding a field to struct sem_entry, and versions of the relevant *_sem() functions that understand the extra "brand" argument. It would be a simple numeric argument that corresponds precisely to the desired boost. Like

    #define SEM_GUI  1000
    #define SEM_HDIO  500
    #define SEM_NET   200

    (or an enum version, if that's preferred) and so on. Just like how priority levels are numeric constants but are #defined with mnemonic names. Just like priority mnemonics don't cover the full gamut, only "key" points, and don't prevent one to use some number in-between.

    (And note that the numbers I just used don't make any sense, I picked them only for the purposes of exemplification. I still don't know what the ideal boosting for any scenario would be! And this is something that can't be really decided without empirical testing)

    So in the future, if we get to brand semaphores with the subsystems and purposes they serve, this boosting mechanism (which effectively creates additional implicit scheduling classes), possibly even coupled with variable quanta, could even be extended to transparently shape loads in any way we desire, be them I/O loads, CPU loads, network loads etc. Said transparency is in the sense that the application developer doesn't need to explicitly guesstimate how to parametrize his threads, or even find him/herself in the uncomfortable position of needing create separate threads just to perform trivial operations because otherwise such operation could mischaracterise the main thread's behaviour (which would undermine the efficiency of heuristic approaches). Well, we're doing away with the heuristics entirely! The thread will always be adequately responsive depending on the code path it follows. Cool, huh?
    One obvious application that comes to mind is providing time-sensitive multiplayer games like FPS, RTS and so on, quality-of-service-like reserved shares of resources such as network bandwidth, CPU time and so on, while you have dozens of torrents happily downloading, and any number of distributed computing projects crunching in the background. And your framerate remains undisturbed.

The sky is the limit.

In a nutshell: effective scheduling classes, transparent to the application developer and not requiring pages and pages of code to implement. Always doing the right thing depending not on explicit hints or complicated window-of-time heuristics, but actual code path behaviour. Load shaping based on whatever we wish, tweakable by user-land with the simple and familiar "priority level" interface, but extended to well beyond just CPU priorities.

Sure, it probably doesn't make much sense to have the thread scheduler take care of all those tasks, but its code could be redeployed to manage many kinds of subsystems that would benefit from this mix of tweakable fairness and responsiveness. I/O scheduler, network scheduler, and so on.

Now, for SMP... Well, that definitely deserves some further discussion. I've already hinted that I'm pretty much set on local, per-CPU queues, and, if possible, every CPU should schedule its threads independently. Having a global "rescheduling time" trigger or a single CPU handing tasks to other CPUs is pretty sub-optimal and almost nobody uses such master/slave scheme anymore. So I'd like to disturb CPUs as little as possible, by reducing IPIs and cache line syncs.

Let's consider the direction technology is heading. Intel is suggesting their future multicore CPUs will allow disabling select cores and overclocking the active ones when load is low. We could have threads spawned on any (enabled) CPU/core (from now on I'll use the term CPU to mean "processor unit", be it a core on a multicore chip, or in a multi-socket setup. I know this generalization is far from optimal when IPI and caches enter the equation, but read on!), and each CPU is responsible for balancing the load. If one CPU "sees" another that is less loaded, according to some threshold, or even that the system could benefit from re-enabling a CPU, it then migrates a thread to that CPU (actually, things should be a little more sophisticated than this; more on that in a bit). Thread migration could be implemented by having per-CPU "incoming" queues.

CPU F would move that a thread into CPU G's incoming queue, and every time CPU G runs the scheduler, it takes (one? All? Some? To be decided; suggestions are appreciated!) thread(s) from this queue and puts them in the appropriate, priority-level-related queue (on the front? on the back? TBD as well). Such incoming queues should be non-blocking (i.e., if you can't move a thread to it right now, forget about it, or find another CPU that's available) and mutex-protected, of course. Alternatively, there could be "outgoing" queues instead, and "idling" processors could actively fetch those left-behind threads... But that doesn't feel very intuitive, IMHO. I'd like to know what you think.

Regarding affinity, well, using the apparatus just described, implementing it became surprisingly easy. With per-CPU queues, the dreaded "ping-pong" effect simply doesn't happen: threads will not be migrated unless load is unbalanced, and said threads don't have an "affinity flag" set. Affinity flags could come in several flavours: migrate as you wish (i.e., not set), do not migrate at all, migrate only to those CPUs in a given CPU pool, and so on.

"Hey!! CPU pools??"

Yes, but not in the NUMA/cluster sense. Remember when I mentioned that generalising processor units was a bad idea? It's because this is what I actually had in mind. Using a CPU pool abstraction is very powerful, because it allows us to encapsulate things like multi-cores and multi-sockets, SMT/Hyperthreading, shared/discrete caches and so on, behind several pre-cookedCPU pool profiles, and even allow the creation of custom ones. The CPU affinity API should be designed in such a way that exploring those incarnations of SMP is straightforward and very close to optimal without having to resort to the amount of exceptions and special cases that abound on Linux's and FreeBSD's schedulers.

OK, those last paragraphs were surprisingly small for such hairy themes. Well, let me know what you think, and feel free to ask for further clarification on parts that weren't explained well enough, or to correct me if I goofed up.

I hope you enjoyed the ride so far! Next post... The obligatory comparison to Linux's Completely Fair Scheduler!

Don't miss the next chapter in the Scheduler saga! Tomorrow (hopefully), same bat time... Same bat channel (powered by RSS feeds!)... Right here. Stay tuned.

Comments

Re: And those decisions lead us to...

Nice!
There's only one thing I don't like though. I am not very familiar with process schedulers, but I think this makes sense:

The stride scheduling is great for interactive processes, but not so great for worker/computational ones. Having many context switches for such processes is unnecessary. Especially in some cases like when you have two worker tasks and each one trashes the cache of the other.

One way around this is to use stride scheduling where the 'pass' value indicates the exact(though sometimes overlaps may happen) scheduler tick when the ticket will be active and the 'stride' indicates how many scheduler ticks are there between the current and the next ticket(that is used by the same process). This stride scheduling will be used only for the non-worker processes. This will leave 'holes' with unreserved ticks between the tickets which can be used by the worker tasks. For instance say we have tickets ABCDDEEEE, where D and E are worker tasks.
After the stride scheduling the tickets will look like this: AxxxBxxxC. We fill the empty slots (x) with the tickets of the worker threads: ADDEBEEEC.

Another approach would be to know for each process - how much time it can wait before it gets scheduled. Using this value - for some tasks the stride can be calculated to be X times bigger, but then - when the 'pass' comes the next X-1 strides will be 1.
A game that should run at 30 fps can declare that it should be scheduled at least 40 times per second. And a video encoder can easily wait for a very long time.

Re: And those decisions lead us to...

Slightly edited with further clarification

geleto wrote:

Nice!

:D

Quote:

There's only one thing I don't like though.

Bummer =P

Quote:

I am not very familiar with process schedulers, but I think this makes sense:

The stride scheduling is great for interactive processes, but not so great for worker/computational ones.

Backwards, no? Or are you referring to my "special" version of stride scheduling? (I guess I should come up with a catchy name, I don't want Waldspurger calling me a moron for screwing up his beautiful work ;))

Quote:

Having many context switches for such processes is unnecessary. Especially in some cases like when you have two worker tasks and each one trashes the cache of the other.

Yeah, sure, but unless you're doing SMP and tuning the affinity, this is unavoidable. I guess this was the primary motivation for wanting affinity on Haiku, as there's no way to set it on BeOS. Other than that, Haiku and BeOS strive for responsiveness, so if your computational thread gets preempted by an interactive one, well, that's the way things are, and the way things are supposed to happen in the context of focus on desktop responsiveness.

Either you set affinity to a less loaded CPU (and it's probably a good idea to include such things on the affinity API, like "for this application, try to ensure I get almost exclusive access to a given CPU"; call it the "peg" mode if you will) or you accept the fact that Haiku is not trying to be The One OS to Rule Them All. Remember, the focus is on desktop responsiveness.

Quote:

One way around this is to use stride scheduling where the 'pass' value indicates the exact (though sometimes overlaps may happen) scheduler tick when the ticket will be active and the 'stride' indicates how many scheduler ticks are there between the current and the next ticket(that is used by the same process).

EXCELLENT analysis. This is actually kind of what is supposed to happen there, except that those numbers are not so explicit. But I guess it would work with explicit numbers, too.

Actually, my feeling is that the problem you're trying to solve is better served by a calendar queue structure.

Quote:

This stride scheduling will be used only for the non-worker processes.

Funny, I decided to go the other way. I'm making use of the stride scheduler ideas to give fairness to CPU-bound threads. Interactive threads are boosted to another plateau.

Quote:

This will leave 'holes' with unreserved ticks between the tickets which can be used by the worker tasks. For instance say we have tickets ABCDDEEEE, where D and E are worker tasks.

I believe that what you want to achieve here is done better (calendar queue excepted) by having variable quanta for CPU-bound threads, so the context switching won't affect them so much. But this is really only good on SMP systems, where responsiveness shouldn't be hampered by this. R2 stuff, though.

(Note that affinity doesn't make a whole lot of sense for interactive threads, unless you want to put every interactive thread on the same CPU and let the other CPUs handle CPU-bound workloads.)

Quote:

After the stride scheduling the tickets will look like this: AxxxBxxxC. We fill the empty slots (x) with the tickets of the worker threads: ADDEBEEEC.

I don't know how I can achieve that without putting a lot more code on the scheduler and kiss simplicity and succinctness bye-bye. Could you tell me what's on your mind, code-wise?

Quote:

Another approach would be to know for each process - how much time it can wait before it gets scheduled. Using this value - for some tasks the stride can be calculated to be X times bigger, but then - when the 'pass' comes the next X-1 strides will be 1.

Now that's really bordering on the realm of real-time issues, no? And I did not even attempt to write a real-time scheduler; what I did instead was supporting real-time as having unquestioned precedence over any other kind of priority. So, if a RT timer is fired, the thread should wake up and get served right away. And AFAIK BeOS does just that.

For a general-purpose scheduler, it's very hard to know this timing information unless the process itself tells it. And you can't (or shouldn't) really have the whole scheduler depend on information that's not available for almost the entire classes of applications expected to run on Haiku, unless we want to go back to heuristic window-of-time approaches. And as I wrote, I have a broader goal of getting rid of them for good.

Quote:

A game that should run at 30 fps can declare that it should be scheduled at least 40 times per second. And a video encoder can easily wait for a very long time.

But I have addressed this on this same blog post, haven't I? See bullet point "Greater Plan Warning!". I did not make provisions for QoS-like reservations yet, because this belongs to R2 and beyond, but I did try to lay a framework to accommodate it in the future without too much trouble.

Cheers,
A.

Re: And those decisions lead us to...

Another approach would be to know for each process - how much time it can wait before it gets scheduled. Using this value - for some tasks the stride can be calculated to be X times bigger, but then - when the 'pass' comes the next X-1 strides will be 1.

Now that's really bordering on the realm of real-time issues, no?

No - more like responsiveness issues, I don't strive for real-time here. What I actually want is to make tasks that do not need responsiveness(e.g. CPU-bound computational threads ) - LESS responsive in order to reduce the context switches. And I think this can be achieved with a very minor adjustment to the stride/pass algorithm.

If you have two computational tasks - A and B, where each one trashes the CPU cache of the other - which is better - ABABAB or AAABBB?

GUI, networking, I/O - these tasks need responsiveness. Tasks like video encoding and non-realtime rendering don't care about responsiveness. For simplicity you can make the responsiveness inversely proportional to the task priority.

Each scheduled task will have 4 variables - stride, pass, repeat and count. We make the value of repeat(default is 1) greater for tasks that need to be less responsive.

This is the standard (a bit simplified) stride/pass algorithm:

for( i=0; i!=num_tickets; i++ )
{
	task = scheduledTasks.getFirst();//get the task with the lowest pass value
	task->run();
	task->pass += task->stride;
	scheduledTasks.add( task );
},

and this is the modified one:

for( i=0; i!=num_tickets; i++ )
{
	task = scheduledTasks.getFirst();
	task->run();
	task->count--;
	if( task->count==0 )
	{
		task->pass += task->stride*task->repeat;
		task->count = task->repeat;
	}
	scheduledTasks.add( task );
}

And one final touch - if there are more tasks with the same pass value, scheduledTasks.getFirst() returns the one that has the highest responsiveness.

Re: And those decisions lead us to...

Hi!

geleto wrote:

Another approach would be to know for each process - how much time it can wait before it gets scheduled. Using this value - for some tasks the stride can be calculated to be X times bigger, but then - when the 'pass' comes the next X-1 strides will be 1.

Now that's really bordering on the realm of real-time issues, no?

No - more like responsiveness issues, I don't strive for real-time here. What I actually want is to make tasks that do not need responsiveness(e.g. CPU-bound computational threads ) - LESS responsive in order to reduce the context switches.

I'm having a hard time wrapping this sentence around my mind :)

Quote:

And I think this can be achieved with a very minor adjustment to the stride/pass algorithm.

Uh... You do realise I'm not using the stride/pass algorithm, right? The classic stride/pass algorithm, even the HSS version, has this deficiency of necessarily changing the shape of the tree all the time (less of an issue in the case of HSS, but the shape still changes); then I proceeded to modify the idea of using tickets, along with a strided "traversal" of the simulated ticket array (you did realise it doesn't really exist, right? And that the arrays I used previously were just for modelling purposes?), precisely to keep the tree shape undisturbed (and thus more cache-friendly), and to avoid the necessity of a real ticket array (and thus cache-friendlier still!), while retaining most of the benefits that the stride/pass scheme would bring us.

Then I attempted to sidestep the issue of threads not consuming their entire quantum by assuming that, on Haiku, most of the threads that use short CPU bursts are interactive threads, not CPU-bound threads blocking temporarily on disk I/O or something like that. Then, I used a boosting mechanism that effectively creates plateaus of scheduling classes, and that computational/worker/CPU-bound threads, by standing on the lowest plateau, will remain undisturbed unless some other thread with interactive behaviour comes into play. The scheduler will tend to pick the interactive thread, because the (plateaued) base priority is so high that the strided traversal will inevitably land inside its range within the next few scheduling rounds.

Quote:

If you have two computational tasks - A and B, where each one trashes the CPU cache of the other - which is better - ABABAB or AAABBB?

The real answer, actually, is: what are you trying to accomplish here? To have both threads finish almost simultaneously, or cutting each thread some larger, undisturbed slack?

As you see, there is a fundamental conflict between what you're trying to accomplish, and the desire for fairness.

And one of the first things Axel told me is that he wanted an algorithm that's also fair.

Besides, this scenario you gave just screams for variable quanta, which, for reasons already explained, are just not going to happen, on Haiku, unless the system is SMP (if you missed it: must not hinder responsiveness, and a large quantum does that by definition; except that on SMP you can design an entire (set of) CPU(s) to handle computational loads, and let another (set of) CPU(s) handle interactive loads). And this is R2 stuff, period; and this is not my call, mind you. Ask anyone on the development lists.

For now, remember: I have a deadline, so at the moment I can't try every possible combination to see what works. I have to have a narrow focus now so I can deliver a scheduler that does a good job within the goals of Haiku R1.

So unless I royally screwed up something on the design and somehow it can't fulfill the goals of my GSoC assignment within the bounds of Haiku R1, I'm not going to change the design again. World domination is R2 stuff ;)

On the issue of cache trashing: if you do have such tasks, how do you know they trash each other's data on the CPU cache? You can at best try to see hot they mess with the TLB, but CPU caches are implemented differently on each generation, by each vendor, sometimes even differing after a minor stepping change! I don't think we'll manage to keep a huge and up-to-date table of CPU models and the cache they implement. Anything else is guesstimation.

Not to mention that there's absolutely no hope to track, in software, what happens when the cache is fully associative.

Quote:

GUI, networking, I/O - these tasks need responsiveness. Tasks like video encoding and non-realtime rendering don't care about responsiveness. For simplicity you can make the responsiveness inversely proportional to the task priority.

I'm starting to get confused by what you mean by "responsiveness", but if we're still on the same page, on my design this happens already, by construction.

Quote:

Each scheduled task will have 4 variables - stride, pass, repeat and count. We make the value of repeat(default is 1) greater for tasks that need to be less responsive.

I'm having a baaaaad feeling about this... ;)

Quote:

This is the standard (a bit simplified) stride/pass algorithm:

for( i=0; i!=num_tickets; i++ )
{
	task = scheduledTasks.getFirst();//get the task with the lowest pass value
	task->run();
	task->pass += task->stride;
	scheduledTasks.add( task );
},

Okay...

Quote:

and this is the modified one:

for( i=0; i!=num_tickets; i++ )
{
	task = scheduledTasks.getFirst();
	task->run();
	task->count--;
	if( task->count==0 )
	{
		task->pass += task->stride*task->repeat;
		task->count = task->repeat;
	}
	scheduledTasks.add( task );
}

Okay. Bye, fairness. ;)

I understand that this is a simplified excerpt, but how are you going to handle some interactive task that should be allowed to run there, when you always re-insert the CPU-bound thread with lowest pass value in the queue?

BTW, since removing a thread from the queue when choosing it to run is only necessary when you have a single shared thread queue on SMP systems, the last step isn't quite necessary. I guess I forgot to mention this on the previous blog posts, but by implementing the queue with a circular list, you only need to either rotate the list to pick the next thread (and thus the former current thread "goes to the end" of the list), or you leave it unrotated if you want to leave the current thread in the head. No need to keep testing for the end of the queue, no need to track head/tail pointers. And round-robin then becomes a side effect of the underlying structure and doesn't have to be programmed separately.

Notice how, by choosing the queues when doing the strided walk, together with the FIFO/circular nature of threads held in the queue, fairness emerges naturally.

However, for the time being, the fairness I'm aiming for is not fairness of CPU usage, but fairness of opportunity to use the CPU. For now, I'm not going to implement the "borrowed/postponed quanta" scheme found on the original stride scheduling algorithm. If we end up having to, this could be as simple as holding the thread that yields early on the front of the queue until it gets its fair share of the CPU. We'll see if this is really going to be necessary. But notice that this is not the same of running the same thread over and over until it consumes the whole quantum: threads from other queues will probably be chosen in-between.

Quote:

And one final touch - if there are more tasks with the same pass value, scheduledTasks.getFirst() returns the one that has the highest responsiveness.

Okay, let's suppose we don't care about fairness. Despite the fact that I still quite don't get what you mean by "has the highest responsiveness" here, what if we find ourselves in a degenerate case and all the threads have the same pass value?

Welcome back to O(n) land and the Linux 2.4 goodness() scheduler call ;)

Re: And those decisions lead us to...

I understand that this is a simplified excerpt, but how are you going to handle some interactive task that should be allowed to run there, when you always re-insert the CPU-bound thread with lowest pass value in the queue?

You are right, I did a small error - the pass should be incremented when the count is not 0:

if( task->count==0 )
{
	task->pass += task->stride*task->repeat;
	task->count = task->repeat;
}
else
	task->pass++;

See my previous explanation of the stride/pass method that uses explicit values for 'pass' that indicate the exact scheduler tick when the task should be active. It's true that worker tasks may be reinserted(a limited nuber of times!) at the next tick, but if that tick/pass happens to be allocated for an interactive task - the interactive task will be scheduled before the worker one (because scheduledTasks is sorted not only by the 'pass' value, but also as a secondary criteria - by the responsiveness(which we can assume is inversely proportional to the priority) of the thread)

But notice that this is not the same of running the same thread over and over until it consumes the whole quantum: threads from other queues will probably be chosen in-between.

That's exactly what I had in mind.

Thank you for your patience, I understand why this is not a priority for R1.

Re: And those decisions lead us to...

geleto wrote:

That's exactly what I had in mind.

The whole shebang preceding that sentence, or just it? :)

Quote:

Thank you for your patience, I understand why this is not a priority for R1.

Not a matter of patience, your questions were very valid and I wanted to give you some feedback.

Your ideas are sound (you'll see what I mean, shortly), it's only that I believe they, at the moment (the SMP workaround excepted), won't work in the context of Haiku R1.

Cheers!

Re: And those decisions lead us to...

First of all, thank you very much for explaining your ideas on this blog. This sounds really good.

Regarding the CPU load-balancing question, have you thought about something along those lines:
- If all active CPUs are overloaded, a new CPU is activated
- If a CPU has not enough work, it looks for CPUs with too much work, and chooses and requests threads for migration. The idea is that the CPUs with less-than-full workload do the work of finding a too-high-workload CPU and choosing threads, so that the CPUs that are already too busy don't have to do this.
- If a CPU with at least a medium but not a full workload does not find a too-high-workload CPU to request threads from, it requests threads from low-workload CPUs. That way, the low-workload CPUs that are not really needed may become idle.
- If an idle CPU does not find any suitable threads for migration, it deactivates itself.

Keep up the good work! I am already waiting for the next part of the saga :-)

Re: And those decisions lead us to...

cl21 wrote:

First of all, thank you very much for explaining your ideas on this blog. This sounds really good.

Thanks!!

Quote:

Regarding the CPU load-balancing question, have you thought about something along those lines:
- If all active CPUs are overloaded, a new CPU is activated

Fair, I even said something along those lines, right? "If one CPU "sees" another that is less loaded, according to some threshold, or even that the system could benefit from re-enabling a CPU, it then migrates a thread to that CPU"

Quote:

- If a CPU has not enough work, it looks for CPUs with too much work, and chooses and requests threads for migration. The idea is that the CPUs with less-than-full workload do the work of finding a too-high-workload CPU and choosing threads, so that the CPUs that are already too busy don't have to do this.

That would be something like the "outgoing queue" I mentioned. But I don't like this idea much. Outgoing queues won't capture the whole picture, and I really don't want other CPUs mucking around one CPU "internal" scheduler structures, to avoid unnecessary cache syncs; otherwise we might as well go with a single shared queue. Which is a possibility, of course, but I'll try the other way first. ;)

Quote:

- If a CPU with at least a medium but not a full workload does not find a too-high-workload CPU to request threads from, it requests threads from low-workload CPUs. That way, the low-workload CPUs that are not really needed may become idle.

Hm. Yes, but see my points above. I mean, if someone decides to implement this, I'm all good; it's just that right now I believe my approach should be at least as good, if not better, and I'd rather focus on it. You know, I have a deadline ;)

Quote:

- If an idle CPU does not find any suitable threads for migration, it deactivates itself.

PERFECT. And fits the current framework!

Quote:

Keep up the good work! I am already waiting for the next part of the saga :-)

Thanks once more! ;)