Scheduler work progress

Blog post by Paweł Dziepak on Mon, 2013-10-21 16:14

Thanks to generosity of Haiku supporters, I will be able to continue my work on scheduler in November. It's high time I wrote a report about what has already been done. As it was mentioned before my work can be tracked in the scheduler branch in my repository. Commit descriptions and some comments in the scheduler code contain more detailed motivation behind some of the decisions I had to make. In this post, though, I will concentrate how my work looked so far and what I plan to do next.

I started with machine dependent part - detecting CPU and cache topology. I didn't need that information until recently, but I wanted to have low level code written before I moved on to reworking the scheduler. CPU vendors make it a bit more complicated than one could expect since AMD and Intel provide topology information in different ways. Moreover, in case of both AMD and Intel their older processors report such data in different way than the new ones, so actually there are four different means of detecting CPU topology. Currently, Haiku recognizes three levels of CPU topology: package (i.e. socket), core and SMT. However, it was implemented in such way that adding another level in the future will be quite trivial.

Once, I was done with the low level stuff I started working on simple scheduler and improving its performance on single processor machines. I had two main goals make it O(1) with regard to the number of ready threads (i.e. all scheduler operations take constant time regardless of how many threads there are) and to prevent thread starvation. In order to achieve constant complexity I used a priority queue created from 121 (because we have 121 possible thread priorities) lists. To improve performance I used a bitmap that allows the scheduler to quickly find the highest priority thread.

Solving thread starvation problem was more tricky. When the thread uses up it whole time slice without voluntarily giving up the processor (being pre-empted by higher priority thread doesn't count) its effective priority is decreased. This fixed the problems like #8007 since no thread (except real-time thread which are not affected by these changes) could for an extended time prevent lower priority threads from executing. There was still another problem. When two CPU bound threads earned their maximum penalty they had both effective priority reduced to 1, hence relation between their priorities was lost. That wasn't right, CPU bound thread with initial priority e.g. 20 should be given more processor time than CPU bound thread with initial priority e.g. 5. I managed to deal with that problem by introducing a limit how much thread priority can be reduced, which depends on the initial thread priority. After reaching this limit the thread priority cycles between 1 and the limit. This means that the thread still can't starve anyone but is given more CPU time than lower priority threads.

Then, I could concentrate on multiprocessor systems. First I chose to add support for systems with all logical processors sharing all levels of cache (e.g. single core processor with HyperThreading). Since the caches are shared the scheduler don't have to care about cache affinity. That's why single, shared run queue can be used what greatly simplifies load balancing. I introduced a priority queue of logical processors, implemented using heap, thus making the scheduler O(log n) with regard to the number of CPUs, which is used to track priorities of the threads the processors are executing. When a thread becomes ready its priority is checked against the minimal priority of a running thread and if it is higher the running thread is pre-empted.

Having done that I finally started working on affine scheduler. It uses a separate run queue for each core so that the threads rarely change the core they are running on (actually, at the moment they aren't changing at all since thread migration is not ready yet). When new thread (that haven't been running on any core yet) arrives the scheduler decides to which core run queue it should be enqueued. If there are idle cores one of them is chosen. Which one - it depends on the current scheduler mode of operation. There are currently two modes of operation which can be changed at any time: performance and power saving. When in "performance" mode the scheduler tries to wake as many packages as possible first, then starts waking subsequent cores on each package. It is motivated by the fact that when only one core in the package is active it has whole shared levels of cache for itself and it can enter boost mode that increases its frequency above the standard level. However, when in the power saving mode the scheduler does the opposite thing - tries to keep as many packages as possible idle and first wake cores on the packages that are active anyway. Moreover, in both modes cores to wake up are chosen in LIFO manner i.e. the core that has been idle for a shorter time is woken first. That helps both performance and power saving.

That's pretty much what I have done so far. The affine scheduler still needs quite a lot work before it will be complete. Choosing core for a thread when there are no idle ones is not completed yet, there is no thread migration that would ensure proper load balancing, there is still a lot to do to reduce power consumption. Also, because the affine scheduler doesn't use global run queue it could help with lock contention. Unfortunately, Haiku uses gSchedulerLock which acts like a big kernel lock and allows only one CPU to enter the scheduler at any given time. It is also used heavily by mutex implementation, what makes things even worse. I plan to replace this giant lock with more fine grained locking once affine scheduler is in more "ready to use" state.

Comments

Re: Scheduler work progress

Nice work Pawel!
An interesting feature would also be the ability to distribute (MSI) interrupts to all cores instead of only the first one, and pin the consumer thread to the same core. For instance, a network receiver thread would be pinned to the same core the interrupt is triggered on, potentially avoiding cache invalidation. Not a feature to prioritize anyway, but I thought it could fit in your current kernel work!

Re: Scheduler work progress

Thank you for mentioning that. Interrupt distribution is definitely one of the things I should work on once the scheduler is complete.

Re: Scheduler work progress

Thank you for all your hard work.
I realize I don't really understand much about what you just explained, but please, keep up the good work.

Re: Scheduler work progress

nice to read about your progress, did you also plan to run benchmarks? So that one can compare the progress (or regression:P).

Re: Scheduler work progress

Yes, I plan to do some performance tests once affine scheduler is completed and gSchedulerLock is replaced with fine grained locking.

Re: Scheduler work progress

Very interesting!

Re: Scheduler work progress

Thanks a lot for your effort!

Do you foresee any trouble with implementing good scheduling for AMD's latest Bulldozer architecture? It's a bit different from Intel's HT approach, so I'm curious.

Again, thanks for bringing Haiku one more step forward. :)

Re: Scheduler work progress

Generally, scheduler is architecture independent and is not supposed to contain any hardware specific code. AFAIK, apart from the confusing naming, the only important difference between solution used in Bulldozer and HyperThreading is the fact that SMT units do not share L1 data cache, thus it may be worth maintain cache affinity also at logical processor level. (Obviously, there are also separate ALUs for each logical processor, but that "merely" improves its performance, does not require scheduler to make different decisions.) I think supporting that at satisfactory level would be quite straightforward.

Re: Scheduler work progress

Congratulations on the extension of your contract. I see a lot of work is being done in your branch. It would be interesting to see another blog post describing the work that's been done and what's next. Keep up the good work!