Haiku meets 9th processor

Blog post by paweł_dziepak on Fri, 2013-12-20 20:59

kernel, scheduler, contract, contract work

It’s been quite a long time since my last report so I think it is a good time to describe what I have been doing in the last two months. The main scheduler logic has been completed and now I am concentrating mainly on bug fixes, adjusting tunables and some minor improvements. I also removed gSchedulerLock, a spinlock I mentioned in my last post, and replaced it with more fine grained locking. An new interfaces for cpufreq and cpuidle modules has been created together with a cpufreq module for Intel Sandy Bridge or newer cores and cpuidle module for all processors that support C-states and invariant TSC. Furthermore, IRQs (including MSI) can be now directed to an arbitrary logical processor. Implementation of inter-processor interrupts has been improved so that it avoids acquiring any lock if it is not necessary and supports multicast interrupts. And, last but not least, 8 processor limit has been removed.

First things first, I have introduced load tracking. Loads of each logical processor, core as well as loads produced by each thread and IRQ are monitored by the scheduler, which then uses that information to make all kinds of decisions. We have already keep track of how much time CPU spends doing useful work and how much CPU time thread consumes so extending it to compute load was a minor change. Knowing the load of all cores and the load a thread has been producing so far scheduler may attempt to choose the perfect core for a thread.

I also had to change slightly the way dynamic priorities work. Previously, when a thread consumed its whole time slice its penalty was increased. When a thread ceased to be CPU bound its penalty has been cancelled. While this worked well in a simple situations when a thread (of threads) are doing some heavy computations for an extended period of time it was still possible to starve low priority threads. Imagine a situation when there are two threads both doing the same thing in a loop: acquire a lock, use CPU for a shorter period of time than its time slice, release a lock. Neither of the threads would receive any penalty since they never use whole time slice, yet the lower priority threads would be starved. To solve this problem I made the scheduler more strict. In addition to the previous conditions when thread penalty has been increased it is also increased when a thread voluntarily gives up the CPU (e.g. waits for a mutex). Also, when a thread goes to sleep the thread that currently is starved mostly is remembered. The penalty is cancelled only when that starved thread had a chance to run before the thread with a penalty wakes up.

These are the most important changes to the scheduler logic I did. Not directly connected with the scheduler algorithm but still important was also replacing gSchedulerLock with more fine grained locking. The things it used to protect are now protected by per thread scheduler locks, per team signal locks and some additional locking for time tracking. Each mutex also has its own spinlock that protects the internals. In some places I took the opportunity to introduce a bit more lightweight locking. Reader-writer spinlocks and sequential locks has been introduced and used when the standard spinlock is not really needed.

Since, the scheduler tracks both processor and thread load it has all information needed to predict how much performance is currently needed in order to achieve target load that is a balance between power efficiency when CPU is active and the amount of time spent idle. There is an interface for cpufreq modules that allow the scheduler to tell whether it needs more or less performance. It is not guaranteed that the request will be satisfied, so that I decided that instead of getting some absolute value the cpufreq module only gets the requested change in the performance.

The scheduler also has control over IRQ affinity. When interrupt handlers are installed the IRQ are balanced over CPUs (it’s obviously aware of the CPU topology). Then load produced by IRQ handlers is tracked and the schedulers can redirect them if it is necessary. Also, when in power saving mode, all IRQs are redirected to the small task packing mode so that other cores are not woken up by them. In order to get the IRQ redirection work correctly when reserving or allocating interrupt vector its type has to be declared: exception, syscall, irq, local irq. Only non-local IRQs can be redirected. Moreover, when several interrupt vectors are allocated at once they are treated as one IRQ by the balancing and redirection code. That was needed to properly support MSI.

I also managed to improve IPI implementation. Processing unicast IPI messages has been made lock-free - that was quite easy since the messages are sent to the target CPU using multiple producer, single consumer queue. Sending broadcast and multicast IPI messages is still blocking but processors don’t need to acquire any lock to check whether there are any messages they need to process. That was achieved by introducing both per processor and global message counter. The global counter keeps track how many messages has already been sent and the per processor counter keeps track how many messages this CPU has already processed. Only if these values differ the processor acquires a lock.

We have had also a bit problematic way of obtaining current thread information in kernel mode on x86. A pointer to the current thread was stored in debug register dr3. Unfortunately, reads from this register are slow. Slower than any read from memory with cache hit. The situation becomes even worse when running Haiku on virtual machine. Depending on the hypervisor configuration access to debug register causes VM exits (i.e. the virtual machine is stopped and the host OS handles that instruction without any hardware support) what makes it much slower then memory reads even with cache miss. That’s why I decided to switch to the more common way of storing current thread data (used e.g. in x86_64 version) and to use segment register gs for that. That involved creating separate GDT for each CPU but nothing difficult in that.

Finally, I’ve removed our 8 processor limit. It is currently set to 64 processors, but further increases will be much easier than this one. The most problematic thing was that the structure system_info which is a part of the legacy interface contained an array of 8 structures with per CPU data. Any change to that would break ABI. To solve this problem, a new interface has been introduced which replaces the old get_system_info() function. Though removed from API the old function is still available in the ABI. This means that sources may need to be updated, but the legacy binaries will work without any changes. The new interface enables applications to obtain the whole CPU topology tree. It was designed with heterogeneous architectures (e.g. ARM big.LITTLE) in mind and is ready for systems where not all cores are of the same model or have the same default frequency.

As expected, the old interface wasn’t the only difficulty in enabling Haiku to take advantage of up to 64 logical processors. In several places a 32 bit value was used as CPU mask, that was replaced with a CPUSet class which size depends on a current internal CPU limit. Also, during initialization physical page mapper allocates certain amount of page slots for each CPU. Since at this stage memory allocation is not fully working it used global variables and placement new to create required objects. The memory reserved for that wasn’t declared in relation with the maximum number of supported CPUs and ran out quite early.

That’s more or less what I did since my last report. I’m now doing some minor fixes and improvements as well as testing. Unless I find any serious problems I think my scheduler branch will be ready for a merge quite soon.

Haiku meets 9th processor

Paweł Dziepak's blog