Haiku meets 9th processor

Blog post by Paweł Dziepak on Fri, 2013-12-20 20:59

It's been quite a long time since my last report so I think it is a good time to describe what I have been doing in the last two months. The main scheduler logic has been completed and now I am concentrating mainly on bug fixes, adjusting tunables and some minor improvements. I also removed gSchedulerLock, a spinlock I mentioned in my last post, and replaced it with more fine grained locking. An new interfaces for cpufreq and cpuidle modules has been created together with a cpufreq module for Intel Sandy Bridge or newer cores and cpuidle module for all processors that support C-states and invariant TSC. Furthermore, IRQs (including MSI) can be now directed to an arbitrary logical processor. Implementation of inter-processor interrupts has been improved so that it avoids acquiring any lock if it is not necessary and supports multicast interrupts. And, last but not least, 8 processor limit has been removed.

First things first, I have introduced load tracking. Loads of each logical processor, core as well as loads produced by each thread and IRQ are monitored by the scheduler, which then uses that information to make all kinds of decisions. We have already keep track of how much time CPU spends doing useful work and how much CPU time thread consumes so extending it to compute load was a minor change. Knowing the load of all cores and the load a thread has been producing so far scheduler may attempt to choose the perfect core for a thread.

I also had to change slightly the way dynamic priorities work. Previously, when a thread consumed its whole time slice its penalty was increased. When a thread ceased to be CPU bound its penalty has been cancelled. While this worked well in a simple situations when a thread (of threads) are doing some heavy computations for an extended period of time it was still possible to starve low priority threads. Imagine a situation when there are two threads both doing the same thing in a loop: acquire a lock, use CPU for a shorter period of time than its time slice, release a lock. Neither of the threads would receive any penalty since they never use whole time slice, yet the lower priority threads would be starved. To solve this problem I made the scheduler more strict. In addition to the previous conditions when thread penalty has been increased it is also increased when a thread voluntarily gives up the CPU (e.g. waits for a mutex). Also, when a thread goes to sleep the thread that currently is starved mostly is remembered. The penalty is cancelled only when that starved thread had a chance to run before the thread with a penalty wakes up.

These are the most important changes to the scheduler logic I did. Not directly connected with the scheduler algorithm but still important was also replacing gSchedulerLock with more fine grained locking. The things it used to protect are now protected by per thread scheduler locks, per team signal locks and some additional locking for time tracking. Each mutex also has its own spinlock that protects the internals. In some places I took the opportunity to introduce a bit more lightweight locking. Reader-writer spinlocks and sequential locks has been introduced and used when the standard spinlock is not really needed.

Since, the scheduler tracks both processor and thread load it has all information needed to predict how much performance is currently needed in order to achieve target load that is a balance between power efficiency when CPU is active and the amount of time spent idle. There is an interface for cpufreq modules that allow the scheduler to tell whether it needs more or less performance. It is not guaranteed that the request will be satisfied, so that I decided that instead of getting some absolute value the cpufreq module only gets the requested change in the performance.

The scheduler also has control over IRQ affinity. When interrupt handlers are installed the IRQ are balanced over CPUs (it's obviously aware of the CPU topology). Then load produced by IRQ handlers is tracked and the schedulers can redirect them if it is necessary. Also, when in power saving mode, all IRQs are redirected to the small task packing mode so that other cores are not woken up by them. In order to get the IRQ redirection work correctly when reserving or allocating interrupt vector its type has to be declared: exception, syscall, irq, local irq. Only non-local IRQs can be redirected. Moreover, when several interrupt vectors are allocated at once they are treated as one IRQ by the balancing and redirection code. That was needed to properly support MSI.

I also managed to improve IPI implementation. Processing unicast IPI messages has been made lock-free - that was quite easy since the messages are sent to the target CPU using multiple producer, single consumer queue. Sending broadcast and multicast IPI messages is still blocking but processors don't need to acquire any lock to check whether there are any messages they need to process. That was achieved by introducing both per processor and global message counter. The global counter keeps track how many messages has already been sent and the per processor counter keeps track how many messages this CPU has already processed. Only if these values differ the processor acquires a lock.

We have had also a bit problematic way of obtaining current thread information in kernel mode on x86. A pointer to the current thread was stored in debug register dr3. Unfortunately, reads from this register are slow. Slower than any read from memory with cache hit. The situation becomes even worse when running Haiku on virtual machine. Depending on the hypervisor configuration access to debug register causes VM exits (i.e. the virtual machine is stopped and the host OS handles that instruction without any hardware support) what makes it much slower then memory reads even with cache miss. That's why I decided to switch to the more common way of storing current thread data (used e.g. in x86_64 version) and to use segment register gs for that. That involved creating separate GDT for each CPU but nothing difficult in that.

Finally, I've removed our 8 processor limit. It is currently set to 64 processors, but further increases will be much easier than this one. The most problematic thing was that the structure system_info which is a part of the legacy interface contained an array of 8 structures with per CPU data. Any change to that would break ABI. To solve this problem, a new interface has been introduced which replaces the old get_system_info() function. Though removed from API the old function is still available in the ABI. This means that sources may need to be updated, but the legacy binaries will work without any changes. The new interface enables applications to obtain the whole CPU topology tree. It was designed with heterogeneous architectures (e.g. ARM big.LITTLE) in mind and is ready for systems where not all cores are of the same model or have the same default frequency.

As expected, the old interface wasn't the only difficulty in enabling Haiku to take advantage of up to 64 logical processors. In several places a 32 bit value was used as CPU mask, that was replaced with a CPUSet class which size depends on a current internal CPU limit. Also, during initialization physical page mapper allocates certain amount of page slots for each CPU. Since at this stage memory allocation is not fully working it used global variables and placement new to create required objects. The memory reserved for that wasn't declared in relation with the maximum number of supported CPUs and ran out quite early.

That's more or less what I did since my last report. I'm now doing some minor fixes and improvements as well as testing. Unless I find any serious problems I think my scheduler branch will be ready for a merge quite soon.

Comments

Re: Haiku meets 9th processor

Aw, I expected a picture :)

Well, no worries, I still have the link you posted the other day:

http://www.dumpt.com/img/viewer.php?file=jzjpgr7skh9v67pz82aw.png

Re: Haiku meets 9th processor

Well, THAT'S impressive!
The scheduler truly is improved!

Re: Haiku meets 9th processor

Benchmarks are missing!

old_scheduler vs new_scheduler

I see a screenshot 50 cpu ? and the most of cpu have 0%.

Haiku need a Media like scheduler, when we have more cpu cores < 2 then then is one cpu core for realtime calculations, more then 3,4.... can handle all Kits stuff eg, appserver_cpu processes (we have no full gpu state now and not in the future!)

But at time is coding on the scheduler the best new think since haiku is released at alpha.

Thanks

Re: Haiku meets 9th processor

Let me clarify what the screenshot is demonstrating...

There are 50 CPUs (virtual) that Haiku can recognize and use, but for binary compatibility, old programs see 8.

Notice the Pulse++ app there - which he probably copied from an old Haiku, or maybe even BeOS, I dunno - only sees 8 CPUs.

This was required in order to increase the maximum number of CPU cores that Haiku can see/use while still maintaining backward binary compatibility with old Haiku and BeOS software.

Re: Haiku meets 9th processor

stargater wrote:

I see a screenshot 50 cpu ? and the most of cpu have 0%.

As expected. Because the system on the screen isn't heavily loaded there isn't enough active threads to utilize all 50 logical processors. Also, on the ActivityMonitor graph you can see cache affinity in action. Processor loads doesn't change significantly because the scheduler tries to keep each active thread on its own core and do not move it as long as it is possible. If all or most of 50 processors had non zero load that would mean either there is very high number of active threads or the scheduler is needlessly migrating threads.

Quote:

Haiku need a Media like scheduler, when we have more cpu cores < 2 then then is one cpu core for realtime calculations, more then 3,4.... can handle all Kits stuff eg, appserver_cpu processes (we have no full gpu state now and not in the future!)

Having dedicated logical processors for different kinds of tasks makes it very difficult to do proper load balancing. Basically, either you waste CPU power because the other processor or processors are reserved for something else or you end up with load balancing algorithm which does not take such reservations into account in any significant way.

Re: Haiku meets 9th processor

OK
I Hope you will soon finish with the schedular and merge to Haiku.
I think the Focus is x64 cpu and breaking ABI is good. The most of us will see Haiku in the next Level, for me is like to see we have 1 compiler (clang or gcc4.x) and Vesa3 GPU driver and a stable fs (BeExt4 :-) ) and a Desktop and UI like ChromeOS.

A goal of this os are not "be a binär compatible to BeOSr5" for me os he handle a loat of streams of data (video files, images, sounds)
In other OS the Memory is reserved for eyescandy UI and we do the same things as for 20 years but nicer, hmmm thats is a killer for innovation.

Re: Haiku meets 9th processor

Excellent update and progress, well done! Would it be a good idea to release a few scheduler nightlies so that people can test them easily and provide you feedback on your work? Maybe Matt can add your branch to the buildbot like the package managment nightlies... This way big issues can be caught before you merge.

That, and now we can hold a contest on who has the most cpu's (on real hardware of course!). My I7 only has 8, so while it is nice, that's nothing new :-)

Re: Haiku meets 9th processor

Reads very impressive, Paweł. Don't understand the half of it, of course... :)

I wonder, is all of it your brain-child, plus trial&error, benchmarking, reverting and improving? Or are these all/in part/mostly commonly known strategies that have been implemented in other OSs before?

I'll echo other commenters, some before/after benchmarks would be nice, as well as test images once you have the biggest bugs fixed. :)

Thanks for your fantastic work!
Humdinger

Re: Haiku meets 9th processor

humdinger wrote:

I wonder, is all of it your brain-child, plus trial&error, benchmarking, reverting and improving? Or are these all/in part/mostly commonly known strategies that have been implemented in other OSs before?

A bit of both. Obviously, it is better to learn from the mistakes of the others so I am watching e.g. recent discussions about power aware Linux scheduler. Apparently, they have noticed that having cpufreq module and scheduler directly cooperate is what they need and that significantly influenced my decision to do that Haiku. (It's not that surprising though, the scheduler and the cpufreq modules both need to know CPU load so keeping them separate doesn't look like a good idea.) Some other things are just influenced by the solutions used in other kernels. For example, using O(1) runqueue in the scheduler implementation is very common, at some time all significant kernels has used it (and most of them still uses it). There are also things that were just an intuition (and general knowledge which types of behaviour processors like and which don't) that proved to be right or a need to solve some specific problems.

I know that many of you are eager to see benchmarks, but I ask you for little patience. I am cleaning the code, fixing remaining issues and doing some minor optimizations ("minor" because of the amount of code involved, not necessarily because of their impact) all of which needs to be done before I will be able to present benchmark result that is actually worth something (fixing some issues may decrease performance, optimizing the code should increase it).

Re: Haiku meets 9th processor

Damn good job, but here I am looking for an I7 with 8 cores when hyper-threading and you just raised the bar. Now I need a machine with 2-4 CPU blocks (16-32 virtual cores) just to keep up. :)

But in truth I have one program which supports up to 1024 threads for it's search function but for more than 256 threads on a dual core machine (4 virtual CPUs) the context switching really slows it down.

Your work will solve my problem if I can find the right machine :(

Good work, please continue - it is looking like R1 will be awesome!

Re: Haiku meets 9th processor

Progress, awesome sauce!

Great work.

It would be entertaining to see someone with a quad socket Opteron box fire it up with 48 - 64 cores. (not as outrageously expensive as you might imagine if you eBay the CPU's coming from decommissioned servers)

Re: Haiku meets 9th processor

Impressive work Pawel. And I'll add my vote for some benchmarks :)

Re: Haiku meets 9th processor

Where are the benchmarks..... for awesomeness ?? ;)

Re: Haiku meets 9th processor

Awesome work Pawel!

If I were to integrate this with my USB blink! (blinkenlights), what should I use to get the CPU utilization on each CPU?

Glad to see the 8CPU limit gone! Time for a 16 CPU version blink!

Can't wait to see this branch!

Re: Haiku meets 9th processor

Nice work!

About this:

  • Imagine a situation when there are two threads both doing the same thing in a loop: acquire a lock, use CPU for a shorter period of time than its time slice, release a lock. Neither of the threads would receive any penalty since they never use whole time slice, yet the lower priority threads would be starved. To solve this problem I made the scheduler more strict. In addition to the previous conditions when thread penalty has been increased it is also increased when a thread voluntarily gives up the CPU (e.g. waits for a mutex).

Does this make the mutex-waiting thread wait for a new time slice to start before it runs again?

Because I feel that is what should be happening in that case.

Re: Haiku meets 9th processor

Threads never *wait* for time slices. As soon as the scheduler chooses a threads from the run queue it is run either completing previously started time slice or starting a new one.

When a thread wakes up it usually continues the time slice it got before sleeping.

Thread penalty can be increased no more than once during a time slice. The algorithm has changed slightly since I was writing that but generally thread receives a weaker type of penalty if it has been sleeping during its time slice (in most cases that penalty will be cancelled anyway) or stronger if it has been CPU bound all the time.