Greetings (mostly) from the Kernel (Debugging Land)

Blog post by mmlr on Thu, 2011-11-17 17:00

So what is going on right now in the time I spend on my Haiku contract? For the past two and a half weeks I've had my mind wrapped around various parts of the kernel. Things started out at BeGeistert and the coding sprint following it. The nice thing about the coding sprint is that you spend a lot of time with very knowledgeable people and can therefore tackle things that you would usually shy away from. In this case, Ingo Weinhold and I were seeing some random memory corruption problems and an apparent memory/pages leak. So we started investigating those by adding more debug functions into the relevant parts.

Debug Memory Initialization for the Slab

The old (non-slab) heap implementation Haiku used for quite a while had gained a lot of debugging features over the time. When switching to the slab those were "lost" in a sense. Since the general stability was far better at that point it didn't really hurt to lose them though, so the pressure of adding them there again was quite low.

One of the debug features was initializing all allocated memory to some non 0 value (0xcc in our case). This ensures that code assuming 0 initialized memory trips up and crashes so that these bugs can be found and fixed. On the other side freed memory is reset to (0xdeadbeef), a value some might already be familiar with. When some code uses already freed memory that value usually pops up somewhere and pretty clearly indicates what happened. Bugs like these are normally well hidden, because if the memory isn't re-used by someone else in the meantime, accesses to it usually still work, leading to very hard to trace bugs where only certain timing of things trigger an actual crash or misbehaviour. By ensuring that freed memory is "invalidated" it becomes easier to spot such problems.

So these two things were the obvious first choice in trying to track down memory corruption. Indeed it already triggered at least once and lead to bug report #8123 where apparently an already freed kernel structure was freed again. I've got some leads as to where the actual problem might be, but I haven't been able to find a way to reproduce that error yet.

Allocation Tracking

Another thing that the old heap implementation was able to do, is tracking what allocations are currently active and who allocated them. This is immensely useful when you're seeing a memory leak that slowly (or not so slowly) eats up memory. As mentioned we also saw some increasing memory usage on a supposedly completely idle system.

We used a combination of the excellent kernel tracing mechanism and some helper structures to construct the same allocation tracking that was present on the old heap. The tracing mechanism already stores the team, thread and time of when the entry is created. Also there are trace objects available that allow to get a stack trace of where an entry was created. That means all the information we were looking for would already be present in the tracing entry we created, so basically we "only" needed to associate the entries with the actual allocation. Implementing this, and the corresponding functions to analyze that collected data got us a few very neat KDL commands that can now be used to easily see who's burning through memory without returning it.

We obviously applied that functionality to check up on what was going on. But well, it didn't actually reveal anything unusual. So the problem had to be somewhere else. If it wasn't in the heap, then it had to be a pages leak, meaning that raw page allocations were causing the memory usage to increase.

Page Allocation Tracking

With the slab allocation tracking just done it was pretty easy to apply the same to page allocations. So we did, the same tracing entry leveraging tracking infos were added to the page allocations. Also the analyzing function was pretty similar.

Before being able to make use of that page allocation tracking the coding sprint was sadly already over.

Reporting Errors

After getting home I continued analyzing the supposed leak using that mechanism and reviewing some code. During coding sprint there were a lot of other changes, and one of them concerned what is included in the "used pages" number, meaning the reported amount of "used memory" in the broader sense. Since by now the numbers were off by so much it became obvious that this couldn't really be a leak, it had to be some error in reporting the numbers.

So I started getting into the code and check what as really going on and what was supposed to happen. It turns out that the number of total pages is made up of quite a few different numbers, some overlapping each other. The change in question didn't take that into account, so some pages were accounted for twice resulting in inflated numbers for the used_pages count. Since this is a rather complex matter it took a while to fully grasp what each of the values really represented, but after understanding it it was rather easy to come up with a calculation that would add up the right numbers.

This change fixed the wrong reporting. While this might look like a purely cosmetical change at first ("so the numbers are a bit off, who cares?"), but it really is more than that. The same numbers are used internally to judge how critically the system is in need of freeing up resources to continue to function. So having wrong numbers there would eventually cause wrong decisions to be made.

Diverse Bug Reports

During all that time there's pretty much always a steady stream of bug reports being created and updated. So a relatively large chunk of time again was spent to get the relevant information out of these and trying to track down their causes.

Frustratingly I am yet unable to trigger any of these bugs locally, making it really hard for me to figure out the problems behind them. I've reviewed quite a bit of code to try and spot logic errors in the code, but it is not really an all that efficient approach. Due to the size and reach of some components it is nearly impossible to find all the places involved and keeping all the interactions in mind.

Faced with that situation I've also started to let some of my other computers run through stress tests to try and trigger edge cases (running out of memory, out of pages and/or address space), fixing a debug helper along the way. So far all these machines run through the tests just fine though. There's some potential for optimization, but that isn't really my concern for now.

Eventually I think a combination of code review leading to educated guesses and stressing the corresponding parts will lead to success, so I'm keeping on it.

Outlook

I've been asked for an outlook on what I'm going to work on and why I am getting "held up" with this kernel stuff right now. I've explained it in a recent email like this:

"Why am I working on it right now? Easy: with the coding
sprint at BeGeistert I could dig into that area of the kernel with
Ingo, who has a more intimate knowledge there. I'm continuing now
because my mind is set into that mode and the knowledge is still fresh
so that I can be more productive. Pushing that work out to a later
point in time would essentially mean that I'd have to re-enter that
mindset then, simply wasting time on refreshing knowledge that would
still be here for free right now. I deemed that an inappropriate use of
my contract time."

I realize that this is work that is very hard to see from the outside, often only indirectly leading to fixes and sometimes plain simply resulting in nothing (i.e. the problem is not in the suspected part of the code at all). Believe me that this is at least as frustrating to me as it is for people watching the progress of Haiku. But eventually this work will have to be done, and since I am in the fortunate situation of having the time to really dive into these things and keep at them, I do so.

What happened to the wireless situation? My first priority when starting my contract was clearly getting the wpa_supplicant going so that WPA/WPA2 encryption would work. This task is technically done now, meaning that the encryption and authentication works. The wpa_supplicant optional package is available and to my knowledge works fine (apart from some cases where the driver involved doesn't work, which is a driver issue though). I say technically complete because it isn't quite as user friendly as it eventually needs to be. Storing the password isn't yet implemented and the manual configuration of known networks is "unhandy" at best. I am fully aware of that and will return to that work relatively soon. I felt the situation to be usable enough to let it rest for the moment though.

Comments

Re: Greetings (mostly) from the Kernel (Debugging Land)

Nice to hear another progress report, not sure why anyone feels you should be working on WPA when you have added so much already. In my opinion you have the code at a point where people can test and bug report any issues. The kernel is always more important and although fixes are not as visible, your work is much appreciated.

Re: Greetings (mostly) from the Kernel (Debugging Land)

It's always a good idea to do kernel related improvements, but perhaps after you finish with that and want to move to a new subject, perhaps you can choose the subjects you would like to work on, and then make a poll to ask what the users/developers prefer. That way you can work on the subject you like, and the users/developers are happy too, because they could have a little influence.

Re: Greetings (mostly) from the Kernel (Debugging Land)

awesome report. Thanks Michael.

Re: Greetings (mostly) from the Kernel (Debugging Land)

Despite other points of view, I think your kernel adventures are wise to start with. Then just do what you think is most important, for that will bring the most progress to HAIKU. Indeed lack of WPA/WPA2 support is the reason there is no native HAIKU in my home at this moment. WPA2 is the standard for cable modem/routers with built in Wifi around here nowadays.

Good luck Michael.

Re: Greetings (mostly) from the Kernel (Debugging Land)

And ehm create a 'stub' storing the passwords to speed up development, maybe someone else or whenever you find the time can it be filled in.

Re: Greetings (mostly) from the Kernel (Debugging Land)

In case you get frustrated with the new bugs, there are still plenty of old ones to be fixed ;-)

One particularly weak spot of Haiku is #4157 which is easily reproducible. And if you want to put the new memory corruption tracing facilities to good use, there is #7889, and #7968 waiting. Copying a few CDs should do the trick to trigger them. Or you could find out why ripping them with lame directly from disk doesn't work (it'll choke after a while).

Then there is a bug in the network stack, possibly net_timer or TCP, that's easily reproducible with very slow connections. Sometimes it will stop transferring anything even on a LAN connection - then it helps starting another connection to the same system which points to the net_timer being the culprit. At least on my desktop system, that is.

I guess I could go on and on for hours. Point is: if some bugs are frustrating or you cannot reproduce them, don't bother too much, we have so much more to offer for you.

I always enjoy reading your posts, even though I have usually a pretty good idea what you're up to.

Re: Greetings (mostly) from the Kernel (Debugging Land)

I'm not sure it's something Michael did recently, but since I updated from r42858 to now r43238, my Haiku appears more stable than before. At least after the first few days, I haven't been in KDL once. Am I just lucky? We'll see...
Oh, and my pop3ing mail_daemon hasn't hung once when sending mail, which it did constantly before. I had to kill and restart the daemon pretty much for every mail. I doubt that's Michael's doing, and I thought Axel only had IMAP in his bug-choking grasp. Which would only leave Adrien's refurbished new mail notifications... :) (Which BTW look nice, but are a bit annoying as I can't find a way to disable them...)

Enough off-topics...
Thanks, Michael for fixing these well-hidden deep down system bugs (or at least for developing tools to sniff them out while being elusive on your hardware).

Regards,
Humdinger

Re: Greetings (mostly) from the Kernel (Debugging Land)

Any developer will sympathise with your current situation, I don't see any need to explain your actions! Clearly fixing bugs in the kernel and memory management systems is going to be a lot more beneficial in the long run than getting WPA super user friendly right this minute.

Keep up the good work!