From Bugs back to Wireless and Friends

Blog post by mmlr on Mon, 2011-11-28 02:04

As this week concludes I’d like to post an update on what I’ve been up to and what I’ll be working on next. After fixing a few kernel issues and looking into some others I’ve come to a point where I’ll gradually refocus back on some of the tasks I left open before mentally entering the kernel debugging land. In this blog post I’ll also try to describe some of what I did this week to hopefully make it a bit more accessible.

Bug Hunting

Most of my time this week I’ve spent further tracking down various issues. I’ve started out by continuing where I left off in my last blog post. Due to feedback I’ve got since then I’ve also broadened my “range of interest” a bit, looking into a few older bug reports that might now be easier to debug with the new debugging features.

As usual I’ve also processed incoming new tickets and ticket updates. Especially interesting to me are obviously issues that prevent Haiku from booting and issues in the areas I worked on previously (both of which tend to coincide more often than one would expect).

For example I’ve augmented the interrupt routing code to be more strict in deciding what devices to look at which will hopefully resolve ticket #8111. I’ve also revisited USB legacy handoff due to ticket #8085 where I still don’t see what we do so different from other implementations. This ticket handling brings me to a topic I already wrote about on more than one occasion, but I’ll stress it once more: If you happen to run into issues, especially critical ones that prevent booting or let the system crash, please create tickets for them or add comments to existing tickets that describe your issue. Depending on the issue, having more data points to work with can make a real difference. Also, don’t be too shy about creating tickets, search for a matching ticket and if you don’t find one then just go ahead and open a new ticket. Tickets can and will simply be closed as duplicates by someone who may be more knowledgeable if you’re not sure about the issue you experience. Not having the tickets makes us developers more or less blind about issues we aren’t running into ourselves (often due to the limited hardware and system configurations we have available). A good example for that would be ticket #8153, where a certain system configuration would lead to a crash on boot. The issue was fixable by the information in the ticket alone, and would have been completely overlooked if there was no ticket, simply because this specific configuration doesn’t seem to be very common. So if you run into an issue running Haiku on some more obscure system, don’t just think “yeah I didn’t expect it to run on this config anyway”, but instead create tickets. Thank you!

I’ve set up a secondary system to (try to) reproduce some of the issues I’m tracking. Especially in cases where reproducing an error needs some longer running system or time intensive but mostly unattended and boring preparation (#7889 for one where one needs to copy a couple of complete audio CDs). Having a secondary system frees my main working machine so I can continue working on other things or start reviewing related code.

It also was on that other system where I ran into the issue described in ticket #8068. I remembered seeing that particular crash on my laptop before, but at the time I was still working on native graphics support so I disregarded it at the time. After implementing native resolution setting the issue wasn’t reproducible on this laptop anymore, which has a pretty obvious reason looking back now: It was only experienced when using VESA graphics.

Considering that an app_server crash is a critical issue (from a user point of view it might as well be a KDL, it essentially makes the system unusable and is unrecoverable), since it is triggered fairly easily and since VESA is still a widely used reality on Haiku, I deemed it definitely worth investigating and fixing. Pretty soon it became obvious that the crash itself wasn’t where the problem was hiding, it was merely a victim of something going wrong on a deeper level. Digging into it further revealed that the vm86 code, the one allowing virtual 8086 mode to be used in Haiku to make VESA BIOS calls, had to be responsible. In some specific case it would introduce a problem when changing the resolution via such a VESA BIOS call. Luckily the problem was easy enough to reproduce, writing a script that constantly changed resolutions even allowed for the crash to be “automated” so to speak. Even so I was mostly unfamiliar with the vm86 code, only having casually looked over it out of curiosity before, which required me to first understand how virtual 8086 was supposed to actually work, how it was supposed to be properly entered and left. Checking over that code already showed that some of the interrupt handling code got out of sync because it wasn’t updated the last time the other parts of that code were updated. Sadly fixing that didn’t cause the crash to disappear though. That’d have been too easy I guess.

Out of a hunch I’ve already established that the problem wasn’t triggered if only a single CPU was in use, and also that it wasn’t a general thread preemption problem affecting code running in virtual 8086 mode. The virtual 8086 mode is pretty much transparent as far as thread scheduling is concerned anyway. So it had to be something CPU specific, most probably something we set up in Haiku and that got corrupted along the way in virtual 8086 mode. Following that train of thought the parts of the puzzle started fitting together. Having TLS in the back of my mind for some reason and seeing a fit in the place where I tracked down the app_server crash made it relatively obvious: Exiting the virtual 8086 mode restored the CPU state to what it was before entering. This is of course correct in general, but it also happened to restore the %fs register, which we reserve to store the CPU specific TLS (thread local storage) address. And this value was clobbered in the specific case where the thread doing the mode switch got preempted and rescheduled on a different CPU while still being in virtual 8086 mode. When returning into userland (to the app_server in that case) after the mode switch, the next use of something that depends on TLS would crash, leading to a relatively unhelpful generic crash in a place that isn’t actually at fault… Once understood the problem was relatively easy to fix by simply re-setting the register with the proper value after returning from virtual 8086.

Next Tasks

As the title of this blog post already reveals I now intend to focus some more on the other tasks I’ve left dormant for some time now. Namely completing what is missing to make wireless networks more everyday usable. As mentioned before, wireless encryption is really complete, so I won’t actually be working on encryption, but on the missing parts to make configuring and using a wireless network more comfortable. Since I’ve already implemented the settings backend that will be used (during the first few weeks of my contract), only one important part is left: Storing the passwords/keys. As described in my first blog post already, I’ll be introducing a KeyStore that can securely manage keys, passwords, certificates, etc. in a central place and generic way (so that it can be used by other applications as well). I’ve started working on that based on a prototype API Axel Dörfler came up with and I’ll now concentrate on making it more generic and actually implementing it. Once this is done the wpa_supplicant will be the first user of this new API to store the necessary credentials to access networks.

Of course I’ll be still trying to reproduce other issues on my other system(s) and handle tickets as they come in.

Until next time, thanks for reading!