x86_64 port: midterm report

Blog post by xyzzy on Wed, 2012-07-11 11:08

Since my quarter term report I have made a great deal of progress. The boot loader x86_64 support is finished, and the kernel can now be booted to the point of searching for the boot volume. A screenshot of this:

This means that most of the major parts of the kernel functionality (e.g. virtual memory management, threading and interrupt handling) are now implemented. Since x86_64 is essentially an extension of x86, there is quite a lot of code that can be shared between them. What I have done is merge the x86 and x86_64 kernel code together: the arch/x86 directory contains the code for both. The parts which are completely different for 32-/64-bit are under 32 and 64 subdirectories, but most of the existing code is common to both architectures. Reusing the existing code has allowed me to progress as quickly as I have: getting hardware interrupts and timers among other things working was just a matter of compiling the code into the x86_64 kernel.

Along the way I have run into some interesting bugs, particularly with memory management. Memory management bugs are quite fun to debug. You can get ones that lead to random memory corruption, or even triple faults (if the CPU encounters another exception trying to handle an exception, you get a double fault which has a special handler, and if it encounters another executing that handler it will reset the machine, a triple fault). My usual testing environment is QEMU, however when I run into such bugs I usually switch over to Bochs, which although very slow in comparison to QEMU, has an excellent built-in debugger. It provides information on the cause of a triple fault, allows you to examine memory, registers, and virtual to physical memory mappings, and set watchpoints to find where a memory location gets written at.

The 2 most interesting bugs both arose from issues with the 64-bit paging setup code in the boot loader. The first was that when the VM initialization code was clearing the memory mappings created by the boot loader that are not needed by the kernel, the code was attempting to free a page that was still mapped elsewhere and causing a panic. I initially thought that this was memory corruption, as from debug output I couldn't see that the address the page was mapped at was being mapped anywhere. After spending a while using Bochs watchpoints to find where the page was being mapped, I found that the boot loader was creating invalid mappings in the 64-bit address space. The boot loader runs in 32-bit mode with 32-bit paging, so the 64-bit setup code converts from the 32-bit address space to the 64-bit one. When getting the physical address that a 32-bit address mapped to, it was not checking the page directory present flag, so it was pulling values out of a page table that didn't actually exist, resulting in the invalid mappings.

The second bug was spurious triple faults in the kernel's slab allocator. This bug couldn't be debugged in Bochs, because for some reason it was only occurring in QEMU. I ended up moving dprintf calls through the slab code to find exactly where it was breaking. The cause of the bug was the way that the boot paging setup code was creating 64-bit page tables. It was leaving them mapped in the virtual addr