Hello kernel? You have a syscall from userland!

Note: this article was written by Daniel Reinhold.

One of the features of modern operating systems is the ability to separate application code from the critical code that implements the core of the system. Regular applications run in user mode (often referred to as userland) which means that they cannot directly manipulate the vital system data structures. This makes everything much more stable -- buggy apps may crash and burn themselves, but they can't bring down the rest of the system.

The flipside to this protection is that userland code is walled off from the kernel code. This means, for example, that your application cannot directly call a kernel function. But the kernel implements many useful services that most apps would like to take advantage of. Indeed, that is one of the main purposes of the kernel -- to abstract all those icky underlying hardware details and provide a clean, consistent interface for applications. So how does all this useful interface ever get called and used?

Well, I'm glad that you asked (ok, you didn't... I'll just pretend that you did), since that's the topic of this article. There is a mechanism that the kernel provides so that user apps may tap into the system's coffers. This is the system call interface -- aka "syscalls" (surprise! I guess the title of the article gave it away).

Syscalls are the mechanism by which requests from userland code are converted into function calls within the kernel. This involves two context switches: first, switching from user to kernel mode in order to run the system service, then from kernel back to user mode to return to the caller. Additionally, any data passed must be copied in both directions (from user to kernel and back again). This means that syscalls are not exactly cheap -- they incur far more overhead than a simple procedure call. But the cost of the service is balanced by the safety of maintaining system integrity.

Caveats

This article will delve into the details of how syscalls are implemented in the Haiku kernel, but there are a couple of caveats that I must lay out from the start:

Only the Intel x86 architecture is covered here. While the overall design of the syscalls mechanism is the same on any platform, the specific details of how it is implemented is highly dependent on the machine architecture.
As I write this, the kernel source code is a moving target. It has already deviated somewhat from the original version forked from NewOS. And it will continue to evolve in the months ahead. I don't think that the basic mechanism for handling syscalls will change (altho you never know), but some of the specific file names and/or code snippets referenced below may become obsolete over time.
I am still learning and studying the syscalls mechanism myself. I believe that all the information presented here is correct, but, in the end, the ultimate reference is the kernel source code itself. Believe what it says first, and take what I've said here as supplemental.

The system interface

Consider your average C program with calls to fopen(), fread(), fclose(), etc. These functions are part of the standard C library and are platform independent -- i.e. they provide the same functionality regardless of what operating system is being run. But how are those calls actually implemented? As system calls in the native OS, of course:

sys_open()
sys_read()
sys_write()
sys_lseek()
sys_close()
. . .

These file operations are so common and so fundamental that is makes sense to offer them as system services. But file operations are not the only services that the kernel provides. There are also operations available for manipulating threads, semaphores, ports, and other low-level goodies. Here's a partial list of some other syscalls defined in Haiku:

sys_system_time()
sys_snooze()
kern_create_sem()
kern_delete_sem()
kern_acquire_sem()
kern_release_sem()
kern_release_sem_etc()
kern_spawn_thread()
kern_kill_thread()
kern_suspend_thread()
kern_resume_thread()
sys_port_create()
sys_port_close()
sys_port_delete()
sys_port_find()
sys_port_get_info()
sys_exit()
. . .

These syscalls provide a good representation what the kernel is capable of doing and how its operation can be controlled. It is the kernel equivalent of an API, only it's really an SPI (System Programming Interface). So far, as of this writing, there are 78 syscalls defined for the kernel. This number is very likely to increase over time. As a point of reference, this Linux syscalls index lists a total of 237 syscalls currently defined for that platform.

How many syscalls should an OS have then? Well, as many as it needs, I guess. It's a judgement call: more system services mean more power for userland (and possibly finer grained control), but too many complicate the interface to the kernel. The best motto would be "keep it as simple as possible, but no simpler".

Peeking with strace

In order to get a better appreciation of the role of syscalls within user applications, you can run a program called strace. This is one of the standard /bin apps included with the BeOS. This very useful command will run a user program while printing out all syscalls as they are invoked. As an example, consider the following command:

strace ls /boot/beos

This will run the command 'ls /boot/beos' while displaying the syscalls encountered during exectution. Here is a sample of the output:

user_create_sem(0x0, 0xec09cd6e "addon lock") = 0x10537  (42 us)

_user_get_next_image_info(0x0, 0xfd001788, 0xfd00178c, 0x434) = 0x0  (145 us)

_user_get_next_image_info(0x0, 0xfd000330, 0xfd000334, 0x434) = 0x0  (146 us)

. . .

area_for(0xfd00028c) = 0x2b36  (51 us)

_user_get_area_info(0x2b36, 0xfd00028c, 0x48) = 0x0  (61 us)

user_find_thread(0x0) = 0x908  (16 us)

user_create_sem(0x0, 0xec09cd3d "gen_malloc") = 0x1053a  (58 us)

. . .

This is only a fraction of the output... run it yourself to see the full glory (heck, it's even color-coded!) Each line in the output is formatted as:

syscall_function(arg1, arg2, ... argN) = return_code

If the argument is a string literal, its string value is displayed immediately following the address that was passed. If the return code is a standard error code, its textual tag will be displayed immediately following its integral value.

For example, from the first line of output, we can surmise that something like the following was present in the 'ls' source code:

sem = user_create_sem(0, "addon lock");

// at run time:

//    sem was set to 0x10537

//    user_create_sem() took 42 microseconds to execute

Running strace is a wonderful way to get a handle on how syscalls are being used by applications. You might even find it useful to run your own programs with strace to see how your application interfaces with the kernel.

Connecting thru software interrupts

Alright, the kernel offers all these wonderful services as syscalls. But how do user apps actually invoke the syscalls? Thru software interrupts.

Most of you are probably familiar with the concept of hardware interrupts. For example, you press a key and the keyboard generates a hardware interrupt, which, in turn, notifies the keyboard driver to process the input. However, interrupts are just as commonly generated by software events.

The mechanism for generating software interrupts is the INT instruction. This is an Intel x86 opcode that interrupts the current program execution, saves the system registers, and then jumps to a specific interrupt handler. After the handler has finished, the system registers are restored and the execution with the calling program is resumed (well, usually).

The INT instruction thus acts as (sort of) an alternative calling technique. Unlike ordinary procedure calls, which pass their args on the stack, interrupts store any needed args in registers. For example, a normal function call such as:

foo(a, b, c);

would be translated by the compiler into something like:

push c push b push a call foo

An interrupt however, must have any needed arguments loaded into general registers first. The register assignments for the syscall handlers are as follows:

eax -- syscall #
ecx -- number of args (0-16)
edx -- pointer to buffer containing args from first to last

After these registers have been set, interrupt 99 is called. What is the significance of the value 99? None really -- this is simply the interrupt number selected by the kernel for handling syscalls. More on this later.

Mapping the syscalls

Each syscall has an entry point defined by a small assembly language function. Therefore, the syscall interface is an assembly file (called syscalls.S) containing a long list of functions, one for each syscall that has been defined. This file should look like this:

.globl sys_null

.type sys_null,@function

.align 8

sys_null:

    movl  $0, %eax        ; syscall #0

    movl  $0, %ecx        ; no args

    lea   4(%esp), %edx   ; pointer to arg list

    int   $99             ; invoke syscall handler

    ret                   ; return

.globl sys_mount

.type sys_mount,@function

.align 8

sys_mount:

    movl  $1, %eax        ; syscall #1

    movl  $4, %ecx        ; mount takes 4 args

    lea   4(%esp), %edx   ; pointer to arg list

    int   $99             ; invoke syscall handler

    ret                   ; return

. . .

Or rather, it would be like the listing above, except that the code is so boiler plate, that, in fact, the syscall functions appear in the source code as a collection of #define macros.

The assignment of system services to syscall numbers is arbitrary. That is, it doesn't really matter which function is syscall #0, syscall #1, syscall #2, etc. so long as everyone is in agreement about the mapping. This mapping is defined in the syscalls.S assembly listing above, and much be matched item-for-item in the C interface header file. For our kernel, the C header is ksyscalls.h which uses an enum to define tags for each syscall:

enum {

    SYSCALL_NULL = 0,

    SYSCALL_MOUNT,

    SYSCALL_UNMOUNT,

    SYSCALL_SYNC,

    SYSCALL_OPEN,

    . . .

};

Interrupt Descriptor Table (IDT)

The code above sets us up for the interrupt call. But what happens when the int $99 instruction is invoked? Quite literally, the exception handler whose address is stored at IDT[99] is called.

The software interrupts rely on the presence of a system structure called the Interrupt Descriptor Table (IDT). This is a memory area allocated and initialized at boot time that holds a table of exception handlers. The table contains exactly 256 entries.

There is an internal x86 register, IDTR, that holds the address of this table. You cannot use this register directly -- it can only be accessed thru instructions such as the lidt (load IDT) instruction. During the stage2 bootstrap, the kernel calls lidt and sets it to the virtual address of the idt descriptor. This descriptor points to a memory area that is initialized with a vector (array) of exception handlers, one for each interrupt number (0 thru 255).

The kernel has some leeway in assigning these handlers. However, certain interrupt numbers have standard, predesignated purposes or are reserved. The table below lists the interrupt numbers and their associated actions that should be implemented by the handlers:

Intel x86 interrupt numbers:

Number	Description	Type
0	Divide-by-zero	fault
1	Debug exception	trap or fault
2	Non-Maskable Interrupt (NMI)	trap
3	Breakpoint (INT 3)	trap
4	Overflow (INTO with EFlags[OF] set)	trap
5	Bound exception (an out-of-bounds access)	trap
6	Invalid Opcode	trap
7	FPU not available	trap
8*	Double Fault	abort
9	Coprocessor Segment Overrun	abort
10*	Invalid TSS	fault
11*	Segment not present	fault
12*	Stack exception	fault
13*	General Protection	fault or trap
14*	Page fault	fault
15	Reserved	. . .
16	Floating-point error	fault
17	Alignment Check	fault
18	Machine Check	abort
19-31	Reserved By Intel	. . .
32-255	Available for software and hardware interrupts	. . .

*These exceptions have an associated error code.

Exception Types:

fault - the return address points to the instruction that caused the exception. The exception handler may fix the problem and then restart the program, making it look like nothing has happened.
trap - the return address points to the instruction after the one that has just completed.
abort - the return address is not always reliably supplied. A program which causes an abort is never meant to be continued.

The exception handlers

The 256 exception handlers that are loaded into the IDT are almost identical. After pushing the specific interrupt number, they all implement the same code sequence:

save all registers (including system registers)
call i386_handle_trap
restore all registers previously saved
return

Because of this, the assembly file that defines these handlers, arch_interrupts.S, is also written largely as a collection of #define macros.

The function i386_handle_trap() serves as the master exception handler. As such, it handles all system interrupts, not just syscalls. However, we're interested specifically in the section that deals with interrupt 99, the syscalls handler.

Here's a snippet of the i386_handle_trap() source code:

void

i386_handle_trap(struct int_frame frame)

{

    int ret = INT_NO_RESCHEDULE;

    switch(frame.vector) {

        case 8:

            ret = i386_double_fault(frame.error_code);

            break;

        case 13:

            ret = i386_general_protection_fault(frame.error_code);

            break;

        . . .

        case 99: {

            uint64 retcode;

            unsigned int args[MAX_ARGS];

            int rc;

            thread_atkernel_entry();

            if(frame.ecx <= MAX_ARGS) {

                if((addr)frame.edx >= KERNEL_BASE &&

                        (addr)frame.edx <= KERNEL_TOP) {

                    retcode =  ERR_VM_BAD_USER_MEMORY;

                } else {

                    rc = user_memcpy(args,

                        (void *)frame.edx,

                        frame.ecx * sizeof(unsigned int));

                    if(rc < 0)

                        retcode = ERR_VM_BAD_USER_MEMORY;

                    else

                        ret = syscall_dispatcher(frame.eax,

                            (void *)args,

                            &retcode);

                }

            }

            frame.eax = retcode & 0xffffffff;

            frame.edx = retcode >> 32;

            break;

        }



        . . .



        if(frame.cs == USER_CODE_SEG || frame.vector == 99) {

        thread_atkernel_exit();

    }

}

The syscalls are handled in the case of the interrupt number 99. Again, there's no particular significance to the number 99. The Intel documentation allows for interrupt numbers 32-255 to be used freely by the OS for whatever purpose. Travis Geiselbrecht, the original author of this interrupt handling technique, probably decided that 99 was easy to remember.

The highlights of the code are:

thread_atkernel_entry() is called upon entering kernel mode
the number of args (in ecx) and the argv address (in edx) are checked for validity (e.g. a kernel address is bad since syscalls are only intended for user apps)
user_memcpy is called to copy the args from the user stack to kernel memory
if all went well, the syscall dispatcher is called, passing the syscall # (stored in eax)
a 64-bit error code in returned in the [eax,edx] pair
thread_atkernel_exit() is called as kernel mode is exited

The dispatcher

The routine syscall_dispatcher() is a core kernel function that finally binds the syscall numbers to their corresponding internal implementations. Here is a snippet of the syscalls.c file that contains the dispatcher:

int

syscall_dispatcher(unsigned long call_num, void *arg_buffer, uint64 *call_ret)

{

    switch(call_num) {

        case SYSCALL_NULL:

            *call_ret = 0;

            break;

        case SYSCALL_MOUNT:

            *call_ret = user_mount((const char *)arg0, (const char *)arg1,

            (const char *)arg2, (void *)arg3);

            break;

        case SYSCALL_UNMOUNT:

            *call_ret = user_unmount((const char *)arg0);

            break;

        case SYSCALL_SYNC:

            *call_ret = user_sync();

            break;

        

        . . .

        

    }

    return INT_RESCHEDULE;

}

Naming conventions

The "user_" prefix on the dispatched functions is not a requirement, but a common convention in the kernel code. These functions do not generally contain the main implementation, but perform any fixups needed and call the true workhorse routine. There are often analogous "sys_" prefixed functions that do the same thing -- i.e. provide a wrapper for the real implementation.

For example, the user_mount() function is found in the vfs.c file since the mount service is part of the virtual filesystem layer. This function, in turn, calls vfs_mount() which actually performs the mount. Likewise, there is a corresponding sys_mount() function in vfs.c that also calls vfs_mount(). This sys_mount() is a kernel mode version of the (userland) sys_mount assembly function found in syscalls.S.

Altho this could be a point of confusion, the idea behind it is reasonable: whether the calling code is user or kernel mode, the same style of interface is used. The userland mount() function will invoke the syscall and eventually result in the user_mount() dispatch function being called. Kernel mode programs (drivers, addons, etc.) call the sys_mount() function directly and don't use the syscall mechanism. Either point of entry results in the underlying vfs_mount() function being called.

Example run-thru

Ok, we've covered a lot of ground in this article. The mechanism for generating and acting upon syscalls is anything but straightforward. But it can be followed and understood. Let's take a look at an example call.

You have a (userland) program with the following line:

int fd = open ("/some/file", O_RDONLY);

This will get translated into a syscall and the 'open' performed within a kernel mode function.
Here are the steps:

The definition of open() is found within the libc library. In the sources for libc, you will find the open.c file (in the unistd folder) that translates this into a call to sys_open(). Basically, the call has now become:
fd = sys_open ("/some/file", O_RDONLY);
The sys_open() function is defined as the assembly routine within the syscalls.S file. Thus, your app needs to be linked against libc.so to resolve this symbol. This will be true for any syscall, regardless of whether the functionality is part of "standard C library" or not. It may seem strange to be linking to libc in order to resolve a call to kern_create_sem(), for example, but the syscalls interface has to be accessible from some library, and libc, for historical reasons, makes as much sense as any other.
The sys_open() assembly routine loads 4 into eax (the syscall # for sys_open), loads 2 into ecx (the number of args), loads the address of the first arg on the user stack into edx, then invokes the instruction int $99
The exception handler that receives the interrupt pushes the value 99 on the stack, pushes the contents of all the registers on the stack, and then calls i386_handle_trap().
Inside i386_handle_trap(), the args are copied to a local (kernel space) buffer and then passed to syscall_dispatcher().
The dispatcher has a large switch statement that farms the requests out to different kernel functions based on the syscall #. In this case, the syscall # is 4, which results in a call to user_open(). The original call has now become:
*retcode = user_open ("/some/file", O_RDONLY);
The retcode is a pointer to a 64-bit value back in the i386_handle_trap() function that is used to hold error values.
The user_open() function is compiled into the kernel (called kernel.x86 for the Intel build). The source is found within the vfs.c file since 'open' is a file operation handled the the virtual file system layer. The user_open() function creates a local copy of the file path arg and passes this on to vfs_open(), which is also defined in vfs.c.
The vfs_open() function finally performs the open command on the file... or does it? Actually, the VFS layer acts as an abstraction for handling file operations across all filesystems. So, in truth, vfs_open() simply calls the open function within the filesystem driver for the filesystem that "/some/file" is mounted on. But that process is a whole other topic...
Assuming that the file exists, resides on a valid, mounted volume, and there are no other problems, the file may then be actually opened. Now aint that something!

Well, as you can see, the entire process of executing a syscall is not exactly simplicity itself. It is definitely a layered, delegated process. But the layers are there for a reason -- to provide memory protection and to abstract the system services. Hey, if kernel programming was easy, your grandma would be doing it!

Hopefully this article has cleared up the process somewhat. Go back and peruse the sources and see if it all makes more sense now.